arxiv: 2605.13801 · v1 · pith:Q2XUZPOLnew · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

Deepak Pandita , Flip Korn , Chris Welty , Christopher M. Homan This is my paper

Pith reviewed 2026-05-14 19:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reproducibilityhuman evaluationannotator modelingbootstrappingLLM evaluationstatistical significancevariance modeling

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{Q2XUZPOL}

Prints a linked pith:Q2XUZPOL badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Multi-level bootstrapping models annotator variance to find the N and K needed for statistically significant evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a multi-level bootstrapping method that uses datasets with many ratings and persistent rater identifiers to simulate how individual annotators differ in their judgments. This approach lets researchers measure how the number of items evaluated and the number of annotations per item trade off against each other to reach reliable statistical significance. Current practices that collect only three to five annotations per item often fail to produce repeatable results because they ignore annotator-specific variance. By modeling that variance explicitly, the method shows what sample sizes actually stabilize evaluation outcomes for AI systems. The work focuses on improving reproducibility in human evaluations of generative models.

Core claim

Leveraging datasets with large numbers of ratings and persistent rater identifiers, the multi-level bootstrapping approach realistically models annotator behavior and quantifies the tradeoffs between the number of items N and the number of responses per item K required to achieve statistical significance.

What carries the argument

Multi-level bootstrapping that resamples ratings at both the item level and the individual annotator level using persistent rater identifiers to capture variance.

If this is right

Evaluation protocols can optimize the balance of items and annotations per item to reach statistical significance with fewer total ratings.
Persistent tracking of individual raters becomes a necessary feature for future evaluation datasets.
Standard small-K practices (three to five annotations) are shown to be insufficient once annotator variance is modeled.
Reproducibility improves when variance estimates come from multi-level resampling rather than simple averaging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bootstrapping technique could be tested on annotation tasks outside AI safety, such as medical image labeling or content moderation.
Benchmarks that release only aggregated labels would need to add rater metadata to support this style of analysis.
Real-time annotation platforms could incorporate the method to decide on the fly when additional ratings are no longer improving significance.

Load-bearing premise

Datasets that contain large numbers of ratings per item together with persistent rater identifiers are available and representative of typical evaluation settings.

What would settle it

Apply the bootstrapping procedure to a fresh dataset that lacks persistent rater identifiers, compute the predicted N and K thresholds for significance, then run repeated real-world annotation campaigns at those sizes and check whether the observed stability matches the prediction.

Figures

Figures reproduced from arXiv: 2605.13801 by Christopher M. Homan, Chris Welty, Deepak Pandita, Flip Korn.

**Figure 2.** Figure 2: Comparing S2 (top) vs S3 (bottom): P-value plots for Toxicity dataset with Accuracy and [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: S2: P-value plots for comparing the DICES 5 rater sample with the Toxicity dataset under [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: S1: P-value plots for DICES dataset with Accuracy as the metric [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: S1: Effect sizes (∆) for DICES dataset with Accuracy as the metric 12 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: S1: P-value plots for DICES dataset with MAE as the metric [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: S1: Effect sizes (∆) for DICES dataset with MAE as the metric 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 0.5 p-value DICES (Wins, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 p-value DICES (Wins, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (b) ϵ = 0.… view at source ↗

**Figure 8.** Figure 8: S1: P-value plots for DICES dataset with Wins as the metric [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: S1: Effect sizes (∆) for DICES dataset with Wins as the metric 0 20 40 60 80 100 K 0.0 0.2 0.4 0.6 0.8 p-value DICES_5 (Accuracy, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.0 0.2 0.4 0.6 0.8 p-value DICES_5 (Accuracy, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (b) ϵ = 0.2 0 20 40 60 80 … view at source ↗

**Figure 10.** Figure 10: S1: P-value plots for DICES 5 rater sample with Accuracy as the metric [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: S1: Effect sizes (∆) for DICES 5 rater sample with Accuracy as the metric 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 p-value DICES_5 (MAE, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 p-value DICES_5 (MAE, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (b) ϵ = 0.2 0 20 40 60 8… view at source ↗

**Figure 12.** Figure 12: S1: P-value plots for DICES 5 rater sample with MAE as the metric [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: S1: Effect sizes (∆) for DICES 5 rater sample with MAE as the metric 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 0.5 p-value DICES_5 (Wins, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 p-value DICES_5 (Wins, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (b) ϵ = 0.2 0 20 40 60 … view at source ↗

**Figure 14.** Figure 14: S1: P-value plots for DICES 5 rater sample with Wins as the metric [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: S1: Effect sizes (∆) for DICES 5 rater sample with Wins as the metric 14 [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

**Figure 16.** Figure 16: S2: P-value plots for DICES dataset with Accuracy as the metric [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

**Figure 17.** Figure 17: S2: Effect sizes (∆) for DICES dataset with Accuracy as the metric 0 20 40 60 80 100 K 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 p-value DICES_S2 (MAE, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 p-value DICES_S2 (MAE, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=… view at source ↗

**Figure 18.** Figure 18: S2: P-value plots for DICES dataset with MAE as the metric [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

**Figure 19.** Figure 19: S2: Effect sizes (∆) for DICES dataset with MAE as the metric 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 0.5 p-value DICES_S2 (Wins, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 p-value DICES_S2 (Wins, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (b)… view at source ↗

**Figure 20.** Figure 20: S2: P-value plots for DICES dataset with Wins as the metric [PITH_FULL_IMAGE:figures/full_fig_p015_20.png] view at source ↗

**Figure 21.** Figure 21: S2: Effect sizes (∆) for DICES dataset with Wins as the metric 0 20 40 60 80 100 K 0.0 0.2 0.4 0.6 0.8 p-value DICES_5_S2 (Accuracy, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.0 0.2 0.4 0.6 0.8 p-value DICES_5_S2 (Accuracy, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (b) ϵ = 0.2 0 20 40… view at source ↗

**Figure 22.** Figure 22: S2: P-value plots for DICES 5 rater sample with Accuracy as the metric [PITH_FULL_IMAGE:figures/full_fig_p016_22.png] view at source ↗

**Figure 23.** Figure 23: S2: Effect sizes (∆) for DICES 5 rater sample with Accuracy as the metric 16 [PITH_FULL_IMAGE:figures/full_fig_p016_23.png] view at source ↗

**Figure 24.** Figure 24: S2: P-value plots for DICES 5 rater sample with MAE as the metric [PITH_FULL_IMAGE:figures/full_fig_p017_24.png] view at source ↗

**Figure 25.** Figure 25: S2: Effect sizes (∆) for DICES 5 rater sample with MAE as the metric 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 0.5 p-value DICES_5_S2 (Wins, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 p-value DICES_5_S2 (Wins, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 Nx… view at source ↗

**Figure 26.** Figure 26: S2: P-value plots for DICES 5 rater sample with Wins as the metric [PITH_FULL_IMAGE:figures/full_fig_p017_26.png] view at source ↗

**Figure 27.** Figure 27: S2: Effect sizes (∆) for DICES 5 rater sample with Wins as the metric 0 20 40 60 80 100 K 0.0 0.2 0.4 0.6 0.8 p-value Toxicity_S2 (Accuracy, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.0 0.2 0.4 0.6 0.8 p-value Toxicity_S2 (Accuracy, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (b) ϵ = 0.… view at source ↗

**Figure 28.** Figure 28: S2: P-value plots for Toxicity dataset with Accuracy as the metric [PITH_FULL_IMAGE:figures/full_fig_p017_28.png] view at source ↗

**Figure 29.** Figure 29: S2: Effect sizes (∆) for Toxicity dataset with Accuracy as the metric 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 p-value Toxicity_S2 (MAE, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 p-value Toxicity_S2 (MAE, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 … view at source ↗

**Figure 30.** Figure 30: S2: P-value plots for Toxicity dataset with MAE as the metric [PITH_FULL_IMAGE:figures/full_fig_p018_30.png] view at source ↗

**Figure 31.** Figure 31: S2: Effect sizes (∆) for Toxicity dataset with MAE as the metric 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 0.5 p-value Toxicity_S2 (Wins, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 0.5 p-value Toxicity_S2 (Wins, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (b) ϵ = 0.2 0 2… view at source ↗

**Figure 32.** Figure 32: S2: P-value plots for Toxicity dataset with Wins as the metric [PITH_FULL_IMAGE:figures/full_fig_p018_32.png] view at source ↗

**Figure 33.** Figure 33: S2: Effect sizes (∆) for Toxicity dataset with Wins as the metric 18 [PITH_FULL_IMAGE:figures/full_fig_p018_33.png] view at source ↗

**Figure 34.** Figure 34: S2: P-value plots for D3code dataset with Accuracy as the metric [PITH_FULL_IMAGE:figures/full_fig_p019_34.png] view at source ↗

**Figure 35.** Figure 35: S2: Effect sizes (∆) for D3code dataset with Accuracy as the metric 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 0.5 p-value D3code_S2 (MAE, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 0.5 p-value D3code_S2 (MAE, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (b) ϵ = 0.2 0 20 4… view at source ↗

**Figure 36.** Figure 36: S2: P-value plots for D3code dataset with MAE as the metric [PITH_FULL_IMAGE:figures/full_fig_p019_36.png] view at source ↗

**Figure 37.** Figure 37: S2: Effect sizes (∆) for D3code dataset with MAE as the metric 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 0.5 p-value D3code_S2 (Wins, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 0.5 p-value D3code_S2 (Wins, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (b) ϵ = 0.2 0 20 40 6… view at source ↗

**Figure 38.** Figure 38: S2: P-value plots for D3code dataset with Wins as the metric [PITH_FULL_IMAGE:figures/full_fig_p019_38.png] view at source ↗

**Figure 39.** Figure 39: S2: Effect sizes (∆) for D3code dataset with Wins as the metric 0 20 40 60 80 100 K 0.0 0.2 0.4 0.6 0.8 p-value Toxicity_S3 (Accuracy, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.0 0.2 0.4 0.6 0.8 p-value Toxicity_S3 (Accuracy, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (b) ϵ = 0.2 0 20… view at source ↗

**Figure 40.** Figure 40: S3: P-value plots for Toxicity dataset with Accuracy as the metric [PITH_FULL_IMAGE:figures/full_fig_p020_40.png] view at source ↗

**Figure 41.** Figure 41: S3: Effect sizes (∆) for Toxicity dataset with Accuracy as the metric 0 20 40 60 80 100 K 0.0 0.1 0.2 0.3 0.4 p-value Toxicity_S3 (MAE, =0.1) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=50000 (a) ϵ = 0.1 0 20 40 60 80 100 K 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 p-value Toxicity_S3 (MAE, =0.2) NxK=100 NxK=250 NxK=500 NxK=1000 NxK=2500 NxK=5000 NxK=10000 NxK=25000 NxK=5… view at source ↗

**Figure 42.** Figure 42: S3: P-value plots for Toxicity dataset with MAE as the metric [PITH_FULL_IMAGE:figures/full_fig_p020_42.png] view at source ↗

**Figure 43.** Figure 43: S3: Effect sizes (∆) for Toxicity dataset with MAE as the metric 20 [PITH_FULL_IMAGE:figures/full_fig_p020_43.png] view at source ↗

**Figure 44.** Figure 44: S3: P-value plots for Toxicity dataset with Wins as the metric [PITH_FULL_IMAGE:figures/full_fig_p021_44.png] view at source ↗

**Figure 45.** Figure 45: S3: Effect sizes (∆) for Toxicity dataset with Wins as the metric 21 [PITH_FULL_IMAGE:figures/full_fig_p021_45.png] view at source ↗

read the original abstract

As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they introduce divergent biases and subjective opinions into their annotations. Overcoming this variance is exceptionally challenging because very little data exists to study how experimental repeatability actually improves as the annotator pool grows. Standard evaluation practices typically rely on a small number of annotations per item (often 3 to 5) and lack the persistent rater identifiers necessary to model individual variance across items. In this work, we introduce a multi-level bootstrapping approach to realistically model annotator behavior. Leveraging datasets with a large number of ratings and persistent rater identifiers, we analyze the tradeoffs between the number of items ($N$) and the number of responses per item ($K$) required to achieve statistical significance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses multi-level bootstrapping on persistent-rater datasets to map N/K tradeoffs for statistical significance, but the transfer to ordinary low-K evaluations remains untested.

read the letter

The main point is that the authors apply multi-level bootstrapping to datasets that track individual raters across many items. This lets them estimate how variance behaves at different levels and then work out the combinations of total items N and ratings per item K that reach statistical significance in human evaluations of models. It is a direct response to the common practice of using K=3-5 without any real basis for the choice. They correctly note that most current setups lack persistent IDs, which makes variance hard to measure. The work is useful because it turns available rich annotation data into concrete guidance on sample sizes rather than leaving the decision to convention. The analysis shows clear tradeoffs and gives practitioners a data-driven starting point for planning evaluations. The soft spot is generalization. The bootstrap is fit on high-K persistent data, yet typical safety and utility benchmarks use small K with raters who are not tracked across items. If the variance structure differs, the reported N and K values could be misleading for real deployments. The paper would be tighter with an explicit check, such as deriving a recommendation from the rich data and then testing whether it predicts actual reproducibility on a held-out standard evaluation. This is aimed at teams that run or design human evaluations for LLMs, especially those who can collect or access detailed rater logs. Readers who already work with large annotation sets will get immediate value from the tradeoff curves. I would send it to peer review. The core method is straightforward and the problem it targets is real, so referees can focus on tightening the validation against ordinary evaluation conditions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a multi-level bootstrapping approach to model annotator behavior and variance in human evaluations of generative AI models. Leveraging datasets with large numbers of ratings per item and persistent rater identifiers, the authors analyze tradeoffs between the number of items (N) and responses per item (K) required to reach statistical significance, with the goal of improving reproducibility in safety and utility assessments.

Significance. If validated, the work could supply practical, data-driven guidelines for annotation study design that reduce the impact of rater variance on evaluation reliability. The use of real persistent-rater datasets to ground the variance model is a clear strength relative to purely synthetic or single-level approaches common in the literature.

major comments (2)

[Abstract and §3] Abstract and §3 (Method): the multi-level bootstrapping procedure is described only at a high level with no equations, pseudocode, or explicit variance decomposition (e.g., item-level, rater-level, and residual components). This is load-bearing for the central N/K tradeoff claims, as the reported statistical-significance thresholds depend directly on how the bootstrap samples across levels.
[§4 and §5] §4 (Experiments) and §5 (Results): no held-out validation is reported that compares the bootstrapped significance predictions against actual repeated annotation trials on ordinary small-K (K=3–5) datasets without persistent rater IDs. Without this check, it remains unclear whether the derived tradeoffs generalize or introduce dataset-specific artifacts.

minor comments (2)

[§5] Ensure the precise definition of 'statistical significance' (p-value threshold, confidence-interval width, or power target) is stated explicitly and used consistently when reporting the N/K curves.
[Figures] Figure captions should include the exact bootstrap parameters (number of resamples, stratification rules) so readers can reproduce the plotted tradeoffs.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): the multi-level bootstrapping procedure is described only at a high level with no equations, pseudocode, or explicit variance decomposition (e.g., item-level, rater-level, and residual components). This is load-bearing for the central N/K tradeoff claims, as the reported statistical-significance thresholds depend directly on how the bootstrap samples across levels.

Authors: We agree that a more formal description is needed. In the revised manuscript we will add the explicit equations for the multi-level bootstrap, including the variance decomposition into item-level, rater-level, and residual components, together with pseudocode that shows the sampling procedure at each level. This will make transparent how the bootstrap produces the reported N/K significance thresholds. revision: yes
Referee: [§4 and §5] §4 (Experiments) and §5 (Results): no held-out validation is reported that compares the bootstrapped significance predictions against actual repeated annotation trials on ordinary small-K (K=3–5) datasets without persistent rater IDs. Without this check, it remains unclear whether the derived tradeoffs generalize or introduce dataset-specific artifacts.

Authors: We acknowledge the value of such validation. Our work relies on the rare datasets that contain both large K and persistent rater identifiers; ordinary small-K evaluations lack these identifiers, so direct comparison would require new annotation campaigns that are outside the scope of the present study. We will add an explicit limitations paragraph discussing this constraint and its implications for generalizability. revision: partial

standing simulated objections not resolved

Direct held-out validation against actual repeated trials on small-K datasets without persistent rater IDs, which would require new data collection.

Circularity Check

0 steps flagged

No circularity detected in bootstrapping-based tradeoff analysis

full rationale

The paper's core contribution is a multi-level bootstrapping procedure applied to external datasets that already contain large numbers of ratings and persistent rater identifiers. This procedure is used to empirically measure how statistical significance varies with N and K; the resulting tradeoffs are outputs of the bootstrap on held-out data rather than quantities defined in terms of themselves. No equations, fitted parameters, or self-citations are shown that would make any reported prediction equivalent to its own input by construction. The method therefore remains self-contained against the provided data sources.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that rater variance can be captured by resampling at multiple levels and that the chosen datasets generalize.

pith-pipeline@v0.9.0 · 5479 in / 1024 out tokens · 27323 ms · 2026-05-14T19:27:32.733664+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Reproducibility Checklist , 2023

AAAI. Reproducibility Checklist , 2023. URL https://aaai.org/conference/aaai/aaai-23/reproducibility-checklist/. Accessed: 2024-04-21

work page 2023
[2]

ACL Rolling Review , 2024

ACL. ACL Rolling Review , 2024. URL http://aclrollingreview.org/responsibleNLPresearch/. Accessed: 2024-04-21

work page 2024
[3]

Dices dataset: Diversity in conversational ai evaluation for safety

Lora Aroyo, Alex Taylor, Mark D\' az, Christopher Homan, Alicia Parrish, Gregory Serapio-Garc\' a, Vinodkumar Prabhakaran, and Ding Wang. Dices dataset: Diversity in conversational ai evaluation for safety. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 5...

work page 2023
[4]

1,500 scientists lift the lid on reproducibility

Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533 0 (7604): 0 452--454, May 2016. ISSN 1476-4687. doi:10.1038/533452a. URL https://www.nature.com/articles/533452a

work page doi:10.1038/533452a 2016
[5]

Toward benchmarking group explanations: Evaluating the effect of aggregation strategies versus explanation

Francesco Barile, Shabnam Najafian, Tim Draws, Oana Inel, Alisa Rieger, Rishav Hada, and Nava Tintarev. Toward benchmarking group explanations: Evaluating the effect of aggregation strategies versus explanation. In Perspectives on the Evaluation of Recommender Systems Workshop 2021: co-located with the 15th ACM Conference on Recommender Systems (RecSys 20...

work page 2021
[6]

We need to consider disagreement in evaluation

Valerio Basile, Michael Fell, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, Massimo Poesio, and Alexandra Uma. We need to consider disagreement in evaluation. In Kenneth Church, Mark Liberman, and Valia Kordoni, editors, Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, pages 15--21, Online, August 2021. Association f...

work page doi:10.18653/v1/2021.bppf-1.3 2021
[7]

Toward a perspectivist turn in ground truthing for predictive computing

Federico Cabitza, Andrea Campagner, and Valerio Basile. Toward a perspectivist turn in ground truthing for predictive computing. Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (6): 0 6860--6868, Jun. 2023. doi:10.1609/aaai.v37i6.25840. URL https://ojs.aaai.org/index.php/AAAI/article/view/25840

work page doi:10.1609/aaai.v37i6.25840 2023
[8]

An examination of generative ai response to suicide inquires: content analysis

Laurie O Campbell, Kathryn Babb, Glenn W Lambie, and B Grant Hayes. An examination of generative ai response to suicide inquires: content analysis. JMIR Mental Health, 12: 0 e73623, 2025

work page 2025
[9]

D3CODE : Disentangling Disagreements in Data across Cultures on Offensiveness Detection and Evaluation , April 2024

Aida Mostafazadeh Davani, Mark Díaz, Dylan Baker, and Vinodkumar Prabhakaran. D3CODE : Disentangling Disagreements in Data across Cultures on Offensiveness Detection and Evaluation , April 2024. URL http://arxiv.org/abs/2404.10857. arXiv:2404.10857 [cs]

work page arXiv 2024
[10]

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Improved reporting of experimental results. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pr...

work page doi:10.18653/v1/d19-1224 2019
[11]

The reproducibility crisis is real

Odd Erik Gundersen. The reproducibility crisis is real. AI Magazine, 41 0 (3): 0 103--106, Sep. 2020. doi:10.1609/aimag.v41i3.5318. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/5318

work page doi:10.1609/aimag.v41i3.5318 2020
[12]

State of the art: Reproducibility in artificial intelligence

Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art: Reproducibility in artificial intelligence. Proceedings of the AAAI Conference on Artificial Intelligence, 32 0 (1): 0 1644--1651, Apr. 2018. doi:10.1609/aaai.v32i1.11503. URL https://ojs.aaai.org/index.php/AAAI/article/view/11503

work page doi:10.1609/aaai.v32i1.11503 2018
[13]

Christopher M Homan, Flip Korn, Deepak Pandita, and Chris Welty. How many ratings per item are necessary for reliable significance testing? In Vera Demberg, Kentaro Inui, and Llu \'i s Marquez, editors, Findings of the A ssociation for C omputational L inguistics: EACL 2026 , pages 4258--4273, Rabat, Morocco, March 2026. Association for Computational Ling...

work page doi:10.18653/v1/2026.findings-eacl.223 2026
[14]

Artificial intelligence faces reproducibility crisis

Matthew Hutson. Artificial intelligence faces reproducibility crisis. Science, 359 0 (6377): 0 725--726, 2018. doi:10.1126/science.359.6377.725. URL https://www.science.org/doi/abs/10.1126/science.359.6377.725

work page doi:10.1126/science.359.6377.725 2018
[15]

ICML 2023 Paper Guidelines , 2023

ICML. ICML 2023 Paper Guidelines , 2023. URL https://icml.cc/Conferences/2023/PaperGuidelines. Accessed: 2024-04-21

work page 2023
[16]

Reproducibility Guidelines – IJCAI - ECAI 2022, 2022

IJCAI. Reproducibility Guidelines – IJCAI - ECAI 2022, 2022. URL https://ijcai-22.org/reproducibility/. Accessed: 2024-04-21

work page 2022
[17]

Leakage and the reproducibility crisis in ml-based science, 2022

Sayash Kapoor and Arvind Narayanan. Leakage and the reproducibility crisis in ml-based science, 2022. URL https://arxiv.org/abs/2207.07048

work page arXiv 2022
[18]

Designing toxic content classification for a diversity of perspectives

Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, and Michael Bailey. Designing toxic content classification for a diversity of perspectives. In Seventeenth Symposium on Usable Privacy and Security (SOUPS 2021), pages 299--318, 2021. URL https://www.usenix.org/conference/soups2021/presentation/kumar

work page 2021
[19]

Community perspective on replicability in natural language processing

Margot Mieskes, Kar \"e n Fort, Aur \'e lie N \'e v \'e ol, Cyril Grouin, and Kevin Cohen. Community perspective on replicability in natural language processing. In Ruslan Mitkov and Galia Angelova, editors, Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 768--775, Varna, Bulgaria, Septembe...

work page doi:10.26615/978-954-452-056-4_089 2019
[20]

Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Aida Mostafazadeh Davani, Mark D \'i az, and Vinodkumar Prabhakaran. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10: 0 92--110, 2022. doi:10.1162/tacl_a_00449. URL https://aclanthology.org/2022.tacl-1.6/

work page doi:10.1162/tacl_a_00449 2022
[21]

PaperInformation / PaperChecklist , 2021

NeurIPS. PaperInformation / PaperChecklist , 2021. URL https://neurips.cc/Conferences/2021/PaperInformation/PaperChecklist. Accessed: 2024-04-21

work page 2021
[22]

Forest vs tree: The (n, k) trade-off in reproducible ml evaluation

Deepak Pandita, Flip Korn, Chris Welty, and Christopher M Homan. Forest vs tree: The (n, k) trade-off in reproducible ml evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 40 0 (29): 0 24736--24744, Mar. 2026. doi:10.1609/aaai.v40i29.39659. URL https://ojs.aaai.org/index.php/AAAI/article/view/39659

work page doi:10.1609/aaai.v40i29.39659 2026
[23]

Problems and Opportunities in Training Deep Learning Software Systems : An Analysis of Variance

Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. Problems and Opportunities in Training Deep Learning Software Systems : An Analysis of Variance . In 2020 35th IEEE / ACM International Conference on Automated Software Engineering ( ASE ) , pages 771--783, September 2020. URL...

work page arXiv 2020
[24]

On releasing annotator-level labels and information in datasets

Vinodkumar Prabhakaran, Aida Mostafazadeh Davani, and Mark Diaz. On releasing annotator-level labels and information in datasets. In Claire Bonial and Nianwen Xue, editors, Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop, pages 133--138, Punta Cana, Dominican Republic, November 20...

work page doi:10.18653/v1/2021.law-1.14 2021
[25]

A step toward quantifying independently reproducible machine learning research

Edward Raff. A step toward quantifying independently reproducible machine learning research. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, Vancouver, Canada, 2019. Curran Associates, Inc. URL https://proceedings.neurips.cc/paper_files/paper/2...

work page 2019
[26]

Soft metrics for evaluation with disagreements: an assessment

Giulia Rizzi, Elisa Leonardelli, Massimo Poesio, Alexandra Uma, Maja Pavlovic, Silviu Paun, Paolo Rosso, and Elisabetta Fersini. Soft metrics for evaluation with disagreements: an assessment. In Gavin Abercrombie, Valerio Basile, Davide Bernadi, Shiran Dudy, Simona Frenda, Lucy Havens, and Sara Tonelli, editors, Proceedings of the 3rd Workshop on Perspect...

work page 2024
[27]

`just what do you think you ' re doing, dave?' a checklist for responsible data use in NLP

Anna Rogers, Timothy Baldwin, and Kobi Leins. `just what do you think you ' re doing, dave?' a checklist for responsible data use in NLP . In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4821--4833, Punta Cana, Dominican Republic, November 2...

work page doi:10.18653/v1/2021.findings-emnlp.414 2021
[28]

Shi, J., Zhong, Y ., Xu, N., Li, Y ., and Xu, C

Harald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik, Dieter Theiler, Maximilian Haberl, Stefan Thalmann, and Dominik Kowald. Reproducibility in machine-learning-based research: Overview, barriers, and drivers. AI Magazine, 46: 0 e70002, 2025. https://doi.org/10.1002/aaai.70002

work page doi:10.1002/aaai.70002 2025
[29]

Cheap and fast--but is it good? evaluating non-expert annotations for natural language tasks

Rion Snow, Brendan O’connor, Dan Jurafsky, and Andrew Y Ng. Cheap and fast--but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing, pages 254--263, 2008

work page 2008
[30]

KhudaBukhsh , and Christopher Homan

Tharindu Cyril Weerasooriya , Alexander Ororbia , Raj Bhensadadia , Ashiqur R. KhudaBukhsh , and Christopher Homan . Disagreement Matters : Preserving Label Diversity by Jointly Modeling Item and Annotator Label Distributions with DisCo . Annual Meeting of the Association for Computational Linguistics, 2023. doi:10.18653/v1/2023.findings-acl.287. S2ID: cc...

work page doi:10.18653/v1/2023.findings-acl.287 2023
[31]

Follow the leader(board) with confidence: Estimating p-values from a single test set with item and response variance

Shira Wein, Christopher Homan, Lora Aroyo, and Chris Welty. Follow the leader(board) with confidence: Estimating p-values from a single test set with item and response variance. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 3138--3161, Toronto, Canada, July 2023....

work page doi:10.18653/v1/2023.findings-acl.196 2023