Metric-Dependent Annotation Saturation for Learning from Label Distributions

Guneet Kohli

arxiv: 2605.29797 · v1 · pith:PELZ7BNSnew · submitted 2026-05-28 · 💻 cs.CL

Metric-Dependent Annotation Saturation for Learning from Label Distributions

Guneet Kohli This is my paper

Pith reviewed 2026-06-29 08:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords label distributionsannotation saturationnatural language inferencesoft labelsentropy correlationKL divergencelabel smoothingChaosNLI

0 comments

The pith

The annotator count needed to learn from label distributions depends on the evaluation metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that saturation in model performance on soft labels varies sharply by metric in a 3-class NLI task. Distributional match measured by KL divergence reaches most of its gains with roughly 10 annotators, while identifying items that elicit disagreement through entropy correlation continues to improve until 20-50 annotators are reached. The advantage of soft labels over label smoothing arises because only the former preserves item-specific differences between ambiguous and unambiguous cases. These patterns hold across multiple model seeds, two architectures, a non-pretrained baseline, and an exploratory content-safety evaluation.

Core claim

In experiments that subsample label distributions from the 100-annotator ChaosNLI corpus, KL divergence between model outputs and true distributions saturates by N approximately 10, capturing 87-95 percent of total improvement, whereas correlation between predicted and true per-item entropy requires N in the 20-50 range to converge. Soft labels achieve higher entropy correlation (r = 0.643) than any of five label-smoothing intensities (r clustered at 0.45-0.49), because smoothing cannot separate ambiguous items from clear ones on a per-item basis.

What carries the argument

Subsampling of per-item label distributions from a fixed 100-annotator pool, evaluated separately by KL divergence for distributional fidelity and by entropy correlation for disagreement identification.

If this is right

Annotation budgets can be set lower when the target metric is distributional match than when it is disagreement identification.
Soft labels must be retained rather than replaced by smoothing if the goal includes recovering item-level ambiguity signals.
Models trained on limited annotations can still produce accurate distribution predictions even when they cannot yet rank disagreement items correctly.
The same metric-dependent pattern appears across DeBERTa, RoBERTa, a non-NLI baseline, and at least one cross-domain setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Annotation planning tools could include metric-specific cost curves rather than a single recommended N.
The same subsampling method could be applied to other tasks that already possess large multi-annotator corpora to check whether saturation points differ by domain.
If real-time collection of new annotations produces different curves, the field would need separate guidelines for incremental versus retrospective annotation design.

Load-bearing premise

Subsampling from one large fixed pool of annotations produces the same saturation curves that would appear if smaller numbers of new independent annotations were collected for each item.

What would settle it

Collect fresh independent annotations in successive batches of 5, 10, 20, and 50 on a new set of NLI items and test whether the KL saturation point remains near 10 while entropy correlation saturation remains near 20-50.

Figures

Figures reproduced from arXiv: 2605.29797 by Guneet Kohli.

**Figure 2.** Figure 2: Annotation efficiency curve showing the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Annotation efficiency curve for all six metrics [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

When annotators disagree on a label, the disagreement itself carries signal -- and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI, a dataset providing 100 independent annotator judgments per item, and identify metric-dependent saturation. In our 3-class NLI setting, entropy correlation -- whether the model identifies which items elicit disagreement -- requires N ~ 20-50 annotators to converge, while distributional match (KL divergence) saturates by N ~ 10 (87-95% of improvement across five model seeds). This finding rests on a prior observation: soft labels carry item-specific signal that label smoothing cannot replicate. Across five smoothing intensities, entropy correlation clusters at r ~ 0.45-0.49, while soft labels reach r = 0.643 (p < 0.001); per-item analysis traces this gap to smoothing's inability to distinguish ambiguous items from clear ones. The soft-label advantage replicates across two architectures (DeBERTa, RoBERTa), a non-NLI-pretrained baseline, and an exploratory cross-domain evaluation on content safety. These results suggest that annotation budgets should be informed by the target evaluation metric rather than set uniformly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Annotation needs for NLI soft labels depend on the metric, but the subsampling approach leaves the saturation numbers open to question.

read the letter

Annotation saturation for these NLI models depends on the metric. KL divergence saturates by N around 10 while entropy correlation needs 20-50 annotators. Soft labels also give a clearer signal for item disagreement than label smoothing does.

The paper quantifies those saturation points by subsampling from ChaosNLI and compares soft labels to several smoothing levels. It does well to check the pattern across two transformer architectures, a non-pretrained model, and a quick content safety test. The analysis showing why smoothing misses the ambiguous items adds some insight.

The main concern is whether subsampling the existing 100 annotations really matches what happens when you collect fewer fresh ones. If the original annotators have any shared biases or the pool isn't fully i.i.d., the numbers could shift. The abstract also skips over the precise statistical handling and variance estimates, so the 87-95% claim is hard to assess without more.

This paper is aimed at NLP folks who collect annotations for tasks with natural disagreement. A reader thinking about how many labels to buy for different evaluation goals would get something out of it.

I would send it for peer review. The idea is straightforward and worth testing more carefully on other datasets or with actual variable-N collections.

Referee Report

2 major / 2 minor

Summary. The paper claims that in 3-class NLI, annotation saturation is metric-dependent: KL divergence for distributional match reaches 87-95% of its improvement by N~10 annotators, while entropy correlation (identifying disagreement-eliciting items) requires N~20-50 to converge. This is shown via subsampling from the 100-annotator ChaosNLI dataset, with soft labels outperforming label smoothing (r=0.643 vs. 0.45-0.49) across DeBERTa, RoBERTa, and other models; the authors conclude that annotation budgets should be metric-dependent.

Significance. If the central empirical results are robust, the work provides actionable guidance for efficient annotation allocation in disagreement-heavy NLP tasks, distinguishing signal capture for different metrics. The replication across architectures and seeds, plus the soft-label vs. smoothing comparison, adds empirical value; the finding that soft labels distinguish ambiguous items where smoothing cannot is a useful observation for label-distribution learning.

major comments (2)

[Methods (subsampling procedure)] Subsampling procedure (Methods section): The headline saturation results (KL at N~10; entropy correlation at N~20-50) are obtained by repeatedly drawing k annotations per item from the fixed 100-annotator ChaosNLI pool. This procedure assumes the 100 judgments are i.i.d. draws equivalent to independent collection of k fresh annotations; potential annotator-pool selection effects, fatigue, or item-specific expertise variation are not tested, leaving the metric-dependent budget recommendation dependent on an unvalidated equivalence.
[Results] Statistical reporting (Results section): Saturation points and percentage improvements (87-95%) are reported across five seeds without accompanying error bars, confidence intervals, or formal tests for convergence; the reader's note on unverifiable statistical methods indicates this detail is needed to support the precise N thresholds claimed.

minor comments (2)

The abstract and text would benefit from explicit discussion of how the subsampling procedure could be validated (e.g., via comparison to a smaller independent collection) to address the i.i.d. assumption.
Notation for the entropy correlation metric and KL divergence should be defined with equations in the main text rather than assumed from prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond point-by-point to the major comments on the subsampling procedure and statistical reporting, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Methods (subsampling procedure)] Subsampling procedure (Methods section): The headline saturation results (KL at N~10; entropy correlation at N~20-50) are obtained by repeatedly drawing k annotations per item from the fixed 100-annotator ChaosNLI pool. This procedure assumes the 100 judgments are i.i.d. draws equivalent to independent collection of k fresh annotations; potential annotator-pool selection effects, fatigue, or item-specific expertise variation are not tested, leaving the metric-dependent budget recommendation dependent on an unvalidated equivalence.

Authors: We agree this is a valid methodological caveat. The subsampling approach is standard for leveraging fixed large annotation pools such as ChaosNLI, but it does rest on an untested equivalence to fresh independent annotations. In revision we will add explicit discussion of this assumption (including potential effects from annotator selection or fatigue) to the Methods and Limitations sections, clarifying that the metric-dependent budget guidance should be interpreted with this caveat in mind. revision: yes
Referee: [Results] Statistical reporting (Results section): Saturation points and percentage improvements (87-95%) are reported across five seeds without accompanying error bars, confidence intervals, or formal tests for convergence; the reader's note on unverifiable statistical methods indicates this detail is needed to support the precise N thresholds claimed.

Authors: We accept that variability measures and clearer convergence criteria would improve transparency. The original submission reported trends across seeds but omitted error bars. In the revised manuscript we will add standard deviations across the five seeds to the relevant figures and tables, and include a short description of the convergence heuristic (percentage of total improvement achieved) used to identify the reported N thresholds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from external dataset subsampling

full rationale

The paper derives its saturation findings (KL divergence saturating at N~10 vs. entropy correlation at N~20-50) solely through repeated subsampling of the external ChaosNLI dataset (100 annotations per item) followed by standard fine-tuning and metric evaluation across model seeds. No equations define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no self-citations supply load-bearing uniqueness theorems or ansatzes. The central claims rest on direct comparison of empirical curves rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical observations from subsampled data from an existing dataset and standard ML training practices rather than new theoretical constructs, fitted parameters, or invented entities.

axioms (2)

domain assumption Subsampling from a fixed set of 100 annotations per item is representative of varying annotation counts
The paper uses this to simulate different N without new data collection.
standard math Standard fine-tuning procedures and loss functions for NLI models on label distributions
Assumes cross-entropy or similar distributional training is appropriate.

pith-pipeline@v0.9.1-grok · 5745 in / 1576 out tokens · 42674 ms · 2026-06-29T08:03:26.039748+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 2 canonical work pages

[1]

InProceedings of the 2021 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technolo- gies, pages 2591–2597

Beyond black & white: Leveraging annota- tor disagreement via soft-label multi-task learning. InProceedings of the 2021 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technolo- gies, pages 2591–2597. Association for Computa- tional Linguistics. Tilmann Gneiting and Adrian E. Raftery. 2007. Stric...

2021
[2]

InInternational Conference on Learning Representations (ICLR)

DeBERTaV3: Improving DeBERTa us- ing ELECTRA-style pre-training with gradient- disentangled embedding sharing. InInternational Conference on Learning Representations (ICLR). Urja Khurana, Eric Nalisnick, Antske Fokkens, and Swabha Swayamdipta. 2024. Crowd-calibrator: Can annotator disagreement inform calibration in subjective tasks? InProceedings of the C...

2024
[3]

When does label smoothing help? InAd- vances in Neural Information Processing Systems, volume 32, pages 4696–4705. Allan H. Murphy. 1973. A new vector partition of the probability score.Journal of Applied Meteorology, 12(4):595–600. Yixin Nie, Xiang Zhou, and Mohit Bansal. 2020. What can we learn from collective human opinions on nat- ural language infere...

work page arXiv 1973
[4]

Tharindu Cyril Weerasooriya, Sarah Luger, Saloni Poddar, Ashiqur KhudaBukhsh, and Christopher Homan

Learning from disagreement: A survey.Jour- nal of Artificial Intelligence Research, 72:1385– 1470. Tharindu Cyril Weerasooriya, Sarah Luger, Saloni Poddar, Ashiqur KhudaBukhsh, and Christopher Homan. 2023. Subjective crowd disagreements for subjective data: Uncovering meaningful Crow- dOpinion with population-level learning. InPro- ceedings of the 61st An...

2023
[5]

A broad-coverage challenge corpus for sen- tence understanding through inference. InProceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics. Thomas Wolf, Lysandre Debut, Victor Sanh,...

2018
[6]

N≈10for KL

Capturing label distribution: A case study in NLI.arXiv preprint arXiv:2102.06859. Xiang Zhou, Yixin Nie, and Mohit Bansal. 2022. Dis- tributed NLI: Learning to predict human opinion dis- tributions for language reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 972–987, Dublin, Ireland. Association for Computational L...

work page arXiv 2022

[1] [1]

InProceedings of the 2021 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technolo- gies, pages 2591–2597

Beyond black & white: Leveraging annota- tor disagreement via soft-label multi-task learning. InProceedings of the 2021 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technolo- gies, pages 2591–2597. Association for Computa- tional Linguistics. Tilmann Gneiting and Adrian E. Raftery. 2007. Stric...

2021

[2] [2]

InInternational Conference on Learning Representations (ICLR)

DeBERTaV3: Improving DeBERTa us- ing ELECTRA-style pre-training with gradient- disentangled embedding sharing. InInternational Conference on Learning Representations (ICLR). Urja Khurana, Eric Nalisnick, Antske Fokkens, and Swabha Swayamdipta. 2024. Crowd-calibrator: Can annotator disagreement inform calibration in subjective tasks? InProceedings of the C...

2024

[3] [3]

When does label smoothing help? InAd- vances in Neural Information Processing Systems, volume 32, pages 4696–4705. Allan H. Murphy. 1973. A new vector partition of the probability score.Journal of Applied Meteorology, 12(4):595–600. Yixin Nie, Xiang Zhou, and Mohit Bansal. 2020. What can we learn from collective human opinions on nat- ural language infere...

work page arXiv 1973

[4] [4]

Tharindu Cyril Weerasooriya, Sarah Luger, Saloni Poddar, Ashiqur KhudaBukhsh, and Christopher Homan

Learning from disagreement: A survey.Jour- nal of Artificial Intelligence Research, 72:1385– 1470. Tharindu Cyril Weerasooriya, Sarah Luger, Saloni Poddar, Ashiqur KhudaBukhsh, and Christopher Homan. 2023. Subjective crowd disagreements for subjective data: Uncovering meaningful Crow- dOpinion with population-level learning. InPro- ceedings of the 61st An...

2023

[5] [5]

A broad-coverage challenge corpus for sen- tence understanding through inference. InProceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics. Thomas Wolf, Lysandre Debut, Victor Sanh,...

2018

[6] [6]

N≈10for KL

Capturing label distribution: A case study in NLI.arXiv preprint arXiv:2102.06859. Xiang Zhou, Yixin Nie, and Mohit Bansal. 2022. Dis- tributed NLI: Learning to predict human opinion dis- tributions for language reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 972–987, Dublin, Ireland. Association for Computational L...

work page arXiv 2022