Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget
Pith reviewed 2026-05-24 03:48 UTC · model grok-4.3
The pith
To identify the better of two binary classifiers, spending a labeling budget on one noisy label per sample beats collecting multiple labels per sample.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a budget of noisy labels and an independent noise model, the probability of correctly ranking two binary classifiers is strictly higher when the budget is used to obtain one label on each of many samples than when it is used to obtain multiple labels on fewer samples, as established by a large-deviations analysis via Cramér's theorem.
What carries the argument
Cramér's theorem applied to the large-deviation rate functions of the empirical accuracy estimators for the two classifiers under the independent label-noise model.
If this is right
- Benchmark construction should favor labeling more unique examples rather than repeated labeling of the same examples.
- Sample-size calculations for pairwise classifier comparisons can use the new bounds, which improve on those derived from Hoeffding's inequality.
- Label aggregation via majority vote is not optimal when the sole goal is to rank two classifiers.
- The ordering of labeling strategies reverses if the objective shifts from ranking to estimating absolute accuracy.
Where Pith is reading between the lines
- The same quantity-over-quality logic may apply to other pairwise ranking tasks that rely on noisy observations, such as A/B testing with noisy user feedback.
- Crowdsourcing platforms could redesign task allocation to prioritize breadth over depth when the downstream use is model comparison.
- If label noise turns out to be correlated across samples, the single-label strategy could lose its advantage and repeated labeling might regain value.
Load-bearing premise
The noisy labels for different samples are independent and follow the paper's specific noise model so that the large-deviation rates govern the comparison error.
What would settle it
A simulation or crowdsourced experiment, under the paper's noise model, in which the probability of correctly identifying the superior classifier is higher when the budget is spent on repeated labels than on single labels across more samples.
Figures
read the original abstract
We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cram\'er's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that, when allocating a fixed budget of noisy labels to identify which of two binary classifiers has higher accuracy, it is optimal to collect one label per sample across as many samples as possible rather than obtaining multiple labels per sample and aggregating them (e.g., by majority vote). The result is derived from a non-trivial application of Cramér's theorem to the large-deviation rate of the event that the empirical accuracy of the worse classifier exceeds that of the better one; the per-sample random variable depends on the number k of labels collected per point, and the rate is shown to be maximized at k=1 under the paper's noise model.
Significance. If the central theorem holds under the stated assumptions, the result supplies sample-size bounds tighter than those obtained from Hoeffding's inequality and directly challenges conventional practice in benchmark construction. It also supplies a concrete, falsifiable prediction about the relative value of label quantity versus label quality that can be tested on real annotation pipelines.
major comments (2)
- [Theorem and proof (likely §3–4)] The application of Cramér's theorem (presumably in the proof of the main result) is performed on the empirical means of per-sample random vectors whose distribution depends on k. The manuscript must explicitly state the noise model (independence of labels, fixed flip probability, etc.) in the section containing the theorem statement and verify that the resulting rate function I_k satisfies I_1 > I_k for k>1; without these details the inequality cannot be checked.
- [Noise model and discussion of assumptions] The skeptic concern is load-bearing: if labels for the same point are correlated (same annotator, batch effects, etc.), the effective per-sample law changes and the ordering of the rates may reverse. The paper should either prove robustness or add a remark quantifying how much correlation is tolerable before the k=1 optimum fails.
minor comments (2)
- [Introduction or final discussion] Add a short paragraph contrasting the new bounds with the Hoeffding-derived bounds mentioned in the abstract, including the explicit functional form of the improvement.
- [Problem formulation] Clarify the precise budget constraint (total labels B fixed, or total samples n fixed?) and how the comparison is performed when the two classifiers are evaluated on the same labeled set.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address the major points below.
read point-by-point responses
-
Referee: [Theorem and proof (likely §3–4)] The application of Cramér's theorem (presumably in the proof of the main result) is performed on the empirical means of per-sample random vectors whose distribution depends on k. The manuscript must explicitly state the noise model (independence of labels, fixed flip probability, etc.) in the section containing the theorem statement and verify that the resulting rate function I_k satisfies I_1 > I_k for k>1; without these details the inequality cannot be checked.
Authors: We agree. In the revised manuscript we will state the noise model (independent labels, each with fixed flip probability) explicitly in the theorem section and include a verification (or appendix reference) that the rate function satisfies I_1 > I_k for k>1. revision: yes
-
Referee: [Noise model and discussion of assumptions] The skeptic concern is load-bearing: if labels for the same point are correlated (same annotator, batch effects, etc.), the effective per-sample law changes and the ordering of the rates may reverse. The paper should either prove robustness or add a remark quantifying how much correlation is tolerable before the k=1 optimum fails.
Authors: We will add a remark in the discussion section acknowledging the independence assumption and noting that correlations could reverse the ordering. A precise quantification of the tolerable correlation threshold lies outside the scope of the present analysis. revision: partial
- Precise quantification of the correlation level at which the k=1 optimum fails.
Circularity Check
No significant circularity; derivation applies external Cramér theorem
full rationale
The paper derives its central comparison of large-deviation rates by applying Cramér's theorem (a standard, externally established result in large deviations theory) to the empirical means of per-sample random vectors whose distribution is fixed by the stated i.i.d. label-noise model. This application yields the rate function I_k and the inequality rate(k=1) > rate(k>1) as a mathematical consequence rather than by redefinition, fitting, or self-citation. No load-bearing step reduces to the paper's own inputs by construction; the noise-model assumptions are explicit and the theorem is independent. The result is therefore self-contained against the external benchmark of Cramér's theorem.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Aroyo, L. and Welty, C. Crowd truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard. WebSci2013. ACM, 2013(2013),
work page 2013
-
[2]
A large annotated corpus for learning natural language inference
Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 ,
work page internal anchor Pith review Pith/arXiv arXiv
- [3]
-
[4]
Cheplygina, V. and Pluim, J. P. Crowd disagreement about medical images is informative. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Gran...
work page 2018
-
[5]
Denton, E., D´ ıaz, M., Kivlichan, I., Prabhakaran, V., and Rosen, R. Whose ground truth? accounting for individual and collective identities underlying dataset annotation. arXiv preprint arXiv:2112.04554,
-
[6]
E., Peychev, M., Konstantinov, N., Goel, N., Ash, E., and Vechev, M
Dorner, F. E., Peychev, M., Konstantinov, N., Goel, N., Ash, E., and Vechev, M. Human-guided fair classification for natural language processing. arXiv preprint arXiv:2212.10154 ,
-
[7]
Gordon, M. L., Lam, M. S., Park, J. S., Patel, K., Hancock, J., Hashimoto, T., and Bernstein, M. S. Jury learning: Integrating dissenting voices into machine learning models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , pp. 1–19,
work page 2022
- [8]
-
[9]
R., Stevens, K., Barhoum, A., Duc, N
K¨ opf, A., Kilcher, Y., von R¨ utte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stanley, O., Nagyfi, R., et al. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327 ,
-
[10]
Mania, H. and Sra, S. Why do classifier accuracies show linear trends under distribution shift? arXiv preprint arXiv:2012.15483,
-
[11]
Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., and Zamparelli, R. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pp. 1–8,
work page 2014
-
[12]
On Releasing Annotator-Level Labels and Information in Datasets
Prabhakaran, V., Davani, A. M., and Diaz, M. On releasing annotator-level labels and information in datasets. arXiv preprint arXiv:2110.05699 ,
-
[13]
Ramponi, A. and Leonardelli, E. Dh-fbk at semeval-2022 task 4: leveraging annotators’ disagreement and multiple data views for patronizing language detection. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) , pp. 324–334,
work page 2022
- [14]
-
[15]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhar- gava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Williams, A., Nangia, N., and Bowman, S. R. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Proof. We use an upper bound version of Stirling’s approximation based on Theorem 2.6 in (Stanica, 2001): m − 1 m−1 2 < 4 m−1 2 q π m−1 2 , 18 the fact that q(1 − q) is maximized at q = 0.5 and the monontonicity of 1√ k to estimate 2σ(m, q) = 2 m−2X k uneven k ⌈ k 2 ⌉ q⌈ k 2 ⌉(1 − q)⌈ k 2 ⌉ = 2 m−2X k uneven k k+1 2 q k+1 2 (1 − q) k+1 2 = 2 m−2X k uneven...
work page 2001
-
[18]
For G as defined as in Section 3.2, m > 1 and qb, qw, pw, p0 b, p1 b fixed such that assumption 1 holds, there exist an N ∈ N such that for n > N P nX i Gi(Mm(q), p) > 0 ! < P mnX i Gi(q, p) > 0 ! . Under assumption 2, this implies that the single label strategy outperforms the m−label strategy for these n. The proof of theorem 2 is based on Cram´ er’s Th...
work page 2013
-
[19]
2n n Z q 0 xn(1 − x)ndx (Boland et al., 1989). Correspondingly, (11) holds if and only if (1 − 2Mm(q))m m − 1 m−1 2 q m−1 2 (1 − q) m−1 2 ≥ m(1 − 2q) p f ∗(q) 2 p f ∗(q) + c p g∗(q) 2 p g∗(q) + c . 24 For 0.5 < q < 1, this is equivalent to 2Mm(q) − 1 2q − 1 m − 1 m−1 2 q m−1 2 (1 − q) m−1 2 ≤ p f ∗(q) 2 p f ∗(q) + c p g∗(q) 2 p g∗(q) + c . (12) Lemma D.1....
work page 1989
-
[20]
Then c1A−c2B c1C−c2D ≤ A C is true if and only if CB ≥ DA. 29 Proof. c1A − c2B c1C − c2D ≤ A C ⇐ ⇒ c1A − c2B ≥ A(c1C − c2D) C ⇐ ⇒ C(c1A − c2B) ≤ A(c1C − c2D) ⇐ ⇒ c1CA − c2CB ≤ c1CA − c2DA ⇐ ⇒ −c2CB ≤ −c2DA ⇐ ⇒ CB ≥ DA We set A = (1 − 2Mm(q + δ)), B = (1 − 2Mm(q)), C = (1 − 2(q + δ)), D = (1 − 2q), such that CB ≥ DA is equivalent to (1 − 2(q + δ))(1 − 2Mm(...
work page 1989
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.