Recognition: 3 theorem links
· Lean TheoremTwo Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference
Pith reviewed 2026-05-08 18:39 UTC · model grok-4.3
The pith
Two labeled calls suffice to produce sharp distribution-free bounds on majority-vote accuracy for any LLM sampling budget.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
From the first two moments of the latent success probability q, every fixed majority-vote budget has a sharp distribution-free interval. The infinite-dimensional moment problem is solved exactly by three-atom extremizers and quadratic dual certificates for each finite budget. The three-vote case has closed form with width at most 1/8 and a certified-improvement criterion; the infinite-vote endpoint is also bounded yet remains sensitive to latent mass near q = 1/2. Maximum-entropy and latent-difficulty Gaussian-probit completions can be added, and empirical voting accuracies on QNLI and QQP fall inside the projected two-call regions.
What carries the argument
The two-moment problem for the binary correctness layer under conditional i.i.d. sampling, solved via three-atom extremal distributions and quadratic dual certificates.
If this is right
- Every majority-vote budget receives exact two-call bounds.
- Three-vote accuracy has closed form and a certified-improvement test.
- Infinite-vote accuracy is sharply bounded but threshold-sensitive.
- Maximum-entropy and Gaussian-probit point completions tighten the intervals.
- Observed accuracies on QNLI and QQP remain inside the projected regions.
Where Pith is reading between the lines
- Two-call probes could guide allocation of test-time compute without requiring full distributional knowledge.
- Temperature changes or model mixtures can produce voting gains not ordered by single-call accuracy.
- The same two-moment reduction may apply to non-binary or multi-class correctness layers.
- Practitioners could use the certified-improvement criterion to decide when extra votes are worthwhile on a given task.
Load-bearing premise
Repeated calls are conditionally independent and identically distributed given the latent per-example success probability q.
What would settle it
Compute the two-call moments on a dataset, derive the predicted interval for three-vote accuracy, then perform three-vote inference on the same dataset and verify whether the observed accuracy falls outside the interval.
Figures
read the original abstract
Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls. One labeled call identifies the mean latent success probability; two labeled calls identify its second moment and hence the same-example correctness correlation that separates stable errors from recoverable call-level randomness. From these two moments, every fixed majority-vote budget has a sharp distribution-free two-call interval. The key technical reduction is that the infinite-dimensional moment problem has three-atom extremizers and quadratic dual certificates for every finite budget, so the bounds are exact rather than discretized or parametric. The first useful budget, three votes, has a closed form, width at most $1/8$, and a certified-improvement criterion. The infinite-vote endpoint is the limit of majority voting as the number of calls tends to infinity; it is also sharply bounded, but remains threshold-sensitive because it depends on latent mass around $q=1/2$. We add maximum-entropy and Latent-difficulty Gaussian-probit point completions, and experiments on LLM calls over QNLI and QQP show that empirical three- and five-vote accuracies are contained in the projected two-call regions while temperature changes and randomized model mixtures can create voting gains not ordered by one-call accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that, under the conditional-i.i.d. model for repeated LLM calls given a latent per-example success probability q, two labeled calls suffice to identify the first two moments of the distribution of q. These moments determine sharp, distribution-free intervals for the accuracy of majority voting at any fixed budget via reduction to an infinite-dimensional moment problem on [0,1]. The reduction is asserted to admit three-atom extremizers together with quadratic dual certificates, yielding exact (non-relaxed) bounds. A closed-form expression is given for the three-vote case (width at most 1/8) along with a certified-improvement criterion; the infinite-vote limit is also bounded but remains sensitive to mass near q=1/2. Maximum-entropy and Gaussian-probit completions are supplied as point estimates, and experiments on QNLI and QQP are reported to show that empirical three- and five-vote accuracies lie inside the projected two-call intervals.
Significance. If the central reduction holds, the work supplies a principled, low-cost method for quantifying the value of repeated sampling in LLMs by separating stable errors from recoverable call-level noise. The distribution-free character of the intervals, their exactness via three-atom extremal measures, and the closed-form three-vote result constitute a clear technical advance over parametric or simulation-based approaches. The empirical containment on standard benchmarks and the analysis of the infinite-vote endpoint add practical relevance for test-time compute allocation.
major comments (2)
- [moment-problem reduction (abstract, §4)] The central technical claim (abstract and §4) is that the moment problem admits three-atom extremizers and quadratic dual certificates for every finite budget, delivering exact rather than relaxed bounds. The manuscript must exhibit the explicit dual-certificate construction (including the quadratic form and the verification that it certifies the optimum for the majority-vote objective) so that the asserted exactness can be checked; without this, the reduction remains a statement rather than a demonstrated result.
- [three-vote case (abstract, §5)] The three-vote closed form (abstract) is stated to have width at most 1/8 and to supply a certified-improvement criterion. The derivation of this closed form from the three-atom extremizer must be supplied in full, together with the algebraic verification that the width bound holds uniformly over all feasible first- and second-moment pairs.
minor comments (2)
- [experiments] The projection of the two-call moment intervals onto finite-budget accuracies (experiments section) should include an explicit algorithmic description or pseudocode for how the bounds are computed from the observed moments, to facilitate reproduction.
- [preliminaries] Notation for the latent variable q and the majority-vote accuracy functional should be introduced once in a dedicated preliminary section and used consistently thereafter.
Simulated Author's Rebuttal
We thank the referee for the thorough review and the recommendation of minor revision. The comments highlight important points for strengthening the presentation of the technical results. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [moment-problem reduction (abstract, §4)] The central technical claim (abstract and §4) is that the moment problem admits three-atom extremizers and quadratic dual certificates for every finite budget, delivering exact rather than relaxed bounds. The manuscript must exhibit the explicit dual-certificate construction (including the quadratic form and the verification that it certifies the optimum for the majority-vote objective) so that the asserted exactness can be checked; without this, the reduction remains a statement rather than a demonstrated result.
Authors: We agree that providing the explicit dual-certificate construction is essential to substantiate the claim of exact bounds. In the revised version, we will augment Section 4 with a detailed construction of the quadratic dual certificate for the majority-vote objective. This will include the specific quadratic form in terms of the moments and a verification step showing that the dual objective equals the value attained by the three-atom extremal measure for any feasible first and second moments. We believe this will fully demonstrate the exactness of the reduction. revision: yes
-
Referee: [three-vote case (abstract, §5)] The three-vote closed form (abstract) is stated to have width at most 1/8 and to supply a certified-improvement criterion. The derivation of this closed form from the three-atom extremizer must be supplied in full, together with the algebraic verification that the width bound holds uniformly over all feasible first- and second-moment pairs.
Authors: We will expand the presentation in Section 5 to provide the complete derivation of the closed-form bounds from the three-atom extremizer. This will detail the optimization steps leading to the explicit expressions for the lower and upper bounds on three-vote accuracy. Additionally, we will include the algebraic verification that the width of these bounds is at most 1/8 for all pairs of moments (m1, m2) satisfying the feasibility constraints (i.e., 0 ≤ m2 ≤ m1 ≤ 1 and m2 ≥ m1²). The verification proceeds by parameterizing the feasible region and showing the bound holds by direct (if tedious) computation or by analyzing the extremal cases. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper identifies the mean and second moment of the latent success probability q from one and two labeled calls respectively, then reduces the problem of bounding majority-vote accuracy for any fixed budget to a moment problem on [0,1]. It asserts that this infinite-dimensional problem admits three-atom extremizers together with quadratic dual certificates, yielding exact (non-relaxed) intervals. This reduction is presented as a technical fact about the moment problem under the stated conditional-i.i.d. model; it does not rename a fitted parameter as a prediction, define a quantity in terms of itself, or rely on a load-bearing self-citation whose content is unverified. The three-vote closed form and infinite-vote endpoint are derived consequences of the same moment reduction rather than inputs. The maximum-entropy and Gaussian-probit completions are explicitly labeled as separate point estimates. No step in the described chain reduces by construction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Repeated LLM calls are conditionally independent and identically distributed given the latent per-example success probability q
- domain assumption Correctness per call is binary
Lean theorems connected to this paper
-
Cost.FunctionalEquation (for contrast)washburn_uniqueness_aczel — not invoked or paralleled; the 1/8 here comes from polynomial moment extremization, not J-cost unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
U_1(μ,ν) − L_1(μ,ν) = 2μ(1−μ)ρ(1−ρ) ≤ 1/8
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
When Can Voting Help, Hurt, or Change Course? Exact Structure of Binary Test-Time Aggregation
The voting curve from repeated binary predictions is exactly equivalent to a signed voting signature capturing excess latent mass above the majority threshold at binomial variance scales, via signed Hausdorff moments.
Reference graph
Works this paper leans on
-
[1]
Let ' s Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLM s
P. Aggarwal, A. Madaan, Y . Yang, and Mausam. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12375–12396, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.761. URL https:...
-
[2]
J. H. Albert. Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, 17(3):251–269, 1992. doi: 10.3102/10769986017003251. URLhttps://doi.org/10.3102/10769986017003251
-
[3]
B. Atıl, S. Aykent, A. Chittams, L. Fu, R. J. Passonneau, E. Radcliffe, G. R. Rajagopal, A. Sloan, T. Tudrej, F. Ture, Z. Wu, L. Xu, and B. Baldwin. Non-determinism of “deter- ministic” LLM system settings in hosted environments. InProceedings of the 5th Work- shop on Evaluation and Comparison of NLP Systems, pages 135–148, Mumbai, India, 2025. Associatio...
-
[4]
D. Bertsimas and I. Popescu. Optimal inequalities in probability theory: A convex opti- mization approach.SIAM Journal on Optimization, 15(3):780–804, 2005. doi: 10.1137/ S1052623401399903. URLhttps://doi.org/10.1137/S1052623401399903
-
[5]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
B. Brown, J. Juravsky, R. S. Ehrlich, R. Clark, Q. V . Le, C. Ré, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024. doi: 10.48550/arXiv.2407.21787. URL https://arxiv.org/ abs/2407.21787
work page internal anchor Pith review doi:10.48550/arxiv.2407.21787 2024
-
[6]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...
-
[7]
S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024. doi: 10.1038/s41586-024-07421-0. URLhttps://doi.org/10.1038/s41586-024-07421-0
-
[8]
R. J. Gallo, M. Baiocchi, T. R. Savage, and J. H. Chen. Establishing best practices in large language model research: An application to repeat prompting.Journal of the American Medical Informatics Association, 32(2):386–390, 2025. doi: 10.1093/jamia/ocae294. URL https://doi.org/10.1093/jamia/ocae294
-
[9]
E. T. Jaynes. Information theory and statistical mechanics.Physical Review, 106(4):620–630,
-
[10]
doi:10.1103/physrev.106.620 , url =
doi: 10.1103/PhysRev.106.620. URL https://doi.org/10.1103/PhysRev.106. 620
-
[11]
Karlin and W
S. Karlin and W. J. Studden.Tchebycheff Systems: With Applications in Analysis and Statistics. Interscience Publishers, New York, 1966
1966
-
[12]
L. Kuhn, Y . Gal, and S. Farquhar. Semantic uncertainty: Linguistic invariances for uncer- tainty estimation in natural language generation. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=VD-AYtP0dve
2023
-
[13]
F. M. Lord.Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum Associates, Hillsdale, NJ, 1980. 10
1980
-
[14]
P. Manakul, A. Liusie, and M. J. F. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.557. URL https...
-
[15]
R. Nowak. Estimating the self-consistency of LLMs.arXiv preprint arXiv:2509.19489, 2025. doi: 10.48550/arXiv.2509.19489. URLhttps://arxiv.org/abs/2509.19489
-
[16]
Generate a completion
Ollama. Generate a completion. Documentation, 2026. URL https://docs.ollama.com/ api/generate. Accessed 2026-05-03
2026
-
[17]
llama3.1:8b model page
Ollama. llama3.1:8b model page. Model documentation, 2026. URL https://ollama.com/ library/llama3.1:8b. Accessed 2026-05-03
2026
-
[18]
phi4-mini model page
Ollama. phi4-mini model page. Model documentation, 2026. URL https://ollama.com/ library/phi4-mini. Accessed 2026-05-03
2026
-
[19]
qwen2.5:7b model page
Ollama. qwen2.5:7b model page. Model documentation, 2026. URL https://ollama.com/ library/qwen2.5:7b. Accessed 2026-05-03
2026
-
[20]
I. Pinelis. On the extreme points of moments sets.Mathematical Methods of Operations Research, 83(3):325–349, 2016. doi: 10.1007/s00186-015-0530-0. URL https://doi.org/ 10.1007/s00186-015-0530-0
-
[21]
SQuAD : 100,000+ questions for machine comprehension of text
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URLhttps://aclanthology.org/D16-1264/
-
[22]
T. Savage, J. Wang, R. Gallo, A. Boukil, V . Patel, S. A. A. Safavi-Naini, A. Soroush, and J. H. Chen. Large language model uncertainty proxies: Discrimination and calibration for medical diagnosis and treatment.Journal of the American Medical Informatics Association, 32(1):139– 149, 2025. doi: 10.1093/jamia/ocae254. URL https://doi.org/10.1093/jamia/ocae254
-
[23]
A. Taubenfeld, T. Sheffer, E. Ofek, A. Feder, A. Goldstein, Z. Gekhman, and G. Yona. Confidence improves self-consistency in LLMs. InFindings of the Association for Com- putational Linguistics: ACL 2025, pages 20090–20111, Vienna, Austria, 2025. Associa- tion for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1030. URL https: //aclanthology...
-
[24]
Vashurin, E
R. Vashurin, E. Fadeeva, A. Vazhentsev, L. Rvanova, D. Vasilev, A. Tsvigun, S. Petrakov, R. Xing, A. Sadallah, K. Grishchenkov, A. Panchenko, T. Baldwin, P. Nakov, M. Panov, and A. Shelmanov. Benchmarking uncertainty quantification methods for large language models with LM-polygraph.Transactions of the Association for Computational Linguistics, 13:220–248,
-
[25]
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM -Polygraph
doi: 10.1162/tacl_a_00737. URLhttps://aclanthology.org/2025.tacl-1.11/
-
[26]
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multi-task bench- mark and analysis platform for natural language understanding. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7
2019
-
[27]
X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=1PL1NIMMrw
2023
-
[28]
Z. Wang, J. Duan, L. Cheng, Y . Zhang, Q. Wang, X. Shi, K. Xu, H. T. Shen, and X. Zhu. ConU: Conformal uncertainty in large language models with correctness coverage guarantees. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6886–6898, Miami, Florida, USA, 2024. Association for Computational Linguistics. doi: 10.18653/v1/2...
-
[29]
G. Winkler. Extreme points of moment sets.Mathematics of Operations Research, 13(4): 581–587, 1988. doi: 10.1287/moor.13.4.581. URL https://doi.org/10.1287/moor.13. 4.581
-
[30]
type": "object
Q. Xiao, D. Bhattacharjya, B. Ganesan, R. Marinescu, K. Mirylenka, N. H. Pham, M. Glass, and J. Lee. The consistency hypothesis in uncertainty quantification for large language models. InProceedings of the Forty-First Conference on Uncertainty in Artificial Intelligence, volume 286 ofProceedings of Machine Learning Research, pages 4636–4651. PMLR, 2025. U...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.