pith. machine review for the scientific record. sign in

arxiv: 2605.05873 · v1 · submitted 2026-05-07 · 📊 stat.ML · cs.AI· cs.LG· math.ST· stat.ME· stat.TH

Recognition: unknown

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

Hirofumi Ota, Junpei Komiyama, Masaaki Imaizumi, Naoto Iwase, Yuki Ichihara

Pith reviewed 2026-05-08 05:24 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LGmath.STstat.MEstat.TH
keywords anytime-valid inferenceLLM self-consistencymode certificatione-processessequential testingintersection-union testingfalse certification controldata-dependent stopping
0
0 comments X

The pith

The CITE algorithm certifies that a prespecified answer is the unique mode of LLM responses with error control that holds for any data-driven stopping rule and without knowing the set of possible answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a statistical method to certify, with rigorous guarantees, that one chosen answer is the most common output from an LLM when multiple responses are sampled. This matters for self-consistency techniques because users often stop sampling once the answers look consistent, yet the decision to stop depends on the data itself and the full list of possible answers is unknown in advance. The proposed CITE procedure uses e-processes inside an intersection-union test to deliver false-certification control at any chosen level while also giving a bound on the number of samples needed that does not grow with the size of the answer set.

Core claim

The authors introduce the Certification by Intersection-union Testing with E-processes (CITE) algorithm. It provably bounds the probability of falsely declaring a target answer to be the unique mode of the response distribution at any prescribed level alpha, and the bound holds for every possible data-dependent stopping time. The construction requires no prior knowledge of the full answer category set and yields a stopping-time upper bound whose leading term is independent of that set size; matching minimax lower bounds are also shown in the main regime, and the same framework is extended to confidence-weighted voting.

What carries the argument

The CITE algorithm, which performs intersection-union testing of the event that the target answer is the unique mode by combining e-processes for each possible competing answer.

If this is right

  • The probability of false certification remains below any prescribed alpha no matter when sampling is stopped on the basis of the observed answers.
  • The expected number of samples required admits an upper bound whose dominant term does not depend on the number of possible answer categories.
  • Minimax lower bounds on the stopping time match the upper bounds up to constant factors in the primary regime.
  • The same guarantees carry over to a weighted-voting version of the procedure.
  • Empirical checks on both synthetic data and actual LLM outputs confirm that the error level is respected and that certification succeeds faster when response tails are diffuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be inserted into existing LLM pipelines to attach statistical reliability statements to self-consistency outputs without changing the underlying model.
  • Similar intersection-union constructions with e-processes may apply to other adaptive sampling tasks where the support of the distribution is unknown.
  • One could test whether the procedure still controls error when the unique-mode assumption is only approximately satisfied.
  • Because the sample-size bound is category-set-size free, the approach may scale to settings with very large or even continuous answer spaces.

Load-bearing premise

The sampled responses are i.i.d. draws from a fixed but unknown distribution that possesses a unique mode.

What would settle it

Repeated trials in which the empirical rate of false certifications exceeds the target level alpha when a data-dependent stopping rule is used, or when the certified answer is deliberately chosen not to be the true mode.

read the original abstract

Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when the stopping rule is data-dependent and the set of possible answers is not known in advance. We study anytime-valid certification of a prespecified target answer as the unique mode of the model's response distribution, a guarantee distinct from answer correctness. We propose the Certification by Intersection-union Testing with E-processes (CITE) algorithm, which provably controls false certification at any prescribed level under arbitrary data-driven stopping, without requiring prior knowledge of the answer category set. We also prove an category-set-size-free stopping-time rate, establish matching minimax lower bounds up to constants in the main regime, and extend the construction to confidence-weighted voting. Simulations and LLM self-consistency experiments show empirical error control and improved certification in diffuse-tail settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes the CITE algorithm for anytime-valid certification that a prespecified target answer is the unique mode of an LLM's response distribution. It claims that the algorithm, based on intersection-union testing with e-processes, provably controls the false certification rate at any prescribed level under arbitrary data-driven stopping times, without requiring prior knowledge of the answer category set. Additional claims include a category-set-size-free stopping-time rate, matching minimax lower bounds up to constants, and an extension to confidence-weighted voting, supported by simulations and LLM self-consistency experiments.

Significance. If the central guarantees hold, the work offers a rigorous, anytime-valid framework for error control in adaptive sampling procedures common to LLM reasoning, addressing a practical gap where stopping rules are data-dependent and answer spaces are open-ended. The matching lower bounds and empirical demonstrations of improved certification in diffuse-tail regimes would strengthen its contribution to statistical inference in machine learning.

major comments (3)
  1. [§3.2] §3.2, CITE e-process construction (intersection-union form): The supermartingale property under the composite null (union over all possible non-target categories a with P(a) ≥ P(target)) must be verified explicitly when the observed support grows dynamically. The current definition appears to condition only on currently observed categories; without an adaptation rule that preserves E[M_{t+1} | F_t] ≤ M_t after a new category appears, optional stopping may fail and invalidate the false-certification control for data-dependent stopping.
  2. [Theorem 4.1] Theorem 4.1 (anytime-valid control): The proof sketch relies on the responses being i.i.d. from a fixed distribution with a unique mode, yet the manuscript does not list this as an explicit assumption in the theorem statement or discuss robustness when the mode is not unique or the distribution changes. This assumption is load-bearing for the intersection-union null and the claimed guarantee without prior category-set knowledge.
  3. [§5] §5, minimax lower bounds: The claimed matching lower bounds (up to constants) are stated for the main regime, but it is unclear whether they account for the dynamic growth of the category set or only for fixed finite support; the lower-bound construction should be checked against the upper-bound stopping-time rate to confirm tightness in the unknown-support setting.
minor comments (2)
  1. Notation for the e-process (e.g., the intersection and union operators) would benefit from a small worked numerical example in the main text to clarify updates upon new category observation.
  2. The abstract and introduction should explicitly state the i.i.d. and unique-mode assumptions that underpin the theoretical results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below with clarifications and proposed revisions. We believe the concerns raised can be resolved without altering the core claims.

read point-by-point responses
  1. Referee: [§3.2] §3.2, CITE e-process construction (intersection-union form): The supermartingale property under the composite null (union over all possible non-target categories a with P(a) ≥ P(target)) must be verified explicitly when the observed support grows dynamically. The current definition appears to condition only on currently observed categories; without an adaptation rule that preserves E[M_{t+1} | F_t] ≤ M_t after a new category appears, optional stopping may fail and invalidate the false-certification control for data-dependent stopping.

    Authors: We appreciate the referee's emphasis on this technical detail. The CITE e-process is constructed so that each pairwise e-process (target vs. observed non-target) is a supermartingale under its respective null, and the intersection-union form takes the minimum over the current set of such processes. When a new category appears at step t+1, it is added to the composite null and a new pairwise e-process is initialized at value 1; the previous processes continue unchanged. Because the new process starts at 1 and the filtration includes the new observation, the conditional expectation E[M_{t+1} | F_t] remains ≤ M_t. We will insert an explicit lemma in the revised §3.2 proving this preservation under dynamic support growth, thereby confirming validity of optional stopping. revision: yes

  2. Referee: [Theorem 4.1] Theorem 4.1 (anytime-valid control): The proof sketch relies on the responses being i.i.d. from a fixed distribution with a unique mode, yet the manuscript does not list this as an explicit assumption in the theorem statement or discuss robustness when the mode is not unique or the distribution changes. This assumption is load-bearing for the intersection-union null and the claimed guarantee without prior category-set knowledge.

    Authors: The i.i.d. sampling from a fixed distribution with a unique mode is stated in the problem setup (Section 2) but, as the referee notes, should be restated in the theorem. We will add it explicitly to the statement of Theorem 4.1. When the mode is not unique the null hypothesis of the intersection-union test holds, so the procedure correctly withholds certification and the false-certification bound is preserved. For time-varying distributions the pathwise supermartingale property still yields anytime-valid control conditional on the realized sequence, but the unique-mode guarantee is stated for the fixed-distribution case. A short remark on these points will be added. revision: yes

  3. Referee: [§5] §5, minimax lower bounds: The claimed matching lower bounds (up to constants) are stated for the main regime, but it is unclear whether they account for the dynamic growth of the category set or only for fixed finite support; the lower-bound construction should be checked against the upper-bound stopping-time rate to confirm tightness in the unknown-support setting.

    Authors: The lower-bound construction in §5 is designed for the unknown-support regime: an adversary may introduce new categories at arbitrary times, and the information-theoretic argument accounts for the worst-case growth of the support. This yields a lower bound on the expected stopping time that is independent of the eventual support size, matching the category-set-size-free upper bound up to constants. We will expand the proof sketch in the revision to make the dynamic-support adversary explicit and to verify that the resulting rate is tight against the CITE upper bound in the growing-support setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The CITE construction applies standard e-process supermartingale properties to an intersection-union test for certifying a prespecified target as unique mode. The anytime-valid false-certification bound under data-dependent stopping and growing category sets is derived from the e-process definition and optional-stopping theorem without reducing to a fitted parameter or self-referential definition. The category-set-size-free rate and minimax lower bounds are stated as separate results with matching constants in the main regime. No load-bearing equation equates a claimed prediction to its own input by construction, and external benchmarks (simulations, LLM experiments) are presented separately from the theoretical guarantees.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The abstract provides no explicit list of free parameters or axioms; the construction implicitly relies on standard properties of e-processes and the existence of a unique mode in the response distribution.

axioms (2)
  • domain assumption Responses are i.i.d. samples from a fixed discrete distribution possessing a unique mode.
    Required for the mode-certification target to be well-defined; not stated explicitly but presupposed by the problem setup.
  • standard math E-processes exist and can be constructed for the intersection-union testing problem under unknown support.
    Central to the CITE construction; assumed from the broader anytime-valid inference literature.
invented entities (1)
  • CITE algorithm no independent evidence
    purpose: Provides the concrete procedure for anytime-valid mode certification.
    The algorithm itself is the novel contribution; no independent evidence beyond the claimed proofs is given in the abstract.

pith-pipeline@v0.9.0 · 5486 in / 1515 out tokens · 17507 ms · 2026-05-08T05:24:11.347009+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 17 canonical work pages · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Let’s sample step by step: Adaptive- consistency for efficient reasoning and coding with llms

    Pranjal Aggarwal, Aman Madaan, Yiming Yang, et al. Let’s sample step by step: Adaptive- consistency for efficient reasoning and coding with llms. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12375–12396,

  3. [3]

    Cges: Confidence- guided early stopping for efficient and accurate self-consistency.arXiv preprint arXiv:2511.02603,

    Ehsan Aghazadeh, Ahmad Ghasemi, Hedyeh Beyhaghi, and Hossein Pishro-Nik. Cges: Confidence- guided early stopping for efficient and accurate self-consistency.arXiv preprint arXiv:2511.02603,

  4. [4]

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini

    doi: 10.1214/ECP.v18-2359. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787,

  5. [5]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  6. [6]

    , title =

    Paula Cordero-Encinar and Andrew B Duncan. Certified self-consistency: Statistical guarantees and test-time training for reliable reasoning in llms.arXiv preprint arXiv:2510.17472,

  7. [7]

    Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

    URLhttps://arxiv.org/abs/2605.00674. Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence.arXiv preprint arXiv:2508.15260,

  8. [8]

    doi: 10.1093/biomet/40.3-4.237. 13 I. J. Good and G. H. Toulmin. The number of new species, and the increase in population coverage, when a sample is increased.Biometrika, 43(1-2):45–63,

  9. [9]

    Peter Grünwald, Rianne de Heide, and Wouter M

    doi: 10.1093/biomet/43.1-2.45. Peter Grünwald, Rianne de Heide, and Wouter M. Koolen. Safe testing.Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(5):1091–1128,

  10. [10]

    The Annals of Statistics , volume =

    doi: 10.1214/20-AOS1991. Emilie Kaufmann and Wouter M Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals.Journal of Machine Learning Research, 22(246):1–44,

  11. [11]

    Locally minimax optimal confidence sets for the best model

    Ilmun Kim and Aaditya Ramdas. Locally minimax optimal confidence sets for the best model. arXiv preprint arXiv:2503.21639,

  12. [13]

    URL https://arxiv.org/abs/2505.22919. OpenAI. gpt-oss-120b & gpt-oss-20b Model Card.arXiv preprint arXiv:2508.10925,

  13. [14]

    gpt-oss-120b & gpt-oss-20b Model Card

    URL https://arxiv.org/abs/2508.10925. Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number of unseen species.Proceedings of the National Academy of Sciences, 113(47):13283–13288,

  14. [15]

    Amichai Painsky

    doi: 10.1073/pnas.1607774113. Amichai Painsky. Confidence intervals for parameters of unobserved events.Journal of the American Statistical Association, 120(549):226–236,

  15. [17]

    Qwen3 Technical Report

    URL https: //arxiv.org/abs/2505.09388. Aaditya Ramdas and Ruodu Wang. Hypothesis testing with e-values.Foundations and Trends®in Statistics, 1(1-2):1–390,

  16. [18]

    Journal of the Royal Statistical Society: Series A (Statistics in Society) , volume =

    doi: 10.1111/rssa.12647. Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations,

  17. [19]

    Confidence improves self-consistency in llms

    Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20090–20111,

  18. [20]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  19. [21]

    E-values: Calibration, combination and applications , volume=

    doi: 10.1214/20-AOS2020. Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan Ö Arık. Dynscaling: Ef- ficient verifier-free inference scaling via dynamic and integrated sampling.arXiv preprint arXiv:2506.16043, 2025a. Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. FrontierScience: Evaluating AI’s Ability ...

  20. [22]

    Frontierscience: Evaluating ai’s ability to perform expert-level scien- tific reasoning.arXiv preprint arXiv:2601.21165, 2026

    URLhttps://arxiv.org/abs/2601.21165. Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, and Kan Li. Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6904–6917, 2025b. Xuezhi Wang, Jason Wei,...

  21. [23]

    RelatedWorks Self-consistency and test-time compute for LLM reasoning

    16 AppendixA. RelatedWorks Self-consistency and test-time compute for LLM reasoning. Self-consistency and related inference- time scaling methods improve LLM reasoning by sampling multiple reasoning trajectories and aggregating their terminal answers, typically by majority voting (Wei et al., 2022; Zhou et al., 2023; Wang et al., 2023; Yao et al., 2023; B...

  22. [24]

    These tools provide tests and confidence statements that remain valid under continuous monitoring and data-dependent stopping

    and the modern theory of e-values, e-processes, testing by betting, and confidence sequences (Howard et al., 2020, 2021; Shafer, 2021; V ovk and Wang, 2021; Lindon and Malek, 2022; Grünwald et al., 2024; Ramdas and Wang, 2025). These tools provide tests and confidence statements that remain valid under continuous monitoring and data-dependent stopping. CI...

  23. [25]

    fall into this regime, with the mode-answer carrying20–30%of the mass and thousands of alternative answers sharing the remainder. For each fixed m, the MMC controls its certification rate at level α on any such P: its residual-mass channel at that m tests the null {p(1) ≤s m}, which is true under P, so the corresponding e-process cannot grow past the reje...

  24. [26]

    Theorem F.5(Adaptive grid rate).Suppose Assumption 2.1 holds and CITE usesΛ geo pw in the pairwise component

    Condition F.4(Geometric grid).For a fixed minimum gapδ 0 ∈(0,1], set Λgeo pw :={2 −k : 1≤k≤K},K:= log2(8/δ0) , with uniform weights wλ =1/K. Theorem F.5(Adaptive grid rate).Suppose Assumption 2.1 holds and CITE usesΛ geo pw in the pairwise component. Keep the LCB-grid and selected-weight assumptions of Theorem 4.3. Then, uniformly over all P with modal ga...

  25. [27]

    In each setting, the tail probabilities follow a shifted-Zipf distribution capped so that the stated modal gapδ=p (1) −p (2) is preserved exactly. Table2. Distribution settings for the simulation study. Setting NameK p r δTail 1 LLM-like Zipf (diffuse) 5000 0.24 0.215 Zipf 1.1 on 4987 labels 2 Concentrated (strong mode) 100 0.60 0.450 Uniform on 94 labels...

  26. [28]

    This occurs because the top categories are already assigned weights close to one, so the logistic transformation does not substantially separate the mode from its runner-up. G.5.Rank-Based Weight Model.The rank-based model assigns E[W|X = a]=0 .95 ·e −γ·rank(a) for the top K0 =10 categories and wlow =0 .1 otherwise, with additive Uniform(−0.05, 0.05) nois...

  27. [29]

    All 30 problems of AIME 2026 (Dekoninck et al.,

  28. [30]

    ¯KN is the Monte Carlo mean number of observed answer categories at budgetN

    Table 12: Per-problem certification rate on AIME 2026 (30 problems) (Case A, target=mode).Bold: CITE > Bonferroni. ¯KN is the Monte Carlo mean number of observed answer categories at budgetN. ProblemNBonf. CITE W-CITE KR MMC ¯KN ¯τCITE ¯τW-CITE ¯τMMC P6 (δ=1.000) 64 0.000±0.0001.000±0.0001.000±0.000 1.000±0.000 1.000±0.000 1.0±0.0 9.0±0.0 10.0±0.0 7.0±0.0...

  29. [31]

    ¯KN is the Monte Carlo mean number of observed answer categories at budgetN

    Table 15: Per-problem certification rate on AIME 2026 (30 problems) (Case A, target=mode).Bold: CITE > Bonferroni. ¯KN is the Monte Carlo mean number of observed answer categories at budgetN. ProblemNBonf. CITE W-CITE KR MMC ¯KN ¯τCITE ¯τW-CITE ¯τMMC P6 (δ=0.995) 64 0.222±0.0191.000±0.0001.000±0.000 1.000±0.000 1.000±0.000 1.2±0.0 9.2±0.0 9.2±0.0 7.2±0.0 ...

  30. [32]

    https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507 gpt-oss-20b (OpenAI, 2025)https://huggingface.co/openai/gpt-oss-20b Datasets AIME 2026 (Dekoninck et al.,