arxiv: 2602.03814 · v2 · submitted 2026-02-03 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Conformal Thinking: Risk Control for Reasoning on a Compute Budget

Xi Wang , Anushri Suresh , Alvin Zhang , Rishi More , William Jurayj , Benjamin Van Durme , Mehrdad Farajtabar , Daniel Khashabi

show 1 more author

Eric Nalisnick

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:43 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords risk controlLLM reasoningadaptive stoppingtoken budgetearly exitdistribution-free guaranteescompute efficiency

0 comments

The pith

Distribution-free risk control sets upper and lower thresholds so LLMs stop reasoning early while keeping error rates below a user target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes token-budget decisions for reasoning LLMs as a risk-control task: limit the chance of wrong answers while cutting average compute. It introduces an upper threshold that halts once the model is confident and a parametric lower threshold that aborts likely unsolvable cases before wasting tokens. Given a target risk level and a validation set, distribution-free methods fix the thresholds to meet the bound. When several stopping rules are available, an efficiency term selects the cheapest one that still respects the risk limit. Experiments on multiple tasks and models show the method delivers the promised error control together with measurable savings in tokens.

Core claim

Given a target risk level and a validation set, distribution-free risk control can be used to choose both an upper threshold that stops reasoning when the model becomes confident and a novel parametric lower threshold that stops on instances unlikely to be solved. The resulting procedure bounds the probability that the final output is incorrect while minimizing expected token use; when multiple candidate rules exist, an efficiency loss selects the cheapest rule that still satisfies the risk bound.

What carries the argument

Distribution-free risk control applied to upper and lower stopping thresholds, with a parametric lower threshold that identifies unsolvable cases.

If this is right

The probability that any output is incorrect remains below the user-specified risk target.
Early termination on unsolvable problems reduces average token count without violating the risk bound.
When several stopping rules are considered, the efficiency loss selects the one with lowest compute among those that meet the risk target.
The same calibration procedure applies to any reasoning LLM and any task for which a validation set can be collected.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same calibration could be applied to other adaptive-compute choices such as model selection or tool use.
Extending the parametric lower threshold to multiple risk types at once would allow joint control over accuracy and other costs.
Deployed systems could let users set different risk targets per query type, with the validation set updated periodically to maintain the bound.

Load-bearing premise

The validation set must be drawn from the same distribution as future queries so the risk bound carries over, and the parametric form chosen for the lower threshold must separate unsolvable instances without stopping too many solvable ones.

What would settle it

Apply the calibrated thresholds to a fresh test set drawn from the same distribution; if the observed error rate exceeds the target risk level, the guarantee does not hold.

Figures

Figures reproduced from arXiv: 2602.03814 by Alvin Zhang, Anushri Suresh, Benjamin Van Durme, Daniel Khashabi, Eric Nalisnick, Mehrdad Farajtabar, Rishi More, William Jurayj, Xi Wang.

**Figure 2.** Figure 2: Dual-threshold early exit via risk-controlled confidence dynamics. We plot confidence trajectories as a function of token usage under Qwen3-8B on AIME questions. Left: an unsolvable instance, model confidence fluctuates and fails to reach the upper threshold; the reasoning is halted early by the parametric lower threshold, preventing unnecessary token consumption. Right: a solvable instance, where confid… view at source ↗

**Figure 3.** Figure 3: Visualization of the proposed correctness and efficiency losses under different thresholds. Lines of different colors (purple to pink) denote different threshold curves. Numbers in the box show the correctness and efficiency loss for each threshold. The top row shows the upper-threshold correctness and efficiency loss (Eq. (6) and (8)). Bottom two figures show lower-threshold sigmoid curves (Eq. (12)) and… view at source ↗

**Figure 4.** Figure 4: Empirical verification of risk control. We plot the empirical test risk (y-axis) against the user-specified target risk ϵ (x-axis). Solid lines and shaded regions indicate the mean and standard deviation over 40 random test-validation splits. Different colors denote different early-stopping signals. The left panel (NAIVE) selects thresholds on the validation set without finitesample correction, leading t… view at source ↗

**Figure 5.** Figure 5: Ensemble of signals improves efficiency. Under four models, we consider upper-threshold only early stopping. Given a target tolerance ϵ, risk control framework picks the signal that minimizes the efficiency loss (Eq. (8)), forming an ensemble of signals, which translates to superior efficiency on the test set (better accuracy v.s. token trade-off). 5. Empirical validation In this section, we provide empiri… view at source ↗

**Figure 6.** Figure 6: Lower-threshold gains grow when unsolvable instances are prevalent. We evaluate Qwen3-8B with confidence as the uncertainty signal on datasets with solvable:unsolvable ratios of 3:1, 1:1, and 1:3 (constructed by pooling AIME and GPQA, labeling instances by solvability under the full token budget, and subsampling to match each ratio). Top: test accuracy (instances abstained by lower threshold considered as … view at source ↗

**Figure 7.** Figure 7: Ablation on validation set size. Top row: False positive risk; Bottom row: False negative risk. Principled risk control approach (UCB) shows better risk control than Naive cross-validation under small validation set size, as well as for 0.5 1.0 Target risk 0.5 1.0 Test risk Naive 0.5 1.0 Target risk UCB confidence probe EAT #tokens 20000 y=x Validation avg. length: 10019.8, Test avg. length=3821.1 0.5 1.0 … view at source ↗

**Figure 8.** Figure 8: Ablation on length shift between validation and test set. Top row: False positive risk; Bottom row: False negative risk. Short to long shift (first column) brings more challenges to risk control. For upper threshold (top row), principled risk control alleviates excessive risk for all signals except for token-based. False negative risk, controlled by the lower threshold, shows a lack of robustness against l… view at source ↗

**Figure 9.** Figure 9: Ablation on dataset shift between validation and test set. Top row: False positive risk; Bottom row: False negative risk. Consistent with previous observations, using principled risk control again yields more controlled risk under distribution shift. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Reasoning Large Language Models (LLMs) enable test-time scaling, with dataset-level accuracy improving as the token budget increases, motivating adaptive reasoning -- spending tokens when they improve reliability and stopping early when additional computation is unlikely to help. However, setting the token budget, as well as the threshold for adaptive reasoning, is a practical challenge that entails a fundamental risk-accuracy trade-off. We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute. Our framework introduces an upper threshold that stops reasoning when the model is confident (risking incorrect output) and a novel parametric lower threshold that preemptively stops unsolvable instances (risking premature stoppage). Given a target risk and a validation set, we use distribution-free risk control to optimally specify these stopping mechanisms. For scenarios with multiple budget controlling criteria, we incorporate an efficiency loss to select the most computationally efficient exiting mechanism. Empirical results across diverse reasoning tasks and models demonstrate the effectiveness of our risk control approach, demonstrating computational efficiency gains from the lower threshold and ensemble stopping mechanisms while adhering to the user-specified risk target.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies conformal risk control to set stopping thresholds for LLM reasoning, adding a parametric lower threshold for unsolvable cases and an efficiency loss for multi-criterion selection, but the joint optimization on validation data likely weakens the distribution-free guarantees.

read the letter

The main takeaway is a practical framing of token budget setting as risk control: an upper threshold stops when the model is confident, while a new parametric lower threshold tries to cut off unsolvable instances early. They add an efficiency loss to pick the cheapest mechanism when several criteria are available. This combination is the actual novelty beyond standard conformal or early-exit work. The empirical section claims efficiency gains across tasks while meeting the target risk level, which is the part that could matter for people running test-time scaling on reasoning models. The approach is straightforward to implement if the calibration works as described. The soft spot is the guarantee itself. Optimizing the lower threshold parameters and the selection rule on the same validation set breaks exchangeability, so the marginal coverage no longer follows from the usual conformal argument. The abstract gives no sign of sample splitting or a fixed rule to restore it, and the stress-test concern holds up on that point. Without explicit verification that the final decision rule was calibrated separately, the distribution-free claim is weaker than stated. This is aimed at researchers working on efficient inference for LLMs rather than a broad audience. A reader already using conformal methods for calibration would find the lower threshold and efficiency loss worth looking at. It deserves peer review so the authors can clarify how the bounds survive the optimization step.

Referee Report

2 major / 2 minor

Summary. The paper proposes 'Conformal Thinking,' a framework that reframes token-budget setting for reasoning LLMs as a risk-control problem. It introduces an upper threshold to stop when the model is confident and a novel parametric lower threshold to preemptively halt unsolvable instances, both calibrated via distribution-free risk control on a validation set to meet a user-specified target risk while minimizing compute. For multiple stopping criteria an efficiency loss is used to select the most efficient mechanism. Empirical results across reasoning tasks and models are reported to show efficiency gains while respecting the risk target.

Significance. If the distribution-free guarantees survive the optimization step, the work supplies a practical, non-parametric method for managing the accuracy-compute trade-off in test-time scaling of LLMs. The combination of conformal calibration with a parametric early-stopping rule and an explicit efficiency objective is a concrete contribution that could be adopted in production reasoning pipelines.

major comments (2)

[Abstract / risk-control procedure] Abstract and the risk-control procedure: the claim that distribution-free risk control is used to 'optimally specify' the stopping mechanisms is undermined by the joint optimization of the parametric lower-threshold parameters via the efficiency loss on the same validation set. This step selects both the mechanism and its cutoff, rendering the final decision rule data-dependent and violating the exchangeability assumption required for standard conformal risk control to deliver the stated marginal coverage guarantee.
[risk-control procedure] The description of the efficiency-loss optimization: no sample-splitting, hold-out set, or post-hoc correction is mentioned that would restore validity after the lower threshold is tuned on the calibration data. Without such a separation the reported risk control may be optimistically biased and the empirical adherence to the target risk on held-out data does not by itself establish the distribution-free property.

minor comments (2)

[Abstract] The abstract supplies no equations, pseudocode, or concrete definition of the parametric lower threshold or the efficiency loss; these omissions make it impossible to verify the claimed optimality or to reproduce the calibration step from the text alone.
[Empirical evaluation] The manuscript should clarify whether the validation set used for threshold selection is disjoint from any data used to report final risk and efficiency numbers, and should include a table or plot showing achieved risk versus target risk across multiple random splits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the risk-control procedure. The comments highlight an important subtlety regarding the interaction between efficiency-loss optimization and conformal calibration. We address each point below and will revise the manuscript to incorporate sample splitting, thereby restoring the distribution-free guarantees while preserving the core contributions.

read point-by-point responses

Referee: [Abstract / risk-control procedure] Abstract and the risk-control procedure: the claim that distribution-free risk control is used to 'optimally specify' the stopping mechanisms is undermined by the joint optimization of the parametric lower-threshold parameters via the efficiency loss on the same validation set. This step selects both the mechanism and its cutoff, rendering the final decision rule data-dependent and violating the exchangeability assumption required for standard conformal risk control to deliver the stated marginal coverage guarantee.

Authors: We agree that performing the efficiency-loss optimization jointly with threshold selection on the same validation set used for conformal calibration renders the final rule data-dependent and can violate the exchangeability assumption underlying standard conformal risk control. To correct this, we will revise the method to use explicit sample splitting: the validation set will be partitioned into a calibration subset for conformal risk control and an independent tuning subset for optimizing the efficiency loss and selecting among stopping mechanisms. The abstract and Section 3 will be updated to describe this two-stage procedure, and we will add a brief argument showing that the marginal coverage guarantee is preserved on the calibration subset. This constitutes a substantive but localized change to the algorithm. revision: yes
Referee: [risk-control procedure] The description of the efficiency-loss optimization: no sample-splitting, hold-out set, or post-hoc correction is mentioned that would restore validity after the lower threshold is tuned on the calibration data. Without such a separation the reported risk control may be optimistically biased and the empirical adherence to the target risk on held-out data does not by itself establish the distribution-free property.

Authors: We acknowledge that the current manuscript does not describe sample splitting or any post-hoc correction, so the theoretical distribution-free property is not rigorously established by the existing procedure. In the revision we will introduce a dedicated hold-out tuning set for the efficiency-loss optimization, perform conformal calibration exclusively on the remaining calibration data, and re-run all experiments under this corrected protocol. The revised text will include both the updated algorithm and a short proof sketch confirming that the marginal risk bound continues to hold. Empirical results on the original held-out test sets will be retained for comparison, but the primary claims will now rest on the split-validation procedure. revision: yes

Circularity Check

0 steps flagged

Minor data-dependence from efficiency loss but no reduction by construction

full rationale

The framework applies distribution-free risk control on a held-out validation set to set upper/lower thresholds and uses an efficiency loss only for tie-breaking among multiple criteria. No quoted equation or derivation shows the final risk bound or selected mechanism reducing to a fitted parameter by construction; the calibration step remains separate from the efficiency selection. This is standard conformal practice and does not match any enumerated circularity pattern. Score kept at 2 to reflect the potential exchangeability concern raised by joint optimization without claiming an explicit self-referential collapse.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard conformal prediction assumptions plus the existence of a suitable parametric family for the lower threshold.

free parameters (2)

target risk level
User-specified input that determines the calibration of both thresholds on the validation set.
efficiency loss weighting
Hyperparameter used to trade off among valid stopping mechanisms when multiple criteria are active.

axioms (1)

domain assumption Distribution-free risk control guarantees transfer from validation set to test distribution
Invoked when the paper states that thresholds are set to meet the target risk on future instances.

pith-pipeline@v0.9.0 · 5520 in / 1256 out tokens · 39731 ms · 2026-05-16T07:43:16.978409+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

parametric lower threshold... λ−(t;c)=σ(c(ωt−B/2))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Process Supervision of Confidence Margin for Calibrated LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2501.12948. Fu, Y ., Chen, J., Zhu, S., Fu, Z., Dai, Z., Zhuang, Y ., Ma, Y ., Qiao, A., Rosing, T., Stoica, I., and Zhang, H. Effi- ciently scaling llm reasoning with certaindex,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Jazbec, M., Timans, A., Hadˇzi Veljkovi´c, T., Sakmann, K., Zhang, D., Andersson Naesseth, C., and Nalisnick, E

URL https://arxiv.org/abs/2412.20993. Jazbec, M., Timans, A., Hadˇzi Veljkovi´c, T., Sakmann, K., Zhang, D., Andersson Naesseth, C., and Nalisnick, E. Fast yet safe: Early-exiting with risk control.Advances in Neural Information Processing Systems, 37:129825– 129854,

work page arXiv
[3]

Langley, P

URL https: //aclanthology.org/2025.acl-short.50/. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA,

work page 2025
[4]

URLhttps://arxiv.org/abs/2505.12992. Liu, X. and Wang, L. Answer convergence as a signal for early stopping in reasoning,

work page arXiv
[5]

Luo, M., Tan, S., Wong, J., Shi, X., Tang, W

URL https: //arxiv.org/abs/2506.02536. Luo, M., Tan, S., Wong, J., Shi, X., Tang, W. Y ., Roongta, M., Cai, C., Luo, J., Li, L. E., Popa, R. A., and Stoica, I. Deepscaler: Surpass- ing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303...

work page arXiv
[6]

Rein, D., Hou, B

URL https://arxiv.org/abs/2509.14004. Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling,

work page arXiv
[7]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- time compute optimally can be more effective than scal- ing model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Entropy After </Think> for reasoning model early exiting

URL https: //openreview.net/forum?id=QWTCcxMpPA. Wang, X., McInerney, J., Wang, L., and Kallus, N. Entropy after ⟨/Think⟩ for reasoning model early exiting, 2025a. URLhttps://arxiv.org/abs/2509.26522. Wang, Y ., Zhang, Y ., Yu, T., Xu, C., Zhang, F., and Lian, F. Adaptive deep reasoning: Triggering deep thinking when needed, 2025b. URL https://arxiv.org/ ...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

org/abs/2508.17627

URL https://arxiv. org/abs/2508.17627. Wu, M., Zhou, C., Bates, S., and Jaakkola, T. Thought calibration: Efficient and confident test-time scal- ing.ArXiv, abs/2505.18404,

work page arXiv
[10]

Zeng, H., Huang, J., Jing, B., Wei, H., and An, B

URL https://arxiv.org/ abs/2504.15895. Zeng, H., Huang, J., Jing, B., Wei, H., and An, B. Pac reasoning: Controlling the performance loss for efficient reasoning,

work page arXiv
[11]

Zhang, A., Chen, Y ., Pan, J., Zhao, C., Panda, A., Li, J., and He, H

URL https://arxiv.org/abs/ 2510.09133. Zhang, A., Chen, Y ., Pan, J., Zhao, C., Panda, A., Li, J., and He, H. Reasoning models know when they’re right: Probing hidden states for self-verification,

work page arXiv
[12]

\n **Final Answer **\n\boxed{

URL https://arxiv.org/abs/2504.05419. 10 Conformal Thinking: Risk Control for Reasoning on a Compute Budget A. Extended experiment specifications. A.1. Signal Extraction We evaluate confidence signals that measure uncertainty or confidence at each thought chunk. Specifically, we focus on two primary metrics: Entropy After</think>(Wang et al., 2025b, EAT) ...

work page arXiv 2025
[13]

The probe is trained on AIME 1983–2024

to predict stepwise correctness ys from the representationh s. The probe is trained on AIME 1983–2024. 11 Conformal Thinking: Risk Control for Reasoning on a Compute Budget B. Risk control and finite-sample correction This section details how we calibrate threshold parameters using distribution-free risk control. The goal is to select a signal–threshold p...

work page 1983