pith. sign in

arxiv: 2605.16874 · v1 · pith:3RES4GZVnew · submitted 2026-05-16 · 💻 cs.AI

Reasoning Can Be Restored by Correcting a Few Decision Tokens

Pith reviewed 2026-05-19 20:52 UTC · model grok-4.3

classification 💻 cs.AI
keywords reasoningbasemodeltokensearlymodelsbenchmarkschallenging
0
0 comments X p. Extension
pith:3RES4GZV Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{3RES4GZV}

Prints a linked pith:3RES4GZV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Base LLMs lose most reasoning ability at a few early planning tokens that can be fixed by brief stronger-model intervention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper quantifies token-level disagreement between base LLMs and stronger reasoning models using likelihood divergences. It shows the performance gap is sparse, concentrated in roughly 8 percent of early tokens that are heavily enriched for planning decisions and coincide with high base-model uncertainty. Building on this, the authors test a simple intervention that lets the reasoning model generate only at those high-disagreement positions before handing control back to the base model, recovering most of the lost accuracy on challenging reasoning benchmarks with a small token budget.

Core claim

The reasoning advantage of large reasoning models over base models is sparse and localized to early planning-related decision tokens. On Qwen3-0.6B, only about 8 percent of generated tokens drive the salient distributional disagreement; these tokens appear early, occur 17 times more often in planning decisions than elsewhere, and align with elevated uncertainty in the base model. This pattern indicates that base models mainly derail at initial planning points that steer the rest of the trajectory.

What carries the argument

Disagreement-guided token intervention: a one-token takeover by the reasoning model performed only at positions of high likelihood-based divergence, followed by an immediate switch back to the base model for the remainder of generation.

Load-bearing premise

That the positions of high distributional disagreement are precisely the causal points where early planning errors derail the base model's subsequent reasoning, rather than merely correlated symptoms of some other problem.

What would settle it

A controlled test showing that swapping in the reasoning model's token at the identified high-disagreement positions produces no improvement in final reasoning accuracy would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.16874 by An Zhang, Changshuo Shen, Leheng Sheng, Xiang Wang, Yuxin Chen.

Figure 1
Figure 1. Figure 1: Disagreement-guided token intervention and its effec￾tiveness. Intervening on 8% critical tokens, triggered by high token-level distributional disagreement through a strong LRM (i.e., top figure), enables a 0.6B base model to recover most reasoning capabilities of the 8B strong LRM (i.e., bottom figure) and outper￾form its post-trained 0.6B variant. Large reasoning models (LRMs) have recently demonstrated … view at source ↗
Figure 2
Figure 2. Figure 2: Token-level disagreement is sparse, early, and uncertainty-aligned across benchmarks. (a) Lorenz curves show that disagreement mass is highly concentrated in a small fraction of tokens. (b) Normalized-position density of the top-1% highest-disagreement tokens is strongly skewed toward early steps (u=t/T), indicating early-to-mid trajectory concentration relative to a uniform baseline. (c) IoU between top-p… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of Disagreement-Gated Intervention. In Step I, we calibrate the disagreement threshold τ and factor λ using the top-r% of scores from base rollouts. In Step II, the gate gt dynamically switches between the Base model and the Intervene model (reasoning expert). Intervention is triggered only when the token-level disagreement st satisfies both the global constraint (st > τ ) and the local relati… view at source ↗
Figure 4
Figure 4. Figure 4: The distribution of top-100 mean disagreement scores shows a distinct separation, where incorrect samples (red) exhibit significantly higher values than correct ones (green). GSM8K: 0.851 vs. 0.817 AUROC). This result closes the loop with the preceding structure findings: the same sparse, early, planning-linked spikes that dominate disagreement mass also carry actionable, sample-level predictive signal. Th… view at source ↗
Figure 5
Figure 5. Figure 5: Performance recovers sharply with few replacement. A small budget can outperform a same-size reason￾ing model [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative pattern: takeover induces a “stop-and-check” planning step. Green highlights tokens generated under takeover (reasoning model); white tokens are generated by the base model. In this example, a short self-check triggered at an ambiguous clause fixes the variable binding, after which control returns to the base model for routine execution [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Stability of realized replacement budget. Violin plots of per-problem replacement rate ρi across benchmarks for Qwen3-0.6B under (top) Cross-Entropy gating and (bottom) reverse-KL gating, each shown at two calibrated spike ratios (r = 0.03 and r = 0.05). Concentrated distributions indicate that the realized intervention budget is stable across problems rather than dominated by a small subset of outliers. C… view at source ↗
Figure 8
Figure 8. Figure 8: Case study prompt. The following figure visualize the token-level takeover trajectory for this example. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case study rollout. Highlighted tokens denote reasoning model replacements triggered by disagreement spikes. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
read the original abstract

Large reasoning models (LRMs) substantially outperform their base LLM counterparts on challenging reasoning benchmarks, yet it remains poorly understood where base models go wrong during token-by-token generation and how to narrow this gap efficiently. We study the base-reasoning gap through quantifying token-level distributional disagreement between a base model and a stronger reasoning model using likelihood-based divergences. Across benchmarks, we find that the reasoning advantage is highly sparse and concentrates on a small set of early, planning-related decision tokens. For instance, on Qwen3-0.6B, only ~8% of generated tokens account for the salient disagreement, and these tokens concentrate early in the response, are strongly enriched in planning-related decisions (17x), and coincide with high base-model uncertainty -- suggesting that base models fail mainly at early planning points that steer the subsequent reasoning trajectory. Building on these findings, we propose disagreement-guided token intervention, a simple inference-time delegation scheme that performs a one-token takeover by the reasoning model only at high-disagreement positions and immediately switches back to the base model. With a small intervention budget, this sparse delegation substantially recovers and can even surpass the performance of a same-size reasoning model on challenging reasoning tasks. Code is available at https://github.com/AlphaLab-USTC/RRTokenIntervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper quantifies token-level distributional disagreement (via likelihood divergences) between base LLMs and larger reasoning models across benchmarks. It reports that the reasoning advantage is sparse (~8% of tokens on Qwen3-0.6B), concentrated on early planning-related decision tokens that are enriched 17x for planning decisions and coincide with high base-model uncertainty. Building on this, it introduces a disagreement-guided one-token intervention that delegates only at high-disagreement positions and claims this recovers (or surpasses) same-size reasoning-model performance with small budget.

Significance. If the sparsity and intervention results hold under proper controls, the work would be significant for providing an empirical map of where base models derail on reasoning trajectories and a lightweight inference-time fix. The reproducible code and cross-benchmark consistency are strengths; the approach could inform efficient hybrid decoding strategies.

major comments (2)
  1. [Intervention Experiment] Intervention section: the performance recovery is shown for the disagreement-guided scheme, but no ablations are reported that compare it to intervening at random early tokens, low-disagreement positions, or uncertainty-matched positions while holding the one-token takeover budget fixed. Without these controls it remains possible that any early delegation to the stronger model produces the gains, weakening the claim that high-disagreement tokens are the specific causal sites.
  2. [Token Disagreement Analysis] Token analysis (around the Qwen3-0.6B ~8% figure and 17x planning enrichment): the positions of high disagreement are shown to correlate with uncertainty and planning labels, yet the manuscript does not include trajectory-level analysis or counterfactual rollouts demonstrating that errors at these exact tokens propagate to derail later reasoning. This leaves the 'steer the subsequent trajectory' interpretation as an association rather than a demonstrated causal mechanism.
minor comments (2)
  1. [Methods] Methods: specify whether the planning-related token labeling criteria were pre-registered or chosen after inspecting the data; if post-hoc, discuss potential selection bias.
  2. [Results] Figures and tables: ensure all reported percentages (e.g., 8%, 17x) are accompanied by exact token counts, benchmark breakdowns, and confidence intervals or standard errors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important aspects of our experimental design and causal interpretations. We address each major comment below and have revised the manuscript accordingly to strengthen the claims.

read point-by-point responses
  1. Referee: [Intervention Experiment] Intervention section: the performance recovery is shown for the disagreement-guided scheme, but no ablations are reported that compare it to intervening at random early tokens, low-disagreement positions, or uncertainty-matched positions while holding the one-token takeover budget fixed. Without these controls it remains possible that any early delegation to the stronger model produces the gains, weakening the claim that high-disagreement tokens are the specific causal sites.

    Authors: We agree that these controls are necessary to isolate the role of high-disagreement positions. In the revised manuscript we have added a new ablation subsection (Section 4.3) that reports exactly these comparisons: intervening at random early tokens, at low-disagreement positions, and at uncertainty-matched positions, all under an identical one-token takeover budget. The results confirm that only the disagreement-guided selection recovers (and in some cases exceeds) the reasoning-model performance; random or uncertainty-matched early interventions yield substantially smaller gains. These new figures and tables are now included. revision: yes

  2. Referee: [Token Disagreement Analysis] Token analysis (around the Qwen3-0.6B ~8% figure and 17x planning enrichment): the positions of high disagreement are shown to correlate with uncertainty and planning labels, yet the manuscript does not include trajectory-level analysis or counterfactual rollouts demonstrating that errors at these exact tokens propagate to derail later reasoning. This leaves the 'steer the subsequent trajectory' interpretation as an association rather than a demonstrated causal mechanism.

    Authors: We acknowledge that explicit counterfactual rollouts would provide the strongest causal demonstration. Performing full trajectory-level counterfactuals (forcing the base model to adopt the reasoning-model token at each high-disagreement site and then continuing generation) is computationally expensive and was outside the scope of the original submission. However, the one-token intervention results themselves constitute a direct causal test: correcting only those specific tokens and immediately returning control to the base model is sufficient to restore performance. This outcome is difficult to explain unless errors at precisely those positions derail the subsequent trajectory. We have added a dedicated paragraph in Section 5 discussing this interpretation, the correlational nature of the token-level statistics, and the limitation regarding exhaustive counterfactuals. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical token-disagreement measurements and intervention are self-contained

full rationale

The paper quantifies token-level distributional disagreement via likelihood-based divergences between base and reasoning models, reports empirical sparsity and enrichment statistics on planning tokens, and evaluates a disagreement-guided one-token intervention. These steps consist of direct distributional comparisons and controlled experiments on benchmarks; none reduce by construction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The derivation chain is therefore independent of its own outputs and remains falsifiable against external task performance.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is primarily empirical and relies on standard assumptions of autoregressive token generation and likelihood-based divergence measures; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5766 in / 1170 out tokens · 32818 ms · 2026-05-19T20:52:50.412735+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 15 internal anchors

  1. [1]

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford,...

  2. [2]

    On predictability of reinforce- ment learning dynamics for large language models.CoRR, abs/2510.00553,

    Cai, Y ., Cao, D., Xu, X., Yao, Z., Huang, Y ., Tan, Z., Zhang, B., Liu, G., and Fang, J. On predictability of reinforce- ment learning dynamics for large language models.CoRR, abs/2510.00553,

  3. [3]

    Chen, Z., Yang, X., Lin, J., Sun, C., Huang, J., and Chang, K. C. Cascade speculative drafting for even faster LLM inference.CoRR, abs/2312.11462,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.CoRR, abs/2110.14168,

  5. [5]

    Gandhi, K., Chakravarthy, A., Singh, A., Lile, N., and Goodman, N. D. Cognitive behaviors that enable self- improving reasoners, or, four habits of highly effective stars.CoRR, abs/2503.01307,

  6. [6]

    Minillm: Knowl- edge distillation of large language models

    Gu, Y ., Dong, L., Wei, F., and Huang, M. Minillm: Knowl- edge distillation of large language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7–11,

  7. [7]

    Be- yond correctness: Confidence-aware reward modeling for enhancing large language model reasoning.CoRR, abs/2511.07483,

    He, Q., Ren, Q., Lei, S., Wang, X., and Wang, Y . Be- yond correctness: Confidence-aware reward modeling for enhancing large language model reasoning.CoRR, abs/2511.07483,

  8. [8]

    Distilling the Knowledge in a Neural Network

    Hinton, G. E., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.CoRR, abs/1503.02531,

  9. [9]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models.CoRR, abs/2106.09685,

  10. [10]

    doi: 10.1109/TIT.2009

    ISSN 0018-9448. doi: 10.1109/TIT.2009. 2027527. URL https://doi.org/10.1109/TIT. 2009.2027527. Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., Iftimie, A., Karpenko, A., Passos, A. T., Neitz, A., Prokofiev, A., Wei, A., Tam, A., Bennett, A., Kumar, A., Saraiva, A., Vallone, A., Dubers...

  11. [11]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. CoRR, abs/2001.08361,

  12. [12]

    Reasoning-as-logic-units: Scal- ing test-time reasoning in large language models through logic unit alignment.arXiv preprint arXiv:2502.07803,

    Li, C., Xu, T., and Guo, Y . Reasoning-as-logic-units: Scal- ing test-time reasoning in large language models through logic unit alignment.arXiv preprint arXiv:2502.07803,

  13. [13]

    URL https://doi.org/10.2307/2276207

    doi: 10.2307/2276207. URL https://doi.org/10.2307/2276207. Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., and Clark, P. Self-refine: Iterative refinement with self-feedback. InNeurIPS,

  14. [15]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C

    Ac- cessed: 2025-12-23. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback. InNeurIPS,

  15. [16]

    Adaswitch: Adaptive switching genera- tion for knowledge distillation.CoRR, abs/2510.07842,

    Peng, J., Wang, M., Cai, H., Li, Y ., Zhang, K., Wang, S., Yin, D., and Zhao, X. Adaswitch: Adaptive switching genera- tion for knowledge distillation.CoRR, abs/2510.07842,

  16. [17]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URL https: //openreview.net/forum?id=Ti67584b98. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

  17. [18]

    Emotionshapesthediffusionofmoralizedcontentinsocialnetworks

    doi: 10.1073/pnas. 0506580102. URL https://www.pnas.org/doi/ abs/10.1073/pnas.0506580102. Sun, H., Wu, J., Cai, H., Wei, X., Feng, Y ., Wang, B., Wang, S., Zhang, Y ., and Yin, D. Adaswitch: Adaptive switching between small and large agents for effective cloud-local collaborative learning. InEMNLP, pp. 8052–

  18. [19]

    Team, L. The llama 3 herd of models.CoRR, abs/2407.@articleDBLP:journals/corr/abs-2504- 07128, author = Sara Vera Marjanovic and Arkil Patel and Vaibhav Adlakha and Milad Aghajohari and Parishad BehnamGhader and Mehar Bhatia and Aditi Khandelwal and Austin Kraft and Benno Krojer and Xing Han L `u and Nicholas Meade and Dongchan Shin and Amirhossein Kazemn...

  19. [20]

    Attention Is All You Need

    URL https: //openreview.net/forum?id=2XBPdPIcFK. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need.CoRR, abs/1706.03762,

  20. [21]

    Base models know how to reason, thinking models learn when.CoRR, abs/2510.07364,

    11 Reasoning Can Be Restored by Correcting a Few Decision Tokens Venhoff, C., Arcuschin, I., Torr, P., Conmy, A., and Nanda, N. Base models know how to reason, thinking models learn when.CoRR, abs/2510.07364,

  21. [22]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., Liu, Y ., Yang, A., Zhao, A., Yue, Y ., Song, S., Yu, B., Huang, G., and Lin, J. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning. CoRR, abs/2506.01939, 2025a. Wang, X., Wei, J., Schuurmans, D., Le, Q....

  22. [23]

    Reinforcement Learning for Reasoning in Large Language Models with One Training Example

    Wang, Y ., Yang, Q., Zeng, Z., Ren, L., Liu, L., Peng, B., Cheng, H., He, X., Wang, K., Gao, J., Chen, W., Wang, S., Du, S. S., and Shen, Y . Reinforcement learning for reasoning in large language models with one training example.CoRR, abs/2504.20571, 2025b. Ward, J., Lin, C., Venhoff, C., and Nanda, N. Reasoning- finetuning repurposes latent representati...

  23. [24]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E. H., Le, Q., and Zhou, D. Chain of thought prompt- ing elicits reasoning in large language models.CoRR, abs/2201.11903,

  24. [25]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Wen, X., Liu, Z., Zheng, S., Xu, Z., Ye, S., Wu, Z., Liang, X., Wang, Y ., Li, J., Miao, Z., Bian, J., and Yang, M. Reinforcement learning with verifiable rewards implic- itly incentivizes correct reasoning in base llms.CoRR, abs/2506.14245,

  25. [26]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  26. [27]

    M., Pehlevan, C., Je- lassi, S., and Malach, E

    Zhao, R., Meterez, A., Kakade, S. M., Pehlevan, C., Je- lassi, S., and Malach, E. Echo chamber: RL post- training amplifies behaviors learned in pretraining.CoRR, abs/2504.07912,

  27. [28]

    Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., Du, Y ., Yang, C., Chen, Y ., Chen, Z., Jiang, J., Ren, R., Li, Y ., Tang, X., Liu, Z., Liu, P., Nie, J., and Wen, J. A survey of large language models.CoRR, abs/2303.18223,

  28. [29]

    First return, entropy-eliciting explore.CoRR, abs/2507.07017,

    Zheng, T., Xing, T., Gu, Q., Liang, T., Qu, X., Zhou, X., Li, Y ., Wen, Z., Lin, C., Huang, W., Liu, Q., Zhang, G., and Ma, Z. First return, entropy-eliciting explore.CoRR, abs/2507.07017,

  29. [30]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Zou, A., Phan, L., Chen, S. L., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation engineering: A top-down approach to AI transparency.CoRR, abs/2310.01405,

  30. [31]

    All evaluations are conducted on a single compute node equipped with an NVIDIA A100 GPU

    for efficient, high-throughput inference. All evaluations are conducted on a single compute node equipped with an NVIDIA A100 GPU. To enable a controlled comparison between base generation and reasoning guidance (with a shared tokenizer and vocabulary), we use the Qwen3 model family (Yang et al., 2025). Specifically, we use Qwen3-0.6B-Base as the primary ...

  31. [32]

    Google-proof

    Output:responsey 1:t. 1:// — Phase I: Offline calibration (run once) — 2:Run base rollouts onCand collectS={s t}withs t = CE(pb(· |xt), p r(· |xt)) 3:τ←Q 1−r(S)//(1−r)-quantile threshold 4:λ←E[s|s > τ]/E[s]// tail-to-mean scale 5:// — Phase II: Online decoding for promptx 0 — 6:Initializey 1:0 ← ∅ 7:fort= 1, . . . , T max do 8:x t ←(x 0, y1:t−1) 9:Compute...

  32. [33]

    Calibration follows the same procedure as in the main experiments (held-out GSM8K-style prompts, sliding windowW= 64)

    and the reasoning guide is DeepSeek-R1-Distill-Llama-8B (Guo et al., 2025). Calibration follows the same procedure as in the main experiments (held-out GSM8K-style prompts, sliding windowW= 64). Table 11 reports the average performance over the six math benchmarks. The same monotonic recovery pattern holds: with ¯ρ≈0.20, guided intervention recovers ∼91% ...

  33. [34]

    controller

    This indicates that the disagreement-spike phenomenon and the resulting recovery curve are not specific to a single model family or training pipeline. Table 11.Cross-family generalization on the LLaMA pair (Accuracy / Pass@8).Base model: LLaMA-3.1-8B; reasoning guide: DeepSeek-R1-Distill-Llama-8B. The Recovery column quantifies how much the intervention r...