arxiv: 2604.17304 · v1 · submitted 2026-04-19 · 💻 cs.AI

Recognition: unknown

Efficient Test-Time Scaling via Temporal Reasoning Aggregation

Jiakun Li , Xingwei He , Kefan Li , Hongzheng Chai , Hongyue Yu , Yuan Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:14 UTC · model grok-4.3

classification 💻 cs.AI

keywords test-time scalingreasoning efficiencyearly stoppingLLM reasoningtemporal aggregationdynamic inferencetoken reduction

0 comments

The pith

TRACE stops LLM reasoning early by checking if answers stay consistent and confidence stabilizes across recent steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often continue generating reasoning tokens long after they have settled on a correct answer, wasting compute. The paper presents TRACE, a training-free method that monitors the reasoning process over multiple steps instead of relying on any single moment's signal. It tracks two signals together: how persistently the model gives the same answer in recent steps, and how the model's confidence level rises or falls over time. When both signals indicate the reasoning has converged, the process halts. Experiments across benchmarks show this cuts token use by 25-30 percent on average while keeping accuracy within 1-2 percent of running the full reasoning chain.

Core claim

TRACE detects reasoning convergence over time by aggregating two complementary signals across recent reasoning steps: answer consistency, capturing the persistence of predicted answers, and confidence trajectory, modeling the temporal evolution of model confidence. Benefiting from these two factors, TRACE can accurately determine whether the reasoning process has converged, thereby promptly halting inference and effectively avoiding redundant reasoning steps.

What carries the argument

Temporal aggregation of answer consistency and confidence trajectory over recent reasoning steps, which replaces single-step confidence checks for deciding early termination.

If this is right

Reduces average reasoning token usage by 25-30 percent across tested benchmarks.
Keeps final accuracy within 1-2 percent of complete reasoning chains.
Outperforms prior dynamic early-exit methods that rely on single-step signals.
Requires no additional training, so it applies directly to existing models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same temporal-stability idea could be tested on non-reasoning tasks such as code generation or multi-turn dialogue to see if early stopping generalizes.
Combining TRACE with token-budget limits might produce even larger savings on very long problems.
If the two signals sometimes disagree, a simple weighted rule or learned threshold might further improve the accuracy-efficiency trade-off.

Load-bearing premise

That aggregating answer consistency and confidence trajectory over recent steps reliably detects reasoning convergence without prematurely stopping on incorrect paths or missing better answers in multi-step settings.

What would settle it

A benchmark run where TRACE terminates reasoning on problems that later steps would have corrected, producing measurably lower accuracy than full-length reasoning on the same set.

Figures

Figures reproduced from arXiv: 2604.17304 by Hongyue Yu, Hongzheng Chai, Jiakun Li, Kefan Li, Xingwei He, Yuan Yuan.

**Figure 1.** Figure 1: Single-step confidence-based early-exit prematurely stops at a high-confidence but incorrect intermediate prediction, while TRACE avoids false convergence and reaches the correct answer. The green box indicates the correct answer, and the red box indicates an incorrect answer. generated reasoning traces via chain-of-thought prompting (Wei et al., 2022), and (ii) expanding the search over alternative rea… view at source ↗

**Figure 2.** Figure 2: Cumulative accuracy of single-step early-exit [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of TRACE. The model generates reasoning steps autoregressively. A sliding window over [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy–token trade-off under identical [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy (%) of TRACE early-exit across multiple math [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Answer consistency comparison (Single-Step [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Sensitivity of TRACE to the window size k across models and benchmarks. (a) Qwen3-8B (b) R1-Distill-Llama [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Per-example decoding cost under early stop [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Kernel density estimates of TRACE Stability [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Illustration of a case study. patterns, which TRACE is designed to detect. The degree of separation varies by dataset. On benchmarks such as MATH and AMC23, incorrect predictions rarely achieve very high stability scores, resulting in a clear gap between correct and incorrect distributions. For more challenging datasets, including AIME24, AIME25, and Olympiad Bench, the distributions partially overlap, … view at source ↗

read the original abstract

Test-time scaling improves the reasoning performance of large language models but often results in token-inefficient overthinking, where models continue reasoning beyond what is necessary for a correct answer. Existing dynamic early-exit methods typically rely on single-step confidence signals, which are often unreliable for detecting reasoning convergence in multi-step settings. To mitigate this limitation, we propose TRACE, a training-free framework for efficient test-time scaling that determines when to terminate reasoning based on temporal aggregation of multi-step evidence rather than instantaneous signals. TRACE detects reasoning convergence over time by aggregating two complementary signals across recent reasoning steps: answer consistency, capturing the persistence of predicted answers, and confidence trajectory, modeling the temporal evolution of model confidence. Benefiting from these two factors, TRACE can accurately determine whether the reasoning process has converged, thereby promptly halting inference and effectively avoiding redundant reasoning steps. Extensive experiments on multiple challenging benchmarks show that TRACE reduces reasoning token usage by 25-30% on average while maintaining accuracy within 1-2% of full-length reasoning, consistently outperforming existing dynamic reasoning methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRACE stops LLM reasoning early by averaging answer consistency and confidence trends over a short window, claiming 25-30% token cuts with 1-2% accuracy loss, but the abstract leaves the exact rule and safeguards unclear.

read the letter

TRACE is a training-free early-exit method for chain-of-thought that watches two signals over recent steps: whether the predicted answer stays the same and whether confidence is rising or flattening. If both point to stability, it halts generation instead of running the full trace. That is the core contribution, and it is a direct response to the overthinking problem in test-time scaling where models keep adding tokens after the answer has already settled.

Referee Report

3 major / 2 minor

Summary. The paper introduces TRACE, a training-free framework for efficient test-time scaling in LLMs. It determines reasoning termination by temporally aggregating two signals over recent steps—answer consistency (persistence of predicted answers) and confidence trajectory (evolution of model confidence)—rather than relying on single-step signals. The central claim is that this detects convergence reliably enough to reduce reasoning token usage by 25-30% on average while keeping accuracy within 1-2% of full-length reasoning and outperforming prior dynamic early-exit methods across multiple benchmarks.

Significance. If the aggregation reliably avoids premature termination on incorrect paths, TRACE would offer a practical, training-free way to curb overthinking in multi-step chain-of-thought without additional model training or fine-tuning. This addresses a clear inefficiency in current test-time scaling and could be broadly applicable to existing LLMs, provided the method generalizes beyond the reported benchmarks.

major comments (3)

[§3] §3 (Method): The description of temporal aggregation does not specify the window length, combination weights between consistency and confidence, or the exact termination threshold. These choices are load-bearing for the central claim, as they directly determine whether local persistence implies global convergence; without them the 25-30% token reduction cannot be reproduced or stress-tested against late-stage corrections.
[§4] §4 (Experiments): The reported 1-2% accuracy tolerance is presented as an aggregate figure, but no per-problem breakdown or error analysis is given for cases where models stabilize on incorrect intermediate answers before later revisions. This leaves the weakest assumption unexamined and risks concealing larger accuracy drops on multi-step problems.
[§4.3] §4.3 (Baselines and ablations): The comparison to existing dynamic reasoning methods lacks detail on whether those baselines were re-implemented with the same stopping criteria or hyperparameter search; if not, the consistent outperformance claim rests on potentially mismatched implementations.

minor comments (2)

[Abstract] The abstract and introduction use the term 'temporal aggregation' without an early equation or pseudocode; adding a compact definition in §2 or §3 would improve readability.
[§4] Figure captions and axis labels in the experimental plots should explicitly state the window size and aggregation parameters used for each TRACE curve.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of clarity and experimental rigor in our work on TRACE. We respond to each major comment below and will revise the manuscript to incorporate the suggested improvements for better reproducibility and transparency.

read point-by-point responses

Referee: [§3] §3 (Method): The description of temporal aggregation does not specify the window length, combination weights between consistency and confidence, or the exact termination threshold. These choices are load-bearing for the central claim, as they directly determine whether local persistence implies global convergence; without them the 25-30% token reduction cannot be reproduced or stress-tested against late-stage corrections.

Authors: We agree that these hyperparameters are critical for reproducibility and should be stated explicitly in the method section. While the core aggregation logic is described in §3, the specific values (a sliding window of the most recent 4 steps, equal weights of 0.5 for the consistency and confidence signals, and a termination threshold of 0.75) appear only in the experimental setup. We will revise §3 to include a clear subsection on the aggregation formula, chosen hyperparameters, and their selection rationale from validation experiments. This change will make the 25-30% token reduction fully reproducible. revision: yes
Referee: [§4] §4 (Experiments): The reported 1-2% accuracy tolerance is presented as an aggregate figure, but no per-problem breakdown or error analysis is given for cases where models stabilize on incorrect intermediate answers before later revisions. This leaves the weakest assumption unexamined and risks concealing larger accuracy drops on multi-step problems.

Authors: The referee is correct that aggregate metrics alone leave open the possibility of hidden per-problem variance, particularly on multi-step problems where early stabilization on an incorrect answer could occur. Our current results report only overall accuracy and token savings. In the revision we will add a targeted error analysis (new paragraph in §4 and supporting table in the appendix) that quantifies the frequency of premature termination on incorrect paths and the resulting accuracy impact on a per-benchmark basis. This will directly examine the assumption that temporal aggregation reliably avoids such cases. revision: yes
Referee: [§4.3] §4.3 (Baselines and ablations): The comparison to existing dynamic reasoning methods lacks detail on whether those baselines were re-implemented with the same stopping criteria or hyperparameter search; if not, the consistent outperformance claim rests on potentially mismatched implementations.

Authors: We re-implemented the baselines following the original papers as closely as possible and tuned their hyperparameters on a held-out validation split to ensure a fair comparison. However, the manuscript does not document this process in sufficient detail. We will expand §4.3 with an explicit description of the re-implementation protocol, the stopping criteria applied to each baseline, and the hyperparameter ranges searched. This addition will substantiate the outperformance results and allow readers to verify the fairness of the comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: TRACE is a defined heuristic with no self-referential reduction

full rationale

The paper introduces TRACE as an explicitly training-free heuristic that terminates reasoning by aggregating two observable signals (answer consistency and confidence trajectory) over a recent window. No equations, parameters, or termination rules are fitted to the target accuracy metric; the method is defined directly by its aggregation logic and then evaluated empirically on benchmarks. No self-citations are invoked as load-bearing uniqueness theorems, and no prediction is shown to be equivalent to its own inputs by construction. The derivation chain is therefore self-contained as a proposal of a new stopping rule rather than a tautological restatement of fitted or cited quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes standard LLM generation and inference; no new entities introduced. Specific decision thresholds and window sizes for aggregation are likely free parameters but not detailed in the abstract.

axioms (1)

domain assumption Multi-step temporal signals are more reliable than single-step confidence for detecting reasoning convergence
Explicitly stated as motivation for moving beyond existing dynamic early-exit methods.

pith-pipeline@v0.9.0 · 5487 in / 1078 out tokens · 31433 ms · 2026-05-10T06:14:17.285733+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 29 canonical work pages · 5 internal anchors

[1]

NAACL 2024 , pages =

Jiahui Geng and Fengyu Cai and Yuxia Wang and Heinz Koeppl and Preslav Nakov and Iryna Gurevych , title =. NAACL 2024 , pages =. 2024 , url =

2024
[4]

EMNLP 2024 , pages =

Mozhi Zhang and Mianqiu Huang and Rundong Shi and Linsen Guo and Chong Peng and Peng Yan and Yaqian Zhou and Xipeng Qiu , title =. EMNLP 2024 , pages =. 2024 , url =

2024
[5]

QA-Calibration of Language Model Confidence Scores , booktitle =

Putra Manggala and Atalanti. QA-Calibration of Language Model Confidence Scores , booktitle =. 2025 , url =

2025
[13]

ICLR 2024 , publisher =

Miao Xiong and Zhiyuan Hu and Xinyang Lu and Yifei Li and Jie Fu and Junxian He and Bryan Hooi , title =. ICLR 2024 , publisher =. 2024 , url =

2024
[15]

2025 , howpublished =

AI-MO/aimo-validation-amc:. 2025 , howpublished =

2025
[16]

2024 , howpublished =

AIME Problems and Solutions , author =. 2024 , howpublished =

2024
[17]

ICLR 2024 , publisher =

Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. ICLR 2024 , publisher =. 2024 , url =

2024
[20]

Distilling system 2 into system 1

Distilling system 2 into system 1 , author=. arXiv preprint arXiv:2407.06023 , year=

work page arXiv
[21]

Token-Budget-Aware

Tingxu Han and Zhenting Wang and Chunrong Fang and Shiyu Zhao and Shiqing Ma and Zhenyu Chen , editor =. Token-Budget-Aware. ACL 2025 , pages =. 2025 , url =

2025
[27]

ICML 2025 , publisher =

Xingyu Chen and Jiahao Xu and Tian Liang and Zhiwei He and Jianhui Pang and Dian Yu and Linfeng Song and Qiuzhi Liu and Mengfei Zhou and Zhuosheng Zhang and Rui Wang and Zhaopeng Tu and Haitao Mi and Dong Yu , title =. ICML 2025 , publisher =. 2025 , url =

2025
[29]

Stop Overthinking:

Yang Sui and Yu. Stop Overthinking:. Trans. Mach. Learn. Res. , volume =. 2025 , url =

2025
[30]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , booktitle =

Shunyu Yao and Dian Yu and Jeffrey Zhao and Izhak Shafran and Tom Griffiths and Yuan Cao and Karthik Narasimhan , editor =. Tree of Thoughts: Deliberate Problem Solving with Large Language Models , booktitle =. 2023 , url =

2023
[31]

Le and Ed H

Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc V. Le and Ed H. Chi and Sharan Narang and Aakanksha Chowdhery and Denny Zhou , title =. ICLR 2023 , publisher =. 2023 , url =

2023
[32]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , editor =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , booktitle =. 2022 , url =

2022
[34]

ICML 2025 , publisher =

Xinyu Guan and Li Lyna Zhang and Yifei Liu and Ning Shang and Youran Sun and Yi Zhu and Fan Yang and Mao Yang , title =. ICML 2025 , publisher =. 2025 , url =

2025
[35]

TEVC 2025 , year=

CoCoEvo: Co-Evolution of Programs and Test Cases to Enhance Code Generation , author=. TEVC 2025 , year=

2025
[38]

Reason- ing models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

Wenjie Ma and Jingxuan He and Charlie Snell and Tyler Griggs and Sewon Min and Matei Zaharia , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.09858 , eprinttype =. 2504.09858 , timestamp =

work page doi:10.48550/arxiv.2504.09858 2025
[40]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[43]

AI-MO Project and Project Numina . 2025. Ai-mo/aimo-validation-amc: AMC math validation set. https://huggingface.co/datasets/AI-MO/aimo-validation-amc. Dataset on Hugging Face. Extracted from AMC12 2022 and AMC12 2023 problems from the Art of Problem Solving Wiki and adapted to integer outputs

2025
[44]

Daman Arora and Andrea Zanette. 2025. https://doi.org/10.48550/ARXIV.2502.04463 Training language models to reason efficiently . CoRR, abs/2502.04463

work page doi:10.48550/arxiv.2502.04463 2025
[45]

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2025 a . https://openreview.net/forum?id=MSbU3L7V00 Do NOT think that much for 2+3=? on the overthinking of long reasoning models . In ICML 2025. OpenReview.net

2025
[46]

Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, and Bohan Zhuang. 2025 b . https://doi.org/10.48550/ARXIV.2507.17307 R-stitch: Dynamic trajectory stitching for efficient reasoning . CoRR, abs/2507.17307

work page doi:10.48550/arxiv.2507.17307 2025
[47]

MAA Committees. 2024. https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions Aime problems and solutions . Online. Retrieved from Art of Problem Solving Wiki

2024
[48]

arXiv preprint arXiv:2502.08235 , year =

Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, and Joseph E. Gonzalez. 2025. https://doi.org/10.48550/ARXIV.2502.08235 The danger of overthinking: Examining the reasoning-action dilemma in a...

work page doi:10.48550/arxiv.2502.08235 2025
[49]

DeepSeek - AI. 2025. https://doi.org/10.48550/ARXIV.2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning . CoRR, abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[50]

Jinzong Dong, Zhaohui Jiang, Dong Pan, and Haoyang Yu. 2025. https://doi.org/10.1609/AAAI.V39I15.33792 Combining priors with experience: Confidence calibration based on binomial process modeling . In AAAI 25, pages 16317--16326. AAAI Press

work page doi:10.1609/aaai.v39i15.33792 2025
[51]

Lorenzo Jaime Yu Flores, Ori Ernst, and Jackie CK Cheung. 2025. https://doi.org/10.18653/V1/2025.ACL-SHORT.15 Improving the calibration of confidence scores in text generation using the output distribution's characteristics . In ACL 2025, pages 172--182. Association for Computational Linguistics

work page doi:10.18653/v1/2025.acl-short.15 2025
[52]

Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, and Hao Zhang. 2024. https://doi.org/10.48550/ARXIV.2412.20993 Efficiently serving LLM reasoning programs with certaindex . CoRR, abs/2412.20993

work page doi:10.48550/arxiv.2412.20993 2024
[53]

Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. 2024. https://doi.org/10.18653/v1/2024.naacl-long.366 A survey of confidence estimation and calibration in large language models . In NAACL 2024, pages 6577--6595. Association for Computational Linguistics

work page doi:10.18653/v1/2024.naacl-long.366 2024
[54]

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. 2025. https://openreview.net/forum?id=5zwF1GizFa rstar-math: Small llms can master math reasoning with self-evolved deep thinking . In ICML 2025. OpenReview.net

2025
[55]

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. 2025. https://aclanthology.org/2025.findings-acl.1274/ Token-budget-aware LLM reasoning . In ACL 2025, pages 24842--24855. Association for Computational Linguistics

2025
[56]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. https://doi.org/10.18653/V1/2024.ACL-LONG.211 Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems . In ACL 2024, ...

work page doi:10.18653/v1/2024.acl-long.211 2024
[57]

Jiameng Huang, Baijiong Lin, Guhao Feng, Jierun Chen, Di He, and Lu Hou. 2025. https://doi.org/10.48550/ARXIV.2508.05337 Efficient reasoning for large reasoning language models via certainty-guided reflection suppression . CoRR, abs/2508.05337

work page doi:10.48550/arxiv.2508.05337 2025
[58]

Romain Lacombe, Kerrie Wu, and Eddie Dilworth. 2025. https://doi.org/10.48550/ARXIV.2508.15050 Don't think twice! over-reasoning impairs confidence calibration . CoRR, abs/2508.15050

work page doi:10.48550/arxiv.2508.15050 2025
[59]

Kefan Li, Yuan Yuan, Hongyue Yu, Tingyu Guo, and Shijie Cao. 2025. https://doi.org/10.48550/ARXIV.2502.10802 Cocoevo: Co-evolution of programs and test cases to enhance code generation . TEVC 2025

work page doi:10.48550/arxiv.2502.10802 2025
[60]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. https://openreview.net/forum?id=v8L0pN6EOi Let's verify step by step . In ICLR 2024. OpenReview.net

2024
[61]

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. 2025. https://doi.org/10.1145/3711896.3736569 Uncertainty quantification and confidence calibration in large language models: A survey . In KDD 2025, pages 6107--6117. ACM

work page doi:10.1145/3711896.3736569 2025
[62]

Xin Liu and Lu Wang. 2025. https://doi.org/10.48550/ARXIV.2506.02536 Answer convergence as a signal for early stopping in reasoning . CoRR, arXiv:2506.02536

work page doi:10.48550/arxiv.2506.02536 2025
[63]

Ximing Lu, Seungju Han, David Acuna, Hyunwoo Kim, Jaehun Jung, Shrimai Prabhumoye, Niklas Muennighoff, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. 2025. https://doi.org/10.48550/ARXIV.2504.04383 Retro-search: Exploring untaken paths for deeper and efficient reasoning . CoRR, abs/2504.04383

work page doi:10.48550/arxiv.2504.04383 2025
[64]

Putra Manggala, Atalanti - Anastasia Mastakouri, Elke Kirschbaum, Shiva Prasad Kasiviswanathan, and Aaditya Ramdas. 2025. https://openreview.net/forum?id=D2hhkU5O48 Qa-calibration of language model confidence scores . In ICLR 2025. OpenReview.net

2025
[65]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2023. https://doi.org/10.48550/ARXIV.2311.12022 GPQA: A graduate-level google-proof q & a benchmark . CoRR, abs/2311.12022

work page internal anchor Pith review doi:10.48550/arxiv.2311.12022 2023
[66]

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. https://doi.org/10.48550/ARXIV.2408.03314 Scaling LLM test-time compute optimally can be more effective than scaling model parameters . CoRR, arXiv:2408.03314

work page Pith review doi:10.48550/arxiv.2408.03314 2024
[67]

Yang Sui, Yu - Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, and Xia Hu. 2025. https://openreview.net/forum?id=HvoG8SxggZ Stop overthinking: A survey on efficient reasoning for large language models . Trans. Mach. Learn. Res., 2025

2025
[68]

Gemini Team. 2025. https://doi.org/10.48550/ARXIV.2507.06261 Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities . CoRR, abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025
[69]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, and 75 others. 2025. https://doi.org/10.48550/ARXIV.2501.12599 Kimi k1.5: Scaling reinforcement learning with ll...

work page internal anchor Pith review doi:10.48550/arxiv.2501.12599 2025
[70]

Chenlong Wang, Yuanning Feng, Dongping Chen, Zhaoyang Chu, Ranjay Krishna, and Tianyi Zhou. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.394 Wait, we don ' t need to ``wait''! removing thinking tokens improves reasoning efficiency . In EMNLP 2025, pages 7459--7482. Association for Computational Linguistics

work page doi:10.18653/v1/2025.findings-emnlp.394 2025
[71]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://openreview.net/forum?id=1PL1NIMMrw Self-consistency improves chain of thought reasoning in language models . In ICLR 2023. OpenReview.net

2023
[72]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html Chain-of-thought prompting elicits reasoning in large language models . In NeurIPS 2022

2022
[73]

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2024. https://openreview.net/forum?id=gjeQKFxFpZ Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms . In ICLR 2024. OpenReview.net

2024
[74]

Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. 2025. https://doi.org/10.48550/ARXIV.2501.09686 Towards large reasoning models: A survey of reinforced re...

work page doi:10.48550/arxiv.2501.09686 2025
[75]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 40 others. 2025 a . https://doi.org/10.48550/ARXIV.2505.09388 Qwen3 technical report . CoRR, arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[76]

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Zheng Lin, Li Cao, and Weiping Wang. 2025 b . https://doi.org/10.48550/ARXIV.2504.15895 Dynamic early exit in reasoning models . CoRR, abs/2504.15895

work page doi:10.48550/arxiv.2504.15895 2025
[77]

Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. 2025 c . https://doi.org/10.48550/ARXIV.2502.18080 Towards thinking-optimal scaling of test-time compute for LLM reasoning . CoRR, abs/2502.18080

work page doi:10.48550/arxiv.2502.18080 2025
[78]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. http://papers.nips.cc/paper\_files/paper/2023/hash/271db9922b8d1f4dd7aaef84ed5ac703-Abstract-Conference.html Tree of thoughts: Deliberate problem solving with large language models . In NeurIPS 2023

2023
[79]

Zhaojian Yu, Yinghao Wu, Yilun Zhao, Arman Cohan, and Xiao - Ping Zhang. 2025. https://doi.org/10.48550/ARXIV.2504.00810 Z1: efficient test-time scaling with code . CoRR, arXiv:2504.00810

work page doi:10.48550/arxiv.2504.00810 2025
[80]

Mozhi Zhang, Mianqiu Huang, Rundong Shi, Linsen Guo, Chong Peng, Peng Yan, Yaqian Zhou, and Xipeng Qiu. 2024. https://doi.org/10.18653/V1/2024.EMNLP-MAIN.173 Calibrating the confidence of large language models by eliciting fidelity . In EMNLP 2024, pages 2959--2979. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.173 2024
[81]

Xinran Zhao, Hongming Zhang, Xiaoman Pan, Wenlin Yao, Dong Yu, Tongshuang Wu, and Jianshu Chen. 2024. https://doi.org/10.18653/V1/2024.FINDINGS-ACL.515 Fact-and-reflection (far) improves confidence calibration of large language models . In ACL 2024, pages 8702--8718. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-acl.515 2024