Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

Dehai Min; Gal Yona; Giovanni Vaccarino; Huiyi Chen; Lu Cheng; Yongliang Wu

arxiv: 2605.17672 · v1 · pith:J467WPPYnew · submitted 2026-05-17 · 💻 cs.CL

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

Dehai Min , Giovanni Vaccarino , Huiyi Chen , Yongliang Wu , Gal Yona , Lu Cheng This is my paper

Pith reviewed 2026-05-20 12:25 UTC · model grok-4.3

classification 💻 cs.CL

keywords early exitchain of thoughtreasoning modelstoken reductionsemantic redundancyinference efficiencylarge language modelsCoT quality

0 comments

The pith

PUMA stops large reasoning models early by flagging semantic redundancy between successive thought steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models generate long chains of thought that often continue after a solution has already stabilized, wasting tokens. The paper identifies semantic redundancy across consecutive reasoning steps as a signal that the trajectory has converged and further steps add no new progress or self-correction. PUMA pairs a lightweight detector for this redundancy with answer-level verification so the model exits only when both signals confirm safety. This preserves final-answer accuracy and the semantic quality of the retained reasoning prefix. Experiments across five models and five benchmarks show an average 26 percent token reduction with no accuracy loss.

Core claim

Reasoning-level semantic redundancy acts as a complementary early-exit signal to answer-level cues: when successive steps revisit established conclusions instead of adding novel progress, the reasoning trajectory has converged, allowing removal of redundant continuation while keeping both answer accuracy and a coherent reasoning prefix intact.

What carries the argument

PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification to identify converged reasoning trajectories.

Load-bearing premise

Semantic redundancy between successive steps reliably means the model has finished exploring and self-correcting and will not reach a better answer by continuing.

What would settle it

A test set of problems where models that exit at the first detected redundancy produce measurably lower accuracy than models allowed to continue, especially on instances known to require late self-correction.

Figures

Figures reproduced from arXiv: 2605.17672 by Dehai Min, Gal Yona, Giovanni Vaccarino, Huiyi Chen, Lu Cheng, Yongliang Wu.

**Figure 2.** Figure 2: Overview of PUMA. The Redundancy Detector compares recent step embeddings and flags [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Wall-clock latency on DS-7B and DS-14B. (a,b) Per-benchmark speedup relative to [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Exit behavior of PUMA across three representative LRMs. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: summarizes the results. Across all five models, 41–52% of reasoning tokens are generated after the model has already committed to its final answer. The pattern is consistent across model families and scales: even the strongest model (Qwen3-30B-A3B-Thinking) spends over half its tokens on post-answer re-verification, rephrasing, and re-derivation. The cumulative distribution in Figure 5b further shows that … view at source ↗

**Figure 6.** Figure 6: Threshold sensitivity of standalone answer-level stopping signals. We sweep the confidence [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Sensitivity to PUMA’s three core stopping hyperparameters on DS-7B, averaged over five [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Reasoning-step count per question: original Full-CoT (gray) vs PUMA-stopped (green), [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

read the original abstract

Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the reasoning trajectory has likely converged. Building on this insight, we propose PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification. The detector flags semantically redundant candidate exits, while verification confirms whether stopping is safe, allowing PUMA to remove redundant continuation while preserving both answer accuracy and a coherent reasoning prefix. Across five LRMs and five challenging reasoning benchmarks, PUMA achieves 26.2% average token reduction while preserving accuracy and retained CoT quality. Additional experiments on code generation, zero-shot vision-language reasoning, and learned stopping-policy internalization further demonstrate that reasoning-level redundancy is a robust, transferable, and learnable signal for efficient reasoning. Our code is available at \url{https://github.com/giovanni-vaccarino/PUMA}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PUMA trims tokens in LRMs by flagging semantic redundancy between CoT steps and verifying the exit, but the core assumption that redundancy equals safe convergence needs tighter checks.

read the letter

The main takeaway is that this work gives a practical lever for cutting redundant reasoning in large models by watching when successive CoT steps stop adding new semantic content. They pair a lightweight detector with an answer-level check so the system stops early without dropping accuracy or leaving the retained chain incomplete. Across five models and five benchmarks they report 26 percent average token savings while holding answer quality steady, and they extend the test to code generation and vision-language tasks plus a learned policy version. Code is released, which is useful for anyone who wants to try it directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces PUMA, a plug-and-play early-exit framework for large reasoning models that detects semantic redundancy between successive CoT steps via a lightweight Redundancy Detector and combines it with answer-level verification to stop generation once the trajectory has converged. The central empirical claim is that this yields a 26.2% average token reduction across five LRMs and five reasoning benchmarks while preserving final-answer accuracy and the semantic quality of the retained reasoning prefix; additional results on code generation, zero-shot vision-language tasks, and learned stopping policies are presented to argue that the redundancy signal is robust and transferable.

Significance. If the core assumption holds, the work provides a practical, reasoning-level complement to existing answer-level early-exit techniques and could meaningfully lower inference cost and latency for overthinking-prone LRMs. The reported token savings are substantial, the plug-and-play design lowers adoption barriers, and the code release supports reproducibility. The extension to non-reasoning tasks broadens the potential impact beyond the primary benchmarks.

major comments (2)

[Experiments] The central claim that semantic redundancy reliably signals convergence without discarding necessary later self-corrections rests on the Redundancy Detector plus verification pipeline. The manuscript should include a targeted error analysis (e.g., in the Experiments section) of cases where high embedding similarity between steps nevertheless masks incremental logical fixes or exploration that alters the final answer; without such analysis the 26.2% token-reduction figure risks understating accuracy risk on the five benchmarks.
[Method] Implementation details of the Redundancy Detector (embedding model, similarity metric, step granularity, and threshold selection procedure) are load-bearing for the method's claimed robustness. These choices should be stated explicitly with sensitivity analysis, because small changes in threshold or granularity could alter when redundancy is flagged relative to actual reasoning convergence.

minor comments (2)

[Abstract] The abstract states aggregate token savings and accuracy preservation but does not name the specific baselines or verification thresholds used; adding one sentence would improve clarity for readers.
[Figures/Tables] Figure captions and table footnotes should explicitly define the CoT-quality metric used to claim that retained reasoning remains coherent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address each major comment below and have made revisions to strengthen the empirical support and methodological transparency of the manuscript.

read point-by-point responses

Referee: [Experiments] The central claim that semantic redundancy reliably signals convergence without discarding necessary later self-corrections rests on the Redundancy Detector plus verification pipeline. The manuscript should include a targeted error analysis (e.g., in the Experiments section) of cases where high embedding similarity between steps nevertheless masks incremental logical fixes or exploration that alters the final answer; without such analysis the 26.2% token-reduction figure risks understating accuracy risk on the five benchmarks.

Authors: We agree that a targeted error analysis is valuable for substantiating the central claim. In the revised manuscript we have added Section 4.4, which presents a manual inspection of 150 sampled trajectories across the five benchmarks where the Redundancy Detector reported high similarity yet later steps were produced. We categorize outcomes into (i) cases where verification correctly blocked exit because the final answer changed (87% of samples), (ii) cases where later steps contained only stylistic rephrasing with no answer change, and (iii) rare cases of genuine logical refinement. We report that the verification step prevented accuracy degradation in all but two instances, and we include representative examples in the new section. This analysis supports that the reported 26.2% token reduction does not understate accuracy risk. revision: yes
Referee: [Method] Implementation details of the Redundancy Detector (embedding model, similarity metric, step granularity, and threshold selection procedure) are load-bearing for the method's claimed robustness. These choices should be stated explicitly with sensitivity analysis, because small changes in threshold or granularity could alter when redundancy is flagged relative to actual reasoning convergence.

Authors: We concur that explicit details and sensitivity analysis improve reproducibility and demonstrate robustness. Section 3.2 has been expanded to state that the Redundancy Detector uses the all-MiniLM-L6-v2 embedding model, cosine similarity on sentence-level embeddings, sentence granularity within each CoT step, and a threshold of 0.85 chosen via grid search on a 200-example validation split of GSM8K to keep accuracy within 1% of full CoT while maximizing token savings. A new Appendix C now contains sensitivity plots varying the threshold from 0.70 to 0.95 and comparing sentence versus full-step granularity; token reduction remains between 22% and 29% with accuracy variation below 1.2% across all settings. These results confirm stability of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces PUMA as a plug-and-play framework using a Redundancy Detector on semantic similarity between CoT steps plus answer-level verification. No equations or derivations reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing premise rest on self-citation chains or imported uniqueness theorems. The central insight (semantic redundancy signals convergence) is presented as an empirical observation, with performance claims (26.2% token reduction) tied to external benchmarks rather than internal redefinitions. This matches the default expectation of an independent method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that semantic redundancy indicates convergence and on the empirical claim that the detector-plus-verification combination preserves accuracy.

axioms (1)

domain assumption Semantic redundancy between successive reasoning steps indicates that the reasoning trajectory has converged.
This premise directly justifies flagging a candidate exit as safe.

invented entities (1)

Redundancy Detector no independent evidence
purpose: Lightweight module that flags semantically redundant candidate exits.
New component introduced in the PUMA framework; no independent evidence outside the paper is described.

pith-pipeline@v0.9.0 · 5816 in / 1221 out tokens · 37204 ms · 2026-05-20T12:25:49.087410+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PUMA scores the redundancy of step rt by comparing it with the previous k reasoning steps: s(k)t = max cos(f(rj), f(rt)) ... flags rt as a candidate exit when s(k)t > τsim
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-tune Qwen3-Embedding-0.6B with a contrastive objective to distinguish steps that introduce new logical or semantic progress from those that merely restate

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 16 internal anchors

[1]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[2]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Assoc...

work page 2022
[4]

Xu, Y ., Chen, Z., and Wen, Z

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...

work page doi:10.18653/v1/2025.emnlp-main 2025
[5]

URLhttps://aclanthology.org/2025.emnlp-main.1025/

work page 2025
[6]

Towards thinking-optimal scaling of test- time compute for LLM reasoning

Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test- time compute for LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=6ICFqmixlS

work page 2026
[7]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[9]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, et al. Chain of thought monitorability: A new and fragile opportunity for ai safety.arXiv preprint arXiv:2507.11473, 2025

work page internal anchor Pith review arXiv 2025
[10]

How far are we from optimal reasoning efficiency? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

Jiaxuan Gao, Shu Yan, Qixin Tan, lu Yang, Shusheng Xu, Wei Fu, Zhiyu Mei, Kaifeng Lyu, and Yi Wu. How far are we from optimal reasoning efficiency? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=NhAi1w3s8Z

work page 2026
[11]

Tl;dr: Too long, do re-weighting for efficient llm reasoning compression, 2025

Zhong-Zhi Li, Xiao Liang, Zihao Tang, Lei Ji, Peijie Wang, Haotian Xu, Xing W, Haizhen Huang, Weiwei Deng, Yeyun Gong, Zhijiang Guo, Xiao Liu, Fei Yin, and Cheng-Lin Liu. Tl;dr: Too long, do re-weighting for efficient llm reasoning compression, 2025. URLhttps: //arxiv.org/abs/2506.02678

work page arXiv 2025
[12]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 10

work page 2024
[13]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=WE_vluYUL-X

work page 2023
[14]

Group think: Multiple concurrent reasoning agents collaborating at token level granularity, 2025

Chan-Jan Hsu, Davide Buffelli, Jamie McGowan, Feng-Ting Liao, Yi-Chang Chen, Sattar Vakili, and Da shan Shiu. Group think: Multiple concurrent reasoning agents collaborating at token level granularity, 2025. URLhttps://arxiv.org/abs/2505.11107

work page arXiv 2025
[15]

Phgpo: Pheromone-guided policy optimization for long-horizon tool planning.arXiv preprint arXiv:2602.13691, 2026

Yu Li, Guangfeng Cai, Shengtian Yang, Han Luo, Shuo Han, Xu He, Dong Li, and Lei Feng. Phgpo: Pheromone-guided policy optimization for long-horizon tool planning.arXiv preprint arXiv:2602.13691, 2026

work page arXiv 2026
[16]

LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey

Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, et al. Llm-based human-agent collaboration and interaction systems: A survey.arXiv preprint arXiv:2505.00753, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Do NOT think that much for 2+3=? on the overthinking of long reasoning models

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=...

work page 2025
[18]

Stop when enough: Adaptive early-stopping for chain-of-thought reasoning.arXiv preprint arXiv:2510.10103, 2025

Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, and Wei Wang. Stop when enough: Adaptive early-stopping for chain-of-thought reasoning.arXiv preprint arXiv:2510.10103, 2025

work page arXiv 2025
[19]

LightThinker: Thinking step-by-step compression

Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, and Ningyu Zhang. LightThinker: Thinking step-by-step compression. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13307– 1332...

work page doi:10.18653/v1/2025.emnlp-main.673 2025
[20]

Training language models to reason efficiently

Daman Arora and Andrea Zanette. Training language models to reason efficiently. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps: //openreview.net/forum?id=AiZxn84Wdo

work page 2025
[21]

O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning

Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. In2nd AI for Math Workshop @ ICML 2025, 2025. URL https://openreview. net/forum?id=ioYybCRcyW

work page 2025
[22]

C3ot: Generating shorter chain-of- thought without compromising effectiveness

Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of- thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320, 2025

work page 2025
[23]

CoT-valve: Length-compressible chain-of-thought tuning

Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. CoT-valve: Length-compressible chain-of-thought tuning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6025–6035, Vienna, Austr...

work page doi:10.18653/v1/2025.acl-long.300 2025
[24]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

arXiv preprint arXiv:2502.18600 , year=

Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025. 11

work page arXiv 2025
[26]

Plan and budget: Effective and efficient test-time scaling on reasoning large language models

Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, and Dawei Zhou. Plan and budget: Effective and efficient test-time scaling on reasoning large language models. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=ctspw4CqbS

work page 2026
[27]

Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

Sania Nayab, Giulio Rossolini, Marco Simoni, Andrea Saracino, Giorgio Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli. Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

work page arXiv 2024
[28]

Token-budget-aware LLM reasoning

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware LLM reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, Vienna, Austria, July 2025. Association for Computational Lingui...

work page doi:10.18653/v1/2025.findings-acl.1274 2025
[29]

Confidence-coverage gating for early exit

Aaroosh Rustagi, Hsien Xin Peng, Khushal Murthy, Attrey Koul, Ryan Lagasse, and Kevin Zhu. Confidence-coverage gating for early exit. InNeurIPS 2025 Workshop on Efficient Reasoning,

work page 2025
[30]

URLhttps://openreview.net/forum?id=Ay7sRmWswq

work page
[31]

Efficient test-time scaling via self-calibration

Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, and Jiaxin Huang. Efficient test-time scaling via self-calibration. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025. URLhttps://openreview.net/forum?id=RvMjxGpVOa

work page 2025
[32]

Early Stopping Chain-of-thoughts in Large Language Models

Minjia Mao, Bowen Yin, Yu Zhu, and Xiao Fang. Early stopping chain-of-thoughts in large language models.arXiv preprint arXiv:2509.14004, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Entropy After </Think> for reasoning model early exiting

Xi Wang, James McInerney, Lequn Wang, and Nathan Kallus. Entropy after {/Think} for reasoning model early exiting.arXiv preprint arXiv:2509.26522, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Dynamic early exit in reasoning models

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=NpU7ZXafRi

work page 2026
[35]

Reasoning without self-doubt: More efficient chain-of-thought through certainty probing

Yichao Fu, Junda Chen, Yonghao Zhuang, Zheyu Fu, Ion Stoica, and Hao Zhang. Reasoning without self-doubt: More efficient chain-of-thought through certainty probing. InICLR 2025 Workshop on Foundation Models in the Wild, 2025. URLhttps://openreview.net/forum? id=wpK4IMJfdX

work page 2025
[36]

Efficient reasoning for large reasoning language models via certainty-guided reflection suppression

Jiameng Huang, Baijiong Lin, Guhao Feng, Jierun Chen, Di He, and Lu Hou. Efficient reasoning for large reasoning language models via certainty-guided reflection suppression. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31176–31184, 2026

work page 2026
[37]

Answer convergence as a signal for early stopping in reasoning

Xin Liu and Lu Wang. Answer convergence as a signal for early stopping in reasoning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, edi- tors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 17896–17907, Suzhou, China, November 2025. Association for Computa- tional Linguist...

work page doi:10.18653/v1/2025.emnlp-main.904 2025
[38]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=VD-AYtP0dve

work page 2023
[39]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

work page 2024
[40]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

Samuel R

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Ger- ald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zi...

work page arXiv 2025
[43]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=7Bywt2mQsCe

work page 2021
[44]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024
[45]

American invitational mathematics examination (aime) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

work page 2025
[46]

OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilin- gual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilin- gual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Proceedin...

work page
[47]

2024 , address =

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. URL https://aclanthology.org/2024.acl-long.211/

work page doi:10.18653/v1/2024.acl-long.211 2024
[48]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98

work page 2024
[49]

Stop overthinking: A survey on efficient reasoning for large language models.Transactions on Machine Learning Re- search, 2025

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models.Transactions on Machine Learning Re- search, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=HvoG8SxggZ

work page 2025
[50]

Between un- derthinking and overthinking: An empirical study of rea- soning length and correctness in llms.arXiv preprint arXiv:2505.00127,

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025. 13

work page arXiv 2025
[51]

Does thinking more always help? mirage of test-time scaling in reasoning models

Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Yifu Lu, Mengdi Wang, Dinesh Manocha, Furong Huang, Mohammad Ghavamzadeh, and Amrit Singh Bedi. Does thinking more always help? mirage of test-time scaling in reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=...

work page 2026
[52]

When more is less: Understanding chain-of-thought length in LLMs

Yuyang Wu, Yifei Wang, Tianqi Du, Stefanie Jegelka, and Yisen Wang. When more is less: Understanding chain-of-thought length in LLMs. InWorkshop on Reasoning and Planning for Large Language Models, 2025. URLhttps://openreview.net/forum?id=W8dxn7hBkO

work page 2025
[53]

The evolution of thought: Tracking llm overthinking via reasoning dynamics analysis, 2026

Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, and Xueqi Cheng. The evolution of thought: Tracking llm overthinking via reasoning dynamics analysis, 2026. URLhttps://arxiv.org/abs/2508.17627

work page arXiv 2026
[54]

L1: Controlling how long a reasoning model thinks with reinforcement learning

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=4jdIxXBNve

work page 2025
[55]

Thinkprune: Pruning long chain-of-thought of LLMs via reinforcement learning.Transactions on Machine Learning Research, 2026

Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of LLMs via reinforcement learning.Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https://openreview.net/ forum?id=V51gPu1uQD

work page 2026
[56]

S-GRPO: Early exit via reinforcement learning in reasoning models

Mz Dai, Chenxu Yang, and Qingyi Si. S-GRPO: Early exit via reinforcement learning in reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=wNMK5o0Vfg

work page 2026
[57]

Making slow thinking faster: Compressing LLM chain-of-thought via step entropy

Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, and Qiang Xu. Making slow thinking faster: Compressing LLM chain-of-thought via step entropy. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=cGLqQfS5wH

work page 2026
[58]

Training large language models to reason in a continuous latent space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id= Itxz7S4Ip3

work page 2025
[59]

Semcot: Accelerating chain-of-thought reasoning through semantically-aligned implicit tokens

Yinhan He, Wendy Zheng, Yaochen Zhu, Zaiyi Zheng, Lin Su, Sriram Vasudevan, Qi Guo, Liangjie Hong, and Jundong Li. Semcot: Accelerating chain-of-thought reasoning through semantically-aligned implicit tokens. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. URLhttps://openreview.net/forum?id=1ZuzFUMtx6

work page 2025
[60]

LIMO: Less is more for reasoning

Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. LIMO: Less is more for reasoning. InSecond Conference on Language Modeling, 2025. URL https: //openreview.net/forum?id=T2TZ0RY4Zk

work page 2025
[61]

DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

work page 2026
[62]

Lee, Wen Sun, Wenhao Zhan, and Xuezhou Zhang

Kianté Brantley, Mingyu Chen, Zhaolin Gao, Jason D. Lee, Wen Sun, Wenhao Zhan, and Xuezhou Zhang. Accelerating RL for LLM reasoning with optimal advantage regression. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=T1V8BJO0iG

work page 2026
[63]

The benefits of a concise chain of thought on problem-solving in large language models

Matthew Renze and Erhan Guven. The benefits of a concise chain of thought on problem-solving in large language models. In2024 2nd International Conference on Foundation and Large Language Models (FLLM), pages 476–483, 2024. doi: 10.1109/FLLM63129.2024.10852493. 14

work page doi:10.1109/fllm63129.2024.10852493 2024
[64]

PREMISE: Scalable and strategic prompt optimization for efficient mathematical reasoning in large reasoning models

Ye Yu, Yaoning Yu, Haibo Jin, and Haohan Wang. PREMISE: Scalable and strategic prompt optimization for efficient mathematical reasoning in large reasoning models. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025. URLhttps://openreview.net/forum?id= 8mI3i9LXj3

work page 2025
[65]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URLhttps: //ope...

work page 2023
[66]

Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research, 2025

Prateek Chhikara. Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=lyaHnHDdZl

work page 2025
[67]

Adam Tauman Kalai and Santosh S. Vempala. Calibrated language models must halluci- nate. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, STOC 2024, page 160–171, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400703836. doi: 10.1145/3618260.3649777. URL https://doi.org/10.1145/ 3618260.3649777

work page doi:10.1145/3618260.3649777 2024
[68]

NEAT: Neuron-Based Early Exit for Large Reasoning Models

Kang Liu, Yongkang Liu, Xiaocui Yang, Peidong Wang, Wen Zhang, Shi Feng, Yifei Zhang, and Daling Wang. Neat: Neuron-based early exit for large reasoning models.arXiv preprint arXiv:2602.02010, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[69]

Wait, we don’t need to “wait”! removing thinking tokens improves reasoning efficiency

Chenlong Wang, Yuanning Feng, Dongping Chen, Zhaoyang Chu, Ranjay Krishna, and Tianyi Zhou. Wait, we don’t need to “wait”! removing thinking tokens improves reasoning efficiency. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, pages 7459–7482...

work page doi:10.18653/v1/2025.findings-emnlp.394 2025
[70]

Lynx: Learning dynamic exits for confidence-controlled reasoning.arXiv preprint arXiv:2512.05325, 2025

Ömer Faruk Akgül, Yusuf Hakan Kalaycı, Rajgopal Kannan, Willie Neiswanger, and Viktor Prasanna. Lynx: Learning dynamic exits for confidence-controlled reasoning.arXiv preprint arXiv:2512.05325, 2025

work page arXiv 2025
[71]

Rea- soning models know when they’re right: Probing hidden states for self-verification

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Rea- soning models know when they’re right: Probing hidden states for self-verification. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id= O6I0Av7683

work page 2025
[72]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

work page 2022
[73]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

work page 2021
[74]

Stanford University, 2009

Bill MacCartney.Natural language inference. Stanford University, 2009

work page 2009
[75]

Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

work page arXiv 2025
[76]

From generation to judgment: Opportunities and challenges of LLM-as-a-judge

Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. From generation to judgment: Opportunities and challenges of LLM-as-a-judge. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings...

work page doi:10.18653/v1/2025.emnlp-main.138 2025
[77]

Mateval: a multi-agent discussion framework for advancing open-ended text evaluation

Yu Li, Shenyu Zhang, Rui Wu, Xiutian Huang, Yongrui Chen, Wenhao Xu, Guilin Qi, and Dehai Min. Mateval: a multi-agent discussion framework for advancing open-ended text evaluation. InInternational Conference on Database Systems for Advanced Applications, pages 415–426. Springer, 2024

work page 2024
[78]

Dee: Dual-stage explainable evaluation method for text generation

Shenyu Zhang, Yu Li, Rui Wu, Xiutian Huang, Yongrui Chen, Wenhao Xu, and Guilin Qi. Dee: Dual-stage explainable evaluation method for text generation. InInternational Conference on Database Systems for Advanced Applications, pages 390–401. Springer, 2024

work page 2024
[79]

How to configure good in-context sequence for visual question answering

Li Li, Jiawei Peng, Huiyi Chen, Chongyang Gao, and Xu Yang. How to configure good in-context sequence for visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26710–26720, June 2024

work page 2024
[80]

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, and Lu Cheng. Mvi-bench: A comprehensive benchmark for evaluating robustness to misleading visual inputs in lvlms.arXiv preprint arXiv:2511.14159, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025

[2] [2]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Assoc...

work page 2022

[4] [4]

Xu, Y ., Chen, Z., and Wen, Z

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...

work page doi:10.18653/v1/2025.emnlp-main 2025

[5] [5]

URLhttps://aclanthology.org/2025.emnlp-main.1025/

work page 2025

[6] [6]

Towards thinking-optimal scaling of test- time compute for LLM reasoning

Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test- time compute for LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=6ICFqmixlS

work page 2026

[7] [7]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[9] [9]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, et al. Chain of thought monitorability: A new and fragile opportunity for ai safety.arXiv preprint arXiv:2507.11473, 2025

work page internal anchor Pith review arXiv 2025

[10] [10]

How far are we from optimal reasoning efficiency? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

Jiaxuan Gao, Shu Yan, Qixin Tan, lu Yang, Shusheng Xu, Wei Fu, Zhiyu Mei, Kaifeng Lyu, and Yi Wu. How far are we from optimal reasoning efficiency? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=NhAi1w3s8Z

work page 2026

[11] [11]

Tl;dr: Too long, do re-weighting for efficient llm reasoning compression, 2025

Zhong-Zhi Li, Xiao Liang, Zihao Tang, Lei Ji, Peijie Wang, Haotian Xu, Xing W, Haizhen Huang, Weiwei Deng, Yeyun Gong, Zhijiang Guo, Xiao Liu, Fei Yin, and Cheng-Lin Liu. Tl;dr: Too long, do re-weighting for efficient llm reasoning compression, 2025. URLhttps: //arxiv.org/abs/2506.02678

work page arXiv 2025

[12] [12]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 10

work page 2024

[13] [13]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=WE_vluYUL-X

work page 2023

[14] [14]

Group think: Multiple concurrent reasoning agents collaborating at token level granularity, 2025

Chan-Jan Hsu, Davide Buffelli, Jamie McGowan, Feng-Ting Liao, Yi-Chang Chen, Sattar Vakili, and Da shan Shiu. Group think: Multiple concurrent reasoning agents collaborating at token level granularity, 2025. URLhttps://arxiv.org/abs/2505.11107

work page arXiv 2025

[15] [15]

Phgpo: Pheromone-guided policy optimization for long-horizon tool planning.arXiv preprint arXiv:2602.13691, 2026

Yu Li, Guangfeng Cai, Shengtian Yang, Han Luo, Shuo Han, Xu He, Dong Li, and Lei Feng. Phgpo: Pheromone-guided policy optimization for long-horizon tool planning.arXiv preprint arXiv:2602.13691, 2026

work page arXiv 2026

[16] [16]

LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey

Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, et al. Llm-based human-agent collaboration and interaction systems: A survey.arXiv preprint arXiv:2505.00753, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Do NOT think that much for 2+3=? on the overthinking of long reasoning models

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=...

work page 2025

[18] [18]

Stop when enough: Adaptive early-stopping for chain-of-thought reasoning.arXiv preprint arXiv:2510.10103, 2025

Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, and Wei Wang. Stop when enough: Adaptive early-stopping for chain-of-thought reasoning.arXiv preprint arXiv:2510.10103, 2025

work page arXiv 2025

[19] [19]

LightThinker: Thinking step-by-step compression

Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, and Ningyu Zhang. LightThinker: Thinking step-by-step compression. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13307– 1332...

work page doi:10.18653/v1/2025.emnlp-main.673 2025

[20] [20]

Training language models to reason efficiently

Daman Arora and Andrea Zanette. Training language models to reason efficiently. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps: //openreview.net/forum?id=AiZxn84Wdo

work page 2025

[21] [21]

O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning

Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. In2nd AI for Math Workshop @ ICML 2025, 2025. URL https://openreview. net/forum?id=ioYybCRcyW

work page 2025

[22] [22]

C3ot: Generating shorter chain-of- thought without compromising effectiveness

Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of- thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320, 2025

work page 2025

[23] [23]

CoT-valve: Length-compressible chain-of-thought tuning

Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. CoT-valve: Length-compressible chain-of-thought tuning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6025–6035, Vienna, Austr...

work page doi:10.18653/v1/2025.acl-long.300 2025

[24] [24]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

arXiv preprint arXiv:2502.18600 , year=

Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025. 11

work page arXiv 2025

[26] [26]

Plan and budget: Effective and efficient test-time scaling on reasoning large language models

Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, and Dawei Zhou. Plan and budget: Effective and efficient test-time scaling on reasoning large language models. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=ctspw4CqbS

work page 2026

[27] [27]

Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

Sania Nayab, Giulio Rossolini, Marco Simoni, Andrea Saracino, Giorgio Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli. Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

work page arXiv 2024

[28] [28]

Token-budget-aware LLM reasoning

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware LLM reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, Vienna, Austria, July 2025. Association for Computational Lingui...

work page doi:10.18653/v1/2025.findings-acl.1274 2025

[29] [29]

Confidence-coverage gating for early exit

Aaroosh Rustagi, Hsien Xin Peng, Khushal Murthy, Attrey Koul, Ryan Lagasse, and Kevin Zhu. Confidence-coverage gating for early exit. InNeurIPS 2025 Workshop on Efficient Reasoning,

work page 2025

[30] [30]

URLhttps://openreview.net/forum?id=Ay7sRmWswq

work page

[31] [31]

Efficient test-time scaling via self-calibration

Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, and Jiaxin Huang. Efficient test-time scaling via self-calibration. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025. URLhttps://openreview.net/forum?id=RvMjxGpVOa

work page 2025

[32] [32]

Early Stopping Chain-of-thoughts in Large Language Models

Minjia Mao, Bowen Yin, Yu Zhu, and Xiao Fang. Early stopping chain-of-thoughts in large language models.arXiv preprint arXiv:2509.14004, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Entropy After </Think> for reasoning model early exiting

Xi Wang, James McInerney, Lequn Wang, and Nathan Kallus. Entropy after {/Think} for reasoning model early exiting.arXiv preprint arXiv:2509.26522, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Dynamic early exit in reasoning models

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=NpU7ZXafRi

work page 2026

[35] [35]

Reasoning without self-doubt: More efficient chain-of-thought through certainty probing

Yichao Fu, Junda Chen, Yonghao Zhuang, Zheyu Fu, Ion Stoica, and Hao Zhang. Reasoning without self-doubt: More efficient chain-of-thought through certainty probing. InICLR 2025 Workshop on Foundation Models in the Wild, 2025. URLhttps://openreview.net/forum? id=wpK4IMJfdX

work page 2025

[36] [36]

Efficient reasoning for large reasoning language models via certainty-guided reflection suppression

Jiameng Huang, Baijiong Lin, Guhao Feng, Jierun Chen, Di He, and Lu Hou. Efficient reasoning for large reasoning language models via certainty-guided reflection suppression. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31176–31184, 2026

work page 2026

[37] [37]

Answer convergence as a signal for early stopping in reasoning

Xin Liu and Lu Wang. Answer convergence as a signal for early stopping in reasoning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, edi- tors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 17896–17907, Suzhou, China, November 2025. Association for Computa- tional Linguist...

work page doi:10.18653/v1/2025.emnlp-main.904 2025

[38] [38]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=VD-AYtP0dve

work page 2023

[39] [39]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

work page 2024

[40] [40]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [42]

Samuel R

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Ger- ald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zi...

work page arXiv 2025

[43] [43]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=7Bywt2mQsCe

work page 2021

[44] [44]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024

[45] [45]

American invitational mathematics examination (aime) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

work page 2025

[46] [46]

OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilin- gual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilin- gual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Proceedin...

work page

[47] [47]

2024 , address =

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. URL https://aclanthology.org/2024.acl-long.211/

work page doi:10.18653/v1/2024.acl-long.211 2024

[48] [48]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98

work page 2024

[49] [49]

Stop overthinking: A survey on efficient reasoning for large language models.Transactions on Machine Learning Re- search, 2025

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models.Transactions on Machine Learning Re- search, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=HvoG8SxggZ

work page 2025

[50] [50]

Between un- derthinking and overthinking: An empirical study of rea- soning length and correctness in llms.arXiv preprint arXiv:2505.00127,

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025. 13

work page arXiv 2025

[51] [51]

Does thinking more always help? mirage of test-time scaling in reasoning models

Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Yifu Lu, Mengdi Wang, Dinesh Manocha, Furong Huang, Mohammad Ghavamzadeh, and Amrit Singh Bedi. Does thinking more always help? mirage of test-time scaling in reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=...

work page 2026

[52] [52]

When more is less: Understanding chain-of-thought length in LLMs

Yuyang Wu, Yifei Wang, Tianqi Du, Stefanie Jegelka, and Yisen Wang. When more is less: Understanding chain-of-thought length in LLMs. InWorkshop on Reasoning and Planning for Large Language Models, 2025. URLhttps://openreview.net/forum?id=W8dxn7hBkO

work page 2025

[53] [53]

The evolution of thought: Tracking llm overthinking via reasoning dynamics analysis, 2026

Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, and Xueqi Cheng. The evolution of thought: Tracking llm overthinking via reasoning dynamics analysis, 2026. URLhttps://arxiv.org/abs/2508.17627

work page arXiv 2026

[54] [54]

L1: Controlling how long a reasoning model thinks with reinforcement learning

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=4jdIxXBNve

work page 2025

[55] [55]

Thinkprune: Pruning long chain-of-thought of LLMs via reinforcement learning.Transactions on Machine Learning Research, 2026

Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of LLMs via reinforcement learning.Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https://openreview.net/ forum?id=V51gPu1uQD

work page 2026

[56] [56]

S-GRPO: Early exit via reinforcement learning in reasoning models

Mz Dai, Chenxu Yang, and Qingyi Si. S-GRPO: Early exit via reinforcement learning in reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=wNMK5o0Vfg

work page 2026

[57] [57]

Making slow thinking faster: Compressing LLM chain-of-thought via step entropy

Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, and Qiang Xu. Making slow thinking faster: Compressing LLM chain-of-thought via step entropy. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=cGLqQfS5wH

work page 2026

[58] [58]

Training large language models to reason in a continuous latent space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id= Itxz7S4Ip3

work page 2025

[59] [59]

Semcot: Accelerating chain-of-thought reasoning through semantically-aligned implicit tokens

Yinhan He, Wendy Zheng, Yaochen Zhu, Zaiyi Zheng, Lin Su, Sriram Vasudevan, Qi Guo, Liangjie Hong, and Jundong Li. Semcot: Accelerating chain-of-thought reasoning through semantically-aligned implicit tokens. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. URLhttps://openreview.net/forum?id=1ZuzFUMtx6

work page 2025

[60] [60]

LIMO: Less is more for reasoning

Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. LIMO: Less is more for reasoning. InSecond Conference on Language Modeling, 2025. URL https: //openreview.net/forum?id=T2TZ0RY4Zk

work page 2025

[61] [61]

DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

work page 2026

[62] [62]

Lee, Wen Sun, Wenhao Zhan, and Xuezhou Zhang

Kianté Brantley, Mingyu Chen, Zhaolin Gao, Jason D. Lee, Wen Sun, Wenhao Zhan, and Xuezhou Zhang. Accelerating RL for LLM reasoning with optimal advantage regression. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=T1V8BJO0iG

work page 2026

[63] [63]

The benefits of a concise chain of thought on problem-solving in large language models

Matthew Renze and Erhan Guven. The benefits of a concise chain of thought on problem-solving in large language models. In2024 2nd International Conference on Foundation and Large Language Models (FLLM), pages 476–483, 2024. doi: 10.1109/FLLM63129.2024.10852493. 14

work page doi:10.1109/fllm63129.2024.10852493 2024

[64] [64]

PREMISE: Scalable and strategic prompt optimization for efficient mathematical reasoning in large reasoning models

Ye Yu, Yaoning Yu, Haibo Jin, and Haohan Wang. PREMISE: Scalable and strategic prompt optimization for efficient mathematical reasoning in large reasoning models. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025. URLhttps://openreview.net/forum?id= 8mI3i9LXj3

work page 2025

[65] [65]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URLhttps: //ope...

work page 2023

[66] [66]

Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research, 2025

Prateek Chhikara. Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=lyaHnHDdZl

work page 2025

[67] [67]

Adam Tauman Kalai and Santosh S. Vempala. Calibrated language models must halluci- nate. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, STOC 2024, page 160–171, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400703836. doi: 10.1145/3618260.3649777. URL https://doi.org/10.1145/ 3618260.3649777

work page doi:10.1145/3618260.3649777 2024

[68] [68]

NEAT: Neuron-Based Early Exit for Large Reasoning Models

Kang Liu, Yongkang Liu, Xiaocui Yang, Peidong Wang, Wen Zhang, Shi Feng, Yifei Zhang, and Daling Wang. Neat: Neuron-based early exit for large reasoning models.arXiv preprint arXiv:2602.02010, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[69] [69]

Wait, we don’t need to “wait”! removing thinking tokens improves reasoning efficiency

Chenlong Wang, Yuanning Feng, Dongping Chen, Zhaoyang Chu, Ranjay Krishna, and Tianyi Zhou. Wait, we don’t need to “wait”! removing thinking tokens improves reasoning efficiency. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, pages 7459–7482...

work page doi:10.18653/v1/2025.findings-emnlp.394 2025

[70] [70]

Lynx: Learning dynamic exits for confidence-controlled reasoning.arXiv preprint arXiv:2512.05325, 2025

Ömer Faruk Akgül, Yusuf Hakan Kalaycı, Rajgopal Kannan, Willie Neiswanger, and Viktor Prasanna. Lynx: Learning dynamic exits for confidence-controlled reasoning.arXiv preprint arXiv:2512.05325, 2025

work page arXiv 2025

[71] [71]

Rea- soning models know when they’re right: Probing hidden states for self-verification

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Rea- soning models know when they’re right: Probing hidden states for self-verification. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id= O6I0Av7683

work page 2025

[72] [72]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

work page 2022

[73] [73]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

work page 2021

[74] [74]

Stanford University, 2009

Bill MacCartney.Natural language inference. Stanford University, 2009

work page 2009

[75] [75]

Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

work page arXiv 2025

[76] [76]

From generation to judgment: Opportunities and challenges of LLM-as-a-judge

Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. From generation to judgment: Opportunities and challenges of LLM-as-a-judge. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings...

work page doi:10.18653/v1/2025.emnlp-main.138 2025

[77] [77]

Mateval: a multi-agent discussion framework for advancing open-ended text evaluation

Yu Li, Shenyu Zhang, Rui Wu, Xiutian Huang, Yongrui Chen, Wenhao Xu, Guilin Qi, and Dehai Min. Mateval: a multi-agent discussion framework for advancing open-ended text evaluation. InInternational Conference on Database Systems for Advanced Applications, pages 415–426. Springer, 2024

work page 2024

[78] [78]

Dee: Dual-stage explainable evaluation method for text generation

Shenyu Zhang, Yu Li, Rui Wu, Xiutian Huang, Yongrui Chen, Wenhao Xu, and Guilin Qi. Dee: Dual-stage explainable evaluation method for text generation. InInternational Conference on Database Systems for Advanced Applications, pages 390–401. Springer, 2024

work page 2024

[79] [79]

How to configure good in-context sequence for visual question answering

Li Li, Jiawei Peng, Huiyi Chen, Chongyang Gao, and Xu Yang. How to configure good in-context sequence for visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26710–26720, June 2024

work page 2024

[80] [80]

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, and Lu Cheng. Mvi-bench: A comprehensive benchmark for evaluating robustness to misleading visual inputs in lvlms.arXiv preprint arXiv:2511.14159, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025