Recognition: 2 theorem links
· Lean TheoremTowards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Pith reviewed 2026-05-12 08:35 UTC · model grok-4.3
The pith
Long CoT with deep reasoning, exploration, and reflection enables LLMs to solve more complex tasks than short chains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The survey establishes that Long CoT, distinguished from Short CoT by its incorporation of deep reasoning, extensive exploration, and feasible reflection, equips reasoning large language models to address more complex tasks and deliver more efficient, coherent outcomes.
What carries the argument
A novel taxonomy that organizes reasoning paradigms by contrasting Long CoT against Short CoT, anchored in the three characteristics of deep reasoning, extensive exploration, and feasible reflection.
If this is right
- Models equipped with Long CoT can tackle more intricate problems in domains such as mathematics and coding than those restricted to Short CoT.
- The emergence of Long CoT produces observable effects including overthinking and gains from inference-time scaling.
- Future systems can build on the taxonomy to integrate multi-modal reasoning and refine knowledge frameworks for greater efficiency.
- A shared structure for these processes supports clearer progress in developing logical reasoning capabilities for AI.
Where Pith is reading between the lines
- The taxonomy could serve as a basis for creating standardized benchmarks that measure depth of exploration separately from final accuracy.
- Techniques to detect and shorten unnecessary reflection steps might reduce overthinking while preserving the benefits of Long CoT.
- Inference-time scaling implies that deployment budgets could shift from larger models to longer allowed thinking time on smaller ones.
- Combining the survey's framework with visual inputs could create unified reasoning systems that handle both text and image-based problems.
Load-bearing premise
The differences between Long CoT and Short CoT are fundamental enough to support a clean taxonomy without substantial overlap or missing categories.
What would settle it
An empirical result in which Short CoT models achieve equal or better performance than Long CoT models on complex tasks without added mechanisms, or a new reasoning approach that blends both forms in a way the taxonomy cannot classify.
read the original abstract
Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like "overthinking" and "inference-time scaling." This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and inference-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey paper on long chain-of-thought (Long CoT) for reasoning large language models (RLLMs). It distinguishes Long CoT from traditional short CoT, introduces a novel taxonomy for current reasoning paradigms, examines key characteristics of Long CoT (deep reasoning, extensive exploration, and feasible reflection), analyzes phenomena including overthinking and inference-time scaling, and identifies research gaps with future directions such as multi-modal integration, efficiency improvements, and enhanced knowledge frameworks.
Significance. If the distinctions and taxonomy are robust, this work provides a timely organizational synthesis for the emerging 'reasoning era' in LLMs, building on models like o1 and R1. The structured overview of characteristics, phenomena, and gaps can serve as a useful reference to guide research on complex task handling and coherence, with the explicit identification of future directions as a particular strength of the synthesis.
major comments (2)
- [§3 (Taxonomy)] §3 (Taxonomy): The central claim that the proposed taxonomy comprehensively categorizes reasoning paradigms without significant overlap or omission is load-bearing for the unified perspective but lacks explicit validation; the manuscript should add a systematic mapping or comparison table against prior CoT/reasoning taxonomies to demonstrate completeness and novelty.
- [§4 (Characteristics)] §4 (Characteristics): The assertion that Long CoT characteristics enable 'more efficient, coherent outcomes' compared to Short CoT is presented as a key distinction, yet the survey does not quantify or cite specific metrics (e.g., token efficiency or error rates) across representative models to support this over the qualitative description.
minor comments (3)
- [Abstract and §1] Abstract and §1: The phrasing 'feasible reflection' is introduced without an immediate operational definition or example; add a brief clarifying sentence or footnote on first use.
- [§5 (Phenomena)] §5 (Phenomena): The discussion of 'overthinking' would benefit from a dedicated subsection or table listing observed instances across models (e.g., o1 vs. R1) to improve readability and allow direct comparison.
- [References] References: Several foundational CoT papers (pre-2023) appear underrepresented relative to recent model-specific works; expand the citation list for balance in the literature synthesis.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation of minor revision. The comments on validating the taxonomy and strengthening the characteristics section are helpful. We address each point below, indicating revisions where made.
read point-by-point responses
-
Referee: [§3 (Taxonomy)] §3 (Taxonomy): The central claim that the proposed taxonomy comprehensively categorizes reasoning paradigms without significant overlap or omission is load-bearing for the unified perspective but lacks explicit validation; the manuscript should add a systematic mapping or comparison table against prior CoT/reasoning taxonomies to demonstrate completeness and novelty.
Authors: We agree that an explicit validation strengthens the taxonomy's contribution. In the revised manuscript, we have added a new Table 2 that provides a systematic mapping of our taxonomy categories (deep reasoning, extensive exploration, feasible reflection, and related paradigms) against prior CoT and reasoning taxonomies, including those from Wei et al. (2022), Kojima et al. (2022), and recent surveys on LLM reasoning. The table highlights distinctions, such as our coverage of inference-time scaling and reflection in Long CoT models like o1 and R1, while noting areas of overlap and confirming no major omissions in current paradigms. This addition directly addresses the request for demonstrated completeness and novelty. revision: yes
-
Referee: [§4 (Characteristics)] §4 (Characteristics): The assertion that Long CoT characteristics enable 'more efficient, coherent outcomes' compared to Short CoT is presented as a key distinction, yet the survey does not quantify or cite specific metrics (e.g., token efficiency or error rates) across representative models to support this over the qualitative description.
Authors: The referee is correct that the current version relies on qualitative description for the efficiency and coherence claims. As a survey, we do not conduct new experiments, but we have revised Section 4 to incorporate citations to empirical studies on models such as OpenAI-o1 and DeepSeek-R1. These include reported metrics on benchmark accuracy (e.g., MATH, GSM8K), token usage comparisons showing longer but more effective reasoning paths, and reduced error rates in complex tasks relative to short CoT baselines. This provides concrete support for the distinction while remaining within survey scope. We can further expand the citations if additional references are suggested. revision: partial
Circularity Check
No significant circularity; literature synthesis without derivations
full rationale
This survey paper organizes existing literature on Long CoT versus Short CoT in reasoning LLMs, drawing distinctions and proposing a taxonomy based on observed characteristics from models such as OpenAI-O1 and DeepSeek-R1. No mathematical derivations, equations, fitted parameters, or quantitative predictions appear that could reduce to the paper's own inputs by construction. Claims about deep reasoning, exploration, reflection, overthinking, and inference-time scaling are presented as syntheses of prior work rather than self-referential definitions or load-bearing self-citations. The structure remains self-contained as an organizational review with no steps that equate outputs to inputs via ansatz, renaming, or uniqueness theorems imported from the authors' prior work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Long CoT is meaningfully distinct from Short CoT and enables superior handling of complex tasks
Forward citations
Cited by 31 Pith papers
-
Unsupervised Process Reward Models
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
-
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candida...
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
-
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
-
RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step
RuPLaR replaces multi-step latent CoT with a single-model one-step generator guided by rule-based priors and a joint consistency-plus-alignment loss, delivering 11.1 percent higher accuracy at lower token cost.
-
AIPO: : Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
Training Transformers for KV Cache Compressibility
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
-
Training Transformers for KV Cache Compressibility
KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
-
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
-
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
-
SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution
SAVOIR combines prospective expected utility valuation with Shapley values for fair credit assignment in social dialogue RL, achieving SOTA on SOTOPIA where a 7B model matches or exceeds GPT-4o and Claude-3.5-Sonnet.
-
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...
-
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
-
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing t...
-
An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing
An agentic AI framework with LLMs generates formulations for coupled UAV product collection and MEC task scheduling, solved by hierarchical PPO that reaches 99.6% collection success and 100% deadline compliance in sim...
-
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
-
SiriusHelper: An LLM Agent-Based Operations Assistant for Big Data Platforms
SiriusHelper deploys an LLM agent with intent routing, DeepSearch multi-hop retrieval, and automated SOP distillation to outperform alternatives and reduce ticket volume by 20.8% on Tencent's big data platform.
-
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
-
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
-
StaRPO: Stability-Augmented Reinforcement Policy Optimization
StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.
-
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.
-
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
A 16-factor structured prompt framework strengthens CoT reasoning in LLMs for security analysis, yielding up to 40% reasoning gains in smaller models and stable accuracy improvements validated by human raters with Coh...
-
Pragmos: A Process Agentic Modeling System
Pragmos is a hybrid interactive system that decomposes process modeling into explainable steps using LLMs augmented by behavioral-relation tools to produce sound and comprehensible models.
-
A Survey of Context Engineering for Large Language Models
The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.
Reference graph
Works this paper leans on
-
[1]
Medec: A benchmark for medical error detection and correction in clinical notes
Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, and Thomas Lin. Medec: A benchmark for medical error detection and correction in clinical notes. arXiv preprint arXiv:2412.19260, 2024
-
[2]
Inference scaling vs reasoning: An empirical analysis of compute-optimal llm problem-solving
Marwan AbdElhameed and Pavly Halim. Inference scaling vs reasoning: An empirical analysis of compute-optimal llm problem-solving. arXiv preprint arXiv:2412.16260, 2024
-
[3]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Nemotron-4 340b technical report
Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024
-
[5]
The unreasonable effectiveness of entropy minimization in llm reasoning
Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134, 2025
-
[6]
arXiv preprint arXiv:2503.04697 , year=
Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697, 2025
-
[7]
Opencodereasoning: Advancing data distillation for competitive coding
Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025
- [8]
- [9]
-
[10]
Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, and Nick Haber. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models, 2025
work page 2025
-
[11]
Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms
Mohammad Ali Alomrani, Yingxue Zhang, Derek Li, Qianyi Sun, Soumyasundar Pal, Zhanguang Zhang, Yaochen Hu, Rohan Deepak Ajwani, Antonios Valkanas, Raika Karimi, et al. Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms. arXiv preprint arXiv:2507.02076, 2025
-
[12]
Lower bounds for chain-of- thought reasoning in hard-attention transformers
Alireza Amiri, Xinting Huang, Mark Rofin, and Michael Hahn. Lower bounds for chain-of- thought reasoning in hard-attention transformers. arXiv preprint arXiv:2502.02393, 2025
-
[13]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016. 37
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[14]
Learning from mistakes makes llm better reasoner
Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689, 2023
-
[15]
Phd knowledge not required: A reasoning challenge for large language models
Carolyn Jane Anderson, Joydeep Biswas, Aleksander Boruch-Gruszecki, Federico Cassano, Molly Q Feldman, Arjun Guha, Francesca Lucchetti, and Zixuan Wu. Phd knowledge not required: A reasoning challenge for large language models. arXiv preprint arXiv:2502.01584, 2025
-
[16]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023
work page internal anchor Pith review arXiv 2023
-
[17]
Critique-out-loud reward models
Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan Daniel Chang, and Prithviraj Am- manabrolu. Critique-out-loud reward models. In Pluralistic Alignment Workshop at NeurIPS 2024, October 2024. URL https://openreview.net/forum?id=CljYUvIlRW
work page 2024
-
[18]
Thinking fast and slow with deep learn- ing and tree search
Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learn- ing and tree search. Advances in neural information processing systems , 30, Decem- ber 2017. URL https://proceedings.neurips.cc/paper_files/paper/ 2017/file/d8e1344e27a5b08cdfd5d027d9b8d6de-Paper.pdf
work page 2017
-
[19]
The claude 3 model family: Opus, sonnet, haiku
AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude- 3 Model Card , 1:1, 2024. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3. pdf
work page 2024
-
[20]
Roberto Araya. Do chains-of-thoughts of large language models suffer from hallucinations, cognitive biases, or phobias in bayesian reasoning? arXiv preprint arXiv:2503.15268, 2025
-
[21]
Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models
Mikhail L Arbuzov, Alexey A Shvets, and Sisong Beir. Beyond exponential decay: Rethinking error accumulation in large language models. arXiv preprint arXiv:2505.24187, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Training language models to reason efficiently
Daman Arora and Andrea Zanette. Training language models to reason efficiently. arXiv preprint arXiv:2502.04463, 2025
-
[23]
Early external safety testing of openai’s o3-mini: Insights from the pre-deployment evaluation
Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. Early external safety testing of openai’s o3-mini: Insights from the pre-deployment evaluation. arXiv preprint arXiv:2501.17749, 2025
-
[24]
o3-mini vs deepseek-r1: Which one is safer? arXiv preprint arXiv:2501.18438, 2025
Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. o3-mini vs deepseek-r1: Which one is safer? arXiv preprint arXiv:2501.18438, 2025
-
[25]
Language models can predict their own behavior.arXiv preprint arXiv:2502.13329, 2025
Dhananjay Ashok and Jonathan May. Language models can predict their own behavior.arXiv preprint arXiv:2502.13329, 2025
-
[26]
Jiang, Jia Deng, Stella Biderman, and Sean Welleck
Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. In The Twelfth International Conference on Learn- ing Representations, January 2024. URL https://openreview.net/forum?id= 4WnqRR915j
work page 2024
-
[27]
Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025
-
[28]
The lookahead limitation: Why multi-operand addition is hard for llms
Tanja Baeumel, Josef van Genabith, and Simon Ostermann. The lookahead limitation: Why multi-operand addition is hard for llms. arXiv preprint arXiv:2502.19981, 2025
-
[29]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Monitoring reasoning models for misbehavior and the risks of promoting obfus- cation
Bowen Baker, Joost Huizinga, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfus- cation. March 2025. URL https://openai.com/index/chain-of-thought- monitoring/
work page 2025
-
[31]
Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, et al. Inference-time scaling for complex tasks: Where we stand and what lies ahead. arXiv preprint arXiv:2504.00294, 2025. 38
-
[32]
Marthe Ballon, Andres Algaba, and Vincent Ginis. The relationship between reasoning and performance in large language models–o3 (mini) thinks harder, not longer. arXiv preprint arXiv:2502.15631, 2025
-
[33]
Thinking machines: A survey of llm based reasoning strategies
Dibyanayan Bandyopadhyay, Soham Bhattacharjee, and Asif Ekbal. Thinking machines: A survey of llm based reasoning strategies. arXiv preprint arXiv:2503.10814, 2025
-
[34]
Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, and Mehran Kazemi. Smaller, weaker, yet better: Training LLM reasoners via compute-optimal sampling. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, January 2025. URL https: //openreview.net/forum?id=HuYSURUxs2
work page 2025
-
[35]
Learning to stop overthinking at test time
Hieu Tran Bao, Nguyen Cong Dat, Nguyen Duc Anh, and Hoang Thanh Tung. Learning to stop overthinking at test time. arXiv preprint arXiv:2502.10954, 2025
-
[36]
Teaching llm to reason: Reinforcement learning from algorithmic problems without code
Keqin Bao, Nuo Chen, Xiaoyuan Li, Binyuan Hui, Bowen Yu, Fuli Feng, Junyang Lin, Xiangnan He, and Dayiheng Liu. Teaching llm to reason: Reinforcement learning from algorithmic problems without code. arXiv preprint arXiv:2507.07498, 2025
-
[37]
Qiming Bao, Alex Yuxuan Peng, Tim Hartill, Neset Tan, Zhenyun Deng, Michael Witbrock, and Jiamou Liu. Multi-step deductive reasoning over natural language: An empirical study on out-of-distribution generalisation. arXiv preprint arXiv:2207.14000, 2022
-
[38]
Qiming Bao, Gael Gendron, Alex Yuxuan Peng, Wanjun Zhong, Neset Tan, Yang Chen, Michael Witbrock, and Jiamou Liu. Assessing and enhancing the robustness of large language models with task structure variations for logical reasoning. arXiv preprint arXiv:2310.09430, 2023
-
[39]
Contrastive learning with logic- driven data augmentation for logical reasoning over text
Qiming Bao, Alex Yuxuan Peng, Zhenyun Deng, Wanjun Zhong, Neset Tan, Nathan Young, Yang Chen, Yonghua Zhu, Michael Witbrock, and Jiamou Liu. Contrastive learning with logic- driven data augmentation for logical reasoning over text. arXiv preprint arXiv:2305.12599, 2023
-
[40]
Abstract Meaning Representation-based logic-driven data augmentation for logical rea- soning
Qiming Bao, Alex Peng, Zhenyun Deng, Wanjun Zhong, Gael Gendron, Timothy Pistotti, Ne- set Tan, Nathan Young, Yang Chen, Yonghua Zhu, Paul Denny, Michael Witbrock, and Jiamou Liu. Abstract Meaning Representation-based logic-driven data augmentation for logical rea- soning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Associa- ...
work page 2024
-
[41]
doi: 10.18653/v1/2024.findings-acl.353
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.353. URL https://aclanthology.org/2024.findings-acl.353/
-
[42]
Qiming Bao, Juho Leinonen, Alex Yuxuan Peng, Wanjun Zhong, Gaël Gendron, Timothy Pistotti, Alice Huang, Paul Denny, Michael Witbrock, and Jiamou Liu. Exploring iterative enhancement for improving learnersourced multiple-choice question explanations with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pag...
work page 2025
-
[43]
Brian R Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, and Bhavya Kailkhura. Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training. arXiv preprint arXiv:2503.18929, 2025
-
[44]
Requirements ambiguity detection and ex- planation with llms: An industrial study
Sarmad Bashir, Alessio Ferrari, Abbas Khan, Per Erik Strandberg, Zulqarnain Haider, Mehrdad Saadatmand, and Markus Bohlin. Requirements ambiguity detection and ex- planation with llms: An industrial study. July 2025
work page 2025
-
[45]
arXiv preprint arXiv:2501.00663 , year=
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2024
-
[46]
arXiv preprint arXiv:2502.15657 , year=
Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? arXiv preprint arXiv:2502.15657, 2025
-
[47]
International ai safety report
Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, et al. International ai safety report. arXiv preprint arXiv:2501.17805, 2025. 39
-
[48]
Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, and Raffaella Bernardi. The validation gap: A mechanistic analysis of how language models compute arithmetic but fail to validate it. arXiv preprint arXiv:2502.11771, 2025
-
[49]
Graph of thoughts: Solving elaborate problems with large language models
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, Mar
-
[50]
In: Wooldridge, M.J., Dy, J.G., Natarajan, S
doi: 10.1609/aaai.v38i16.29720. URL https://ojs.aaai.org/index.php/ AAAI/article/view/29720
-
[51]
Demystifying chains, trees, and graphs of thoughts
Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwa ´sniewski, Jürgen Müller, et al. Demystifying chains, trees, and graphs of thoughts. arXiv preprint arXiv:2401.14295, 2024
-
[52]
Reasoning Language Models: A Blueprint,
Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gersten- berger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. Reasoning language models: A blueprint. arXiv preprint arXiv:2501.11223, 2025
-
[53]
Cot-kinetics: A theoretical modeling assessing lrm reasoning process
Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hinrich Schuetze, V olker Tresp, et al. Cot-kinetics: A theoretical modeling assessing lrm reasoning process. arXiv preprint arXiv:2505.13408, 2025
-
[54]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, and Huajun Chen. When do program-of-thought works for reasoning? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17691–17699, 2024. URL https://ojs. aaai.org/index.php/AAAI/article/view/29721/31237
work page 2024
-
[56]
Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning,
Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing llm reasoning. arXiv preprint arXiv:2412.09078, 2024
-
[57]
On the query complexity of verifier-assisted language generation
Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T Ash, Cyril Zhang, and Andrej Ris- teski. On the query complexity of verifier-assisted language generation. arXiv preprint arXiv:2502.12123, 2025
-
[58]
Vermcts: Synthesizing multi-step programs using a verifier, a large language model, and tree search
David Brandfonbrener, Simon Henniger, Sibi Raja, Tarun Prasad, Chloe Loughridge, Federico Cassano, Sabrina Ruixin Hu, Jianang Yang, William E Byrd, Robert Zinkov, et al. Vermcts: Synthesizing multi-step programs using a verifier, a large language model, and tree search. arXiv preprint arXiv:2402.08147, 2024
-
[59]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner
Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws. arXiv preprint arXiv:2502.08606, 2025
-
[61]
Test-time-scaling for zero-shot diagnosis with visual-language reasoning
Ji Young Byun, Young-Jin Park, Navid Azizan, and Rama Chellappa. Test-time-scaling for zero-shot diagnosis with visual-language reasoning. arXiv preprint arXiv:2506.11166, 2025
-
[62]
Ju-Seung Byun, Jiyun Chun, Jihyung Kil, and Andrew Perrault. ARES: Alternating rein- forcement learning and supervised fine-tuning for enhanced multi-modal chain-of-thought reasoning through diverse AI feedback. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Proce...
-
[63]
System-2 mathematical reasoning via enriched instruction tuning
Huanqia Cai, Yijun Yang, and Zhifeng Li. System-2 mathematical reasoning via enriched instruction tuning. arXiv preprint arXiv:2412.16964, 2024
-
[64]
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 40
-
[65]
Xai meets llms: A survey of the relation between explainable ai and large language models
Erik Cambria, Lorenzo Malandri, Fabio Mercorio, Navid Nobani, and Andrea Seveso. Xai meets llms: A survey of the relation between explainable ai and large language models. arXiv preprint arXiv:2407.15248, 2024
-
[66]
Lang Cao. GraphReason: Enhancing reasoning capabilities of large language models through a graph-based verification approach. In Bhavana Dalvi Mishra, Greg Durrett, Peter Jansen, Ben Lipkin, Danilo Neves Ribeiro, Lionel Wong, Xi Ye, and Wenting Zhao, editors, Pro- ceedings of the 2nd Workshop on Natural Language Reasoning and Structured Explanations (@ACL...
work page 2024
-
[67]
Behavior injection: Preparing language models for reinforcement learning
Zhepeng Cen, Yihang Yao, William Han, Zuxin Liu, and Ding Zhao. Behavior injection: Preparing language models for reinforcement learning. arXiv preprint arXiv:2505.18917, 2025
-
[68]
xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning
Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xiannian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, et al. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. arXiv preprint arXiv:2401.07037, 2024
-
[69]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024
work page Pith review arXiv 2024
-
[70]
On the convergence rate of mcts for the optimal value estimation in markov decision processes
Hyeong Soo Chang. On the convergence rate of mcts for the optimal value estimation in markov decision processes. IEEE Transactions on Automatic Control, pages 1–6, February
-
[71]
URL https://ieeexplore.ieee.org/ document/10870057
doi: 10.1109/TAC.2025.3538807. URL https://ieeexplore.ieee.org/ document/10870057
-
[72]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025
work page internal anchor Pith review arXiv 2025
-
[73]
Evaluating o1-like llms: Unlocking reasoning for translation through comprehensive analysis
Andong Chen, Yuchen Song, Wenxin Zhu, Kehai Chen, Muyun Yang, Tiejun Zhao, et al. Evaluating o1-like llms: Unlocking reasoning for translation through comprehensive analysis. arXiv preprint arXiv:2502.11544, 2025
-
[74]
Threading the needle: Reweaving chain-of-thought reasoning to explain human label variation
Beiduo Chen, Yang Janet Liu, Anna Korhonen, and Barbara Plank. Threading the needle: Reweaving chain-of-thought reasoning to explain human label variation. arXiv preprint arXiv:2505.23368, 2025
-
[75]
Finereason: Evaluating and improving llms’ deliberate reasoning through reflective puzzle solving
Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, and Yu Rong. Finereason: Evaluating and improving llms’ deliberate reasoning through reflective puzzle solving. arXiv preprint arXiv:2502.20238, 2025
-
[76]
Step-level value preference optimiza- tion for mathematical reasoning
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimiza- tion for mathematical reasoning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 7889–7903, Miami, Florida, USA, November 2024. Association for Computational Linguis- tics. d...
-
[77]
Alphamath almost zero: Process supervision without process
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: Process supervision without process. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, September 2024. URL https://openreview.net/forum?id= VaXnxQ3UKo
work page 2024
-
[78]
ChineseEcomQA: A scal- able e-commerce concept evaluation benchmark for large language models
Haibin Chen, Kangtao Lv, Chengwei Hu, Yanshi Li, Yujin Yuan, Yancheng He, Xingyao Zhang, Langming Liu, Shilei Liu, Wenbo Su, et al. Chineseecomqa: A scalable e-commerce concept evaluation benchmark for large language models. arXiv preprint arXiv:2502.20196, 2025
-
[79]
Benchmarking large lan- guage models on answering and explaining challenging medical questions
Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. Benchmarking large lan- guage models on answering and explaining challenging medical questions. arXiv preprint arXiv:2402.18060, 2024
-
[80]
Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, et al. Language models are hid- 41 den reasoners: Unlocking latent reasoning capabilities via self-rewarding. arXiv preprint arXiv:2411.04282, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.