pith. machine review for the scientific record. sign in

arxiv: 2503.09567 · v5 · submitted 2025-03-12 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Dengyun Peng, Jiannan Guan, Jinhao Liu, Libo Qin, Mengkang Hu, Peng Wang, Qiguang Chen, Te Gao, Wanxiang Che, Yuhang Zhou

Pith reviewed 2026-05-12 08:35 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords long chain-of-thoughtreasoning large language modelsdeep reasoningextensive explorationfeasible reflectionoverthinkinginference-time scalingreasoning taxonomy
0
0 comments X

The pith

Long CoT with deep reasoning, exploration, and reflection enables LLMs to solve more complex tasks than short chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey separates long chain-of-thought reasoning from traditional short chain-of-thought in large language models. It introduces a taxonomy that groups reasoning approaches around three defining traits: deep reasoning, extensive exploration, and feasible reflection. These traits let models manage intricate problems in mathematics and coding while producing more coherent and efficient outputs. The work also examines how Long CoT appears in practice through phenomena such as overthinking and inference-time scaling. It closes by listing open gaps and future paths including multi-modal extensions and efficiency improvements.

Core claim

The survey establishes that Long CoT, distinguished from Short CoT by its incorporation of deep reasoning, extensive exploration, and feasible reflection, equips reasoning large language models to address more complex tasks and deliver more efficient, coherent outcomes.

What carries the argument

A novel taxonomy that organizes reasoning paradigms by contrasting Long CoT against Short CoT, anchored in the three characteristics of deep reasoning, extensive exploration, and feasible reflection.

If this is right

  • Models equipped with Long CoT can tackle more intricate problems in domains such as mathematics and coding than those restricted to Short CoT.
  • The emergence of Long CoT produces observable effects including overthinking and gains from inference-time scaling.
  • Future systems can build on the taxonomy to integrate multi-modal reasoning and refine knowledge frameworks for greater efficiency.
  • A shared structure for these processes supports clearer progress in developing logical reasoning capabilities for AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could serve as a basis for creating standardized benchmarks that measure depth of exploration separately from final accuracy.
  • Techniques to detect and shorten unnecessary reflection steps might reduce overthinking while preserving the benefits of Long CoT.
  • Inference-time scaling implies that deployment budgets could shift from larger models to longer allowed thinking time on smaller ones.
  • Combining the survey's framework with visual inputs could create unified reasoning systems that handle both text and image-based problems.

Load-bearing premise

The differences between Long CoT and Short CoT are fundamental enough to support a clean taxonomy without substantial overlap or missing categories.

What would settle it

An empirical result in which Short CoT models achieve equal or better performance than Long CoT models on complex tasks without added mechanisms, or a new reasoning approach that blends both forms in a way the taxonomy cannot classify.

read the original abstract

Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like "overthinking" and "inference-time scaling." This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and inference-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript is a survey paper on long chain-of-thought (Long CoT) for reasoning large language models (RLLMs). It distinguishes Long CoT from traditional short CoT, introduces a novel taxonomy for current reasoning paradigms, examines key characteristics of Long CoT (deep reasoning, extensive exploration, and feasible reflection), analyzes phenomena including overthinking and inference-time scaling, and identifies research gaps with future directions such as multi-modal integration, efficiency improvements, and enhanced knowledge frameworks.

Significance. If the distinctions and taxonomy are robust, this work provides a timely organizational synthesis for the emerging 'reasoning era' in LLMs, building on models like o1 and R1. The structured overview of characteristics, phenomena, and gaps can serve as a useful reference to guide research on complex task handling and coherence, with the explicit identification of future directions as a particular strength of the synthesis.

major comments (2)
  1. [§3 (Taxonomy)] §3 (Taxonomy): The central claim that the proposed taxonomy comprehensively categorizes reasoning paradigms without significant overlap or omission is load-bearing for the unified perspective but lacks explicit validation; the manuscript should add a systematic mapping or comparison table against prior CoT/reasoning taxonomies to demonstrate completeness and novelty.
  2. [§4 (Characteristics)] §4 (Characteristics): The assertion that Long CoT characteristics enable 'more efficient, coherent outcomes' compared to Short CoT is presented as a key distinction, yet the survey does not quantify or cite specific metrics (e.g., token efficiency or error rates) across representative models to support this over the qualitative description.
minor comments (3)
  1. [Abstract and §1] Abstract and §1: The phrasing 'feasible reflection' is introduced without an immediate operational definition or example; add a brief clarifying sentence or footnote on first use.
  2. [§5 (Phenomena)] §5 (Phenomena): The discussion of 'overthinking' would benefit from a dedicated subsection or table listing observed instances across models (e.g., o1 vs. R1) to improve readability and allow direct comparison.
  3. [References] References: Several foundational CoT papers (pre-2023) appear underrepresented relative to recent model-specific works; expand the citation list for balance in the literature synthesis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The comments on validating the taxonomy and strengthening the characteristics section are helpful. We address each point below, indicating revisions where made.

read point-by-point responses
  1. Referee: [§3 (Taxonomy)] §3 (Taxonomy): The central claim that the proposed taxonomy comprehensively categorizes reasoning paradigms without significant overlap or omission is load-bearing for the unified perspective but lacks explicit validation; the manuscript should add a systematic mapping or comparison table against prior CoT/reasoning taxonomies to demonstrate completeness and novelty.

    Authors: We agree that an explicit validation strengthens the taxonomy's contribution. In the revised manuscript, we have added a new Table 2 that provides a systematic mapping of our taxonomy categories (deep reasoning, extensive exploration, feasible reflection, and related paradigms) against prior CoT and reasoning taxonomies, including those from Wei et al. (2022), Kojima et al. (2022), and recent surveys on LLM reasoning. The table highlights distinctions, such as our coverage of inference-time scaling and reflection in Long CoT models like o1 and R1, while noting areas of overlap and confirming no major omissions in current paradigms. This addition directly addresses the request for demonstrated completeness and novelty. revision: yes

  2. Referee: [§4 (Characteristics)] §4 (Characteristics): The assertion that Long CoT characteristics enable 'more efficient, coherent outcomes' compared to Short CoT is presented as a key distinction, yet the survey does not quantify or cite specific metrics (e.g., token efficiency or error rates) across representative models to support this over the qualitative description.

    Authors: The referee is correct that the current version relies on qualitative description for the efficiency and coherence claims. As a survey, we do not conduct new experiments, but we have revised Section 4 to incorporate citations to empirical studies on models such as OpenAI-o1 and DeepSeek-R1. These include reported metrics on benchmark accuracy (e.g., MATH, GSM8K), token usage comparisons showing longer but more effective reasoning paths, and reduced error rates in complex tasks relative to short CoT baselines. This provides concrete support for the distinction while remaining within survey scope. We can further expand the citations if additional references are suggested. revision: partial

Circularity Check

0 steps flagged

No significant circularity; literature synthesis without derivations

full rationale

This survey paper organizes existing literature on Long CoT versus Short CoT in reasoning LLMs, drawing distinctions and proposing a taxonomy based on observed characteristics from models such as OpenAI-O1 and DeepSeek-R1. No mathematical derivations, equations, fitted parameters, or quantitative predictions appear that could reduce to the paper's own inputs by construction. Claims about deep reasoning, exploration, reflection, overthinking, and inference-time scaling are presented as syntheses of prior work rather than self-referential definitions or load-bearing self-citations. The structure remains self-contained as an organizational review with no steps that equate outputs to inputs via ansatz, renaming, or uniqueness theorems imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on standard domain assumptions from LLM reasoning literature and introduces no new free parameters or invented entities; its main addition is the taxonomy itself.

axioms (1)
  • domain assumption Long CoT is meaningfully distinct from Short CoT and enables superior handling of complex tasks
    Central framing of the survey as stated in the abstract

pith-pipeline@v0.9.0 · 5614 in / 1124 out tokens · 58867 ms · 2026-05-12T08:35:25.655145+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unsupervised Process Reward Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

  2. CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

    cs.CL 2026-05 unverdicted novelty 7.0

    CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candida...

  3. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

    cs.AI 2026-05 conditional novelty 7.0

    Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

  4. Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

    cs.CL 2026-04 unverdicted novelty 7.0

    DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.

  5. ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

    cs.CL 2026-04 unverdicted novelty 7.0

    ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

  6. Video-R1: Reinforcing Video Reasoning in MLLMs

    cs.CV 2025-03 conditional novelty 7.0

    Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

  7. STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

    cs.CL 2026-05 unverdicted novelty 6.0

    STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.

  8. Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...

  9. RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step

    cs.CL 2026-05 unverdicted novelty 6.0

    RuPLaR replaces multi-step latent CoT with a single-model one-step generator guided by rule-based priors and a joint consistency-plus-alignment loss, delivering 11.1 percent higher accuracy at lower token cost.

  10. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  11. Training Transformers for KV Cache Compressibility

    cs.LG 2026-05 unverdicted novelty 6.0

    Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.

  12. Training Transformers for KV Cache Compressibility

    cs.LG 2026-05 unverdicted novelty 6.0

    KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...

  13. Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding

    cs.AI 2026-05 unverdicted novelty 6.0

    CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.

  14. VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

    cs.RO 2026-05 unverdicted novelty 6.0

    VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.

  15. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  16. OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

    cs.CV 2026-04 unverdicted novelty 6.0

    OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.

  17. SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution

    cs.AI 2026-04 unverdicted novelty 6.0

    SAVOIR combines prospective expected utility valuation with Shapley values for fair credit assignment in social dialogue RL, achieving SOTA on SOTOPIA where a 7B model matches or exceeds GPT-4o and Claude-3.5-Sonnet.

  18. Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

    cs.CL 2026-04 unverdicted novelty 6.0

    PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...

  19. Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

    cs.CL 2026-04 unverdicted novelty 6.0

    Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.

  20. TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation

    cs.CL 2026-04 unverdicted novelty 6.0

    TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing t...

  21. An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

    cs.AI 2026-05 unverdicted novelty 5.0

    An agentic AI framework with LLMs generates formulations for coupled UAV product collection and MEC task scheduling, solved by hierarchical PPO that reaches 99.6% collection success and 100% deadline compliance in sim...

  22. MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading

    cs.CL 2026-05 unverdicted novelty 5.0

    MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.

  23. SiriusHelper: An LLM Agent-Based Operations Assistant for Big Data Platforms

    cs.DB 2026-04 unverdicted novelty 5.0

    SiriusHelper deploys an LLM agent with intent routing, DeepSearch multi-hop retrieval, and automated SOP distillation to outperform alternatives and reduce ticket volume by 20.8% on Tencent's big data platform.

  24. Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering

    cs.SE 2026-04 unverdicted novelty 5.0

    LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.

  25. Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...

  26. StaRPO: Stability-Augmented Reinforcement Policy Optimization

    cs.AI 2026-04 unverdicted novelty 5.0

    StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.

  27. OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

    cs.CV 2026-04 unverdicted novelty 5.0

    OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.

  28. Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework

    cs.CR 2026-04 unverdicted novelty 5.0

    A 16-factor structured prompt framework strengthens CoT reasoning in LLMs for security analysis, yielding up to 40% reasoning gains in smaller models and stable accuracy improvements validated by human raters with Coh...

  29. Pragmos: A Process Agentic Modeling System

    cs.SE 2026-04 unverdicted novelty 4.0

    Pragmos is a hybrid interactive system that decomposes process modeling into explainable steps using LLMs augmented by behavioral-relation tools to produce sound and comprehensible models.

  30. A Survey of Context Engineering for Large Language Models

    cs.CL 2025-07 accept novelty 4.0

    The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...

  31. Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

    cs.CL 2026-05 unverdicted novelty 3.0

    EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 30 Pith papers · 30 internal anchors

  1. [1]

    Medec: A benchmark for medical error detection and correction in clinical notes

    Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, and Thomas Lin. Medec: A benchmark for medical error detection and correction in clinical notes. arXiv preprint arXiv:2412.19260, 2024

  2. [2]

    Inference scaling vs reasoning: An empirical analysis of compute-optimal llm problem-solving

    Marwan AbdElhameed and Pavly Halim. Inference scaling vs reasoning: An empirical analysis of compute-optimal llm problem-solving. arXiv preprint arXiv:2412.16260, 2024

  3. [3]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  4. [4]

    Nemotron-4 340b technical report

    Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024

  5. [5]

    The unreasonable effectiveness of entropy minimization in llm reasoning

    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134, 2025

  6. [6]

    arXiv preprint arXiv:2503.04697 , year=

    Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697, 2025

  7. [7]

    Opencodereasoning: Advancing data distillation for competitive coding

    Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025

  8. [8]

    Aime 2024

    AI-MO. Aime 2024. https://huggingface.co/datasets/AI-MO/aimo-validation-aime, July 2024

  9. [9]

    Amc 2023

    AI-MO. Amc 2023. https://huggingface.co/datasets/AI-MO/aimo-validation-amc, July 2024

  10. [10]

    Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models, 2025

    Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, and Nick Haber. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models, 2025

  11. [11]

    Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms

    Mohammad Ali Alomrani, Yingxue Zhang, Derek Li, Qianyi Sun, Soumyasundar Pal, Zhanguang Zhang, Yaochen Hu, Rohan Deepak Ajwani, Antonios Valkanas, Raika Karimi, et al. Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms. arXiv preprint arXiv:2507.02076, 2025

  12. [12]

    Lower bounds for chain-of- thought reasoning in hard-attention transformers

    Alireza Amiri, Xinting Huang, Mark Rofin, and Michael Hahn. Lower bounds for chain-of- thought reasoning in hard-attention transformers. arXiv preprint arXiv:2502.02393, 2025

  13. [13]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016. 37

  14. [14]

    Learning from mistakes makes llm better reasoner

    Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689, 2023

  15. [15]

    Phd knowledge not required: A reasoning challenge for large language models

    Carolyn Jane Anderson, Joydeep Biswas, Aleksander Boruch-Gruszecki, Federico Cassano, Molly Q Feldman, Arjun Guha, Francesca Lucchetti, and Zixuan Wu. Phd knowledge not required: A reasoning challenge for large language models. arXiv preprint arXiv:2502.01584, 2025

  16. [16]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

  17. [17]

    Critique-out-loud reward models

    Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan Daniel Chang, and Prithviraj Am- manabrolu. Critique-out-loud reward models. In Pluralistic Alignment Workshop at NeurIPS 2024, October 2024. URL https://openreview.net/forum?id=CljYUvIlRW

  18. [18]

    Thinking fast and slow with deep learn- ing and tree search

    Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learn- ing and tree search. Advances in neural information processing systems , 30, Decem- ber 2017. URL https://proceedings.neurips.cc/paper_files/paper/ 2017/file/d8e1344e27a5b08cdfd5d027d9b8d6de-Paper.pdf

  19. [19]

    The claude 3 model family: Opus, sonnet, haiku

    AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude- 3 Model Card , 1:1, 2024. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3. pdf

  20. [20]

    Do chains-of-thoughts of large language models suffer from hallucinations, cognitive biases, or phobias in bayesian reasoning? arXiv preprint arXiv:2503.15268, 2025

    Roberto Araya. Do chains-of-thoughts of large language models suffer from hallucinations, cognitive biases, or phobias in bayesian reasoning? arXiv preprint arXiv:2503.15268, 2025

  21. [21]

    Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models

    Mikhail L Arbuzov, Alexey A Shvets, and Sisong Beir. Beyond exponential decay: Rethinking error accumulation in large language models. arXiv preprint arXiv:2505.24187, 2025

  22. [22]

    Training language models to reason efficiently

    Daman Arora and Andrea Zanette. Training language models to reason efficiently. arXiv preprint arXiv:2502.04463, 2025

  23. [23]

    Early external safety testing of openai’s o3-mini: Insights from the pre-deployment evaluation

    Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. Early external safety testing of openai’s o3-mini: Insights from the pre-deployment evaluation. arXiv preprint arXiv:2501.17749, 2025

  24. [24]

    o3-mini vs deepseek-r1: Which one is safer? arXiv preprint arXiv:2501.18438, 2025

    Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. o3-mini vs deepseek-r1: Which one is safer? arXiv preprint arXiv:2501.18438, 2025

  25. [25]

    Language models can predict their own behavior.arXiv preprint arXiv:2502.13329, 2025

    Dhananjay Ashok and Jonathan May. Language models can predict their own behavior.arXiv preprint arXiv:2502.13329, 2025

  26. [26]

    Jiang, Jia Deng, Stella Biderman, and Sean Welleck

    Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. In The Twelfth International Conference on Learn- ing Representations, January 2024. URL https://openreview.net/forum?id= 4WnqRR915j

  27. [27]

    Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

    Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025

  28. [28]

    The lookahead limitation: Why multi-operand addition is hard for llms

    Tanja Baeumel, Josef van Genabith, and Simon Ostermann. The lookahead limitation: Why multi-operand addition is hard for llms. arXiv preprint arXiv:2502.19981, 2025

  29. [29]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  30. [30]

    Monitoring reasoning models for misbehavior and the risks of promoting obfus- cation

    Bowen Baker, Joost Huizinga, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfus- cation. March 2025. URL https://openai.com/index/chain-of-thought- monitoring/

  31. [31]

    Balachandran, J

    Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, et al. Inference-time scaling for complex tasks: Where we stand and what lies ahead. arXiv preprint arXiv:2504.00294, 2025. 38

  32. [32]

    The relationship between reasoning and performance in large language models–o3 (mini) thinks harder, not longer

    Marthe Ballon, Andres Algaba, and Vincent Ginis. The relationship between reasoning and performance in large language models–o3 (mini) thinks harder, not longer. arXiv preprint arXiv:2502.15631, 2025

  33. [33]

    Thinking machines: A survey of llm based reasoning strategies

    Dibyanayan Bandyopadhyay, Soham Bhattacharjee, and Asif Ekbal. Thinking machines: A survey of llm based reasoning strategies. arXiv preprint arXiv:2503.10814, 2025

  34. [34]

    Tran, and Mehran Kazemi

    Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, and Mehran Kazemi. Smaller, weaker, yet better: Training LLM reasoners via compute-optimal sampling. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, January 2025. URL https: //openreview.net/forum?id=HuYSURUxs2

  35. [35]

    Learning to stop overthinking at test time

    Hieu Tran Bao, Nguyen Cong Dat, Nguyen Duc Anh, and Hoang Thanh Tung. Learning to stop overthinking at test time. arXiv preprint arXiv:2502.10954, 2025

  36. [36]

    Teaching llm to reason: Reinforcement learning from algorithmic problems without code

    Keqin Bao, Nuo Chen, Xiaoyuan Li, Binyuan Hui, Bowen Yu, Fuli Feng, Junyang Lin, Xiangnan He, and Dayiheng Liu. Teaching llm to reason: Reinforcement learning from algorithmic problems without code. arXiv preprint arXiv:2507.07498, 2025

  37. [37]

    Multi-step deductive reasoning over natural language: An empirical study on out-of-distribution generalisation

    Qiming Bao, Alex Yuxuan Peng, Tim Hartill, Neset Tan, Zhenyun Deng, Michael Witbrock, and Jiamou Liu. Multi-step deductive reasoning over natural language: An empirical study on out-of-distribution generalisation. arXiv preprint arXiv:2207.14000, 2022

  38. [38]

    Assessing and enhancing the robustness of large language models with task structure variations for logical reasoning

    Qiming Bao, Gael Gendron, Alex Yuxuan Peng, Wanjun Zhong, Neset Tan, Yang Chen, Michael Witbrock, and Jiamou Liu. Assessing and enhancing the robustness of large language models with task structure variations for logical reasoning. arXiv preprint arXiv:2310.09430, 2023

  39. [39]

    Contrastive learning with logic- driven data augmentation for logical reasoning over text

    Qiming Bao, Alex Yuxuan Peng, Zhenyun Deng, Wanjun Zhong, Neset Tan, Nathan Young, Yang Chen, Yonghua Zhu, Michael Witbrock, and Jiamou Liu. Contrastive learning with logic- driven data augmentation for logical reasoning over text. arXiv preprint arXiv:2305.12599, 2023

  40. [40]

    Abstract Meaning Representation-based logic-driven data augmentation for logical rea- soning

    Qiming Bao, Alex Peng, Zhenyun Deng, Wanjun Zhong, Gael Gendron, Timothy Pistotti, Ne- set Tan, Nathan Young, Yang Chen, Yonghua Zhu, Paul Denny, Michael Witbrock, and Jiamou Liu. Abstract Meaning Representation-based logic-driven data augmentation for logical rea- soning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Associa- ...

  41. [41]

    doi: 10.18653/v1/2024.findings-acl.353

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.353. URL https://aclanthology.org/2024.findings-acl.353/

  42. [42]

    Exploring iterative enhancement for improving learnersourced multiple-choice question explanations with large language models

    Qiming Bao, Juho Leinonen, Alex Yuxuan Peng, Wanjun Zhong, Gaël Gendron, Timothy Pistotti, Alice Huang, Paul Denny, Michael Witbrock, and Jiamou Liu. Exploring iterative enhancement for improving learnersourced multiple-choice question explanations with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pag...

  43. [43]

    Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training

    Brian R Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, and Bhavya Kailkhura. Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training. arXiv preprint arXiv:2503.18929, 2025

  44. [44]

    Requirements ambiguity detection and ex- planation with llms: An industrial study

    Sarmad Bashir, Alessio Ferrari, Abbas Khan, Per Erik Strandberg, Zulqarnain Haider, Mehrdad Saadatmand, and Markus Bohlin. Requirements ambiguity detection and ex- planation with llms: An industrial study. July 2025

  45. [45]

    arXiv preprint arXiv:2501.00663 , year=

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2024

  46. [46]

    arXiv preprint arXiv:2502.15657 , year=

    Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? arXiv preprint arXiv:2502.15657, 2025

  47. [47]

    International ai safety report

    Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, et al. International ai safety report. arXiv preprint arXiv:2501.17805, 2025. 39

  48. [48]

    The validation gap: A mechanistic analysis of how language models compute arithmetic but fail to validate it

    Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, and Raffaella Bernardi. The validation gap: A mechanistic analysis of how language models compute arithmetic but fail to validate it. arXiv preprint arXiv:2502.11771, 2025

  49. [49]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, Mar

  50. [50]

    In: Wooldridge, M.J., Dy, J.G., Natarajan, S

    doi: 10.1609/aaai.v38i16.29720. URL https://ojs.aaai.org/index.php/ AAAI/article/view/29720

  51. [51]

    Demystifying chains, trees, and graphs of thoughts

    Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwa ´sniewski, Jürgen Müller, et al. Demystifying chains, trees, and graphs of thoughts. arXiv preprint arXiv:2401.14295, 2024

  52. [52]

    Reasoning Language Models: A Blueprint,

    Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gersten- berger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. Reasoning language models: A blueprint. arXiv preprint arXiv:2501.11223, 2025

  53. [53]

    Cot-kinetics: A theoretical modeling assessing lrm reasoning process

    Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hinrich Schuetze, V olker Tresp, et al. Cot-kinetics: A theoretical modeling assessing lrm reasoning process. arXiv preprint arXiv:2505.13408, 2025

  54. [54]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024

  55. [55]

    When do program-of-thought works for reasoning? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17691–17699, 2024

    Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, and Huajun Chen. When do program-of-thought works for reasoning? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17691–17699, 2024. URL https://ojs. aaai.org/index.php/AAAI/article/view/29721/31237

  56. [56]

    Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning,

    Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing llm reasoning. arXiv preprint arXiv:2412.09078, 2024

  57. [57]

    On the query complexity of verifier-assisted language generation

    Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T Ash, Cyril Zhang, and Andrej Ris- teski. On the query complexity of verifier-assisted language generation. arXiv preprint arXiv:2502.12123, 2025

  58. [58]

    Vermcts: Synthesizing multi-step programs using a verifier, a large language model, and tree search

    David Brandfonbrener, Simon Henniger, Sibi Raja, Tarun Prasad, Chloe Loughridge, Federico Cassano, Sabrina Ruixin Hu, Jianang Yang, William E Byrd, Robert Zinkov, et al. Vermcts: Synthesizing multi-step programs using a verifier, a large language model, and tree search. arXiv preprint arXiv:2402.08147, 2024

  59. [59]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

  60. [60]

    Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner

    Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws. arXiv preprint arXiv:2502.08606, 2025

  61. [61]

    Test-time-scaling for zero-shot diagnosis with visual-language reasoning

    Ji Young Byun, Young-Jin Park, Navid Azizan, and Rama Chellappa. Test-time-scaling for zero-shot diagnosis with visual-language reasoning. arXiv preprint arXiv:2506.11166, 2025

  62. [62]

    ARES: Alternating rein- forcement learning and supervised fine-tuning for enhanced multi-modal chain-of-thought reasoning through diverse AI feedback

    Ju-Seung Byun, Jiyun Chun, Jihyung Kil, and Andrew Perrault. ARES: Alternating rein- forcement learning and supervised fine-tuning for enhanced multi-modal chain-of-thought reasoning through diverse AI feedback. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Proce...

  63. [63]

    System-2 mathematical reasoning via enriched instruction tuning

    Huanqia Cai, Yijun Yang, and Zhifeng Li. System-2 mathematical reasoning via enriched instruction tuning. arXiv preprint arXiv:2412.16964, 2024

  64. [64]

    Internlm2 technical report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 40

  65. [65]

    Xai meets llms: A survey of the relation between explainable ai and large language models

    Erik Cambria, Lorenzo Malandri, Fabio Mercorio, Navid Nobani, and Andrea Seveso. Xai meets llms: A survey of the relation between explainable ai and large language models. arXiv preprint arXiv:2407.15248, 2024

  66. [66]

    GraphReason: Enhancing reasoning capabilities of large language models through a graph-based verification approach

    Lang Cao. GraphReason: Enhancing reasoning capabilities of large language models through a graph-based verification approach. In Bhavana Dalvi Mishra, Greg Durrett, Peter Jansen, Ben Lipkin, Danilo Neves Ribeiro, Lionel Wong, Xi Ye, and Wenting Zhao, editors, Pro- ceedings of the 2nd Workshop on Natural Language Reasoning and Structured Explanations (@ACL...

  67. [67]

    Behavior injection: Preparing language models for reinforcement learning

    Zhepeng Cen, Yihang Yao, William Han, Zuxin Liu, and Ding Zhao. Behavior injection: Preparing language models for reinforcement learning. arXiv preprint arXiv:2505.18917, 2025

  68. [68]

    xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning

    Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xiannian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, et al. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. arXiv preprint arXiv:2401.07037, 2024

  69. [69]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

  70. [70]

    On the convergence rate of mcts for the optimal value estimation in markov decision processes

    Hyeong Soo Chang. On the convergence rate of mcts for the optimal value estimation in markov decision processes. IEEE Transactions on Automatic Control, pages 1–6, February

  71. [71]

    URL https://ieeexplore.ieee.org/ document/10870057

    doi: 10.1109/TAC.2025.3538807. URL https://ieeexplore.ieee.org/ document/10870057

  72. [72]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

  73. [73]

    Evaluating o1-like llms: Unlocking reasoning for translation through comprehensive analysis

    Andong Chen, Yuchen Song, Wenxin Zhu, Kehai Chen, Muyun Yang, Tiejun Zhao, et al. Evaluating o1-like llms: Unlocking reasoning for translation through comprehensive analysis. arXiv preprint arXiv:2502.11544, 2025

  74. [74]

    Threading the needle: Reweaving chain-of-thought reasoning to explain human label variation

    Beiduo Chen, Yang Janet Liu, Anna Korhonen, and Barbara Plank. Threading the needle: Reweaving chain-of-thought reasoning to explain human label variation. arXiv preprint arXiv:2505.23368, 2025

  75. [75]

    Finereason: Evaluating and improving llms’ deliberate reasoning through reflective puzzle solving

    Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, and Yu Rong. Finereason: Evaluating and improving llms’ deliberate reasoning through reflective puzzle solving. arXiv preprint arXiv:2502.20238, 2025

  76. [76]

    Step-level value preference optimiza- tion for mathematical reasoning

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimiza- tion for mathematical reasoning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 7889–7903, Miami, Florida, USA, November 2024. Association for Computational Linguis- tics. d...

  77. [77]

    Alphamath almost zero: Process supervision without process

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: Process supervision without process. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, September 2024. URL https://openreview.net/forum?id= VaXnxQ3UKo

  78. [78]

    ChineseEcomQA: A scal- able e-commerce concept evaluation benchmark for large language models

    Haibin Chen, Kangtao Lv, Chengwei Hu, Yanshi Li, Yujin Yuan, Yancheng He, Xingyao Zhang, Langming Liu, Shilei Liu, Wenbo Su, et al. Chineseecomqa: A scalable e-commerce concept evaluation benchmark for large language models. arXiv preprint arXiv:2502.20196, 2025

  79. [79]

    Benchmarking large lan- guage models on answering and explaining challenging medical questions

    Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. Benchmarking large lan- guage models on answering and explaining challenging medical questions. arXiv preprint arXiv:2402.18060, 2024

  80. [80]

    Language models are hid- 41 den reasoners: Unlocking latent reasoning capabilities via self-rewarding

    Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, et al. Language models are hid- 41 den reasoners: Unlocking latent reasoning capabilities via self-rewarding. arXiv preprint arXiv:2411.04282, 2024

Showing first 80 references.