arxiv: 2503.09567 · v5 · submitted 2025-03-12 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Dengyun Peng, Jiannan Guan, Jinhao Liu, Libo Qin, Mengkang Hu, Peng Wang, Qiguang Chen, Te Gao, Wanxiang Che, Yuhang Zhou

Pith reviewed 2026-05-12 08:35 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords long chain-of-thoughtreasoning large language modelsdeep reasoningextensive explorationfeasible reflectionoverthinkinginference-time scalingreasoning taxonomy

0 comments

The pith

Long CoT with deep reasoning, exploration, and reflection enables LLMs to solve more complex tasks than short chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey separates long chain-of-thought reasoning from traditional short chain-of-thought in large language models. It introduces a taxonomy that groups reasoning approaches around three defining traits: deep reasoning, extensive exploration, and feasible reflection. These traits let models manage intricate problems in mathematics and coding while producing more coherent and efficient outputs. The work also examines how Long CoT appears in practice through phenomena such as overthinking and inference-time scaling. It closes by listing open gaps and future paths including multi-modal extensions and efficiency improvements.

Core claim

The survey establishes that Long CoT, distinguished from Short CoT by its incorporation of deep reasoning, extensive exploration, and feasible reflection, equips reasoning large language models to address more complex tasks and deliver more efficient, coherent outcomes.

What carries the argument

A novel taxonomy that organizes reasoning paradigms by contrasting Long CoT against Short CoT, anchored in the three characteristics of deep reasoning, extensive exploration, and feasible reflection.

If this is right

Models equipped with Long CoT can tackle more intricate problems in domains such as mathematics and coding than those restricted to Short CoT.
The emergence of Long CoT produces observable effects including overthinking and gains from inference-time scaling.
Future systems can build on the taxonomy to integrate multi-modal reasoning and refine knowledge frameworks for greater efficiency.
A shared structure for these processes supports clearer progress in developing logical reasoning capabilities for AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could serve as a basis for creating standardized benchmarks that measure depth of exploration separately from final accuracy.
Techniques to detect and shorten unnecessary reflection steps might reduce overthinking while preserving the benefits of Long CoT.
Inference-time scaling implies that deployment budgets could shift from larger models to longer allowed thinking time on smaller ones.
Combining the survey's framework with visual inputs could create unified reasoning systems that handle both text and image-based problems.

Load-bearing premise

The differences between Long CoT and Short CoT are fundamental enough to support a clean taxonomy without substantial overlap or missing categories.

What would settle it

An empirical result in which Short CoT models achieve equal or better performance than Long CoT models on complex tasks without added mechanisms, or a new reasoning approach that blends both forms in a way the taxonomy cannot classify.

read the original abstract

Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like "overthinking" and "inference-time scaling." This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and inference-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey gives a practical taxonomy for Long CoT in reasoning models but stays at synthesis level without new mechanisms or data.

read the letter

The main thing here is that the paper pulls recent work on models like o1 and R1 into one place and offers a taxonomy that separates Long CoT from Short CoT along lines of depth, exploration, and reflection. It frames these traits as what lets models tackle harder problems with more coherent outputs and flags real issues like overthinking and inference-time scaling. That framing is the useful part. The taxonomy and the breakdown of characteristics plus phenomena give a clear way to organize the fast-moving literature, and the gap analysis on multi-modal work and efficiency points to directions worth pursuing. It does the job of a survey by making the distinctions explicit and linking them to practical outcomes. The soft spots are the usual ones for this type of paper. Everything rests on how well the cited work was selected and summarized, and the abstract does not show whether categories overlap in deployed models or if some paradigms were left out. No new experiments or derivations appear, so the claims about Long CoT enabling better results are organizational rather than proven here. The distinctions read as reasonable but could be tested more directly in follow-up work. This is the sort of paper that helps people entering the reasoning LLM area get oriented without chasing every new arXiv post. It deserves a serious referee because a clean taxonomy and gap list can save time for the community even if the paper itself adds no fresh results. I would send it to review.

Referee Report

2 major / 3 minor

Summary. The manuscript is a survey paper on long chain-of-thought (Long CoT) for reasoning large language models (RLLMs). It distinguishes Long CoT from traditional short CoT, introduces a novel taxonomy for current reasoning paradigms, examines key characteristics of Long CoT (deep reasoning, extensive exploration, and feasible reflection), analyzes phenomena including overthinking and inference-time scaling, and identifies research gaps with future directions such as multi-modal integration, efficiency improvements, and enhanced knowledge frameworks.

Significance. If the distinctions and taxonomy are robust, this work provides a timely organizational synthesis for the emerging 'reasoning era' in LLMs, building on models like o1 and R1. The structured overview of characteristics, phenomena, and gaps can serve as a useful reference to guide research on complex task handling and coherence, with the explicit identification of future directions as a particular strength of the synthesis.

major comments (2)

[§3 (Taxonomy)] §3 (Taxonomy): The central claim that the proposed taxonomy comprehensively categorizes reasoning paradigms without significant overlap or omission is load-bearing for the unified perspective but lacks explicit validation; the manuscript should add a systematic mapping or comparison table against prior CoT/reasoning taxonomies to demonstrate completeness and novelty.
[§4 (Characteristics)] §4 (Characteristics): The assertion that Long CoT characteristics enable 'more efficient, coherent outcomes' compared to Short CoT is presented as a key distinction, yet the survey does not quantify or cite specific metrics (e.g., token efficiency or error rates) across representative models to support this over the qualitative description.

minor comments (3)

[Abstract and §1] Abstract and §1: The phrasing 'feasible reflection' is introduced without an immediate operational definition or example; add a brief clarifying sentence or footnote on first use.
[§5 (Phenomena)] §5 (Phenomena): The discussion of 'overthinking' would benefit from a dedicated subsection or table listing observed instances across models (e.g., o1 vs. R1) to improve readability and allow direct comparison.
[References] References: Several foundational CoT papers (pre-2023) appear underrepresented relative to recent model-specific works; expand the citation list for balance in the literature synthesis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The comments on validating the taxonomy and strengthening the characteristics section are helpful. We address each point below, indicating revisions where made.

read point-by-point responses

Referee: [§3 (Taxonomy)] §3 (Taxonomy): The central claim that the proposed taxonomy comprehensively categorizes reasoning paradigms without significant overlap or omission is load-bearing for the unified perspective but lacks explicit validation; the manuscript should add a systematic mapping or comparison table against prior CoT/reasoning taxonomies to demonstrate completeness and novelty.

Authors: We agree that an explicit validation strengthens the taxonomy's contribution. In the revised manuscript, we have added a new Table 2 that provides a systematic mapping of our taxonomy categories (deep reasoning, extensive exploration, feasible reflection, and related paradigms) against prior CoT and reasoning taxonomies, including those from Wei et al. (2022), Kojima et al. (2022), and recent surveys on LLM reasoning. The table highlights distinctions, such as our coverage of inference-time scaling and reflection in Long CoT models like o1 and R1, while noting areas of overlap and confirming no major omissions in current paradigms. This addition directly addresses the request for demonstrated completeness and novelty. revision: yes
Referee: [§4 (Characteristics)] §4 (Characteristics): The assertion that Long CoT characteristics enable 'more efficient, coherent outcomes' compared to Short CoT is presented as a key distinction, yet the survey does not quantify or cite specific metrics (e.g., token efficiency or error rates) across representative models to support this over the qualitative description.

Authors: The referee is correct that the current version relies on qualitative description for the efficiency and coherence claims. As a survey, we do not conduct new experiments, but we have revised Section 4 to incorporate citations to empirical studies on models such as OpenAI-o1 and DeepSeek-R1. These include reported metrics on benchmark accuracy (e.g., MATH, GSM8K), token usage comparisons showing longer but more effective reasoning paths, and reduced error rates in complex tasks relative to short CoT baselines. This provides concrete support for the distinction while remaining within survey scope. We can further expand the citations if additional references are suggested. revision: partial

Circularity Check

0 steps flagged

No significant circularity; literature synthesis without derivations

full rationale

This survey paper organizes existing literature on Long CoT versus Short CoT in reasoning LLMs, drawing distinctions and proposing a taxonomy based on observed characteristics from models such as OpenAI-O1 and DeepSeek-R1. No mathematical derivations, equations, fitted parameters, or quantitative predictions appear that could reduce to the paper's own inputs by construction. Claims about deep reasoning, exploration, reflection, overthinking, and inference-time scaling are presented as syntheses of prior work rather than self-referential definitions or load-bearing self-citations. The structure remains self-contained as an organizational review with no steps that equate outputs to inputs via ansatz, renaming, or uniqueness theorems imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on standard domain assumptions from LLM reasoning literature and introduces no new free parameters or invented entities; its main addition is the taxonomy itself.

axioms (1)

domain assumption Long CoT is meaningfully distinct from Short CoT and enables superior handling of complex tasks
Central framing of the survey as stated in the abstract

pith-pipeline@v0.9.0 · 5614 in / 1124 out tokens · 58867 ms · 2026-05-12T08:35:25.655145+00:00 · methodology

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unsupervised Process Reward Models
cs.LG 2026-05 unverdicted novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
cs.CL 2026-05 unverdicted novelty 7.0

CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candida...
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
cs.CL 2026-04 unverdicted novelty 7.0

DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
cs.CL 2026-04 unverdicted novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
cs.CL 2026-05 unverdicted novelty 6.0

STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step
cs.CL 2026-05 unverdicted novelty 6.0

RuPLaR replaces multi-step latent CoT with a single-model one-step generator guided by rule-based priors and a joint consistency-plus-alignment loss, delivering 11.1 percent higher accuracy at lower token cost.
AIPO: : Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
cs.AI 2026-05 unverdicted novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
cs.RO 2026-05 unverdicted novelty 6.0

VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
cs.CV 2026-04 unverdicted novelty 6.0

OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution
cs.AI 2026-04 unverdicted novelty 6.0

SAVOIR combines prospective expected utility valuation with Shapley values for fair credit assignment in social dialogue RL, achieving SOTA on SOTOPIA where a 7B model matches or exceeds GPT-4o and Claude-3.5-Sonnet.
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
cs.CL 2026-04 unverdicted novelty 6.0

PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
cs.CL 2026-04 unverdicted novelty 6.0

Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
cs.CL 2026-04 unverdicted novelty 6.0

TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing t...
An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing
cs.AI 2026-05 unverdicted novelty 5.0

An agentic AI framework with LLMs generates formulations for coupled UAV product collection and MEC task scheduling, solved by hierarchical PPO that reaches 99.6% collection success and 100% deadline compliance in sim...
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
cs.CL 2026-05 unverdicted novelty 5.0

MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
SiriusHelper: An LLM Agent-Based Operations Assistant for Big Data Platforms
cs.DB 2026-04 unverdicted novelty 5.0

SiriusHelper deploys an LLM agent with intent routing, DeepSearch multi-hop retrieval, and automated SOP distillation to outperform alternatives and reduce ticket volume by 20.8% on Tencent's big data platform.
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
cs.SE 2026-04 unverdicted novelty 5.0

LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
cs.LG 2026-04 unverdicted novelty 5.0

CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
StaRPO: Stability-Augmented Reinforcement Policy Optimization
cs.AI 2026-04 unverdicted novelty 5.0

StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
cs.CV 2026-04 unverdicted novelty 5.0

OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
cs.CR 2026-04 unverdicted novelty 5.0

A 16-factor structured prompt framework strengthens CoT reasoning in LLMs for security analysis, yielding up to 40% reasoning gains in smaller models and stable accuracy improvements validated by human raters with Coh...
Pragmos: A Process Agentic Modeling System
cs.SE 2026-04 unverdicted novelty 4.0

Pragmos is a hybrid interactive system that decomposes process modeling into explainable steps using LLMs augmented by behavioral-relation tools to produce sound and comprehensible models.
A Survey of Context Engineering for Large Language Models
cs.CL 2025-07 accept novelty 4.0

The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
cs.CL 2026-05 unverdicted novelty 3.0

EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 30 Pith papers · 30 internal anchors

[1]

Medec: A benchmark for medical error detection and correction in clinical notes

Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, and Thomas Lin. Medec: A benchmark for medical error detection and correction in clinical notes. arXiv preprint arXiv:2412.19260, 2024

work page arXiv 2024
[2]

Inference scaling vs reasoning: An empirical analysis of compute-optimal llm problem-solving

Marwan AbdElhameed and Pavly Halim. Inference scaling vs reasoning: An empirical analysis of compute-optimal llm problem-solving. arXiv preprint arXiv:2412.16260, 2024

work page arXiv 2024
[3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Nemotron-4 340b technical report

Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024

work page arXiv 2024
[5]

The unreasonable effectiveness of entropy minimization in llm reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134, 2025

work page arXiv 2025
[6]

arXiv preprint arXiv:2503.04697 , year=

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697, 2025

work page arXiv 2025
[7]

Opencodereasoning: Advancing data distillation for competitive coding

Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025

work page arXiv 2025
[8]

Aime 2024

AI-MO. Aime 2024. https://huggingface.co/datasets/AI-MO/aimo-validation-aime, July 2024

work page 2024
[9]

Amc 2023

AI-MO. Amc 2023. https://huggingface.co/datasets/AI-MO/aimo-validation-amc, July 2024

work page 2023
[10]

Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models, 2025

Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, and Nick Haber. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models, 2025

work page 2025
[11]

Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms

Mohammad Ali Alomrani, Yingxue Zhang, Derek Li, Qianyi Sun, Soumyasundar Pal, Zhanguang Zhang, Yaochen Hu, Rohan Deepak Ajwani, Antonios Valkanas, Raika Karimi, et al. Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms. arXiv preprint arXiv:2507.02076, 2025

work page arXiv 2025
[12]

Lower bounds for chain-of- thought reasoning in hard-attention transformers

Alireza Amiri, Xinting Huang, Mark Rofin, and Michael Hahn. Lower bounds for chain-of- thought reasoning in hard-attention transformers. arXiv preprint arXiv:2502.02393, 2025

work page arXiv 2025
[13]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016. 37

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

Learning from mistakes makes llm better reasoner

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689, 2023

work page arXiv 2023
[15]

Phd knowledge not required: A reasoning challenge for large language models

Carolyn Jane Anderson, Joydeep Biswas, Aleksander Boruch-Gruszecki, Federico Cassano, Molly Q Feldman, Arjun Guha, Francesca Lucchetti, and Zixuan Wu. Phd knowledge not required: A reasoning challenge for large language models. arXiv preprint arXiv:2502.01584, 2025

work page arXiv 2025
[16]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review arXiv 2023
[17]

Critique-out-loud reward models

Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan Daniel Chang, and Prithviraj Am- manabrolu. Critique-out-loud reward models. In Pluralistic Alignment Workshop at NeurIPS 2024, October 2024. URL https://openreview.net/forum?id=CljYUvIlRW

work page 2024
[18]

Thinking fast and slow with deep learn- ing and tree search

Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learn- ing and tree search. Advances in neural information processing systems , 30, Decem- ber 2017. URL https://proceedings.neurips.cc/paper_files/paper/ 2017/file/d8e1344e27a5b08cdfd5d027d9b8d6de-Paper.pdf

work page 2017
[19]

The claude 3 model family: Opus, sonnet, haiku

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude- 3 Model Card , 1:1, 2024. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3. pdf

work page 2024
[20]

Do chains-of-thoughts of large language models suffer from hallucinations, cognitive biases, or phobias in bayesian reasoning? arXiv preprint arXiv:2503.15268, 2025

Roberto Araya. Do chains-of-thoughts of large language models suffer from hallucinations, cognitive biases, or phobias in bayesian reasoning? arXiv preprint arXiv:2503.15268, 2025

work page arXiv 2025
[21]

Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models

Mikhail L Arbuzov, Alexey A Shvets, and Sisong Beir. Beyond exponential decay: Rethinking error accumulation in large language models. arXiv preprint arXiv:2505.24187, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Training language models to reason efficiently

Daman Arora and Andrea Zanette. Training language models to reason efficiently. arXiv preprint arXiv:2502.04463, 2025

work page arXiv 2025
[23]

Early external safety testing of openai’s o3-mini: Insights from the pre-deployment evaluation

Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. Early external safety testing of openai’s o3-mini: Insights from the pre-deployment evaluation. arXiv preprint arXiv:2501.17749, 2025

work page arXiv 2025
[24]

o3-mini vs deepseek-r1: Which one is safer? arXiv preprint arXiv:2501.18438, 2025

Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. o3-mini vs deepseek-r1: Which one is safer? arXiv preprint arXiv:2501.18438, 2025

work page arXiv 2025
[25]

Language models can predict their own behavior.arXiv preprint arXiv:2502.13329, 2025

Dhananjay Ashok and Jonathan May. Language models can predict their own behavior.arXiv preprint arXiv:2502.13329, 2025

work page arXiv 2025
[26]

Jiang, Jia Deng, Stella Biderman, and Sean Welleck

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. In The Twelfth International Conference on Learn- ing Representations, January 2024. URL https://openreview.net/forum?id= 4WnqRR915j

work page 2024
[27]

Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025

work page arXiv 2025
[28]

The lookahead limitation: Why multi-operand addition is hard for llms

Tanja Baeumel, Josef van Genabith, and Simon Ostermann. The lookahead limitation: Why multi-operand addition is hard for llms. arXiv preprint arXiv:2502.19981, 2025

work page arXiv 2025
[29]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Monitoring reasoning models for misbehavior and the risks of promoting obfus- cation

Bowen Baker, Joost Huizinga, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfus- cation. March 2025. URL https://openai.com/index/chain-of-thought- monitoring/

work page 2025
[31]

Balachandran, J

Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, et al. Inference-time scaling for complex tasks: Where we stand and what lies ahead. arXiv preprint arXiv:2504.00294, 2025. 38

work page arXiv 2025
[32]

The relationship between reasoning and performance in large language models–o3 (mini) thinks harder, not longer

Marthe Ballon, Andres Algaba, and Vincent Ginis. The relationship between reasoning and performance in large language models–o3 (mini) thinks harder, not longer. arXiv preprint arXiv:2502.15631, 2025

work page arXiv 2025
[33]

Thinking machines: A survey of llm based reasoning strategies

Dibyanayan Bandyopadhyay, Soham Bhattacharjee, and Asif Ekbal. Thinking machines: A survey of llm based reasoning strategies. arXiv preprint arXiv:2503.10814, 2025

work page arXiv 2025
[34]

Tran, and Mehran Kazemi

Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, and Mehran Kazemi. Smaller, weaker, yet better: Training LLM reasoners via compute-optimal sampling. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, January 2025. URL https: //openreview.net/forum?id=HuYSURUxs2

work page 2025
[35]

Learning to stop overthinking at test time

Hieu Tran Bao, Nguyen Cong Dat, Nguyen Duc Anh, and Hoang Thanh Tung. Learning to stop overthinking at test time. arXiv preprint arXiv:2502.10954, 2025

work page arXiv 2025
[36]

Teaching llm to reason: Reinforcement learning from algorithmic problems without code

Keqin Bao, Nuo Chen, Xiaoyuan Li, Binyuan Hui, Bowen Yu, Fuli Feng, Junyang Lin, Xiangnan He, and Dayiheng Liu. Teaching llm to reason: Reinforcement learning from algorithmic problems without code. arXiv preprint arXiv:2507.07498, 2025

work page arXiv 2025
[37]

Multi-step deductive reasoning over natural language: An empirical study on out-of-distribution generalisation

Qiming Bao, Alex Yuxuan Peng, Tim Hartill, Neset Tan, Zhenyun Deng, Michael Witbrock, and Jiamou Liu. Multi-step deductive reasoning over natural language: An empirical study on out-of-distribution generalisation. arXiv preprint arXiv:2207.14000, 2022

work page arXiv 2022
[38]

Assessing and enhancing the robustness of large language models with task structure variations for logical reasoning

Qiming Bao, Gael Gendron, Alex Yuxuan Peng, Wanjun Zhong, Neset Tan, Yang Chen, Michael Witbrock, and Jiamou Liu. Assessing and enhancing the robustness of large language models with task structure variations for logical reasoning. arXiv preprint arXiv:2310.09430, 2023

work page arXiv 2023
[39]

Contrastive learning with logic- driven data augmentation for logical reasoning over text

Qiming Bao, Alex Yuxuan Peng, Zhenyun Deng, Wanjun Zhong, Neset Tan, Nathan Young, Yang Chen, Yonghua Zhu, Michael Witbrock, and Jiamou Liu. Contrastive learning with logic- driven data augmentation for logical reasoning over text. arXiv preprint arXiv:2305.12599, 2023

work page arXiv 2023
[40]

Abstract Meaning Representation-based logic-driven data augmentation for logical rea- soning

Qiming Bao, Alex Peng, Zhenyun Deng, Wanjun Zhong, Gael Gendron, Timothy Pistotti, Ne- set Tan, Nathan Young, Yang Chen, Yonghua Zhu, Paul Denny, Michael Witbrock, and Jiamou Liu. Abstract Meaning Representation-based logic-driven data augmentation for logical rea- soning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Associa- ...

work page 2024
[41]

doi: 10.18653/v1/2024.findings-acl.353

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.353. URL https://aclanthology.org/2024.findings-acl.353/

work page doi:10.18653/v1/2024.findings-acl.353 2024
[42]

Exploring iterative enhancement for improving learnersourced multiple-choice question explanations with large language models

Qiming Bao, Juho Leinonen, Alex Yuxuan Peng, Wanjun Zhong, Gaël Gendron, Timothy Pistotti, Alice Huang, Paul Denny, Michael Witbrock, and Jiamou Liu. Exploring iterative enhancement for improving learnersourced multiple-choice question explanations with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pag...

work page 2025
[43]

Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training

Brian R Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, and Bhavya Kailkhura. Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training. arXiv preprint arXiv:2503.18929, 2025

work page arXiv 2025
[44]

Requirements ambiguity detection and ex- planation with llms: An industrial study

Sarmad Bashir, Alessio Ferrari, Abbas Khan, Per Erik Strandberg, Zulqarnain Haider, Mehrdad Saadatmand, and Markus Bohlin. Requirements ambiguity detection and ex- planation with llms: An industrial study. July 2025

work page 2025
[45]

arXiv preprint arXiv:2501.00663 , year=

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2024

work page arXiv 2024
[46]

arXiv preprint arXiv:2502.15657 , year=

Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? arXiv preprint arXiv:2502.15657, 2025

work page arXiv 2025
[47]

International ai safety report

Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, et al. International ai safety report. arXiv preprint arXiv:2501.17805, 2025. 39

work page arXiv 2025
[48]

The validation gap: A mechanistic analysis of how language models compute arithmetic but fail to validate it

Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, and Raffaella Bernardi. The validation gap: A mechanistic analysis of how language models compute arithmetic but fail to validate it. arXiv preprint arXiv:2502.11771, 2025

work page arXiv 2025
[49]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, Mar

work page
[50]

In: Wooldridge, M.J., Dy, J.G., Natarajan, S

doi: 10.1609/aaai.v38i16.29720. URL https://ojs.aaai.org/index.php/ AAAI/article/view/29720

work page doi:10.1609/aaai.v38i16.29720
[51]

Demystifying chains, trees, and graphs of thoughts

Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwa ´sniewski, Jürgen Müller, et al. Demystifying chains, trees, and graphs of thoughts. arXiv preprint arXiv:2401.14295, 2024

work page arXiv 2024
[52]

Reasoning Language Models: A Blueprint,

Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gersten- berger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. Reasoning language models: A blueprint. arXiv preprint arXiv:2501.11223, 2025

work page arXiv 2025
[53]

Cot-kinetics: A theoretical modeling assessing lrm reasoning process

Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hinrich Schuetze, V olker Tresp, et al. Cot-kinetics: A theoretical modeling assessing lrm reasoning process. arXiv preprint arXiv:2505.13408, 2025

work page arXiv 2025
[54]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

When do program-of-thought works for reasoning? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17691–17699, 2024

Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, and Huajun Chen. When do program-of-thought works for reasoning? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17691–17699, 2024. URL https://ojs. aaai.org/index.php/AAAI/article/view/29721/31237

work page 2024
[56]

Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning,

Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing llm reasoning. arXiv preprint arXiv:2412.09078, 2024

work page arXiv 2024
[57]

On the query complexity of verifier-assisted language generation

Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T Ash, Cyril Zhang, and Andrej Ris- teski. On the query complexity of verifier-assisted language generation. arXiv preprint arXiv:2502.12123, 2025

work page arXiv 2025
[58]

Vermcts: Synthesizing multi-step programs using a verifier, a large language model, and tree search

David Brandfonbrener, Simon Henniger, Sibi Raja, Tarun Prasad, Chloe Loughridge, Federico Cassano, Sabrina Ruixin Hu, Jianang Yang, William E Byrd, Robert Zinkov, et al. Vermcts: Synthesizing multi-step programs using a verifier, a large language model, and tree search. arXiv preprint arXiv:2402.08147, 2024

work page arXiv 2024
[59]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws. arXiv preprint arXiv:2502.08606, 2025

work page arXiv 2025
[61]

Test-time-scaling for zero-shot diagnosis with visual-language reasoning

Ji Young Byun, Young-Jin Park, Navid Azizan, and Rama Chellappa. Test-time-scaling for zero-shot diagnosis with visual-language reasoning. arXiv preprint arXiv:2506.11166, 2025

work page arXiv 2025
[62]

ARES: Alternating rein- forcement learning and supervised fine-tuning for enhanced multi-modal chain-of-thought reasoning through diverse AI feedback

Ju-Seung Byun, Jiyun Chun, Jihyung Kil, and Andrew Perrault. ARES: Alternating rein- forcement learning and supervised fine-tuning for enhanced multi-modal chain-of-thought reasoning through diverse AI feedback. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Proce...

work page doi:10.18653/v1/2024.emnlp-main.252 2024
[63]

System-2 mathematical reasoning via enriched instruction tuning

Huanqia Cai, Yijun Yang, and Zhifeng Li. System-2 mathematical reasoning via enriched instruction tuning. arXiv preprint arXiv:2412.16964, 2024

work page arXiv 2024
[64]

Internlm2 technical report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 40

work page arXiv 2024
[65]

Xai meets llms: A survey of the relation between explainable ai and large language models

Erik Cambria, Lorenzo Malandri, Fabio Mercorio, Navid Nobani, and Andrea Seveso. Xai meets llms: A survey of the relation between explainable ai and large language models. arXiv preprint arXiv:2407.15248, 2024

work page arXiv 2024
[66]

GraphReason: Enhancing reasoning capabilities of large language models through a graph-based verification approach

Lang Cao. GraphReason: Enhancing reasoning capabilities of large language models through a graph-based verification approach. In Bhavana Dalvi Mishra, Greg Durrett, Peter Jansen, Ben Lipkin, Danilo Neves Ribeiro, Lionel Wong, Xi Ye, and Wenting Zhao, editors, Pro- ceedings of the 2nd Workshop on Natural Language Reasoning and Structured Explanations (@ACL...

work page 2024
[67]

Behavior injection: Preparing language models for reinforcement learning

Zhepeng Cen, Yihang Yao, William Han, Zuxin Liu, and Ding Zhao. Behavior injection: Preparing language models for reinforcement learning. arXiv preprint arXiv:2505.18917, 2025

work page arXiv 2025
[68]

xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning

Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xiannian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, et al. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. arXiv preprint arXiv:2401.07037, 2024

work page arXiv 2024
[69]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

work page Pith review arXiv 2024
[70]

On the convergence rate of mcts for the optimal value estimation in markov decision processes

Hyeong Soo Chang. On the convergence rate of mcts for the optimal value estimation in markov decision processes. IEEE Transactions on Automatic Control, pages 1–6, February

work page
[71]

URL https://ieeexplore.ieee.org/ document/10870057

doi: 10.1109/TAC.2025.3538807. URL https://ieeexplore.ieee.org/ document/10870057

work page doi:10.1109/tac.2025.3538807 2025
[72]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review arXiv 2025
[73]

Evaluating o1-like llms: Unlocking reasoning for translation through comprehensive analysis

Andong Chen, Yuchen Song, Wenxin Zhu, Kehai Chen, Muyun Yang, Tiejun Zhao, et al. Evaluating o1-like llms: Unlocking reasoning for translation through comprehensive analysis. arXiv preprint arXiv:2502.11544, 2025

work page arXiv 2025
[74]

Threading the needle: Reweaving chain-of-thought reasoning to explain human label variation

Beiduo Chen, Yang Janet Liu, Anna Korhonen, and Barbara Plank. Threading the needle: Reweaving chain-of-thought reasoning to explain human label variation. arXiv preprint arXiv:2505.23368, 2025

work page arXiv 2025
[75]

Finereason: Evaluating and improving llms’ deliberate reasoning through reflective puzzle solving

Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, and Yu Rong. Finereason: Evaluating and improving llms’ deliberate reasoning through reflective puzzle solving. arXiv preprint arXiv:2502.20238, 2025

work page arXiv 2025
[76]

Step-level value preference optimiza- tion for mathematical reasoning

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimiza- tion for mathematical reasoning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 7889–7903, Miami, Florida, USA, November 2024. Association for Computational Linguis- tics. d...

work page doi:10.18653/v1/2024.findings-emnlp.463 2024
[77]

Alphamath almost zero: Process supervision without process

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: Process supervision without process. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, September 2024. URL https://openreview.net/forum?id= VaXnxQ3UKo

work page 2024
[78]

ChineseEcomQA: A scal- able e-commerce concept evaluation benchmark for large language models

Haibin Chen, Kangtao Lv, Chengwei Hu, Yanshi Li, Yujin Yuan, Yancheng He, Xingyao Zhang, Langming Liu, Shilei Liu, Wenbo Su, et al. Chineseecomqa: A scalable e-commerce concept evaluation benchmark for large language models. arXiv preprint arXiv:2502.20196, 2025

work page arXiv 2025
[79]

Benchmarking large lan- guage models on answering and explaining challenging medical questions

Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. Benchmarking large lan- guage models on answering and explaining challenging medical questions. arXiv preprint arXiv:2402.18060, 2024

work page arXiv 2024
[80]

Language models are hid- 41 den reasoners: Unlocking latent reasoning capabilities via self-rewarding

Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, et al. Language models are hid- 41 den reasoners: Unlocking latent reasoning capabilities via self-rewarding. arXiv preprint arXiv:2411.04282, 2024

work page arXiv 2024

Showing first 80 references.