arxiv: 2502.03373 · v1 · pith:5777KZB3new · submitted 2025-02-05 · 💻 cs.CL · cs.LG

Demystifying Long Chain-of-Thought Reasoning in LLMs

Edward Yeo , Yuxuan Tong , Morry Niu , Graham Neubig , Xiang Yue This is my paper

Pith reviewed 2026-05-19 01:26 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords chain-of-thought reasoningreinforcement learninglarge language modelsverifiable rewardsweb data filteringout-of-distribution generalizationerror correctiontraining compute

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{5777KZB3}

Prints a linked pith:5777KZB3 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Scaling verifiable rewards from filtered noisy web data is what drives long chain-of-thought reasoning to emerge in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the training conditions that allow large language models to produce long chains of thought during reinforcement learning. Experiments show that supervised fine-tuning is helpful but not required, while reasoning length grows with more compute only when rewards are shaped to reward longer, correct trajectories. The central result is that scaling verifiable reward signals works even when the signals come from noisy web-scraped solutions that have been filtered, and this approach helps most on out-of-distribution problems such as STEM tasks. Readers care because the findings give direct, practical rules for choosing data and reward designs that make advanced reasoning appear more reliably.

Core claim

The authors establish that scaling verifiable reward signals is critical for reinforcement learning to produce long CoT trajectories; they demonstrate that noisy web-extracted solutions, when passed through filtering mechanisms, supply effective training signals and improve performance especially on out-of-distribution tasks such as STEM reasoning.

What carries the argument

Verifiable reward signals obtained from filtered noisy web-extracted solutions, which stabilize CoT length growth and incentivize error correction during RL.

If this is right

SFT can be omitted to simplify training while still reaching long CoT with sufficient RL compute.
Reward shaping must be used to keep CoT length from stalling as training scale increases.
Noisy but filtered web data can substitute for cleaner sources on out-of-distribution reasoning tasks.
Error-correction skills already exist in base models and become usable on hard problems once RL compute is large enough.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This route may lower the cost of building reasoning models by reducing dependence on hand-curated datasets.
The same filtering-plus-RL pattern could be tested on non-STEM domains to see how far the OOD benefit extends.
If the filters are made more transparent, researchers could measure exactly which data properties most help long CoT appear.

Load-bearing premise

The filtering mechanisms must turn noisy web-extracted data into reliable training signals without introducing biases that block long CoT emergence or hurt generalization to new tasks.

What would settle it

Train an identical model using the same RL setup but with unfiltered web data and check whether CoT length fails to grow and OOD accuracy on STEM problems stays flat.

read the original abstract

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs useful ablations on SFT and RL for long CoT emergence and surfaces four practical observations, though the noisy-data filtering claim needs more scrutiny on bias.

read the letter

The core takeaway is that this work tests concrete factors behind long chain-of-thought in RL-trained LLMs and reports four observations that could guide training choices. SFT is not required but speeds things up. Compute alone does not guarantee longer traces, so reward shaping matters for stabilizing length growth. Noisy web solutions plus filtering can scale verifiable rewards and help on OOD STEM tasks. Base models already hold some error-correction ability, but eliciting it on hard problems still demands substantial training compute. They release code, which helps reproducibility. These points build directly on prior RL-for-reasoning papers by adding targeted tests on data sources and reward design rather than introducing a new algorithm. The experiments focus on verifiable rewards, which is a sensible choice for STEM domains. The weakest part is the third finding. The abstract presents filtering of noisy web data as effective for OOD generalization, yet it does not spell out the filter rules or show that they preserve examples requiring genuine backtracking instead of just keeping surface-similar problems. That leaves room for the selection-bias concern raised in the stress-test note. If the full paper includes ablations on filter variants or checks that hard reasoning traces survive, the claim strengthens; otherwise it stays suggestive. Readers working on scaling reasoning models will find the heuristics worth testing. The questions are timely, the setup is described enough to replicate with the code, and the work shows clear engagement with the literature. It deserves peer review rather than a desk reject.

Referee Report

1 major / 2 minor

Summary. The paper investigates the emergence of long chain-of-thought (CoT) reasoning in LLMs using SFT and RL experiments. It reports four findings: SFT is not strictly necessary but aids efficiency; reasoning emerges with increased compute but requires reward shaping to stabilize CoT length; scaling verifiable rewards via noisy web-extracted solutions with filtering shows promise especially for OOD STEM tasks; and core skills like error correction exist in base models but need substantial compute and nuanced measurement to incentivize for complex tasks. Code is released for reproducibility.

Significance. If the experimental results hold, the work offers actionable insights into RL design choices for long CoT, particularly the viability of filtered noisy data sources for verifiable rewards and OOD generalization. The public code release strengthens the contribution by enabling direct replication and extension of the training setups.

major comments (1)

Finding (3): The assertion that noisy web-extracted solutions combined with filtering mechanisms yield strong potential for OOD tasks such as STEM reasoning is load-bearing for the central claim about scaling verifiable rewards. The manuscript does not appear to include explicit validation that the (unspecified) filters avoid selection bias toward problems whose surface features align with the base model's pretraining distribution or that they preserve examples requiring genuine backtracking and error correction; without such checks, observed OOD gains could reflect easier subset selection rather than improved reasoning incentives.

minor comments (2)

The abstract and findings sections would benefit from explicit quantitative definitions and metrics for 'long CoT' (e.g., token length thresholds, backtracking frequency) and for measuring emergence/stabilization of reasoning capabilities.
Ensure all experimental details—RL reward formulations, exact filtering criteria, data exclusion rules, hyperparameter sweeps, and statistical reporting (including error bars or significance tests)—are fully specified in the main text or appendix to support the reproducibility claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive feedback on our paper 'Demystifying Long Chain-of-Thought Reasoning in LLMs'. We address the referee's major comment below and are prepared to revise the manuscript accordingly to strengthen the presentation of our results on scaling verifiable rewards.

read point-by-point responses

Referee: Finding (3): The assertion that noisy web-extracted solutions combined with filtering mechanisms yield strong potential for OOD tasks such as STEM reasoning is load-bearing for the central claim about scaling verifiable rewards. The manuscript does not appear to include explicit validation that the (unspecified) filters avoid selection bias toward problems whose surface features align with the base model's pretraining distribution or that they preserve examples requiring genuine backtracking and error correction; without such checks, observed OOD gains could reflect easier subset selection rather than improved reasoning incentives.

Authors: We thank the referee for this insightful comment. Our filtering mechanisms primarily consist of removing solutions that are incomplete, contain obvious errors, or fall below a minimum length threshold, as detailed in Section 4.2 of the manuscript. While we did not explicitly quantify selection bias or the preservation of backtracking examples in the original submission, our OOD gains are supported by comparisons to baselines using unfiltered data, where performance was significantly lower. To strengthen the claim, we will add a new analysis in the revised manuscript, including a comparison of problem difficulty distributions (measured by the number of required reasoning steps) before and after filtering, and an ablation study training on size-matched random subsets. This will help demonstrate that the improvements stem from better reasoning incentives rather than subset selection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical findings from SFT/RL experiments

full rationale

The paper presents four findings based on direct experimental outcomes from supervised fine-tuning and reinforcement learning runs on LLMs. Claims about long CoT emergence, reward scaling with filtered web data, and error correction are tied to observed training dynamics and OOD performance metrics rather than any closed-loop derivation, self-referential definition, or parameter fit renamed as prediction. No equations, uniqueness theorems, or load-bearing self-citations appear in the reported chain; results are externally falsifiable via the released code and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study relying on standard assumptions in RL and LLM training rather than introducing new theoretical axioms or entities; no free parameters or invented entities are explicitly described in the abstract.

axioms (1)

domain assumption Reward shaping is crucial for stabilizing CoT length growth during RL training.
Invoked as necessary to ensure reasoning capabilities develop with increased training compute.

pith-pipeline@v0.9.0 · 5801 in / 1146 out tokens · 55073 ms · 2026-05-19T01:26:16.742704+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
cs.LG 2026-05 unverdicted novelty 7.0

POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
cs.LG 2026-05 unverdicted novelty 7.0

POISE estimates value baselines for RL in LLMs from the actor's internal states via a lightweight probe and cross-rollout construction, matching DAPO performance with lower compute on math reasoning benchmarks.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
cs.SE 2025-02 unverdicted novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
Hint Tuning: Less Data Makes Better Reasoners
cs.CL 2026-05 unverdicted novelty 6.0

Hint Tuning uses an instruct model as a difficulty probe to create 1K multi-level hint examples that train reasoning models to calibrate chain-of-thought length, cutting tokens by 31.5% on average across 4B-32B models...
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
cs.RO 2026-05 unverdicted novelty 6.0

VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
cs.AI 2026-05 unverdicted novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning
cs.LG 2026-02 unverdicted novelty 6.0

Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.
Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards
cs.CV 2025-09 unverdicted novelty 6.0

Geo-R1 uses indirect proxy rewards from cross-view alignment with geolocation metadata to drive reinforcement learning, enabling zero-shot geospatial reasoning that transfers across 25+ tasks and sometimes exceeds sup...
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
cs.AI 2025-08 unverdicted novelty 6.0

CoT reasoning is a brittle mirage governed by distribution discrepancy between training and test data, demonstrated via controlled experiments in the new DataAlchemy environment.
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
cs.CL 2025-06 unverdicted novelty 6.0

LongWriter-Zero applies RL from a base model with specialized rewards for length, quality, and structure to outperform SFT baselines and larger models on long-writing benchmarks.
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
cs.LG 2025-03 unverdicted novelty 6.0

A simple PPO-based RL training pipeline on base models scales reasoning performance and response length, outperforming prior work on math and science benchmarks with one-tenth the training steps.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
cs.CL 2026-05 unverdicted novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
cs.CV 2025-11 unverdicted novelty 5.0

Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
Self-Aligned Reward: Towards Effective and Efficient Reasoners
cs.LG 2025-09 unverdicted novelty 5.0

Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
cs.CV 2025-07 unverdicted novelty 5.0

ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
cs.CL 2025-03 accept novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
Phi-4-reasoning Technical Report
cs.AI 2025-04 unverdicted novelty 4.0

A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related...
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 20 Pith papers · 3 internal anchors

[1]

Broder, A

doi: 10.1109/SEQUEN.1997.666900. Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzen- macher, M. Min-wise independent permutations. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 327–336, 1998. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., ...

work page doi:10.1109/sequen.1997.666900 1997
[2]

Training Verifiers to Solve Math Word Problems

URL https://t.co/2sjhynKxzJ. Slide 48. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. Dao, T. FlashAttention-2: Faster attention with better paral- lelism and work partitioning. In Internat...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Reinforced Self-Training (ReST) for Language Modeling

doi: 10.1126/sciadv.abg6611. Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1126/sciadv.abg6611 2023
[4]

Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023

URL https://github.com/Jiayi-Pan/ TinyZero. Accessed: 2025-01-24. Paster, K., Santos, M. D., Azerbayev, Z., and Ba, J. Open- webmath: An open dataset of high-quality mathematical web text, 2023. Qwen Team. Qwen2.5-math technical report: Toward math- ematical expert model via self-improvement, 2024a. Qwen Team. Qwq: Reflect deeply on the boundaries of the ...

work page arXiv 2025
[5]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

ISBN 0262039249. Tong, Y ., Zhang, X., Wang, R., Wu, R., and He, J. DART- math: Difficulty-aware rejection tuning for mathematical problem-solving. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozire, B., Goyal, N., Hambro, E., Azhar,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Dec 13, 2017 ### songoku

work page 2017
[7]

Consider the cylinder x2 + y2 <= r2, and let C be the part of the cylinder that satisfies 0 <= z <= y

The problem statement, all variables and given/known data Let r be a positive constant. Consider the cylinder x2 + y2 <= r2, and let C be the part of the cylinder that satisfies 0 <= z <= y. (1) Consider the cross section of C by the plane x = t (-r <= t <= r), and express its area in terms of r, t. (2) Calculate the volume of C, and express it in terms of r

work page
[8]

So -1 <= y <= 1

Dec 13, 2017 ### BvU Simple case: x = 0. So -1 <= y <= 1. In the yz plane 0 <= z <= y is a triangle. What about y ?

work page 2017
[9]

Let me start from the basic again:

Dec 13, 2017 ### songoku I think I am missing something here because I feel I can’t really grasp the hint given. Let me start from the basic again:

work page 2017
[10]

I imagine there is circle on xy plane with radius r then it extends out of page (I take out of page as z+) to form 3 D cylinder

Let the x - axis horizontal, y - axis vertical and z - axis in / out of page. I imagine there is circle on xy plane with radius r then it extends out of page (I take out of page as z+) to form 3 D cylinder. Is this correct?

work page
[11]

Is this correct? Thanks

Plane x = t is like the shape of a piece of paper hold vertically with the face of paper facing x - axis (I mean x - axis is the normal of the plane). Is this correct? Thanks

work page
[12]

Dec 14, 2017 ### BvU Yes

work page 2017
[13]

Consider the cross section of C by plane x = t

Dec 14, 2017 ### songoku "Consider the cross section of C by plane x = t" means plane x = t cuts the cylinder ? And the intersection will be rectangle? ... 38 Demystifying Long Chain-of-Thought Reasoning in LLMs Source: StackExchange The user Baymax is asking for help on a probability problem and we see dialogue with another user Lulu. We see that the qui...

work page 2017