Demystifying Long Chain-of-Thought Reasoning in LLMs
Pith reviewed 2026-05-19 01:26 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{5777KZB3}
Prints a linked pith:5777KZB3 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Scaling verifiable rewards from filtered noisy web data is what drives long chain-of-thought reasoning to emerge in large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that scaling verifiable reward signals is critical for reinforcement learning to produce long CoT trajectories; they demonstrate that noisy web-extracted solutions, when passed through filtering mechanisms, supply effective training signals and improve performance especially on out-of-distribution tasks such as STEM reasoning.
What carries the argument
Verifiable reward signals obtained from filtered noisy web-extracted solutions, which stabilize CoT length growth and incentivize error correction during RL.
If this is right
- SFT can be omitted to simplify training while still reaching long CoT with sufficient RL compute.
- Reward shaping must be used to keep CoT length from stalling as training scale increases.
- Noisy but filtered web data can substitute for cleaner sources on out-of-distribution reasoning tasks.
- Error-correction skills already exist in base models and become usable on hard problems once RL compute is large enough.
Where Pith is reading between the lines
- This route may lower the cost of building reasoning models by reducing dependence on hand-curated datasets.
- The same filtering-plus-RL pattern could be tested on non-STEM domains to see how far the OOD benefit extends.
- If the filters are made more transparent, researchers could measure exactly which data properties most help long CoT appear.
Load-bearing premise
The filtering mechanisms must turn noisy web-extracted data into reliable training signals without introducing biases that block long CoT emergence or hurt generalization to new tasks.
What would settle it
Train an identical model using the same RL setup but with unfiltered web data and check whether CoT length fails to grow and OOD accuracy on STEM problems stays flat.
read the original abstract
Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the emergence of long chain-of-thought (CoT) reasoning in LLMs using SFT and RL experiments. It reports four findings: SFT is not strictly necessary but aids efficiency; reasoning emerges with increased compute but requires reward shaping to stabilize CoT length; scaling verifiable rewards via noisy web-extracted solutions with filtering shows promise especially for OOD STEM tasks; and core skills like error correction exist in base models but need substantial compute and nuanced measurement to incentivize for complex tasks. Code is released for reproducibility.
Significance. If the experimental results hold, the work offers actionable insights into RL design choices for long CoT, particularly the viability of filtered noisy data sources for verifiable rewards and OOD generalization. The public code release strengthens the contribution by enabling direct replication and extension of the training setups.
major comments (1)
- Finding (3): The assertion that noisy web-extracted solutions combined with filtering mechanisms yield strong potential for OOD tasks such as STEM reasoning is load-bearing for the central claim about scaling verifiable rewards. The manuscript does not appear to include explicit validation that the (unspecified) filters avoid selection bias toward problems whose surface features align with the base model's pretraining distribution or that they preserve examples requiring genuine backtracking and error correction; without such checks, observed OOD gains could reflect easier subset selection rather than improved reasoning incentives.
minor comments (2)
- The abstract and findings sections would benefit from explicit quantitative definitions and metrics for 'long CoT' (e.g., token length thresholds, backtracking frequency) and for measuring emergence/stabilization of reasoning capabilities.
- Ensure all experimental details—RL reward formulations, exact filtering criteria, data exclusion rules, hyperparameter sweeps, and statistical reporting (including error bars or significance tests)—are fully specified in the main text or appendix to support the reproducibility claim.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our paper 'Demystifying Long Chain-of-Thought Reasoning in LLMs'. We address the referee's major comment below and are prepared to revise the manuscript accordingly to strengthen the presentation of our results on scaling verifiable rewards.
read point-by-point responses
-
Referee: Finding (3): The assertion that noisy web-extracted solutions combined with filtering mechanisms yield strong potential for OOD tasks such as STEM reasoning is load-bearing for the central claim about scaling verifiable rewards. The manuscript does not appear to include explicit validation that the (unspecified) filters avoid selection bias toward problems whose surface features align with the base model's pretraining distribution or that they preserve examples requiring genuine backtracking and error correction; without such checks, observed OOD gains could reflect easier subset selection rather than improved reasoning incentives.
Authors: We thank the referee for this insightful comment. Our filtering mechanisms primarily consist of removing solutions that are incomplete, contain obvious errors, or fall below a minimum length threshold, as detailed in Section 4.2 of the manuscript. While we did not explicitly quantify selection bias or the preservation of backtracking examples in the original submission, our OOD gains are supported by comparisons to baselines using unfiltered data, where performance was significantly lower. To strengthen the claim, we will add a new analysis in the revised manuscript, including a comparison of problem difficulty distributions (measured by the number of required reasoning steps) before and after filtering, and an ablation study training on size-matched random subsets. This will help demonstrate that the improvements stem from better reasoning incentives rather than subset selection. revision: yes
Circularity Check
No circularity: empirical findings from SFT/RL experiments
full rationale
The paper presents four findings based on direct experimental outcomes from supervised fine-tuning and reinforcement learning runs on LLMs. Claims about long CoT emergence, reward scaling with filtered web data, and error correction are tied to observed training dynamics and OOD performance metrics rather than any closed-loop derivation, self-referential definition, or parameter fit renamed as prediction. No equations, uniqueness theorems, or load-bearing self-citations appear in the reported chain; results are externally falsifiable via the released code and benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reward shaping is crucial for stabilizing CoT length growth during RL training.
Forward citations
Cited by 22 Pith papers
-
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
-
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
POISE estimates value baselines for RL in LLMs from the actor's internal states via a lightweight probe and cross-rollout construction, matching DAPO performance with lower compute on math reasoning benchmarks.
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
Hint Tuning: Less Data Makes Better Reasoners
Hint Tuning uses an instruct model as a difficulty probe to create 1K multi-level hint examples that train reasoning models to calibrate chain-of-thought length, cutting tokens by 31.5% on average across 4B-32B models...
-
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
-
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
-
Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning
Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.
-
Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards
Geo-R1 uses indirect proxy rewards from cross-view alignment with geolocation metadata to drive reinforcement learning, enabling zero-shot geospatial reasoning that transfers across 25+ tasks and sometimes exceeds sup...
-
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
CoT reasoning is a brittle mirage governed by distribution discrepancy between training and test data, demonstrated via controlled experiments in the new DataAlchemy environment.
-
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
LongWriter-Zero applies RL from a base model with specialized rewards for length, quality, and structure to outperform SFT baselines and larger models on long-writing benchmarks.
-
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
A simple PPO-based RL training pipeline on base models scales reasoning performance and response length, outperforming prior work on math and science benchmarks with one-tenth the training steps.
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
-
Self-Aligned Reward: Towards Effective and Efficient Reasoners
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
-
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
-
Phi-4-reasoning Technical Report
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related...
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1109/SEQUEN.1997.666900. Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzen- macher, M. Min-wise independent permutations. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 327–336, 1998. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., ...
-
[2]
Training Verifiers to Solve Math Word Problems
URL https://t.co/2sjhynKxzJ. Slide 48. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. Dao, T. FlashAttention-2: Faster attention with better paral- lelism and work partitioning. In Internat...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Reinforced Self-Training (ReST) for Language Modeling
doi: 10.1126/sciadv.abg6611. Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1126/sciadv.abg6611 2023
-
[4]
URL https://github.com/Jiayi-Pan/ TinyZero. Accessed: 2025-01-24. Paster, K., Santos, M. D., Azerbayev, Z., and Ba, J. Open- webmath: An open dataset of high-quality mathematical web text, 2023. Qwen Team. Qwen2.5-math technical report: Toward math- ematical expert model via self-improvement, 2024a. Qwen Team. Qwq: Reflect deeply on the boundaries of the ...
-
[5]
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
ISBN 0262039249. Tong, Y ., Zhang, X., Wang, R., Wu, R., and He, J. DART- math: Difficulty-aware rejection tuning for mathematical problem-solving. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozire, B., Goyal, N., Hambro, E., Azhar,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Dec 13, 2017 ### songoku
work page 2017
-
[7]
The problem statement, all variables and given/known data Let r be a positive constant. Consider the cylinder x2 + y2 <= r2, and let C be the part of the cylinder that satisfies 0 <= z <= y. (1) Consider the cross section of C by the plane x = t (-r <= t <= r), and express its area in terms of r, t. (2) Calculate the volume of C, and express it in terms of r
-
[8]
Dec 13, 2017 ### BvU Simple case: x = 0. So -1 <= y <= 1. In the yz plane 0 <= z <= y is a triangle. What about y ?
work page 2017
-
[9]
Let me start from the basic again:
Dec 13, 2017 ### songoku I think I am missing something here because I feel I can’t really grasp the hint given. Let me start from the basic again:
work page 2017
-
[10]
Let the x - axis horizontal, y - axis vertical and z - axis in / out of page. I imagine there is circle on xy plane with radius r then it extends out of page (I take out of page as z+) to form 3 D cylinder. Is this correct?
-
[11]
Plane x = t is like the shape of a piece of paper hold vertically with the face of paper facing x - axis (I mean x - axis is the normal of the plane). Is this correct? Thanks
-
[12]
Dec 14, 2017 ### BvU Yes
work page 2017
-
[13]
Consider the cross section of C by plane x = t
Dec 14, 2017 ### songoku "Consider the cross section of C by plane x = t" means plane x = t cuts the cylinder ? And the intersection will be rectangle? ... 38 Demystifying Long Chain-of-Thought Reasoning in LLMs Source: StackExchange The user Baymax is asking for help on a probability problem and we see dialogue with another user Lulu. We see that the qui...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.