POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
hub Canonical reference
Demystifying Long Chain-of-Thought Reasoning in LLMs
Canonical reference. 86% of citing Pith papers cite this work as background.
abstract
Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Chain-of-Zoom factorizes extreme super-resolution into an autoregressive sequence of intermediate scales using a reused backbone model plus GRPO-tuned multi-scale VLM prompts.
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
Introduces a hierarchical latent selection model showing SFT supplies raw module materials in compound traces while RL decomposes them to identify atomic modules and enable recombination for new reasoning configurations.
Post-training quantization increases overthinking errors in reasoning models; a logit penalty on curated overthinking markers reduces CoT length 12-23% without accuracy loss.
CIE-Scorer detects unfaithful CoT by tracing compact sentence-level circuits, building internal-external reasoning graphs, and scoring their discrepancy with Fused Gromov-Wasserstein distance, reporting SOTA results on FaithCoT-Bench with reduced circuit cost.
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
LC-ERD frames LLM self-alignment as latent structure mining via a Variational Logic Potential and Multi-Agent Value Decomposition to provide granular, logic-consistent supervision.
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoning benchmarks.
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
Hint Tuning reduces token usage 24-66% (31.5% avg) in reasoning models via 1K self-annotated samples aligned to an instruct model's capabilities while keeping benchmark accuracy.
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.
Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.
Geo-R1 uses indirect proxy rewards from cross-view alignment with geolocation metadata to drive reinforcement learning, enabling zero-shot geospatial reasoning that transfers across 25+ tasks and sometimes exceeds supervised specialists.
CoT reasoning is a brittle mirage governed by distribution discrepancy between training and test data, demonstrated via controlled experiments in the new DataAlchemy environment.
LongWriter-Zero applies RL from a base model with specialized rewards for length, quality, and structure to outperform SFT baselines and larger models on long-writing benchmarks.
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
A simple PPO-based RL training pipeline on base models scales reasoning performance and response length, outperforming prior work on math and science benchmarks with one-tenth the training steps.
Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
citing papers explorer
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy
CIE-Scorer detects unfaithful CoT by tracing compact sentence-level circuits, building internal-external reasoning graphs, and scoring their discrepancy with Fused Gromov-Wasserstein distance, reporting SOTA results on FaithCoT-Bench with reduced circuit cost.
-
CLORE: Content-Level Optimization for Reasoning Efficiency
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
-
LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition
LC-ERD frames LLM self-alignment as latent structure mining via a Variational Logic Potential and Multi-Agent Value Decomposition to provide granular, logic-consistent supervision.
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
-
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
CoT reasoning is a brittle mirage governed by distribution discrepancy between training and test data, demonstrated via controlled experiments in the new DataAlchemy environment.
-
SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning
SLAT applies segment-level adaptive trimming in RL to reduce CoT reasoning length by 50% while maintaining competitive accuracy on benchmarks.
-
Phi-4-reasoning Technical Report
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.