LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.
hub Mixed citations
Training language models to reason efficiently
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.
LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.
DyCon dynamically controls reasoning depth in LRMs by modeling evolving difficulty from step-level embeddings, reducing redundant steps across multiple benchmarks.
ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.
A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.
Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.
GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
SLAT applies segment-level adaptive trimming in RL to reduce CoT reasoning length by 50% while maintaining competitive accuracy on benchmarks.
TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.
SWE-AGILE introduces a Dynamic Reasoning Context with sliding windows of detailed steps and compressed Reasoning Digests to enable efficient long-horizon reasoning in software engineering agents, claiming new benchmark results on SWE-Bench-Verified for 7B-8B models.
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
LCPO reduces average LRM output length by over 50% across benchmarks via targeted preference optimization while preserving reasoning performance.
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
citing papers explorer
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.