hub Mixed citations

Training language models to reason efficiently

Daman Arora, Andrea Zanette · 2025 · arXiv 2502.04463

Mixed citation behavior. Most common role is background (67%).

28 Pith papers citing it

Background 67% of classified citations

read on arXiv browse 28 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 2 method 1

citation-polarity summary

background 6 baseline 2 use method 1

representative citing papers

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

cs.AI · 2025-05-25 · unverdicted · novelty 7.0

UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

cs.CL · 2025-03-06 · unverdicted · novelty 7.0

LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

DyCon dynamically controls reasoning depth in LRMs by modeling evolving difficulty from step-level embeddings, reducing redundant steps across multiple benchmarks.

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.

CLORE: Content-Level Optimization for Reasoning Efficiency

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.

When Less is Enough: Efficient Inference via Collaborative Reasoning

cs.LG · 2026-05-01 · conditional · novelty 6.0

A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.

How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

cs.AI · 2026-04-21 · conditional · novelty 6.0

Across four frontier reasoning models, 61–93% of correct chain-of-thought steps are redundant, and this over-thinking is provably optimal under any length-agnostic outcome reward.

One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

LightThinker++: From Reasoning Compression to Memory Management

cs.CL · 2026-04-04 · unverdicted · novelty 6.0

LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

cs.LG · 2026-02-08 · unverdicted · novelty 6.0

rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.

Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning

cs.LG · 2026-02-06 · unverdicted · novelty 6.0

Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

cs.CL · 2026-01-08 · unverdicted · novelty 6.0

GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning

cs.AI · 2026-05-29 · unverdicted · novelty 5.0

SLAT applies segment-level adaptive trimming in RL to reduce CoT reasoning length by 50% while maintaining competitive accuracy on benchmarks.

Efficient Test-Time Scaling via Temporal Reasoning Aggregation

cs.AI · 2026-04-19 · unverdicted · novelty 5.0

TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.

SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

SWE-AGILE introduces a Dynamic Reasoning Context with sliding windows of detailed steps and compressed Reasoning Digests to enable efficient long-horizon reasoning in software engineering agents, claiming new benchmark results on SWE-Bench-Verified for 7B-8B models.

Self-Aligned Reward: Towards Effective and Efficient Reasoners

cs.LG · 2025-09-05 · unverdicted · novelty 5.0

Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.

Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

cs.AI · 2025-08-13 · unverdicted · novelty 5.0

LCPO reduces average LRM output length by over 50% across benchmarks via targeted preference optimization while preserving reasoning performance.

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

cs.CL · 2025-03-20 · accept · novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 22
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

Training language models to reason efficiently

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer