LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.
Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.
citing papers explorer
-
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.