hub

L1: Controlling how long a reasoning model thinks with reinforcement learning

L1: Controlling how long a reasoning model thinks with reinforcement learning , author= · 2025 · arXiv 2503.04697

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

other 1

citation-polarity summary

unclear 1

representative citing papers

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

AI Achieves a Perfect LSAT Score

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression

cs.CL · 2026-04-05 · unverdicted · novelty 6.0

CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller trustworthiness loss.

Reasoning Compression with Mixed-Policy Distillation

cs.AI · 2026-05-09 · unverdicted · novelty 5.0

Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.

NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

cs.LG · 2026-05-06 · unverdicted · novelty 5.0

Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

cs.CL · 2025-03-20 · accept · novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

citing papers explorer

Showing 11 of 11 citing papers.

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization cs.LG · 2026-05-13 · unverdicted · none · ref 5
RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models cs.LG · 2026-05-10 · unverdicted · none · ref 14
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost cs.AI · 2026-05-07 · conditional · none · ref 66
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
AI Achieves a Perfect LSAT Score cs.AI · 2026-04-11 · unverdicted · none · ref 21
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes cs.CL · 2026-05-13 · unverdicted · none · ref 10
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 1
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression cs.CL · 2026-04-05 · unverdicted · none · ref 1
CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller trustworthiness loss.
Reasoning Compression with Mixed-Policy Distillation cs.AI · 2026-05-09 · unverdicted · none · ref 13
Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning cs.LG · 2026-05-06 · unverdicted · none · ref 95
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models cs.CL · 2025-03-20 · accept · none · ref 2
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 6
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

L1: Controlling how long a reasoning model thinks with reinforcement learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer