hub Mixed citations

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Pranjal Aggarwal, Sean Welleck · 2025 · cs.CL · arXiv 2503.04697

Mixed citation behavior. Most common role is background (62%).

38 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 38 citing papers arXiv PDF

abstract

Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer''-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning language model that produces outputs satisfying a length constraint given in its prompt. L1's length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. Specifically, using LCPO we derive Short Reasoning Models (SRMs), that exhibit similar reasoning patterns as full-length reasoning models, but can generate CoT lengths comparable to non-reasoning models. They demonstrate significant performance gains, for instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at https://www.cmu-l3.github.io/l1

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1 method 1 other 1

citation-polarity summary

background 5 baseline 1 unclear 1 use method 1

representative citing papers

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

cs.LG · 2026-06-01 · unverdicted · novelty 8.0

Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.

Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

SLT selectively compresses reasoning spans via anticipation and gating, trained in three stages including RL, yielding 22.7% higher accuracy than uniform latent baselines at similar compression and 58.4% shorter chains with 2.8% accuracy drop vs explicit CoT on math benchmarks.

Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

MCPO applies contrastive learning to GRPO-style RL by treating cross-domain correct rollouts as positives and incorrect ones as negatives to improve multi-domain reasoning performance in LRMs.

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

AI Achieves a Perfect LSAT Score

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.

TiCo: Time-Controllable Spoken Dialogue Model

cs.CL · 2026-03-23 · unverdicted · novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

TIME: Temporally Intelligent Meta-reasoning Engine for Context-Triggered Explicit Reasoning

cs.LG · 2026-01-08 · unverdicted · novelty 7.0

TIME trains LLMs to trigger compact, context-triggered reasoning via time tags and tick events, improving TIMEBench scores while cutting explicit reasoning tokens by an order of magnitude.

Finding the Time to Think: Learning Planning Budgets in Real-Time RL

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

Trains a gating policy to select state-dependent planning budgets in variable-delay real-time RL, outperforming fixed-budget and heuristic baselines across Pac-Man, Tetris, Snake, Speed Hex, and Speed Go.

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

RL-trained lightweight controller using answer statistics improves trade-offs among correctness, latency, and total samples in adaptive sampling for LLM test-time scaling.

Adaptive Latent Agentic Reasoning

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

ALAR trains LLM agents to perform most reasoning in a latent space supervised by actions and escalates to explicit CoT only when needed, cutting tokens by up to 84.6% while preserving accuracy on search and tool-use benchmarks.

CLORE: Content-Level Optimization for Reasoning Efficiency

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.

Mem-$\pi$: Adaptive Memory through Learning When and What to Generate

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression

cs.CL · 2026-04-05 · unverdicted · novelty 6.0

CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller trustworthiness loss.

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

cs.LG · 2026-03-05 · conditional · novelty 6.0

CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

cs.CL · 2026-01-08 · unverdicted · novelty 6.0

GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

cs.AI · 2025-03-27 · accept · novelty 6.0

UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.

CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models

cs.CL · 2026-07-01 · unverdicted · novelty 5.0

CAT uses intrinsic confidence signals in preference optimization to adapt reasoning length in LRMs, outperforming uniform compression baselines on accuracy across benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 6 · internal anchor
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer