s1: Simple test-time scaling

Emmanuel Cand\`es; Hannaneh Hajishirzi; Li Fei-Fei; Luke Zettlemoyer; Niklas Muennighoff; Percy Liang; Tatsunori Hashimoto; Weijia Shi; Xiang Lisa Li; Zitong Yang

arxiv: 2501.19393 · v3 · submitted 2025-01-31 · 💻 cs.CL · cs.AI· cs.LG

s1: Simple test-time scaling

Niklas Muennighoff , Zitong Yang , Weijia Shi , Xiang Lisa Li , Li Fei-Fei , Hannaneh Hajishirzi , Luke Zettlemoyer , Percy Liang

show 2 more authors

Emmanuel Cand\`es Tatsunori Hashimoto

This is my paper

Pith reviewed 2026-05-11 16:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords test-time scalingreasoning tracesbudget forcingmathematical reasoningsupervised finetuningo1 replicationcompetition math

0 comments

The pith

A 32B model finetuned on 1,000 reasoning traces and equipped with budget forcing exceeds o1-preview on competition math by up to 27%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks the simplest way to achieve test-time scaling for stronger language model reasoning. It assembles s1K, a dataset of 1,000 questions paired with reasoning traces chosen for difficulty, diversity, and quality. Budget forcing is developed to control inference compute by either halting the model's thinking or appending 'Wait' to extend it and prompt self-correction. After supervised finetuning of Qwen2.5-32B-Instruct on s1K and applying budget forcing, the resulting s1-32B model surpasses o1-preview on MATH and AIME24. Further increases in test-time budget via repeated forcing raise performance from 50% to 57% on AIME24.

Core claim

Supervised finetuning on the s1K dataset of 1,000 curated reasoning traces combined with budget forcing during generation produces s1-32B, which exceeds o1-preview on competition math questions by up to 27% on MATH and AIME24; scaling the test-time budget with budget forcing further extrapolates performance beyond the base model, reaching 57% on AIME24 from a 50% starting point.

What carries the argument

Budget forcing, a method that terminates the model's output or appends 'Wait' multiple times when generation would otherwise end, to allocate more test-time compute and encourage double-checking of reasoning steps.

Load-bearing premise

That appending 'Wait' genuinely improves the substance and accuracy of the reasoning chain rather than simply lengthening outputs without adding corrective value.

What would settle it

Run the finetuned model on AIME24 or MATH problems while allowing unconstrained longer generation without any 'Wait' appends and compare accuracy to the budget-forced versions; if the accuracy gains disappear, the claim that budget forcing adds value beyond length fails.

read the original abstract

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A minimal fine-tune on 1k curated traces plus the 'Wait' append trick produces competitive test-time scaling on math benchmarks, but the mechanism still needs tighter controls to separate it from plain length extension.

read the letter

The paper's core result is that supervised fine-tuning Qwen2.5-32B-Instruct on a 1,000-example set of hard, diverse, high-quality reasoning traces, then applying budget forcing at inference, lets the resulting s1-32B beat o1-preview by up to 27 points on MATH and AIME24 while also showing further gains when more 'Wait' tokens are forced. That is the headline worth knowing: a simple, fully open recipe can match or exceed the closed model's reported math performance without any secret sauce beyond data selection and a continuation prompt.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces s1, a simple approach to test-time scaling for language models. It involves curating a 1,000-example dataset (s1K) of questions with high-quality reasoning traces selected for difficulty, diversity, and quality, validated via ablations. A technique called budget forcing is proposed to control test-time compute by either terminating the model's reasoning or appending the word 'Wait' to encourage further thinking and error correction. The Qwen2.5-32B-Instruct model is fine-tuned on s1K and combined with budget forcing to create s1-32B, which reportedly surpasses OpenAI's o1-preview on MATH and AIME24 benchmarks by up to 27%. Additionally, increasing the budget via more 'Wait' appends allows performance extrapolation on AIME24 from 50% to 57%. The work emphasizes simplicity and releases all components openly.

Significance. If the results hold under rigorous controls, this work is significant for providing an accessible, fully open-source demonstration of test-time scaling that achieves competitive or superior performance to closed models like o1-preview using only a small curated dataset and a lightweight inference intervention. Strengths include the emphasis on reproducibility (open model, data, code), ablations validating dataset criteria, and the empirical extrapolation result showing continued gains with increased test-time budget. It offers a practical baseline for the community studying reasoning in LLMs.

major comments (2)

[Budget Forcing section] Budget Forcing section: The central claim that budget forcing enables effective test-time scaling rests on the assertion that appending 'Wait' 'often fixes incorrect reasoning steps.' However, the experiments lack a direct control comparing this mechanism to simply extending generation length via other means (e.g., raising the max token limit, using a generic 'continue' prompt, or sampling additional tokens without the specific intervention). Without isolating this, the reported gains (such as the 50% to 57% AIME24 extrapolation and the 27% lift over o1-preview) could be attributable to increased output length rather than a qualitative change in reasoning distribution. This is load-bearing for the paper's simplicity and effectiveness argument.
[Main Results section] Main Results section: The claim that s1-32B exceeds o1-preview by up to 27% on MATH and AIME24 requires more detail on the exact evaluation protocol for the o1-preview baseline, including the number of samples or test-time compute budget allocated to it (given o1's undisclosed internal scaling). Additionally, reporting variance across multiple runs or statistical tests for the improvements would help assess robustness, especially since the abstract-only view leaves ambiguity on whether gains are consistent or sensitive to specific choices.

minor comments (2)

[Abstract] Abstract: The phrase 'up to 27%' is used without specifying the exact benchmark and condition achieving the maximum; adding this precision would improve clarity.
[Dataset Curation section] Dataset Curation section: While ablations on difficulty, diversity, and quality are mentioned, expanding the main text or appendix with quantitative metrics used to measure each criterion (e.g., diversity scores) would enhance reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below and outline the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Budget Forcing section] Budget Forcing section: The central claim that budget forcing enables effective test-time scaling rests on the assertion that appending 'Wait' 'often fixes incorrect reasoning steps.' However, the experiments lack a direct control comparing this mechanism to simply extending generation length via other means (e.g., raising the max token limit, using a generic 'continue' prompt, or sampling additional tokens without the specific intervention). Without isolating this, the reported gains (such as the 50% to 57% AIME24 extrapolation and the 27% lift over o1-preview) could be attributable to increased output length rather than a qualitative change in reasoning distribution. This is load-bearing for the paper's simplicity and effectiveness argument.

Authors: We agree that a direct comparison isolating the 'Wait' intervention from generic increases in output length is necessary to support the central claim. While our results show consistent gains from budget forcing, we acknowledge that the current experiments do not fully rule out length as the sole factor. In the revised manuscript, we will add new ablation experiments comparing budget forcing to (1) simply raising the maximum token limit without any prompt and (2) using a generic continuation prompt such as 'continue'. These results will be reported in an expanded Budget Forcing section, including quantitative comparisons on AIME24 and MATH to demonstrate whether the specific mechanism produces qualitatively different reasoning behavior beyond length alone. revision: yes
Referee: [Main Results section] Main Results section: The claim that s1-32B exceeds o1-preview by up to 27% on MATH and AIME24 requires more detail on the exact evaluation protocol for the o1-preview baseline, including the number of samples or test-time compute budget allocated to it (given o1's undisclosed internal scaling). Additionally, reporting variance across multiple runs or statistical tests for the improvements would help assess robustness, especially since the abstract-only view leaves ambiguity on whether gains are consistent or sensitive to specific choices.

Authors: We appreciate the request for greater transparency. For the o1-preview baseline, we accessed the model via the OpenAI API using its default configuration and standard test-time scaling behavior. Because o1-preview's internal scaling mechanisms are proprietary and undisclosed, we cannot report the precise compute budget allocated by OpenAI. In the revision, we will expand the Main Results section to fully document our evaluation protocol for both models, including API parameters (temperature, max tokens where applicable), the exact number of problems evaluated from each benchmark, and any sampling details. We will also explicitly note that s1-32B results are from single runs due to computational cost and will report performance consistency across varying budget-forcing levels as an indirect robustness check. Statistical tests will be added where multiple runs become feasible. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical claims

full rationale

The paper reports an empirical pipeline: curation of s1K via ablations on difficulty/diversity/quality criteria, SFT of Qwen2.5-32B-Instruct, and application of budget forcing (append 'Wait' or terminate) whose effects are measured on MATH/AIME24. All performance numbers (e.g., 27% lift, 50% to 57% extrapolation) are observed outcomes, not quantities derived from the method definition or fitted parameters renamed as predictions. No equations, uniqueness theorems, or self-citations function as load-bearing premises that reduce the central result to its inputs by construction. The work is self-contained against external benchmarks with released data and code.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method relies on empirical choices for dataset curation and the budget forcing heuristic, with no new theoretical entities.

free parameters (2)

s1K dataset size = 1000
Chosen as small dataset for curation based on difficulty, diversity, quality.
Number of 'Wait' appends = multiple
Used to lengthen thinking process.

axioms (1)

domain assumption The Qwen2.5-32B-Instruct base model is suitable for SFT on reasoning traces.
Assumed without detailed justification in abstract.

pith-pipeline@v0.9.0 · 5577 in / 1318 out tokens · 38457 ms · 2026-05-11T16:58:08.198671+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
cs.LG 2026-05 unverdicted novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budget...
GIANTS: Generative Insight Anticipation from Scientific Literature
cs.CL 2026-04 unverdicted novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks
cs.CR 2025-09 conditional novelty 8.0

RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasonin...
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 7.0

DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
cs.SE 2026-05 unverdicted novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
cs.LG 2026-04 unverdicted novelty 7.0

A single-parameter Tsallis loss continuum unifies SFT and RLVR, derives time-to-escape bounds for cold start, and yields GARL and PAFT estimators that improve performance on QA reasoning tasks.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
cs.LG 2026-04 unverdicted novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Towards Unconstrained Human-Object Interaction
cs.CV 2026-04 unverdicted novelty 7.0

Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
AI Achieves a Perfect LSAT Score
cs.AI 2026-04 unverdicted novelty 7.0

Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems
eess.SP 2025-12 accept novelty 7.0

An edge-cloud-expert LLM cascade for telecom knowledge systems minimizes processing cost subject to misalignment-risk bounds via multiple hypothesis testing on knowledge and confidence scores.
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data
cs.CL 2025-12 conditional novelty 7.0

PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with...
User-Assistant Bias in LLMs
cs.CL 2025-08 unverdicted novelty 7.0

LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
Bayesian Social Deduction with Graph-Informed Language Models
cs.AI 2025-06 unverdicted novelty 7.0

Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
Learning a Continue-Thinking Token for Enhanced Test-Time Scaling
cs.CL 2025-06 unverdicted novelty 7.0

A learned continue-thinking token, trained via RL on its embedding alone, improves math benchmark accuracy more than fixed-token budget forcing in a frozen language model.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
cs.CL 2025-03 unverdicted novelty 7.0

LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.
PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

PathCal calibrates reasoning paths by type-aware soft rebalancing of reflection-marker logits at uncertain states, yielding better efficiency-performance trade-offs on six benchmarks.
Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection
cs.CL 2026-05 unverdicted novelty 6.0

LONSREX introduces a metric-based pipeline to identify necessary and sufficient rationales when creating training data for fine-tuning LLMs on explainable misinformation detection, addressing limitations of naive labe...
Self-Supervised On-Policy Distillation for Reasoning Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIM...
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
cs.LG 2026-05 conditional novelty 6.0

A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
cs.CL 2026-05 unverdicted novelty 6.0

Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.
Evaluating the False Trust Engendered by LLM Explanations
cs.HC 2026-05 unverdicted novelty 6.0

A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
cs.LG 2026-05 unverdicted novelty 6.0

CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 6.0

RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
Gradient Extrapolation-Based Policy Optimization
cs.LG 2026-05 unverdicted novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation
cs.CL 2026-05 unverdicted novelty 6.0

VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
cs.RO 2026-05 unverdicted novelty 6.0

VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
cs.LG 2026-04 unverdicted novelty 6.0

Tsallis q-loss continuum enables faster cold-start escape in reasoning model training via probability-based gradient amplification, with practical Monte Carlo estimators GARL and PAFT.
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Understanding the Mechanism of Altruism in Large Language Models
econ.GN 2026-04 unverdicted novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Reasoning Structure Matters for Safety Alignment of Reasoning Models
cs.AI 2026-04 unverdicted novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
cs.LG 2026-04 unverdicted novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling
cs.AI 2026-04 unverdicted novelty 6.0

Extended reasoning in LLMs exhibits overthinking and diminishing returns, with optimal thinking length varying by problem difficulty, allowing significant compute savings by stopping at moderate budgets.
On the Step Length Confounding in LLM Reasoning Data Selection
cs.CL 2026-04 unverdicted novelty 6.0

Average log probability selection for LLM reasoning datasets is confounded by step length because longer steps dilute low-probability first tokens; ASLEC-DROP and ASLEC-CASL remove this bias.
Procedural Knowledge at Scale Improves Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
cs.SE 2026-03 unverdicted novelty 6.0

Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
cs.LG 2026-03 unverdicted novelty 6.0

Terminator learns to predict optimal early-exit points in chain-of-thought reasoning by training on the first positions where the model emits its final answer, yielding 14-55% shorter outputs with no accuracy loss.
rePIRL: Learn PRM with Inverse RL for LLM Reasoning
cs.LG 2026-02 unverdicted novelty 6.0

rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and co...
Diffusion-State Policy Optimization for Masked Diffusion Language Models
cs.CL 2026-02 unverdicted novelty 6.0

DiSPO optimizes intermediate decisions in masked diffusion LMs by branching at selected masked states, resampling tokens, scoring completions, and updating only new tokens using a derived policy-gradient estimator tha...
Diffusion-State Policy Optimization for Masked Diffusion Language Models
cs.CL 2026-02 unverdicted novelty 6.0

DiSPO is a plug-in credit-assignment method for masked diffusion LMs that optimizes intermediate filling decisions via branched completions from rollout-cached logits.
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
cs.CL 2026-01 conditional novelty 6.0

Rank-Surprisal Ratio (RSR) correlates strongly (average Spearman 0.86) with post-distillation reasoning gains across five student models and trajectories from eleven teachers, outperforming existing selection metrics.
Training-Trajectory-Aware Token Selection
cs.CL 2026-01 unverdicted novelty 6.0

Training-Trajectory-Aware Token Selection (T3S) reconstructs the token-level training objective to overcome a performance bottleneck in continual distillation of reasoning capabilities from large to small language models.
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs
cs.LG 2025-12 unverdicted novelty 6.0

Using properties of positional embeddings, reasoning LLMs can be made to think, listen, and generate outputs asynchronously without any additional training, cutting time to first token to under 5 seconds.
Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought
cs.LG 2025-10 unverdicted novelty 6.0

LLMs interleave true causal reasoning steps with decorative ones in CoT, with only ~2.3% of steps having high causal impact on AIME for Qwen-2.5, and a steering direction can force internal use of specific steps.
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
cs.CV 2025-10 unverdicted novelty 6.0

RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
cs.AI 2025-10 unverdicted novelty 6.0

A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
The Signal is in the Steps: Local Scoring for Reasoning Data Selection
cs.LG 2025-10 unverdicted novelty 6.0

LALP scores local reasoning steps rather than full trajectories to improve selection of training data from diverse teacher models for distilling long-form reasoning.