s1: Simple test-time scaling
Pith reviewed 2026-05-11 16:58 UTC · model grok-4.3
The pith
A 32B model finetuned on 1,000 reasoning traces and equipped with budget forcing exceeds o1-preview on competition math by up to 27%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Supervised finetuning on the s1K dataset of 1,000 curated reasoning traces combined with budget forcing during generation produces s1-32B, which exceeds o1-preview on competition math questions by up to 27% on MATH and AIME24; scaling the test-time budget with budget forcing further extrapolates performance beyond the base model, reaching 57% on AIME24 from a 50% starting point.
What carries the argument
Budget forcing, a method that terminates the model's output or appends 'Wait' multiple times when generation would otherwise end, to allocate more test-time compute and encourage double-checking of reasoning steps.
Load-bearing premise
That appending 'Wait' genuinely improves the substance and accuracy of the reasoning chain rather than simply lengthening outputs without adding corrective value.
What would settle it
Run the finetuned model on AIME24 or MATH problems while allowing unconstrained longer generation without any 'Wait' appends and compare accuracy to the budget-forced versions; if the accuracy gains disappear, the claim that budget forcing adds value beyond length fails.
read the original abstract
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces s1, a simple approach to test-time scaling for language models. It involves curating a 1,000-example dataset (s1K) of questions with high-quality reasoning traces selected for difficulty, diversity, and quality, validated via ablations. A technique called budget forcing is proposed to control test-time compute by either terminating the model's reasoning or appending the word 'Wait' to encourage further thinking and error correction. The Qwen2.5-32B-Instruct model is fine-tuned on s1K and combined with budget forcing to create s1-32B, which reportedly surpasses OpenAI's o1-preview on MATH and AIME24 benchmarks by up to 27%. Additionally, increasing the budget via more 'Wait' appends allows performance extrapolation on AIME24 from 50% to 57%. The work emphasizes simplicity and releases all components openly.
Significance. If the results hold under rigorous controls, this work is significant for providing an accessible, fully open-source demonstration of test-time scaling that achieves competitive or superior performance to closed models like o1-preview using only a small curated dataset and a lightweight inference intervention. Strengths include the emphasis on reproducibility (open model, data, code), ablations validating dataset criteria, and the empirical extrapolation result showing continued gains with increased test-time budget. It offers a practical baseline for the community studying reasoning in LLMs.
major comments (2)
- [Budget Forcing section] Budget Forcing section: The central claim that budget forcing enables effective test-time scaling rests on the assertion that appending 'Wait' 'often fixes incorrect reasoning steps.' However, the experiments lack a direct control comparing this mechanism to simply extending generation length via other means (e.g., raising the max token limit, using a generic 'continue' prompt, or sampling additional tokens without the specific intervention). Without isolating this, the reported gains (such as the 50% to 57% AIME24 extrapolation and the 27% lift over o1-preview) could be attributable to increased output length rather than a qualitative change in reasoning distribution. This is load-bearing for the paper's simplicity and effectiveness argument.
- [Main Results section] Main Results section: The claim that s1-32B exceeds o1-preview by up to 27% on MATH and AIME24 requires more detail on the exact evaluation protocol for the o1-preview baseline, including the number of samples or test-time compute budget allocated to it (given o1's undisclosed internal scaling). Additionally, reporting variance across multiple runs or statistical tests for the improvements would help assess robustness, especially since the abstract-only view leaves ambiguity on whether gains are consistent or sensitive to specific choices.
minor comments (2)
- [Abstract] Abstract: The phrase 'up to 27%' is used without specifying the exact benchmark and condition achieving the maximum; adding this precision would improve clarity.
- [Dataset Curation section] Dataset Curation section: While ablations on difficulty, diversity, and quality are mentioned, expanding the main text or appendix with quantitative metrics used to measure each criterion (e.g., diversity scores) would enhance reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment point by point below and outline the changes we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Budget Forcing section] Budget Forcing section: The central claim that budget forcing enables effective test-time scaling rests on the assertion that appending 'Wait' 'often fixes incorrect reasoning steps.' However, the experiments lack a direct control comparing this mechanism to simply extending generation length via other means (e.g., raising the max token limit, using a generic 'continue' prompt, or sampling additional tokens without the specific intervention). Without isolating this, the reported gains (such as the 50% to 57% AIME24 extrapolation and the 27% lift over o1-preview) could be attributable to increased output length rather than a qualitative change in reasoning distribution. This is load-bearing for the paper's simplicity and effectiveness argument.
Authors: We agree that a direct comparison isolating the 'Wait' intervention from generic increases in output length is necessary to support the central claim. While our results show consistent gains from budget forcing, we acknowledge that the current experiments do not fully rule out length as the sole factor. In the revised manuscript, we will add new ablation experiments comparing budget forcing to (1) simply raising the maximum token limit without any prompt and (2) using a generic continuation prompt such as 'continue'. These results will be reported in an expanded Budget Forcing section, including quantitative comparisons on AIME24 and MATH to demonstrate whether the specific mechanism produces qualitatively different reasoning behavior beyond length alone. revision: yes
-
Referee: [Main Results section] Main Results section: The claim that s1-32B exceeds o1-preview by up to 27% on MATH and AIME24 requires more detail on the exact evaluation protocol for the o1-preview baseline, including the number of samples or test-time compute budget allocated to it (given o1's undisclosed internal scaling). Additionally, reporting variance across multiple runs or statistical tests for the improvements would help assess robustness, especially since the abstract-only view leaves ambiguity on whether gains are consistent or sensitive to specific choices.
Authors: We appreciate the request for greater transparency. For the o1-preview baseline, we accessed the model via the OpenAI API using its default configuration and standard test-time scaling behavior. Because o1-preview's internal scaling mechanisms are proprietary and undisclosed, we cannot report the precise compute budget allocated by OpenAI. In the revision, we will expand the Main Results section to fully document our evaluation protocol for both models, including API parameters (temperature, max tokens where applicable), the exact number of problems evaluated from each benchmark, and any sampling details. We will also explicitly note that s1-32B results are from single runs due to computational cost and will report performance consistency across varying budget-forcing levels as an indirect robustness check. Statistical tests will be added where multiple runs become feasible. revision: partial
Circularity Check
No significant circularity in empirical claims
full rationale
The paper reports an empirical pipeline: curation of s1K via ablations on difficulty/diversity/quality criteria, SFT of Qwen2.5-32B-Instruct, and application of budget forcing (append 'Wait' or terminate) whose effects are measured on MATH/AIME24. All performance numbers (e.g., 27% lift, 50% to 57% extrapolation) are observed outcomes, not quantities derived from the method definition or fitted parameters renamed as predictions. No equations, uniqueness theorems, or self-citations function as load-bearing premises that reduce the central result to its inputs by construction. The work is self-contained against external benchmarks with released data and code.
Axiom & Free-Parameter Ledger
free parameters (2)
- s1K dataset size =
1000
- Number of 'Wait' appends =
multiple
axioms (1)
- domain assumption The Qwen2.5-32B-Instruct base model is suitable for SFT on reasoning traces.
Forward citations
Cited by 60 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budget...
-
GIANTS: Generative Insight Anticipation from Scientific Literature
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
-
RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
-
Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasonin...
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
-
How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
A single-parameter Tsallis loss continuum unifies SFT and RLVR, derives time-to-escape bounds for cold start, and yields GARL and PAFT estimators that improve performance on QA reasoning tasks.
-
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...
-
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
-
Towards Unconstrained Human-Object Interaction
Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
-
AI Achieves a Perfect LSAT Score
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
-
Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems
An edge-cloud-expert LLM cascade for telecom knowledge systems minimizes processing cost subject to misalignment-risk bounds via multiple hypothesis testing on knowledge and confidence scores.
-
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data
PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with...
-
User-Assistant Bias in LLMs
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
-
Bayesian Social Deduction with Graph-Informed Language Models
Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
-
Learning a Continue-Thinking Token for Enhanced Test-Time Scaling
A learned continue-thinking token, trained via RL on its embedding alone, improves math benchmark accuracy more than fixed-token budget forcing in a frozen language model.
-
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
-
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.
-
PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning
PathCal calibrates reasoning paths by type-aware soft rebalancing of reflection-marker logits at uncertain states, yielding better efficiency-performance trade-offs on six benchmarks.
-
Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning
CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
-
Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection
LONSREX introduces a metric-based pipeline to identify necessary and sufficient rationales when creating training data for fine-tuning LLMs on explainable misinformation detection, addressing limitations of naive labe...
-
Self-Supervised On-Policy Distillation for Reasoning Language Models
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIM...
-
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
-
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.
-
Evaluating the False Trust Engendered by LLM Explanations
A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
-
AIPO: Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
AIPO: Learning to Reason from Active Interaction
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
-
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
-
Gradient Extrapolation-Based Policy Optimization
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
-
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
-
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
-
How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
Tsallis q-loss continuum enables faster cold-start escape in reasoning model training via probability-based gradient amplification, with practical Monte Carlo estimators GARL and PAFT.
-
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
-
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling
Extended reasoning in LLMs exhibits overthinking and diminishing returns, with optimal thinking length varying by problem difficulty, allowing significant compute savings by stopping at moderate budgets.
-
On the Step Length Confounding in LLM Reasoning Data Selection
Average log probability selection for LLM reasoning datasets is confounded by step length because longer steps dilute low-probability first tokens; ASLEC-DROP and ASLEC-CASL remove this bias.
-
Procedural Knowledge at Scale Improves Reasoning
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
-
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
-
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Terminator learns to predict optimal early-exit points in chain-of-thought reasoning by training on the first positions where the model emits its final answer, yielding 14-55% shorter outputs with no accuracy loss.
-
rePIRL: Learn PRM with Inverse RL for LLM Reasoning
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and co...
-
Diffusion-State Policy Optimization for Masked Diffusion Language Models
DiSPO optimizes intermediate decisions in masked diffusion LMs by branching at selected masked states, resampling tokens, scoring completions, and updating only new tokens using a derived policy-gradient estimator tha...
-
Diffusion-State Policy Optimization for Masked Diffusion Language Models
DiSPO is a plug-in credit-assignment method for masked diffusion LMs that optimizes intermediate filling decisions via branched completions from rollout-cached logits.
-
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
Rank-Surprisal Ratio (RSR) correlates strongly (average Spearman 0.86) with post-distillation reasoning gains across five student models and trajectories from eleven teachers, outperforming existing selection metrics.
-
Training-Trajectory-Aware Token Selection
Training-Trajectory-Aware Token Selection (T3S) reconstructs the token-level training objective to overcome a performance bottleneck in continual distillation of reasoning capabilities from large to small language models.
-
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs
Using properties of positional embeddings, reasoning LLMs can be made to think, listen, and generate outputs asynchronously without any additional training, cutting time to first token to under 5 seconds.
-
Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought
LLMs interleave true causal reasoning steps with decorative ones in CoT, with only ~2.3% of steps having high causal impact on AIME for Qwen-2.5, and a steering direction can force internal use of specific steps.
-
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
-
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
-
The Signal is in the Steps: Local Scoring for Reasoning Data Selection
LALP scores local reasoning steps rather than full trajectories to improve selection of training data from diverse teacher models for distilling long-form reasoning.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.