Recognition: 2 theorem links
· Lean Theorems1: Simple test-time scaling
Pith reviewed 2026-05-11 16:58 UTC · model grok-4.3
The pith
A 32B model finetuned on 1,000 reasoning traces and equipped with budget forcing exceeds o1-preview on competition math by up to 27%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Supervised finetuning on the s1K dataset of 1,000 curated reasoning traces combined with budget forcing during generation produces s1-32B, which exceeds o1-preview on competition math questions by up to 27% on MATH and AIME24; scaling the test-time budget with budget forcing further extrapolates performance beyond the base model, reaching 57% on AIME24 from a 50% starting point.
What carries the argument
Budget forcing, a method that terminates the model's output or appends 'Wait' multiple times when generation would otherwise end, to allocate more test-time compute and encourage double-checking of reasoning steps.
Load-bearing premise
That appending 'Wait' genuinely improves the substance and accuracy of the reasoning chain rather than simply lengthening outputs without adding corrective value.
What would settle it
Run the finetuned model on AIME24 or MATH problems while allowing unconstrained longer generation without any 'Wait' appends and compare accuracy to the budget-forced versions; if the accuracy gains disappear, the claim that budget forcing adds value beyond length fails.
read the original abstract
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces s1, a simple approach to test-time scaling for language models. It involves curating a 1,000-example dataset (s1K) of questions with high-quality reasoning traces selected for difficulty, diversity, and quality, validated via ablations. A technique called budget forcing is proposed to control test-time compute by either terminating the model's reasoning or appending the word 'Wait' to encourage further thinking and error correction. The Qwen2.5-32B-Instruct model is fine-tuned on s1K and combined with budget forcing to create s1-32B, which reportedly surpasses OpenAI's o1-preview on MATH and AIME24 benchmarks by up to 27%. Additionally, increasing the budget via more 'Wait' appends allows performance extrapolation on AIME24 from 50% to 57%. The work emphasizes simplicity and releases all components openly.
Significance. If the results hold under rigorous controls, this work is significant for providing an accessible, fully open-source demonstration of test-time scaling that achieves competitive or superior performance to closed models like o1-preview using only a small curated dataset and a lightweight inference intervention. Strengths include the emphasis on reproducibility (open model, data, code), ablations validating dataset criteria, and the empirical extrapolation result showing continued gains with increased test-time budget. It offers a practical baseline for the community studying reasoning in LLMs.
major comments (2)
- [Budget Forcing section] Budget Forcing section: The central claim that budget forcing enables effective test-time scaling rests on the assertion that appending 'Wait' 'often fixes incorrect reasoning steps.' However, the experiments lack a direct control comparing this mechanism to simply extending generation length via other means (e.g., raising the max token limit, using a generic 'continue' prompt, or sampling additional tokens without the specific intervention). Without isolating this, the reported gains (such as the 50% to 57% AIME24 extrapolation and the 27% lift over o1-preview) could be attributable to increased output length rather than a qualitative change in reasoning distribution. This is load-bearing for the paper's simplicity and effectiveness argument.
- [Main Results section] Main Results section: The claim that s1-32B exceeds o1-preview by up to 27% on MATH and AIME24 requires more detail on the exact evaluation protocol for the o1-preview baseline, including the number of samples or test-time compute budget allocated to it (given o1's undisclosed internal scaling). Additionally, reporting variance across multiple runs or statistical tests for the improvements would help assess robustness, especially since the abstract-only view leaves ambiguity on whether gains are consistent or sensitive to specific choices.
minor comments (2)
- [Abstract] Abstract: The phrase 'up to 27%' is used without specifying the exact benchmark and condition achieving the maximum; adding this precision would improve clarity.
- [Dataset Curation section] Dataset Curation section: While ablations on difficulty, diversity, and quality are mentioned, expanding the main text or appendix with quantitative metrics used to measure each criterion (e.g., diversity scores) would enhance reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment point by point below and outline the changes we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Budget Forcing section] Budget Forcing section: The central claim that budget forcing enables effective test-time scaling rests on the assertion that appending 'Wait' 'often fixes incorrect reasoning steps.' However, the experiments lack a direct control comparing this mechanism to simply extending generation length via other means (e.g., raising the max token limit, using a generic 'continue' prompt, or sampling additional tokens without the specific intervention). Without isolating this, the reported gains (such as the 50% to 57% AIME24 extrapolation and the 27% lift over o1-preview) could be attributable to increased output length rather than a qualitative change in reasoning distribution. This is load-bearing for the paper's simplicity and effectiveness argument.
Authors: We agree that a direct comparison isolating the 'Wait' intervention from generic increases in output length is necessary to support the central claim. While our results show consistent gains from budget forcing, we acknowledge that the current experiments do not fully rule out length as the sole factor. In the revised manuscript, we will add new ablation experiments comparing budget forcing to (1) simply raising the maximum token limit without any prompt and (2) using a generic continuation prompt such as 'continue'. These results will be reported in an expanded Budget Forcing section, including quantitative comparisons on AIME24 and MATH to demonstrate whether the specific mechanism produces qualitatively different reasoning behavior beyond length alone. revision: yes
-
Referee: [Main Results section] Main Results section: The claim that s1-32B exceeds o1-preview by up to 27% on MATH and AIME24 requires more detail on the exact evaluation protocol for the o1-preview baseline, including the number of samples or test-time compute budget allocated to it (given o1's undisclosed internal scaling). Additionally, reporting variance across multiple runs or statistical tests for the improvements would help assess robustness, especially since the abstract-only view leaves ambiguity on whether gains are consistent or sensitive to specific choices.
Authors: We appreciate the request for greater transparency. For the o1-preview baseline, we accessed the model via the OpenAI API using its default configuration and standard test-time scaling behavior. Because o1-preview's internal scaling mechanisms are proprietary and undisclosed, we cannot report the precise compute budget allocated by OpenAI. In the revision, we will expand the Main Results section to fully document our evaluation protocol for both models, including API parameters (temperature, max tokens where applicable), the exact number of problems evaluated from each benchmark, and any sampling details. We will also explicitly note that s1-32B results are from single runs due to computational cost and will report performance consistency across varying budget-forcing levels as an indirect robustness check. Statistical tests will be added where multiple runs become feasible. revision: partial
Circularity Check
No significant circularity in empirical claims
full rationale
The paper reports an empirical pipeline: curation of s1K via ablations on difficulty/diversity/quality criteria, SFT of Qwen2.5-32B-Instruct, and application of budget forcing (append 'Wait' or terminate) whose effects are measured on MATH/AIME24. All performance numbers (e.g., 27% lift, 50% to 57% extrapolation) are observed outcomes, not quantities derived from the method definition or fitted parameters renamed as predictions. No equations, uniqueness theorems, or self-citations function as load-bearing premises that reduce the central result to its inputs by construction. The work is self-contained against external benchmarks with released data and code.
Axiom & Free-Parameter Ledger
free parameters (2)
- s1K dataset size =
1000
- Number of 'Wait' appends =
multiple
axioms (1)
- domain assumption The Qwen2.5-32B-Instruct base model is suitable for SFT on reasoning traces.
Forward citations
Cited by 44 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budget...
-
GIANTS: Generative Insight Anticipation from Scientific Literature
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
-
How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
A single-parameter Tsallis loss continuum unifies SFT and RLVR, derives time-to-escape bounds for cold start, and yields GARL and PAFT estimators that improve performance on QA reasoning tasks.
-
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...
-
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
-
Towards Unconstrained Human-Object Interaction
Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
-
AI Achieves a Perfect LSAT Score
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
-
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
-
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.
-
Evaluating the False Trust engendered by LLM Explanations
A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
-
AIPO: : Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.
-
Gradient Extrapolation-Based Policy Optimization
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
-
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
-
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
-
How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
Tsallis q-loss continuum enables faster cold-start escape in reasoning model training via probability-based gradient amplification, with practical Monte Carlo estimators GARL and PAFT.
-
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
-
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling
Extended reasoning in LLMs exhibits overthinking and diminishing returns, with optimal thinking length varying by problem difficulty, allowing significant compute savings by stopping at moderate budgets.
-
On the Step Length Confounding in LLM Reasoning Data Selection
Average log probability selection for LLM reasoning datasets is confounded by step length because longer steps dilute low-probability first tokens; ASLEC-DROP and ASLEC-CASL remove this bias.
-
Procedural Knowledge at Scale Improves Reasoning
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
Dream 7B: Diffusion Large Language Models
Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...
-
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models
BitCal-TTS raises exact-match accuracy by 3.7 points (7B) and 2.8 points (14B) on small GSM8K shards for 4-bit Qwen2.5 models while cutting premature-stop rates and retaining token savings versus fixed-budget decoding.
-
ReMedi: Reasoner for Medical Clinical Prediction
ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR
Adaptive scheduling of penalties over training time plus confidence-based weighting of mistakes improves LLM performance on math reasoning benchmarks compared to fixed-penalty negative reinforcement.
-
Adam's Law: Textual Frequency Law on Large Language Models
Frequent sentence-level text improves LLM prompting and fine-tuning performance across math, translation, commonsense, and tool-use tasks via a proposed frequency law and curriculum ordering.
- Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.