pith. machine review for the scientific record. sign in

arxiv: 2501.19393 · v3 · submitted 2025-01-31 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

s1: Simple test-time scaling

Emmanuel Cand\`es, Hannaneh Hajishirzi, Li Fei-Fei, Luke Zettlemoyer, Niklas Muennighoff, Percy Liang, Tatsunori Hashimoto, Weijia Shi, Xiang Lisa Li, Zitong Yang

Pith reviewed 2026-05-11 16:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords test-time scalingreasoning tracesbudget forcingmathematical reasoningsupervised finetuningo1 replicationcompetition math
0
0 comments X

The pith

A 32B model finetuned on 1,000 reasoning traces and equipped with budget forcing exceeds o1-preview on competition math by up to 27%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks the simplest way to achieve test-time scaling for stronger language model reasoning. It assembles s1K, a dataset of 1,000 questions paired with reasoning traces chosen for difficulty, diversity, and quality. Budget forcing is developed to control inference compute by either halting the model's thinking or appending 'Wait' to extend it and prompt self-correction. After supervised finetuning of Qwen2.5-32B-Instruct on s1K and applying budget forcing, the resulting s1-32B model surpasses o1-preview on MATH and AIME24. Further increases in test-time budget via repeated forcing raise performance from 50% to 57% on AIME24.

Core claim

Supervised finetuning on the s1K dataset of 1,000 curated reasoning traces combined with budget forcing during generation produces s1-32B, which exceeds o1-preview on competition math questions by up to 27% on MATH and AIME24; scaling the test-time budget with budget forcing further extrapolates performance beyond the base model, reaching 57% on AIME24 from a 50% starting point.

What carries the argument

Budget forcing, a method that terminates the model's output or appends 'Wait' multiple times when generation would otherwise end, to allocate more test-time compute and encourage double-checking of reasoning steps.

Load-bearing premise

That appending 'Wait' genuinely improves the substance and accuracy of the reasoning chain rather than simply lengthening outputs without adding corrective value.

What would settle it

Run the finetuned model on AIME24 or MATH problems while allowing unconstrained longer generation without any 'Wait' appends and compare accuracy to the budget-forced versions; if the accuracy gains disappear, the claim that budget forcing adds value beyond length fails.

read the original abstract

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces s1, a simple approach to test-time scaling for language models. It involves curating a 1,000-example dataset (s1K) of questions with high-quality reasoning traces selected for difficulty, diversity, and quality, validated via ablations. A technique called budget forcing is proposed to control test-time compute by either terminating the model's reasoning or appending the word 'Wait' to encourage further thinking and error correction. The Qwen2.5-32B-Instruct model is fine-tuned on s1K and combined with budget forcing to create s1-32B, which reportedly surpasses OpenAI's o1-preview on MATH and AIME24 benchmarks by up to 27%. Additionally, increasing the budget via more 'Wait' appends allows performance extrapolation on AIME24 from 50% to 57%. The work emphasizes simplicity and releases all components openly.

Significance. If the results hold under rigorous controls, this work is significant for providing an accessible, fully open-source demonstration of test-time scaling that achieves competitive or superior performance to closed models like o1-preview using only a small curated dataset and a lightweight inference intervention. Strengths include the emphasis on reproducibility (open model, data, code), ablations validating dataset criteria, and the empirical extrapolation result showing continued gains with increased test-time budget. It offers a practical baseline for the community studying reasoning in LLMs.

major comments (2)
  1. [Budget Forcing section] Budget Forcing section: The central claim that budget forcing enables effective test-time scaling rests on the assertion that appending 'Wait' 'often fixes incorrect reasoning steps.' However, the experiments lack a direct control comparing this mechanism to simply extending generation length via other means (e.g., raising the max token limit, using a generic 'continue' prompt, or sampling additional tokens without the specific intervention). Without isolating this, the reported gains (such as the 50% to 57% AIME24 extrapolation and the 27% lift over o1-preview) could be attributable to increased output length rather than a qualitative change in reasoning distribution. This is load-bearing for the paper's simplicity and effectiveness argument.
  2. [Main Results section] Main Results section: The claim that s1-32B exceeds o1-preview by up to 27% on MATH and AIME24 requires more detail on the exact evaluation protocol for the o1-preview baseline, including the number of samples or test-time compute budget allocated to it (given o1's undisclosed internal scaling). Additionally, reporting variance across multiple runs or statistical tests for the improvements would help assess robustness, especially since the abstract-only view leaves ambiguity on whether gains are consistent or sensitive to specific choices.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'up to 27%' is used without specifying the exact benchmark and condition achieving the maximum; adding this precision would improve clarity.
  2. [Dataset Curation section] Dataset Curation section: While ablations on difficulty, diversity, and quality are mentioned, expanding the main text or appendix with quantitative metrics used to measure each criterion (e.g., diversity scores) would enhance reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below and outline the changes we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Budget Forcing section] Budget Forcing section: The central claim that budget forcing enables effective test-time scaling rests on the assertion that appending 'Wait' 'often fixes incorrect reasoning steps.' However, the experiments lack a direct control comparing this mechanism to simply extending generation length via other means (e.g., raising the max token limit, using a generic 'continue' prompt, or sampling additional tokens without the specific intervention). Without isolating this, the reported gains (such as the 50% to 57% AIME24 extrapolation and the 27% lift over o1-preview) could be attributable to increased output length rather than a qualitative change in reasoning distribution. This is load-bearing for the paper's simplicity and effectiveness argument.

    Authors: We agree that a direct comparison isolating the 'Wait' intervention from generic increases in output length is necessary to support the central claim. While our results show consistent gains from budget forcing, we acknowledge that the current experiments do not fully rule out length as the sole factor. In the revised manuscript, we will add new ablation experiments comparing budget forcing to (1) simply raising the maximum token limit without any prompt and (2) using a generic continuation prompt such as 'continue'. These results will be reported in an expanded Budget Forcing section, including quantitative comparisons on AIME24 and MATH to demonstrate whether the specific mechanism produces qualitatively different reasoning behavior beyond length alone. revision: yes

  2. Referee: [Main Results section] Main Results section: The claim that s1-32B exceeds o1-preview by up to 27% on MATH and AIME24 requires more detail on the exact evaluation protocol for the o1-preview baseline, including the number of samples or test-time compute budget allocated to it (given o1's undisclosed internal scaling). Additionally, reporting variance across multiple runs or statistical tests for the improvements would help assess robustness, especially since the abstract-only view leaves ambiguity on whether gains are consistent or sensitive to specific choices.

    Authors: We appreciate the request for greater transparency. For the o1-preview baseline, we accessed the model via the OpenAI API using its default configuration and standard test-time scaling behavior. Because o1-preview's internal scaling mechanisms are proprietary and undisclosed, we cannot report the precise compute budget allocated by OpenAI. In the revision, we will expand the Main Results section to fully document our evaluation protocol for both models, including API parameters (temperature, max tokens where applicable), the exact number of problems evaluated from each benchmark, and any sampling details. We will also explicitly note that s1-32B results are from single runs due to computational cost and will report performance consistency across varying budget-forcing levels as an indirect robustness check. Statistical tests will be added where multiple runs become feasible. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical claims

full rationale

The paper reports an empirical pipeline: curation of s1K via ablations on difficulty/diversity/quality criteria, SFT of Qwen2.5-32B-Instruct, and application of budget forcing (append 'Wait' or terminate) whose effects are measured on MATH/AIME24. All performance numbers (e.g., 27% lift, 50% to 57% extrapolation) are observed outcomes, not quantities derived from the method definition or fitted parameters renamed as predictions. No equations, uniqueness theorems, or self-citations function as load-bearing premises that reduce the central result to its inputs by construction. The work is self-contained against external benchmarks with released data and code.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method relies on empirical choices for dataset curation and the budget forcing heuristic, with no new theoretical entities.

free parameters (2)
  • s1K dataset size = 1000
    Chosen as small dataset for curation based on difficulty, diversity, quality.
  • Number of 'Wait' appends = multiple
    Used to lengthen thinking process.
axioms (1)
  • domain assumption The Qwen2.5-32B-Instruct base model is suitable for SFT on reasoning traces.
    Assumed without detailed justification in abstract.

pith-pipeline@v0.9.0 · 5577 in / 1318 out tokens · 38457 ms · 2026-05-11T16:58:08.198671+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 44 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

    cs.LG 2026-05 unverdicted novelty 8.0

    Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budget...

  3. GIANTS: Generative Insight Anticipation from Scientific Literature

    cs.CL 2026-04 unverdicted novelty 8.0

    GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

  4. Multi-Rollout On-Policy Distillation via Peer Successes and Failures

    cs.LG 2026-05 unverdicted novelty 7.0

    MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.

  5. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

  6. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  7. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  8. POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

    cs.SE 2026-05 unverdicted novelty 7.0

    POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.

  9. How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

    cs.LG 2026-04 unverdicted novelty 7.0

    A single-parameter Tsallis loss continuum unifies SFT and RLVR, derives time-to-escape bounds for cold start, and yields GARL and PAFT estimators that improve performance on QA reasoning tasks.

  10. Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...

  11. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

  12. Towards Unconstrained Human-Object Interaction

    cs.CV 2026-04 unverdicted novelty 7.0

    Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.

  13. AI Achieves a Perfect LSAT Score

    cs.AI 2026-04 unverdicted novelty 7.0

    Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.

  14. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    cs.LG 2026-01 unverdicted novelty 7.0

    A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...

  15. DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    cs.CV 2025-05 unverdicted novelty 7.0

    DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

  16. Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

    cs.CL 2026-05 unverdicted novelty 6.0

    Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.

  17. Evaluating the False Trust engendered by LLM Explanations

    cs.HC 2026-05 unverdicted novelty 6.0

    A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.

  18. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  19. Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

    cs.LG 2026-05 unverdicted novelty 6.0

    CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.

  20. Gradient Extrapolation-Based Policy Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...

  21. A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation

    cs.CL 2026-05 unverdicted novelty 6.0

    VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.

  22. VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

    cs.RO 2026-05 unverdicted novelty 6.0

    VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.

  23. How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

    cs.LG 2026-04 unverdicted novelty 6.0

    Tsallis q-loss continuum enables faster cold-start escape in reasoning model training via probability-based gradient amplification, with practical Monte Carlo estimators GARL and PAFT.

  24. Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.

  25. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  26. Understanding the Mechanism of Altruism in Large Language Models

    econ.GN 2026-04 unverdicted novelty 6.0

    A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

  27. Reasoning Structure Matters for Safety Alignment of Reasoning Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.

  28. Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

    cs.LG 2026-04 unverdicted novelty 6.0

    A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

  29. Characterizing Model-Native Skills

    cs.AI 2026-04 conditional novelty 6.0

    Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...

  30. When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

    cs.AI 2026-04 unverdicted novelty 6.0

    Extended reasoning in LLMs exhibits overthinking and diminishing returns, with optimal thinking length varying by problem difficulty, allowing significant compute savings by stopping at moderate budgets.

  31. On the Step Length Confounding in LLM Reasoning Data Selection

    cs.CL 2026-04 unverdicted novelty 6.0

    Average log probability selection for LLM reasoning datasets is confounded by step length because longer steps dilute low-probability first tokens; ASLEC-DROP and ASLEC-CASL remove this bias.

  32. Procedural Knowledge at Scale Improves Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...

  33. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  34. Dream 7B: Diffusion Large Language Models

    cs.CL 2025-08 unverdicted novelty 6.0

    Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...

  35. ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    cs.CL 2025-04 unverdicted novelty 6.0

    ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.

  36. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  37. Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

    cs.AI 2026-05 unverdicted novelty 5.0

    Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.

  38. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

    cs.AI 2026-05 unverdicted novelty 5.0

    Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

  39. BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models

    cs.AI 2026-05 unverdicted novelty 5.0

    BitCal-TTS raises exact-match accuracy by 3.7 points (7B) and 2.8 points (14B) on small GSM8K shards for 4-bit Qwen2.5 models while cutting premature-stop rates and retaining token savings versus fixed-budget decoding.

  40. ReMedi: Reasoner for Medical Clinical Prediction

    cs.CL 2026-05 unverdicted novelty 5.0

    ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.

  41. Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    cs.CL 2025-03 unverdicted novelty 5.0

    Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

  42. Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

    cs.LG 2026-05 unverdicted novelty 4.0

    Adaptive scheduling of penalties over training time plus confidence-based weighting of mistakes improves LLM performance on math reasoning benchmarks compared to fixed-penalty negative reinforcement.

  43. Adam's Law: Textual Frequency Law on Large Language Models

    cs.CL 2026-04 unverdicted novelty 3.0

    Frequent sentence-level text improves LLM prompting and fine-tuning performance across math, translation, commonsense, and tool-use tasks via a proposed frequency law and curriculum ordering.

  44. Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

    cs.CL 2025-08