arxiv: 2506.04178 · v2 · submitted 2025-06-04 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

OpenThoughts: Data Recipes for Reasoning Models

Aaron Gokaslan, Achal Dave, Aditya Grover, Alexandros G. Dimakis, Alon Albalak, Ashima Suvarna, Benjamin Feuer, Blake Wulfe, Caroline Choi, Charlie Cheng-Jie Ji, Chinmay Hegde, Eric Frankel, Etash Guha, Georgios Smyrnis, Greg Durrett, Hritik Bansal, Jean Mercat, Jeffrey Li, Jenia Jitsev, John Yang, Jon Saad-Falcon, Kai-Wei Chang, Kartik Sharma, Kushal Arora, Liangyu Chen, Ludwig Schmidt, Maheswaran Sathiamoorthy, Marianna Nezhurina, Mike A. Merrill, Mohit Bansal, Negin Raoof, Niklas Muennighoff, Reinhard Heckel, Ryan Marten, Saadia Gabriel, Sachin Grover, Sarah Pratt, Sedrick Keh, Sewoong Oh, Shiye Su, Shreyas Pimpalgaonkar, Tatsunori Hashimoto, Trung Vu, Vaishaal Shankar, Vivek Ramanujan, Wanjia Zhao, Yejin Choi, Yichuan Deng, Zaid Khan, Zayne Sprague

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords reasoning modelsdata recipesopen datasetsAIMELiveCodeBenchGPQAdistillationmachine learning

0 comments

The pith

Open data generation recipes train a 7B model to 53 percent on AIME 2025 and 54 percent on GPQA Diamond.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace proprietary datasets with public ones for training reasoning models in math, code, and science. The authors first released an initial dataset that allowed a 32B model to match certain closed baselines. They then conducted more than one thousand controlled experiments to refine every stage of the data pipeline, including question generation, solution creation, and filtering. Scaling the improved process to 1.2 million examples and using a strong teacher model produced a 7B model that records the listed benchmark scores, representing double-digit gains over the prior open 7B baseline. A sympathetic reader would care because the work supplies concrete, replicable steps for building capable reasoning systems without secret data.

Core claim

Through systematic investigation of each step in the data generation pipeline with over 1,000 controlled experiments, the authors create the OpenThoughts3 dataset of 1.2 million examples. When this dataset is used to train a 7B model with QwQ-32B as teacher, the resulting OpenThoughts3-7B model reaches 53 percent on AIME 2025, 51 percent on LiveCodeBench covering June 2024 to January 2025, and 54 percent on GPQA Diamond, for respective gains of 15.3, 17.2, and 20.5 percentage points over DeepSeek-R1-Distill-Qwen-7B.

What carries the argument

The data generation pipeline, whose individual stages are isolated and improved through ablation experiments to produce higher-quality reasoning traces.

Load-bearing premise

The large benchmark gains come mainly from the data recipes and pipeline choices rather than from the specific teacher model, base model architecture, or unstated training hyperparameters.

What would settle it

Train an identical 7B model on the OpenThoughts3 data but with a weaker teacher model and measure whether the AIME, LiveCodeBench, and GPQA scores fall back close to the prior open baseline.

read the original abstract

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on https://openthoughts.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents the OpenThoughts project for developing open-source datasets and pipelines to train reasoning models on math, code, and science tasks. It describes OpenThoughts2-1M yielding OpenThinker2-32B (matching DeepSeek-R1-Distill-32B), followed by systematic investigation of the data generation pipeline via 1,000+ controlled experiments to produce OpenThoughts3. Scaling to 1.2M examples and distilling from QwQ-32B produces OpenThoughts3-7B, which reports SOTA results of 53% on AIME 2025, 51% on LiveCodeBench (06/24-01/25), and 54% on GPQA Diamond—gains of 15.3, 17.2, and 20.5 pp over DeepSeek-R1-Distill-Qwen-7B—with full public release of datasets and models.

Significance. If the performance gains hold after isolating the data recipes from teacher-model effects, this work offers substantial value by releasing large-scale, high-quality open reasoning datasets and models. The scale of controlled pipeline experiments and public artifacts directly address the lack of transparency in proprietary reasoning systems, enabling community replication and extension.

major comments (3)

[Abstract and results] Abstract and results section: The headline claim attributes the 15–20 pp gains of OpenThoughts3-7B primarily to the data recipes and pipeline choices, yet the final model is distilled from QwQ-32B while the DeepSeek-R1-Distill-Qwen-7B baseline uses a different distillation source. No ablation is reported that holds the teacher model fixed while varying only the data pipeline, leaving the contribution of the recipes entangled with teacher strength and unstated training details.
[Experiments] Experiments section: The manuscript states that 1,000+ controlled experiments were used to refine the pipeline, but provides insufficient detail on the exact controls (e.g., fixed random seeds, identical base models, statistical significance testing, or handling of confounds such as prompt formatting and filtering thresholds). This weakens the ability to attribute improvements specifically to the reported recipe changes.
[Results] Results and evaluation: The reported benchmark scores for OpenThoughts3-7B are compared only to DeepSeek-R1-Distill-Qwen-7B; additional baselines using the same teacher (QwQ-32B) or identical base model with prior data recipes are absent, making it hard to confirm that the gains reflect broad reasoning improvements rather than teacher-specific distillation effects.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly state the base model used for OpenThoughts3-7B and any differences in training hyperparameters relative to the cited DeepSeek baseline.
[Tables] Tables reporting benchmark results would benefit from inclusion of standard deviations or multiple runs to convey variability, and from clearer labeling of which teacher model was used for each compared entry.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback, which helps clarify the scope of our contributions. We address each major comment below, providing clarifications on our experimental design and noting where revisions will be made to improve transparency. We maintain that the systematic pipeline optimizations offer valuable public insights, while honestly acknowledging limitations in isolating all variables.

read point-by-point responses

Referee: [Abstract and results] Abstract and results section: The headline claim attributes the 15–20 pp gains of OpenThoughts3-7B primarily to the data recipes and pipeline choices, yet the final model is distilled from QwQ-32B while the DeepSeek-R1-Distill-Qwen-7B baseline uses a different distillation source. No ablation is reported that holds the teacher model fixed while varying only the data pipeline, leaving the contribution of the recipes entangled with teacher strength and unstated training details.

Authors: We acknowledge that the teacher model (QwQ-32B) differs from the one underlying the DeepSeek-R1-Distill-Qwen-7B baseline, and that this entanglement prevents a pure isolation of data recipe effects. Our 1,000+ controlled experiments were conducted with fixed teachers within each ablation to isolate pipeline components such as data filtering, formatting, and example selection. The OpenThoughts2-1M results previously demonstrated that public data recipes can match proprietary distillation performance at the 32B scale. For the 7B model, we selected QwQ-32B as a strong, fully open teacher to maximize performance while keeping the data pipeline public. We will revise the abstract and results to more precisely attribute the gains to the combination of our optimized data recipe and distillation from QwQ-32B, and add an explicit limitations paragraph discussing teacher effects. revision: partial
Referee: [Experiments] Experiments section: The manuscript states that 1,000+ controlled experiments were used to refine the pipeline, but provides insufficient detail on the exact controls (e.g., fixed random seeds, identical base models, statistical significance testing, or handling of confounds such as prompt formatting and filtering thresholds). This weakens the ability to attribute improvements specifically to the reported recipe changes.

Authors: We agree that additional methodological details are needed for full reproducibility and attribution. The controlled experiments held the base model, teacher, and evaluation setup fixed while varying one pipeline factor at a time (e.g., filtering threshold or prompt template). In the revised manuscript, we will expand the Experiments section with specifics on random seeds, base models used across ablations, statistical testing procedures (including confidence intervals where applicable), and explicit controls for prompt formatting and filtering thresholds. revision: yes
Referee: [Results] Results and evaluation: The reported benchmark scores for OpenThoughts3-7B are compared only to DeepSeek-R1-Distill-Qwen-7B; additional baselines using the same teacher (QwQ-32B) or identical base model with prior data recipes are absent, making it hard to confirm that the gains reflect broad reasoning improvements rather than teacher-specific distillation effects.

Authors: The primary comparison to DeepSeek-R1-Distill-Qwen-7B follows standard practice for reporting competitive open models. We did not train an additional full-scale model using the prior OpenThoughts2 recipe with QwQ-32B due to the substantial compute required for 1.2M-example distillation. Incremental gains from OpenThoughts2 to OpenThoughts3 were validated at smaller scales through the controlled experiments. We will add a clarifying paragraph in the Results section noting the absence of same-teacher prior-recipe baselines and discussing the potential for teacher-specific effects, while emphasizing that all data and code are released to enable such follow-up work by the community. revision: partial

standing simulated objections not resolved

A complete same-teacher ablation (training both prior and new data recipes with QwQ-32B at full 1.2M scale) is not feasible within the revision timeline due to computational cost.

Circularity Check

0 steps flagged

No circularity: purely empirical pipeline with released artifacts

full rationale

The paper reports results from 1000+ controlled experiments on data generation steps, scaling to 1.2M examples, and training OpenThoughts3-7B on public data distilled from QwQ-32B. All claims are direct benchmark measurements (AIME, LiveCodeBench, GPQA) with datasets and models released. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear; the work is self-contained empirical reporting without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical ML paper with no mathematical axioms or invented physical entities. No free parameters central to the claim beyond standard training hyperparameters.

pith-pipeline@v0.9.0 · 5785 in / 1023 out tokens · 38081 ms · 2026-05-12T04:51:33.058423+00:00 · methodology

discussion (0)

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
cs.CL 2026-05 unverdicted novelty 8.0

Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
GIANTS: Generative Insight Anticipation from Scientific Literature
cs.CL 2026-04 unverdicted novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
cs.CL 2026-05 unverdicted novelty 7.0

TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4...
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
cs.AI 2026-05 unverdicted novelty 7.0

TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
cs.LG 2026-05 unverdicted novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

SxS Interleaved Reasoning learns when to disclose partial reasoning during generation and improves accuracy versus content-latency trade-offs on math and science benchmarks.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
cs.CL 2026-05 unverdicted novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens
cs.CL 2026-04 unverdicted novelty 7.0

Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.
Super Apriel: One Checkpoint, Many Speeds
cs.LG 2026-04 unverdicted novelty 7.0

A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
cs.AI 2026-04 unverdicted novelty 7.0

SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
cs.LG 2026-04 unverdicted novelty 7.0

ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
cs.CL 2026-03 conditional novelty 7.0

TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Engagement Process: Rethinking the Temporal Interface of Action and Observation
cs.AI 2026-05 unverdicted novelty 6.0

Engagement Process decouples actions and observations into separate time-based event streams within a POMDP structure to explicitly model timing mismatches, deliberation latency, and multi-rate interactions.
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

ATESD makes teacher exposure to reference reasoning a learnable control variable via a Beta-policy optimized on future student improvement, yielding gains of up to +2.33 points over fixed-exposure self-distillation on...
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
cs.CL 2026-05 unverdicted novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
AIPO: : Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
cs.CL 2026-05 unverdicted novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

SxS Interleaved Reasoning learns disclosure timing via entailment-aligned trajectories and SFT+RL training, improving accuracy-content-latency trade-offs on math and science benchmarks.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
cs.AI 2026-05 unverdicted novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
When Less is Enough: Efficient Inference via Collaborative Reasoning
cs.LG 2026-05 conditional novelty 6.0

A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
cs.CL 2026-04 unverdicted novelty 6.0

GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
cs.LG 2026-04 unverdicted novelty 6.0

On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
MEMENTO: Teaching LLMs to Manage Their Own Context
cs.AI 2026-04 unverdicted novelty 6.0

MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
Procedural Knowledge at Scale Improves Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
cs.LG 2026-05 unverdicted novelty 5.0

RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
cs.CL 2025-08

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 36 Pith papers

[1]

reasoning on

URL https://arxiv.org/abs/2503.07879. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. Gemini-Team, Rohan Anil, Sebastian Borgeaud,...

work page doi:10.18653/v1/2022.acl 2020
[2]

It involves 30 questions of different levels of difficulty

AIME24: a mathematics competition for high-school students held in 2024. It involves 30 questions of different levels of difficulty. Answers are a single integer from 0 to 999

work page 2024
[3]

It involves 30 questions of different levels of difficulty

AIME25: a mathematics competition for high-school students held in 2025. It involves 30 questions of different levels of difficulty. Answers are a single integer from 0 to 999

work page 2025
[4]

It consists of 40 questions with different difficulty levels

AMC23: a mathematics competition for high-school students held in 2023. It consists of 40 questions with different difficulty levels. The answers are numerical

work page 2023
[5]

Let’s Verify Step by Step

MATH500: consists of 500 diverse problems in probability, algebra, trigonometry, and geometry. 24 Benchmark Domain / Description Number of Questions Code Generation CodeElo (Quan et al., 2025) Code generation with human-comparable Elo ratings. 391 CodeForces (Penedo et al., 2025) Benchmarking competition-level code generation. 453 LiveCodeBench 05/23-05/2...

work page 2025
[6]

The benchmark measures unit test-based execution accuracy with a human-comparable Elo rating

CodeForces: consists of 453 real-world programming problems sourced from the Code- Forces platform. The benchmark measures unit test-based execution accuracy with a human-comparable Elo rating

work page
[7]

The benchmark measures unit test-based execution accuracy with a difficulty- calibrated Elo rating

CodeElo: consists of 391 real-world programming problems curated from a variety of contests. The benchmark measures unit test-based execution accuracy with a difficulty- calibrated Elo rating

work page
[8]

LiveCodeBench 05/23-05/24 subset has 511 problems released between May 2023 and May 2024, whereas the 06/24-01/25 subset has 369 problems released between May 2024 and Jan

LiveCodeBench: a benchmark of real-world programming tasks that evaluate a model’s ability to generate, execute, verify, and iteratively repair solutions using unit-test feedback. LiveCodeBench 05/23-05/24 subset has 511 problems released between May 2023 and May 2024, whereas the 06/24-01/25 subset has 369 problems released between May 2024 and Jan. 2025

work page 2023
[9]

Questions are in multiple-choice format

GPQA Diamond: a set of 198 challenging questions from the Graduate-Level Google-Proof Q&A Benchmark (GPQA). Questions are in multiple-choice format

work page
[10]

Questions are in multiple-choice and numerical formats

JEEBench: contains 515 questions spanning Physics, Chemistry and Mathematics subjects collected from the Joint Entrance Examination (JEE): Advanced held from 2016 to 2023. Questions are in multiple-choice and numerical formats

work page 2016
[11]

Questions are in Combinatorics, Number Theory, Algebra, and Geometry

HMMT: 30 questions from the HMMT high school mathematics competition held in February 2025. Questions are in Combinatorics, Number Theory, Algebra, and Geometry

work page 2025
[12]

F D ECONTAMINATION Contamination with the evaluation datasets is an important issue, since it poses the danger of misleading results over the actual usefulness of the training set

HLE: a subset of 512 multiple-choice, text-only questions from the Humanity’s Last Exam (HLE) benchmark. F D ECONTAMINATION Contamination with the evaluation datasets is an important issue, since it poses the danger of misleading results over the actual usefulness of the training set. It is expected that training data that contains evaluation questions in...

work page
[13]

We take test sets (MATH500, GPQA Diamond, LiveCodeBench) and sample exact questions from each test set

work page
[14]

Please help me solve this problem:

We sample questions from test sets and apply three types of alteration. Our first alteration is embedding the question in a longer context, such as "Please help me solve this problem: ". The second alteration is replacing several words with synonyms, numerical expressions with equivalent expressions, and variable names. Our final alteration is changing th...

work page
[15]

held out

We add uncontaminated questions by creating completely original questions manually. Overall, our dataset has 3092 contaminated samples and 3000 uncontaminated samples. We tuned our decontamination algorithm to produce nearly 0 false negatives (marking contaminated questions as decontaminated) while not having many false positives. The results of our final...

work page 2000
[16]

Two of the zeros are additive inverses

Given the cubic polynomial $P(x)=x^-7x^-4x+28$ . Two of the zeros are additive inverses. Find the zeros

work page
[17]

If $\mathrm(\mathbf)$ is a polynomial with rational coefficients and roots at 0, 1, $\sqrt$ , and $1 -(\sqrt(3))$ , then the degree of $\mathfrak(p)(\ensuremath(\mathbf(x)))$ is at least?

work page
[18]

I found a piece of the beginning of the equation and a piece at the end, but the middle was missing

When Madison’s dog chewed up her mathematics assignment, one particular equation was ripped apart. I found a piece of the beginning of the equation and a piece at the end, but the middle was missing. The beginning piece was $x^(5)-9x^(4)+$ and the ending piece was $+11=0$ . Fortunately the teacher had promised that all of the roots would be integers. How ...

work page
[19]

Find the sum of the squares of its coefficients

The following is a polynomial. Find the sum of the squares of its coefficients. $\sqrt[3](x^(9)-3x^(8) +18x^(7)-28x^(6)+84x^(5)-42x^(4)+98x^(3)+72x^+15x+1)$ . FURMAN

work page
[20]

If a cubic polynomial $\operatorname(p)(\mathbf(x))$ has roots at -1, 2, and 3, and if $\mathfrak(p) (0)=1$ , then the remainder when $\mathfrak(p)(\ensuremath(\mathbf(x)))$ is divided by $\mathbf(X) -1$ is:

work page
[21]

If 2 is a solution of $x^(3)+h x+10=0$ , then h equals:

work page
[22]

The number of distinct real solutions of the equation $4x^(3)-8x^(2)+5x-1=0$ is:

work page
[23]

What is the sum of the squares of the roots of $x^(4)-5x^(2)+6=0$

work page
[24]

For how many integers $_\mathrm(N)$ is $N^(4)+6N<6N^(3)+N^(2)?$

work page
[25]

How any times does the graph of $f(x)=x^(3)-x^(2)+2x+4$ cross the $\mathbf(X)$ axis?

work page
[26]

Here is the instruction

Madison’s dog chewed on her homework before she could finish it. The fragment saved from the horrible canine’s mouth reveal only the two terms of highest degree of the polynomial $\mathfrak(p)(\ ensuremath\mathbf(x)))$ Now please give me your extraction of all text, including text in images. Figure 24: Gemini OCR Prompt 52 You are to reform the following ...

work page 2024
[27]

A plane contains $40$ lines, no $2$ of which are parallel. Suppose that there are $3$ points where exactly $3$ lines intersect, $4$ points where exactly $4$ lines intersect, $5$ points where exactly $5$ lines intersect, $6$ points where exactly $6$ lines intersect, and no points where more than $6$ lines intersect. Find the number of points where exactly ...

work page
[28]

A spin-half particle is in a linear superposition0.8|\uparrow\rangle+0.6|\downarrow\rangle of its spin -up and spin-down states. If |\uparrow\rangle and |\downarrow\rangle are the eigenstates of \sigma_{ z} , then what is the expectation value up to one decimal place, of the operator 10\sigma_{z}+5\ sigma_{x} ? Here, symbols have their usual meanings

work page
[29]

They claim Subset Sum as an NP-hard problem

An established group of scientists are working on finding solution to NP hard problems. They claim Subset Sum as an NP-hard problem. The problem is to determine whether there exists a subset of a given set S whose sum is a given number K. You are a computer engineer and you claim to solve this problem given that all numbers in the set are non-negative. Gi...

work page
[30]

The probability that both $a_1$ divides $a_2$ and $a_2$ divides $a_3$ is $\tfrac{m}{n},$ where $m$ and $n$ are relatively prime positive integers

Let $S$ be the set of positive integer divisors of $20^9.$ Three numbers are chosen independently and at random with replacement from the set $S$ and labeled $a_1,a_2,$ and $a_3$ in the order they are chosen. The probability that both $a_1$ divides $a_2$ and $a_2$ divides $a_3$ is $\tfrac{m}{n},$ where $m$ and $n$ are relatively prime positive integers. Find $m.$

work page
[31]

KCa-EDTA = 5x10\^10

What is the concentration of calcium ions in a solution containing 0.02 M stoichiometric Ca-EDTA complex (we assume that the pH is ideal, T = 25C). KCa-EDTA = 5x10\^10. Negative Questions:

work page
[32]

Solve 0 = 19 *z - 17 *z for z

work page
[33]

Simplify ((-2 *(-2*sqrt(1210) - sqrt(1210) - sqrt(20)/sqrt(2) *-6))/((sqrt(1800)*2 + sqrt(1800) + sqrt (1800) + sqrt(1800)) *-1)*3)**2.\n

work page
[34]

Given a list of objects that have an ‘is_organized‘ method that returns a boolean value, write a Python function that takes the list and returns a new list of those objects for which ‘is_organized‘ returns True

work page
[35]

Can you provide a Python code snippet that demonstrates how to use a decorator to log the execution time of a function?

work page
[36]

amc_aime

Is sodium hydroxide (NaOH) an acid or base? Here is your question: {{question}} Return a score between 1 and 100, where 100 means exactly like the positive questions whereas 1 is exactly like the negative questions. Figure 28: Prompt for AskLLM Filtering. This text is the prompt for AskLLM Filtering • Length-based Selection (GPT-4.1-mini): Annotate questi...

work page 2003