hub

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal · 2025 · cs.LG · arXiv 2506.04178

35 Pith papers cite this work. Polarity classification is still indexing.

35 Pith papers citing it

open full Pith review browse 35 citing papers arXiv PDF

abstract

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on https://openthoughts.ai.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

claims ledger

abstract Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning
background future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of reasoning large language models 1. 1 Introduction In recent years, as shown in Figure 1, the emergence of reasoning large language models (RLLMs) such as OpenAI o1 [ 307] and DeepSeek R1 [ 227] has sparked a growing body of research into Long Chain-of-Though

co-cited works

representative citing papers

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0

Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4B models.

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.

TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

cs.CL · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

SxS Interleaved Reasoning learns when to disclose partial reasoning during generation and improves accuracy versus content-latency trade-offs on math and science benchmarks.

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.

Super Apriel: One Checkpoint, Many Speeds

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.

ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

cs.LG · 2026-01-26 · unverdicted · novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.

Engagement Process: Rethinking the Temporal Interface of Action and Observation

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Engagement Process decouples actions and observations into separate time-based event streams within a POMDP structure to explicitly model timing mismatches, deliberation latency, and multi-rate interactions.

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

ATESD makes teacher exposure to reference reasoning a learnable control variable via a Beta-policy optimized on future student improvement, yielding gains of up to +2.33 points over fixed-exposure self-distillation on AIME and HMMT math benchmarks.

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.

AIPO: : Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating standard LLM performance.

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.

Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.

When Less is Enough: Efficient Inference via Collaborative Reasoning

cs.LG · 2026-05-01 · conditional · novelty 6.0

A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetric scalar grids.

citing papers explorer

Showing 35 of 35 citing papers.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 17 · internal anchor
Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
GIANTS: Generative Insight Anticipation from Scientific Literature cs.CL · 2026-04-10 · unverdicted · none · ref 6 · internal anchor
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights cs.CL · 2026-05-13 · unverdicted · none · ref 56 · internal anchor
TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4B models.
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment cs.AI · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation cs.CL · 2026-05-10 · unverdicted · none · ref 10 · internal anchor
Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization cs.LG · 2026-05-06 · unverdicted · none · ref 6 · internal anchor
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning cs.CL · 2026-05-05 · unverdicted · none · ref 2 · 2 links · internal anchor
SxS Interleaved Reasoning learns when to disclose partial reasoning during generation and improves accuracy versus content-latency trade-offs on math and science benchmarks.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate cs.CL · 2026-05-02 · unverdicted · none · ref 14 · internal anchor
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens cs.CL · 2026-04-29 · unverdicted · none · ref 8 · internal anchor
Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.
Super Apriel: One Checkpoint, Many Speeds cs.LG · 2026-04-21 · unverdicted · none · ref 20 · internal anchor
A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions cs.AI · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads cs.LG · 2026-04-07 · unverdicted · none · ref 59 · internal anchor
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models cs.LG · 2026-01-26 · unverdicted · none · ref 3 · internal anchor
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.
Engagement Process: Rethinking the Temporal Interface of Action and Observation cs.AI · 2026-05-12 · unverdicted · none · ref 45 · internal anchor
Engagement Process decouples actions and observations into separate time-based event streams within a POMDP structure to explicitly model timing mismatches, deliberation latency, and multi-rate interactions.
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 8 · internal anchor
ATESD makes teacher exposure to reference reasoning a learnable control variable via a Beta-policy optimized on future student improvement, yielding gains of up to +2.33 points over fixed-exposure self-distillation on AIME and HMMT math benchmarks.
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics cs.CL · 2026-05-10 · unverdicted · none · ref 12 · internal anchor
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
AIPO: : Learning to Reason from Active Interaction cs.CL · 2026-05-08 · unverdicted · none · ref 19 · internal anchor
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 34 · internal anchor
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating standard LLM performance.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning cs.CL · 2026-05-07 · unverdicted · none · ref 14 · internal anchor
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding cs.AI · 2026-05-04 · unverdicted · none · ref 23 · internal anchor
CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
When Less is Enough: Efficient Inference via Collaborative Reasoning cs.LG · 2026-05-01 · conditional · none · ref 12 · internal anchor
A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models cs.CL · 2026-05-01 · unverdicted · none · ref 48 · internal anchor
LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling cs.CL · 2026-04-20 · unverdicted · none · ref 16 · internal anchor
GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetric scalar grids.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts cs.LG · 2026-04-20 · unverdicted · none · ref 10 · internal anchor
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Characterizing Model-Native Skills cs.AI · 2026-04-19 · conditional · none · ref 88 · internal anchor
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe cs.LG · 2026-04-14 · unverdicted · none · ref 6 · internal anchor
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 7 · 2 links · internal anchor
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
MEMENTO: Teaching LLMs to Manage Their Own Context cs.AI · 2026-04-10 · unverdicted · none · ref 7 · internal anchor
MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
Procedural Knowledge at Scale Improves Reasoning cs.CL · 2026-04-01 · unverdicted · none · ref 7 · internal anchor
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
On-Policy Distillation with Best-of-N Teacher Rollout Selection cs.CV · 2026-05-10 · unverdicted · none · ref 14 · 2 links · internal anchor
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR cs.LG · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · unverdicted · none · ref 26 · internal anchor
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly 2-3x on benchmarks while preserving accuracy.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 227 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models cs.CL · 2025-08-21 · unreviewed · ref 7 · internal anchor

OpenThoughts: Data Recipes for Reasoning Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer