hub Mixed citations

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal · 2025 · cs.LG · arXiv 2506.04178

Mixed citation behavior. Most common role is background (50%).

57 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 57 citing papers arXiv PDF

abstract

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on https://openthoughts.ai.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 dataset 7 baseline 1

citation-polarity summary

background 9 use dataset 7 baseline 1 unclear 1

claims ledger

abstract Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning
background future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of reasoning large language models 1. 1 Introduction In recent years, as shown in Figure 1, the emergence of reasoning large language models (RLLMs) such as OpenAI o1 [ 307] and DeepSeek R1 [ 227] has sparked a growing body of research into Long Chain-of-Though

co-cited works

representative citing papers

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4B models.

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.

TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

cs.CL · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

SxS Interleaved Reasoning learns when to disclose partial reasoning during generation and improves accuracy versus content-latency trade-offs on math and science benchmarks.

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.

Super Apriel: One Checkpoint, Many Speeds

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.

DMax: Aggressive Parallel Decoding for dLLMs

cs.LG · 2026-04-09 · conditional · novelty 7.0 · 2 refs

DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.

ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

cs.CL · 2026-03-23 · conditional · novelty 7.0

TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

cs.LG · 2026-01-26 · unverdicted · novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.

Scaling Latent Reasoning via Looped Language Models

cs.CL · 2025-10-29 · unverdicted · novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.

MIST: A Co-Design Framework for Heterogeneous, Multi-Stage LLM Inference

cs.AR · 2025-04-14 · unverdicted · novelty 7.0

MIST is a new simulator for heterogeneous multi-stage LLM inference that combines hardware traces with analytical models to explore configuration trade-offs in hybrid CPU-accelerator systems.

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.

Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Introspective Training annotates data with natural-language feedback from a thinking reward model and conditions all LLM training stages on that feedback, bending scaling curves for up to 2.8x compute efficiency gains and superior math/code performance.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

PreFT: Prefill-only finetuning for efficient inference

cs.LG · 2026-05-14 · accept · novelty 6.0

Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.

citing papers explorer

Showing 11 of 11 citing papers after filters.

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation cs.AI · 2026-05-22 · unverdicted · none · ref 3 · internal anchor
EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment cs.AI · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding cs.AI · 2026-05-04 · unverdicted · none · ref 23 · internal anchor
CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
Characterizing Model-Native Skills cs.AI · 2026-04-19 · conditional · none · ref 88 · internal anchor
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
MEMENTO: Teaching LLMs to Manage Their Own Context cs.AI · 2026-04-10 · unverdicted · none · ref 7 · internal anchor
MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 95 · internal anchor
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 227 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Engagement Process: Rethinking the Temporal Interface of Action and Observation cs.AI · 2026-05-12 · unreviewed · ref 45 · internal anchor
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning cs.AI · 2026-05-12 · unreviewed · ref 8 · internal anchor
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions cs.AI · 2026-04-09 · unreviewed · ref 7 · internal anchor

OpenThoughts: Data Recipes for Reasoning Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer