pith. machine review for the scientific record. sign in

arxiv: 2503.14476 · v2 · submitted 2025-03-18 · 💻 cs.LG · cs.CL

Recognition: unknown

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Bole Ma, Chengyi Wang, Chi Zhang, Gaohong Liu, Guangming Sheng, Haibin Lin, Hang Zhu, Hao Zhou, Hongli Yu, Jiangjie Chen, Jiaze Chen, Jingjing Liu, Jinhua Zhu, Lingjun Liu, Lin Yan, Mingxuan Wang, Mofan Zhang, Mu Qiao, Qiying Yu, Ruofei Zhu, Tiantian Fan, Wang Zhang, Weinan Dai, Wei-Ying Ma, Xiangpeng Wei, Xiaochen Zuo, Xin Liu, Ya-Qin Zhang, Yonghui Wu, Yufeng Yuan, Yuxuan Song, Yuxuan Tong, Yu Yue, Zheng Zhang, Zhiqi Lin

Authors on Pith no claims yet
classification 💻 cs.LG cs.CL
keywords textbfopen-sourcelarge-scalereasoningsystemtrainingalgorithmdapo
0
0 comments X
read the original abstract

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

    cs.LG 2026-05 conditional novelty 8.0

    ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...

  2. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    cs.CV 2026-04 unverdicted novelty 8.0

    EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

  3. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  4. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  5. Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.

  6. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  7. GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.

  8. AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

    cs.AI 2026-05 unverdicted novelty 7.0

    AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.

  9. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...

  10. Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

    cs.LG 2026-05 unverdicted novelty 7.0

    RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.

  11. Relative Score Policy Optimization for Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.

  12. LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...

  13. Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

    cs.CV 2026-05 unverdicted novelty 7.0

    RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...

  14. SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.

  15. SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.

  16. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  17. CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

  18. BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...

  19. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  20. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

  21. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  22. Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.

  23. Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

    cs.LG 2026-05 unverdicted novelty 7.0

    POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.

  24. Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

    cs.LG 2026-05 unverdicted novelty 7.0

    POISE estimates value baselines for RL in LLMs from the actor's internal states via a lightweight probe and cross-rollout construction, matching DAPO performance with lower compute on math reasoning benchmarks.

  25. Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

  26. Rubric-based On-policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...

  27. Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...

  28. Teaching Language Models to Think in Code

    cs.CL 2026-05 unverdicted novelty 7.0

    ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.

  29. Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.

  30. Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

    cs.LG 2026-05 unverdicted novelty 7.0

    HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.

  31. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  32. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  33. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  34. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...

  35. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  36. Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

    cs.LG 2026-05 conditional novelty 7.0

    A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.

  37. Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

    cs.LG 2026-05 unverdicted novelty 7.0

    PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.

  38. Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

    cs.LG 2026-05 unverdicted novelty 7.0

    SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.

  39. SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States

    cs.CL 2026-05 unverdicted novelty 7.0

    SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.

  40. GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    GRPO-TTA applies GRPO to test-time visual tuning of vision-language models via group-wise policy optimization on unlabeled class candidates, outperforming prior TTA methods especially under natural distribution shifts.

  41. Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

    cs.LG 2026-05 unverdicted novelty 7.0

    Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...

  42. MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    MIRL uses mutual information to guide trajectory selection and provide separate rewards for visual perception in RLVR for VLMs, achieving 70.22% average accuracy with 25% fewer full trajectories.

  43. Faithful Mobile GUI Agents with Guided Advantage Estimator

    cs.AI 2026-05 unverdicted novelty 7.0

    Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.

  44. Structure Liberates: How Constrained Sensemaking Produces More Novel Research Output

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs trained on reconstructing known research ideation trajectories from citations outperform those inferring novel directions, yielding better novelty, diversity, and downstream research artifact quality.

  45. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

  46. Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

    cs.CL 2026-04 unverdicted novelty 7.0

    DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.

  47. Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders

    cs.IR 2026-04 unverdicted novelty 7.0

    Beam-search negatives induce partial AUC optimization in GRPO for LLM recommenders; Windowed Partial AUC and TAWin improve Top-K alignment on four datasets.

  48. Near-Future Policy Optimization

    cs.LG 2026-04 unverdicted novelty 7.0

    NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...

  49. EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 7.0

    EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.

  50. ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

    cs.CL 2026-04 unverdicted novelty 7.0

    ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

  51. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  52. Neural Garbage Collection: Learning to Forget while Learning to Reason

    cs.LG 2026-04 conditional novelty 7.0

    Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.

  53. Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

    cs.LG 2026-04 unverdicted novelty 7.0

    EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...

  54. S-GRPO: Unified Post-Training for Large Vision-Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

  55. SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

    cs.LG 2026-04 unverdicted novelty 7.0

    SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

  56. KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

    cs.AI 2026-04 unverdicted novelty 7.0

    KnowRL decomposes RL guidance into atomic knowledge points and uses Constrained Subset Search to build minimal-sufficient subsets, yielding 70.08 average accuracy without hints and 74.16 with them on 1.5B-scale models...

  57. SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving

    cs.RO 2026-04 unverdicted novelty 7.0

    SCORP delivers 10-28% gains in safety and 2-7% in efficiency metrics on WOMD by using dual-path scene conditioning in diffusion planning plus variance-gated group-relative policy optimization for closed-loop stability.

  58. Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

    cs.LG 2026-04 unverdicted novelty 7.0

    NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.

  59. Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

    cs.CL 2026-04 unverdicted novelty 7.0

    GRIP integrates retrieval into autoregressive generation through self-triggered control tokens for dynamic query planning, outperforming RAG baselines on QA benchmarks with fewer parameters than GPT-4o.

  60. MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.