pith. sign in

arxiv: 2503.14476 · v2 · submitted 2025-03-18 · 💻 cs.LG · cs.CL

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Pith reviewed 2026-05-22 23:30 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords LLM reinforcement learningpolicy optimizationreasoning modelsopen-source systemAIME benchmarkDAPO algorithmlarge-scale trainingreproducibility
0
0 comments X

The pith

DAPO algorithm with four techniques lets open-source RL reach 50 on AIME 2024 using Qwen2.5-32B.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Decoupled Clip and Dynamic sAmpling Policy Optimization algorithm to make large-scale reinforcement learning for language models reproducible. It pairs this method with fully open-sourced training code on the verl framework and a curated dataset. The resulting system attains 50 points on the AIME 2024 benchmark starting from the Qwen2.5-32B base model. Four specific techniques, including decoupled clipping and dynamic sampling, are presented as the elements that turn large-scale LLM RL into a practical success. By releasing these components the authors aim to let others replicate and extend the results instead of relying on withheld details from closed systems.

Core claim

The DAPO algorithm, built around decoupled clipping and dynamic sampling together with two additional techniques, combined with open-sourced code and dataset, produces a large-scale RL system that reaches 50 points on AIME 2024 when applied to the Qwen2.5-32B base model.

What carries the argument

The DAPO algorithm and its four techniques of decoupled clipping, dynamic sampling, and two supporting methods that stabilize and improve policy optimization at LLM scale.

If this is right

  • Community members can now reproduce the reported AIME performance without access to proprietary details.
  • The open-sourced system lowers the barrier for experimenting with reinforcement learning on other base models.
  • Future work can isolate the contribution of each of the four techniques by ablating them within the released framework.
  • Training runs become more transparent, allowing direct comparison of implementation choices across different labs.
  • The combination of algorithm, code, and data supports scaling studies that were previously blocked by secrecy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same four techniques might transfer to models larger than 32B if the open-sourced code is adapted.
  • Dataset curation effects could be measured separately by swapping in new data while holding the algorithm fixed.
  • Closed models that currently lead on reasoning benchmarks could face pressure once the open system is widely used.
  • Extending the dynamic sampling component to other policy-gradient methods outside LLM RL is a direct next test.

Load-bearing premise

The reported performance gains come primarily from the four techniques rather than from the base model choice or dataset curation choices.

What would settle it

Running the released code and dataset on Qwen2.5-32B and obtaining substantially less than 50 points on AIME 2024 would falsify the claim that the techniques make large-scale LLM RL successful.

read the original abstract

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm for large-scale LLM reinforcement learning. It claims to achieve 50 points on AIME 2024 using the Qwen2.5-32B base model and fully open-sources a state-of-the-art RL system, including training code built on the verl framework, a curated dataset, and details on four key techniques to address reproducibility issues in prior closed systems.

Significance. If the performance result holds and is attributable to the proposed techniques, the work would be significant for providing the first fully open-source large-scale LLM RL system with concrete benchmark results, directly addressing the opacity of systems like OpenAI o1 and DeepSeek R1. The open-sourcing of code, dataset, and techniques is a clear strength that enables community verification and extension.

major comments (2)
  1. [Abstract] Abstract: the claim that the four key techniques 'make large-scale LLM RL a success' is load-bearing for the central contribution but is not supported by ablations that hold the dataset and Qwen2.5-32B base model fixed while comparing DAPO only against a standard PPO/GRPO baseline. Without such controls, the attribution of the 50-point AIME 2024 result to the algorithmic changes (rather than data curation or base-model capabilities) cannot be verified.
  2. [Abstract] Abstract: the manuscript asserts a concrete benchmark score of 50 on AIME 2024 but provides no derivation, ablation data, or error analysis in the presented text to connect the result to the four techniques; this leaves the performance claim unverified against the stated methods.
minor comments (1)
  1. The exact evaluation protocol for the AIME 2024 score (e.g., pass@1, average over multiple samples, or strict correctness) should be stated explicitly to allow precise replication and comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. The comments correctly identify areas where the manuscript's claims could be more precisely supported by evidence. We address each point below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the four key techniques 'make large-scale LLM RL a success' is load-bearing for the central contribution but is not supported by ablations that hold the dataset and Qwen2.5-32B base model fixed while comparing DAPO only against a standard PPO/GRPO baseline. Without such controls, the attribution of the 50-point AIME 2024 result to the algorithmic changes (rather than data curation or base-model capabilities) cannot be verified.

    Authors: We agree that the abstract phrasing attributes success to the four techniques without the precise controlled ablations described. The manuscript presents DAPO as the core algorithmic contribution within an open-sourced system, with the 50-point result obtained using those techniques on the stated base model and dataset. However, the referee is correct that direct attribution requires ablations holding data and base model fixed against a standard baseline. We will add such controlled experiments (or clarify their absence if resource-constrained) in a revised version or appendix to strengthen this claim. revision: yes

  2. Referee: [Abstract] Abstract: the manuscript asserts a concrete benchmark score of 50 on AIME 2024 but provides no derivation, ablation data, or error analysis in the presented text to connect the result to the four techniques; this leaves the performance claim unverified against the stated methods.

    Authors: The 50-point AIME 2024 score is the end-to-end result of the fully described DAPO system. The abstract summarizes this outcome, while the body details the four techniques and training setup. The referee correctly notes the absence of explicit derivation, ablation tables, or error analysis directly linking the score to each technique within the presented text. We will revise the abstract and add cross-references or a concise summary table in the main text to better connect the result to the methods, drawing from any available internal logs or additional analysis. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation or results

full rationale

The manuscript presents an empirical system and benchmark result (50 on AIME 2024) using an external, independently verifiable test set. No equations, fitted parameters, or self-citations are shown that reduce the performance claim to a definition or input by construction. The four techniques are asserted to drive success, but the benchmark itself is not derived from or equivalent to those techniques; it remains an independent external measure. The paper is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical systems contribution; no mathematical axioms, free parameters fitted to the target result, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5848 in / 1073 out tokens · 36992 ms · 2026-05-22T23:30:09.744924+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

    cs.LG 2026-05 conditional novelty 8.0

    DualKV is a new FlashAttention variant that shares prompt KV across multiple rollouts in RL training, delivering 1.63-3.82x speedups on 8B-30B models while remaining mathematically identical to standard attention.

  2. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

    cs.LG 2026-05 conditional novelty 8.0

    ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...

  3. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    cs.CV 2026-04 unverdicted novelty 8.0

    EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

  4. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  5. DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

    cs.CL 2025-04 conditional novelty 8.0

    DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.

  6. Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

    cs.CV 2026-05 unverdicted novelty 7.0

    ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.

  7. DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection

    cs.CV 2026-05 unverdicted novelty 7.0

    A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.

  8. Learnability-Informed Fine-Tuning of Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    LIFT is a learnability-informed SFT algorithm for diffusion LMs that aligns token difficulty with diffusion time steps, yielding up to 3x gains on AIME'24 and AIME'25 over standard SFT baselines.

  9. MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

    cs.CV 2026-05 unverdicted novelty 7.0

    MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on p...

  10. ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.

  11. CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...

  12. CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

    cs.LG 2026-05 conditional novelty 7.0

    CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.

  13. Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

    cs.LG 2026-05 conditional novelty 7.0

    Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

  14. Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

    cs.AI 2026-05 unverdicted novelty 7.0

    PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generatio...

  15. Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

    cs.LG 2026-05 unverdicted novelty 7.0

    Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...

  16. Weak-to-Strong Elicitation via Mismatched Wrong Drafts

    cs.CL 2026-05 conditional novelty 7.0

    Mismatched wrong drafts from a 1.5B math model injected into GRPO training of a 7B model yield higher pass rates on MATH-500 and AIME than on-policy baselines or matched variants.

  17. DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

    cs.LG 2026-05 unverdicted novelty 7.0

    DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more stra...

  18. Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

    cs.LG 2026-05 unverdicted novelty 7.0

    PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.

  19. AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and Agent...

  20. Learning from Language Feedback via Variational Policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming f...

  21. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 conditional novelty 7.0

    ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

  22. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...

  23. AIS: Adaptive Importance Sampling for Quantized RL

    stat.ML 2026-05 unverdicted novelty 7.0

    AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

  24. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  25. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  26. Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.

  27. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  28. GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.

  29. CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

    cs.AI 2026-05 unverdicted novelty 7.0

    CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.

  30. AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

    cs.AI 2026-05 unverdicted novelty 7.0

    AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.

  31. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...

  32. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.

  33. Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

    cs.LG 2026-05 unverdicted novelty 7.0

    RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.

  34. Relative Score Policy Optimization for Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.

  35. LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...

  36. Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

    cs.CV 2026-05 unverdicted novelty 7.0

    RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...

  37. SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.

  38. SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.

  39. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  40. CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

  41. BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...

  42. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  43. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

  44. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  45. Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.

  46. Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

    cs.LG 2026-05 unverdicted novelty 7.0

    POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.

  47. Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

    cs.LG 2026-05 unverdicted novelty 7.0

    POISE estimates value baselines for RL in LLMs from the actor's internal states via a lightweight probe and cross-rollout construction, matching DAPO performance with lower compute on math reasoning benchmarks.

  48. Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

  49. Rubric-based On-policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...

  50. Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...

  51. Teaching Language Models to Think in Code

    cs.CL 2026-05 unverdicted novelty 7.0

    ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.

  52. Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.

  53. Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

    cs.LG 2026-05 unverdicted novelty 7.0

    HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.

  54. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  55. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  56. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  57. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...

  58. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  59. Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

    cs.LG 2026-05 conditional novelty 7.0

    A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.

  60. Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

    cs.LG 2026-05 unverdicted novelty 7.0

    PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 414 Pith papers · 10 internal anchors

  1. [1]

    Learning to reason with llms, 2024

    OpenAI. Learning to reason with llms, 2024

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  3. [3]

    GPT-4 Technical Report

    OpenAI. GPT4 technical report.arXiv preprint arXiv:2303.08774, 2023

  4. [4]

    Claude 3.5 sonnet, 2024

    Anthropic. Claude 3.5 sonnet, 2024

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

  6. [6]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023

  7. [7]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  8. [8]

    Grok 3 beta — the age of reasoning agents, 2024

    XAI. Grok 3 beta — the age of reasoning agents, 2024

  9. [9]

    Gemini 2.0 flash thinking, 2024

    Google DeepMind. Gemini 2.0 flash thinking, 2024

  10. [10]

    Qwq-32b: Embracing the power of reinforcement learning, 2024

    Qwen. Qwq-32b: Embracing the power of reinforcement learning, 2024

  11. [11]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  12. [12]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  13. [13]

    An empirical study on eliciting and improving r1-like reasoning models.arXiv preprint arXiv:2503.04548, 2025

    Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, et al. An empirical study on eliciting and improving r1-like reasoning models.arXiv preprint arXiv:2503.04548, 2025

  14. [14]

    Open-reasoner- zero: An open source approach to scaling reinforcement learning on the base model.https://github.com/ Open-Reasoner-Zero/Open-Reasoner-Zero, 2025

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, and Heung-Yeung Shum Xiangyu Zhang. Open-reasoner- zero: An open source approach to scaling reinforcement learning on the base model.https://github.com/ Open-Reasoner-Zero/Open-Reasoner-Zero, 2025

  15. [15]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

  16. [16]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

  17. [17]

    Token-supervised value models for enhancing mathematical reasoning capabilities of large language models

    Jung Hyun Lee, June Yong Yang, Byeongho Heo, Dongyoon Han, and Kang Min Yoo. Token-supervised value models for enhancing mathematical reasoning capabilities of large language models. arXiv preprint arXiv:2407.12863, 2024

  18. [18]

    Available: https://arxiv.org/abs/2410.01679

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint arXiv:2410.01679, 2024

  19. [19]

    What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

    Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

  20. [20]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024

  21. [21]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 13

  22. [22]

    High-dimensional continuous control using generalized advantage estimation, 2018

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018

  23. [23]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  24. [24]

    Concrete problems in ai safety, 2016

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety, 2016

  25. [25]

    Reinforcement learning with a corrupted reward channel, 2017

    Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg. Reinforcement learning with a corrupted reward channel, 2017

  26. [26]

    Specification gaming: the flip side of ai ingenuity, 2020

    Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: the flip side of ai ingenuity, 2020

  27. [27]

    Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021

    Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021

  28. [28]

    Scaling laws for reward model overoptimization, 2022

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization, 2022

  29. [29]

    Reward hacking in reinforcement learning.lilianweng.github.io, Nov 2024

    Lilian Weng. Reward hacking in reinforcement learning.lilianweng.github.io, Nov 2024

  30. [30]

    Generative language modeling for automated theorem proving, 2020

    Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving, 2020

  31. [31]

    Solving olympiad geometry without human demonstrations

    Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024

  32. [32]

    Alphageometry: An olympiad-level ai system for geometry, 2024

    Trieu Trinh and Thang Luong. Alphageometry: An olympiad-level ai system for geometry, 2024

  33. [33]

    Ai achieves silver-medal standard solving international mathematical olympiad problems, 2024

    AlphaProof and AlphaGeometry Teams. Ai achieves silver-medal standard solving international mathematical olympiad problems, 2024

  34. [34]

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems, 35:21314–21328, 2022

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems, 35:21314–21328, 2022

  35. [35]

    Reflexion: Language agents with verbal reinforcement learning, 2023

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023

  36. [36]

    Teaching large language models to self-debug, 2023

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug, 2023

  37. [37]

    Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025

  38. [38]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  39. [39]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternationalConference on Learning Representations, 2019. 14 Appendix A Dataset Transformation Here is an example of data transformation, where we demonstrate the original problem, the transformed problem, and the model outputs during the transformation process (the original is in...

  40. [40]

    First, note that the answer consists of an integer part and a square root term

    We need to transform this into an integer form. First, note that the answer consists of an integer part and a square root term. According to the given rule, we replace the integer and square root components with undetermined coefficients. The original problem usesx and y as variables, so we choose other letters as undetermined coefficients. We express the...

  41. [41]

    Let B be the set of residents who own a set of golf clubs

    Let’s define the following variables: Let A be the set of residents who own a diamond ring. Let B be the set of residents who own a set of golf clubs. Let C be the set of residents who own a garden spade. Let D be the set of residents who own a bag of candy hearts. Since all 900 residents own a bag of candy hearts, |D| = 900. Let ai denote the number of r...