DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Pith reviewed 2026-05-22 23:30 UTC · model grok-4.3
The pith
DAPO algorithm with four techniques lets open-source RL reach 50 on AIME 2024 using Qwen2.5-32B.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The DAPO algorithm, built around decoupled clipping and dynamic sampling together with two additional techniques, combined with open-sourced code and dataset, produces a large-scale RL system that reaches 50 points on AIME 2024 when applied to the Qwen2.5-32B base model.
What carries the argument
The DAPO algorithm and its four techniques of decoupled clipping, dynamic sampling, and two supporting methods that stabilize and improve policy optimization at LLM scale.
If this is right
- Community members can now reproduce the reported AIME performance without access to proprietary details.
- The open-sourced system lowers the barrier for experimenting with reinforcement learning on other base models.
- Future work can isolate the contribution of each of the four techniques by ablating them within the released framework.
- Training runs become more transparent, allowing direct comparison of implementation choices across different labs.
- The combination of algorithm, code, and data supports scaling studies that were previously blocked by secrecy.
Where Pith is reading between the lines
- The same four techniques might transfer to models larger than 32B if the open-sourced code is adapted.
- Dataset curation effects could be measured separately by swapping in new data while holding the algorithm fixed.
- Closed models that currently lead on reasoning benchmarks could face pressure once the open system is widely used.
- Extending the dynamic sampling component to other policy-gradient methods outside LLM RL is a direct next test.
Load-bearing premise
The reported performance gains come primarily from the four techniques rather than from the base model choice or dataset curation choices.
What would settle it
Running the released code and dataset on Qwen2.5-32B and obtaining substantially less than 50 points on AIME 2024 would falsify the claim that the techniques make large-scale LLM RL successful.
read the original abstract
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm for large-scale LLM reinforcement learning. It claims to achieve 50 points on AIME 2024 using the Qwen2.5-32B base model and fully open-sources a state-of-the-art RL system, including training code built on the verl framework, a curated dataset, and details on four key techniques to address reproducibility issues in prior closed systems.
Significance. If the performance result holds and is attributable to the proposed techniques, the work would be significant for providing the first fully open-source large-scale LLM RL system with concrete benchmark results, directly addressing the opacity of systems like OpenAI o1 and DeepSeek R1. The open-sourcing of code, dataset, and techniques is a clear strength that enables community verification and extension.
major comments (2)
- [Abstract] Abstract: the claim that the four key techniques 'make large-scale LLM RL a success' is load-bearing for the central contribution but is not supported by ablations that hold the dataset and Qwen2.5-32B base model fixed while comparing DAPO only against a standard PPO/GRPO baseline. Without such controls, the attribution of the 50-point AIME 2024 result to the algorithmic changes (rather than data curation or base-model capabilities) cannot be verified.
- [Abstract] Abstract: the manuscript asserts a concrete benchmark score of 50 on AIME 2024 but provides no derivation, ablation data, or error analysis in the presented text to connect the result to the four techniques; this leaves the performance claim unverified against the stated methods.
minor comments (1)
- The exact evaluation protocol for the AIME 2024 score (e.g., pass@1, average over multiple samples, or strict correctness) should be stated explicitly to allow precise replication and comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. The comments correctly identify areas where the manuscript's claims could be more precisely supported by evidence. We address each point below and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the four key techniques 'make large-scale LLM RL a success' is load-bearing for the central contribution but is not supported by ablations that hold the dataset and Qwen2.5-32B base model fixed while comparing DAPO only against a standard PPO/GRPO baseline. Without such controls, the attribution of the 50-point AIME 2024 result to the algorithmic changes (rather than data curation or base-model capabilities) cannot be verified.
Authors: We agree that the abstract phrasing attributes success to the four techniques without the precise controlled ablations described. The manuscript presents DAPO as the core algorithmic contribution within an open-sourced system, with the 50-point result obtained using those techniques on the stated base model and dataset. However, the referee is correct that direct attribution requires ablations holding data and base model fixed against a standard baseline. We will add such controlled experiments (or clarify their absence if resource-constrained) in a revised version or appendix to strengthen this claim. revision: yes
-
Referee: [Abstract] Abstract: the manuscript asserts a concrete benchmark score of 50 on AIME 2024 but provides no derivation, ablation data, or error analysis in the presented text to connect the result to the four techniques; this leaves the performance claim unverified against the stated methods.
Authors: The 50-point AIME 2024 score is the end-to-end result of the fully described DAPO system. The abstract summarizes this outcome, while the body details the four techniques and training setup. The referee correctly notes the absence of explicit derivation, ablation tables, or error analysis directly linking the score to each technique within the presented text. We will revise the abstract and add cross-references or a concise summary table in the main text to better connect the result to the methods, drawing from any available internal logs or additional analysis. revision: yes
Circularity Check
No circularity in claimed derivation or results
full rationale
The manuscript presents an empirical system and benchmark result (50 on AIME 2024) using an external, independently verifiable test set. No equations, fitted parameters, or self-citations are shown that reduce the performance claim to a definition or input by construction. The four techniques are asserted to drive success, but the benchmark itself is not derived from or equivalent to those techniques; it remains an independent external measure. The paper is therefore self-contained against external benchmarks with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
DualKV is a new FlashAttention variant that shares prompt KV across multiple rollouts in RL training, delivering 1.63-3.82x speedups on 8B-30B models while remaining mathematically identical to standard attention.
-
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
-
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
-
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
-
Learnability-Informed Fine-Tuning of Diffusion Language Models
LIFT is a learnability-informed SFT algorithm for diffusion LMs that aligns token difficulty with diffusion time steps, yielding up to 3x gains on AIME'24 and AIME'25 over standard SFT baselines.
-
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on p...
-
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
-
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...
-
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
-
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
-
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generatio...
-
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...
-
Weak-to-Strong Elicitation via Mismatched Wrong Drafts
Mismatched wrong drafts from a 1.5B math model injected into GRPO training of a 7B model yield higher pass rates on MATH-500 and AIME than on-policy baselines or matched variants.
-
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more stra...
-
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
-
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and Agent...
-
Learning from Language Feedback via Variational Policy Distillation
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming f...
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...
-
AIS: Adaptive Importance Sampling for Quantized RL
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization
DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
-
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
-
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive
AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
-
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
-
Relative Score Policy Optimization for Diffusion Language Models
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
-
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
-
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
-
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
-
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.
-
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
-
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
-
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
-
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
POISE estimates value baselines for RL in LLMs from the actor's internal states via a lightweight probe and cross-rollout construction, matching DAPO performance with lower compute on math reasoning benchmarks.
-
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
-
Rubric-based On-policy Distillation
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
-
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
-
Teaching Language Models to Think in Code
ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.
-
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
-
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
-
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
Reference graph
Works this paper leans on
- [1]
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
OpenAI. GPT4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [4]
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[6]
Palm: Scaling language modeling with pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023
work page 2023
-
[7]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Grok 3 beta — the age of reasoning agents, 2024
XAI. Grok 3 beta — the age of reasoning agents, 2024
work page 2024
- [9]
-
[10]
Qwq-32b: Embracing the power of reinforcement learning, 2024
Qwen. Qwq-32b: Embracing the power of reinforcement learning, 2024
work page 2024
-
[11]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, et al. An empirical study on eliciting and improving r1-like reasoning models.arXiv preprint arXiv:2503.04548, 2025
-
[14]
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, and Heung-Yeung Shum Xiangyu Zhang. Open-reasoner- zero: An open source approach to scaling reinforcement learning on the base model.https://github.com/ Open-Reasoner-Zero/Open-Reasoner-Zero, 2025
work page 2025
-
[15]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Jung Hyun Lee, June Yong Yang, Byeongho Heo, Dongyoon Han, and Kang Min Yoo. Token-supervised value models for enhancing mathematical reasoning capabilities of large language models. arXiv preprint arXiv:2407.12863, 2024
-
[18]
Available: https://arxiv.org/abs/2410.01679
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint arXiv:2410.01679, 2024
-
[19]
Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025
-
[20]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 13
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
High-dimensional continuous control using generalized advantage estimation, 2018
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018
work page 2018
-
[23]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...
work page 2022
-
[24]
Concrete problems in ai safety, 2016
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety, 2016
work page 2016
-
[25]
Reinforcement learning with a corrupted reward channel, 2017
Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg. Reinforcement learning with a corrupted reward channel, 2017
work page 2017
-
[26]
Specification gaming: the flip side of ai ingenuity, 2020
Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: the flip side of ai ingenuity, 2020
work page 2020
-
[27]
Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021
work page 2021
-
[28]
Scaling laws for reward model overoptimization, 2022
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization, 2022
work page 2022
-
[29]
Reward hacking in reinforcement learning.lilianweng.github.io, Nov 2024
Lilian Weng. Reward hacking in reinforcement learning.lilianweng.github.io, Nov 2024
work page 2024
-
[30]
Generative language modeling for automated theorem proving, 2020
Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving, 2020
work page 2020
-
[31]
Solving olympiad geometry without human demonstrations
Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024
work page 2024
-
[32]
Alphageometry: An olympiad-level ai system for geometry, 2024
Trieu Trinh and Thang Luong. Alphageometry: An olympiad-level ai system for geometry, 2024
work page 2024
-
[33]
Ai achieves silver-medal standard solving international mathematical olympiad problems, 2024
AlphaProof and AlphaGeometry Teams. Ai achieves silver-medal standard solving international mathematical olympiad problems, 2024
work page 2024
-
[34]
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems, 35:21314–21328, 2022
work page 2022
-
[35]
Reflexion: Language agents with verbal reinforcement learning, 2023
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023
work page 2023
-
[36]
Teaching large language models to self-debug, 2023
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug, 2023
work page 2023
-
[37]
Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025
work page 2025
-
[38]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternationalConference on Learning Representations, 2019. 14 Appendix A Dataset Transformation Here is an example of data transformation, where we demonstrate the original problem, the transformed problem, and the model outputs during the transformation process (the original is in...
work page 2019
-
[40]
First, note that the answer consists of an integer part and a square root term
We need to transform this into an integer form. First, note that the answer consists of an integer part and a square root term. According to the given rule, we replace the integer and square root components with undetermined coefficients. The original problem usesx and y as variables, so we choose other letters as undetermined coefficients. We express the...
-
[41]
Let B be the set of residents who own a set of golf clubs
Let’s define the following variables: Let A be the set of residents who own a diamond ring. Let B be the set of residents who own a set of golf clubs. Let C be the set of residents who own a garden spade. Let D be the set of residents who own a bag of candy hearts. Since all 900 residents own a bag of candy hearts, |D| = 900. Let ai denote the number of r...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.