DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, W · 2025

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

cs.LG · 2026-04-26 · conditional · novelty 6.0

Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.

citing papers explorer

Showing 2 of 2 citing papers.

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR cs.LG · 2026-05-07 · unverdicted · none · ref 24
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning cs.LG · 2026-04-26 · conditional · none · ref 44
Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.

DAPO: An open-source LLM reinforcement learning system at scale

fields

years

verdicts

representative citing papers

citing papers explorer