DCPO: Dynamic clipping policy optimization

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, Rihui Xin · 2025 · arXiv 2509.02333

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Revisiting DAgger in the Era of LLM-Agents

cs.LG · 2026-05-13 · conditional · novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

cs.CL · 2026-04-25 · unverdicted · novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.

Policy Improvement Reinforcement Learning

cs.LG · 2026-04-01 · unverdicted · novelty 6.0

PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.

MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

cs.AI · 2026-04-18 · unverdicted · novelty 5.0

MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

cs.LG · 2026-04-04 · unverdicted · novelty 5.0

Intra-group objectives in sparse-reward RL must maintain token gradient exchangeability to enable cancellation on weak-credit tokens and stabilize training.

citing papers explorer

Showing 6 of 6 citing papers.

Revisiting DAgger in the Era of LLM-Agents cs.LG · 2026-05-13 · conditional · none · ref 43
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 32
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance cs.CL · 2026-04-25 · unverdicted · none · ref 28
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
Policy Improvement Reinforcement Learning cs.LG · 2026-04-01 · unverdicted · none · ref 54
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models cs.AI · 2026-04-18 · unverdicted · none · ref 12
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation cs.LG · 2026-04-04 · unverdicted · none · ref 8
Intra-group objectives in sparse-reward RL must maintain token gradient exchangeability to enable cancellation on weak-credit tokens and stabilize training.

DCPO: Dynamic clipping policy optimization

fields

years

verdicts

representative citing papers

citing papers explorer