Dcpo: Dynamic clipping policy optimization

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, Rihui Xin · 2025 · arXiv 2509.02333

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 2 other 1

citation-polarity summary

background 2 unclear 1

representative citing papers

Revisiting DAgger in the Era of LLM-Agents

cs.LG · 2026-05-13 · conditional · novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

cs.CL · 2026-04-25 · unverdicted · novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.

SSPO: Subsentence-level Policy Optimization

cs.CL · 2025-11-06 · unverdicted · novelty 6.0

SSPO computes policy importance ratios at the subsentence level with entropy-adjusted clipping bounds, yielding higher average scores than GRPO and GSPO on math reasoning benchmarks with Qwen models.

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

cs.AI · 2026-05-27 · unverdicted · novelty 5.0

Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.

MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

cs.AI · 2026-04-18 · unverdicted · novelty 5.0

MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.

Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

cs.AI · 2026-06-08 · unverdicted · novelty 3.0

The paper describes Baichuan-M4, a coordinated medical agent system that reports leading scores across static knowledge, dynamic consultation, long-context memory, retrieval, OCR, and multimodal tasks with a 3.3% hallucination rate.

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

cs.LG · 2026-05-20

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

cs.LG · 2026-04-04

Policy Improvement Reinforcement Learning

cs.LG · 2026-04-01

citing papers explorer

Showing 1 of 1 citing paper after filters.

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance cs.CL · 2026-04-25 · unverdicted · none · ref 28
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.

Dcpo: Dynamic clipping policy optimization

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer