DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Dcpo: Dynamic clipping policy optimization
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
SSPO computes policy importance ratios at the subsentence level with entropy-adjusted clipping bounds, yielding higher average scores than GRPO and GSPO on math reasoning benchmarks with Qwen models.
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
The paper describes Baichuan-M4, a coordinated medical agent system that reports leading scores across static knowledge, dynamic consultation, long-context memory, retrieval, OCR, and multimodal tasks with a 3.3% hallucination rate.
citing papers explorer
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
-
SSPO: Subsentence-level Policy Optimization
SSPO computes policy importance ratios at the subsentence level with entropy-adjusted clipping bounds, yielding higher average scores than GRPO and GSPO on math reasoning benchmarks with Qwen models.
-
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
-
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
-
Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care
The paper describes Baichuan-M4, a coordinated medical agent system that reports leading scores across static knowledge, dynamic consultation, long-context memory, retrieval, OCR, and multimodal tasks with a 3.3% hallucination rate.
- Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation
- Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation
- Policy Improvement Reinforcement Learning