pith. sign in

arxiv: 2605.15155 · v1 · pith:CBEEOBLGnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL

Self-Distilled Agentic Reinforcement Learning

classification 💻 cs.LG cs.AIcs.CL
keywords opsdlearningreinforcementsdarteacheracrossagenticagents
0
0 comments X
read the original abstract

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents

    cs.AI 2026-06 unverdicted novelty 6.0

    ATOD anneals from on-policy distillation to RL with turn-level reweighting to improve multi-turn agent success rates on ALFWorld, WebShop, and Search-QA.

  2. Policy and World Modeling Co-Training for Language Agents

    cs.LG 2026-06 unverdicted novelty 6.0

    PaW co-trains policy and world modeling on standard RL rollouts using action-entropy data selection, noise-tolerant loss, and reward-adaptive balancing, yielding consistent gains on three agent benchmarks.

  3. OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

    cs.CL 2026-06 unverdicted novelty 4.0

    OPID distills episode- and step-level skills from completed on-policy trajectories, routes them via critical-first mechanism, and combines the resulting log-probability shift advantage with outcome advantage for polic...