pith. sign in

arxiv: 2606.29502 · v1 · pith:Q5FI6A25new · submitted 2026-06-28 · 💻 cs.AI · cs.CL

UCOB: Learning to Utilize and Evolve Agentic Skills via Credit-Aware On-Policy Bidirectional Self-Distillation

Pith reviewed 2026-06-30 07:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords agentic reinforcement learningskill memoryself-distillationcredit assignmenton-policy learningALFWorldWebShopreturn-to-go comparison
0
0 comments X

The pith

UCOB treats skill and no-skill prompts as paired on-policy views and lets the higher-return view teach the other within the same task and state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieved skills can raise performance in one state while lowering it in another, which breaks the usual assumption that a skill prompt is always a reliable teacher. UCOB therefore runs both versions from the same anchor state, compares their return-to-go, and designates the better one as the local teacher for an on-policy update. The resulting credit signal is used both to internalize useful skill-conditioned behavior and to revise the skill memory so that future retrievals favor skills that actually help at that task and state. Experiments on ALFWorld, WebShop, and Search-QA report consistent gains over skill-free RL, memory baselines, and prior self-distillation methods across model sizes.

Core claim

UCOB replaces the privileged-teacher assumption with credit-aware on-policy bidirectional self-distillation: skill-conditioned and no-skill prompts are treated as two context views of the same policy; their return-to-go values are compared inside the identical task and anchor state; the higher-return view supplies the distillation target; and the credit difference simultaneously updates the policy, evolves the skill memory, and guides utility-aware retrieval and reflection.

What carries the argument

Credit-aware on-policy bidirectional self-distillation that selects the local teacher by direct return-to-go comparison between paired skill and no-skill rollouts from the same state.

If this is right

  • Skill memory evolves to store only locally useful entries rather than globally high-reward ones.
  • Retrieval becomes conditioned on predicted credit rather than surface similarity alone.
  • Reflection self-training receives a grounded target derived from the same credit comparison.
  • Performance scales with model size while retaining the same credit mechanism.
  • The method yields measured gains on ALFWorld and WebShop that exceed prior skill-memory and self-distillation baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same paired-view comparison could be applied to any retrieval-augmented generation setting where retrieved context sometimes harms performance.
  • On-policy credit signals may reduce the sample complexity of distillation in other partially observable RL domains.
  • Skill libraries could be pruned aggressively once credit-aware updates are in place, lowering memory and retrieval cost.
  • The approach suggests testing whether explicit credit comparison stabilizes other bidirectional distillation schemes that currently rely on privileged teachers.

Load-bearing premise

That the higher return-to-go view observed in the same task and anchor state supplies an unbiased teacher signal that can safely drive both policy updates and skill-memory changes without selection bias or instability.

What would settle it

A controlled run in which the skill with higher immediate return-to-go is shown to produce lower final task success rates than a fixed or random skill policy when the same credit signal is used for distillation and memory updates.

Figures

Figures reproduced from arXiv: 2606.29502 by Chengdong Xu, Dongbin Zhao, Dong Li, Linjing Li, Qichao Zhang, Songjun Tu, Xiangyuan Lan, Yaocheng Zhang, Yiwen Ma.

Figure 1
Figure 1. Figure 1: Overview and empirical summary of UCOB. often remains fixed from the skill-conditioned view to the no-skill view. Our diagnostics in Section 2 challenge this fixed-teacher view: the skill-conditioned branch is not consistently better than the no￾skill branch, and making skill-conditioned rollouts on-policy does not remove this ambiguity. When skill and no-skill views disagree at the same state, which view … view at source ↗
Figure 2
Figure 2. Figure 2: Unified schematic of the observation-study protocols. (a) Fixed-direction skill/no-skill [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SDAR Evaluation w/ and w/o skills during training on Qwen3-1.7B. Skill-conditioned teachers are unreliable and not self￾correcting. We first revisit SDAR (Lu et al., 2026a), an asymmetric self-distillation setup where rollouts use the no-skill prompt while a skill-conditioned prompt serves as the privileged teacher. This design assumes that, for states induced by the no-skill rollout, the skill-conditioned… view at source ↗
Figure 4
Figure 4. Figure 4: Dual-rollout SDAR evalua￾tion and training on WEBSHOP. Rollouts with skills mitigate exposure mismatch but not teacher ambiguity. A natural remedy is to place skills into the rollout itself, so the skill-conditioned view is optimized through environment interaction rather than only serving as a teacher outside the rollout path. We therefore test a dual-rollout fixed-direction variant: each training batch s… view at source ↗
Figure 5
Figure 5. Figure 5: State-group and trajectory￾step diagnostics for two-view rollouts. Teacher direction is locally value-dependent. To de￾cide whether skill-induced behavior should be trusted at a state, we compare skill-conditioned and no-skill views within the same task and anchor state. Follow￾ing GiGPO (Feng et al., 2025), we group rollouts by anchor state and estimate each view by average return￾to-go. For a rollout rec… view at source ↗
Figure 6
Figure 6. Figure 6: Overview of UCOB: dual-level skill retrieval, mixed skill/no-skill rollouts, credit-aware [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study under the Qwen3-1.7B backbone on ALFW [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mechanism and memory-evolution analysis of UCOB with Q [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Skill/no-skill eval-view suc￾cess with QWEN3-1.7B, averaged over ALFWORLD and WEBSHOP. CBSD routing and two-view evaluation [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cost analysis for UCOB. Localized training cost [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Case study of local teacher selection in UCOB. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
read the original abstract

Skill memories can improve agentic reinforcement learning by reusing past experience as textual guidance, but retrieved skills are not oracular: they may help in one state while misleading the same policy in another. This makes the common privileged-teacher assumption fragile, namely that a skill-conditioned prompt can be treated as a fixed teacher for the no-skill prompt. We introduce UCOB, a framework for learning to utilize and evolve agentic skills via credit-aware on-policy bidirectional self-distillation. UCOB treats skill-conditioned and no-skill prompts as two on-policy context views of the same model, compares their return-to-go within the same task and anchor state, and uses the higher-return view as the local teacher. This local credit signal internalizes useful skill-conditioned behavior, corrects misleading skill usage, and guides task/state skill memory updates, utility-aware retrieval, and reflection self-training. Experiments on agentic tasks, including ALFWorld, WebShop, and Search-QA, show that UCOB outperforms skill-free RL, skill-memory baselines, and self-distillation methods across model scales, with up to 23.5 and 18.0 point gains over SOTA baselines on ALFWorld and WebShop. Ablations and analyses further validate its core mechanisms and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces UCOB, a framework for learning to utilize and evolve agentic skills via credit-aware on-policy bidirectional self-distillation. It treats skill-conditioned and no-skill prompts as two on-policy context views of the same model, compares their return-to-go within the same task and anchor state, and uses the higher-return view as the local teacher. This local credit signal is used to internalize useful skill-conditioned behavior, correct misleading skill usage, and guide task/state skill memory updates, utility-aware retrieval, and reflection self-training. Experiments on agentic tasks including ALFWorld, WebShop, and Search-QA show that UCOB outperforms skill-free RL, skill-memory baselines, and self-distillation methods across model scales, with up to 23.5 and 18.0 point gains over SOTA baselines on ALFWorld and WebShop.

Significance. If the local credit signal is unbiased, the approach could advance agentic RL by providing a mechanism to handle imperfect retrieved skills without a privileged-teacher assumption, enabling better utilization and evolution of skill memories. The reported performance gains are large and consistent across tasks and scales; if reproducible with the claimed ablations, this would represent a meaningful empirical contribution to self-distilling agent policies.

major comments (1)
  1. [Abstract (mechanism description)] The core mechanism (abstract) treats the higher return-to-go view as an unbiased local teacher for on-policy updates and skill-memory evolution. However, the skill prompt alters the policy from the anchor state onward, so the two views induce different trajectory distributions; returns therefore conflate skill utility with differential exploration. No derivation is given showing the comparison remains an unbiased estimator of the value of skill usage conditional on the anchor, nor any analysis of resulting effects on policy-gradient variance or convergence. This assumption is load-bearing for all claimed gains and skill-evolution claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the load-bearing assumption in the core mechanism. We address the concern point-by-point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract (mechanism description)] The core mechanism (abstract) treats the higher return-to-go view as an unbiased local teacher for on-policy updates and skill-memory evolution. However, the skill prompt alters the policy from the anchor state onward, so the two views induce different trajectory distributions; returns therefore conflate skill utility with differential exploration. No derivation is given showing the comparison remains an unbiased estimator of the value of skill usage conditional on the anchor, nor any analysis of resulting effects on policy-gradient variance or convergence. This assumption is load-bearing for all claimed gains and skill-evolution claims.

    Authors: We agree that the manuscript provides no formal derivation establishing that the return-to-go comparison is an unbiased estimator of skill utility conditional on the anchor state. Because the skill prompt changes the policy distribution from the anchor onward, the two views generate different trajectory distributions, and the return comparison necessarily mixes skill utility with differences in exploration. The current text relies on the empirical utility of the resulting local credit signal for bidirectional self-distillation and memory updates, supported by the reported ablations and gains, but does not analyze bias, policy-gradient variance, or convergence properties. In the revision we will (1) explicitly acknowledge this limitation in the mechanism description, (2) add a discussion section examining the implications for bias and variance under the on-policy bidirectional setup, and (3) include additional analysis or controlled experiments that quantify the practical impact of differential exploration on the credit signal. These changes will make the justification for the approach more transparent while preserving the empirical contributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; on-policy self-distillation is standard RL structure, not a definitional reduction.

full rationale

The paper defines UCOB as comparing return-to-go between two on-policy context views (skill-conditioned vs. no-skill) of the same model from a shared anchor state, then using the higher-return view as local teacher for updates and skill-memory evolution. This is an explicit design choice for bidirectional self-distillation in agentic RL; the resulting credit signal and policy updates are not shown to equal the input data or a fitted parameter by construction. No equations, self-citations, or uniqueness theorems are invoked in the abstract or description that would make the claimed gains tautological. Experiments on ALFWorld, WebShop, and Search-QA serve as external benchmarks. The derivation chain remains self-contained as an algorithmic proposal rather than a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no access to equations, hyperparameters, or full methods section prevents enumeration of fitted values or background axioms.

axioms (1)
  • domain assumption Higher return-to-go view between two prompts is a reliable local teacher
    Core mechanism stated in abstract.

pith-pipeline@v0.9.1-grok · 5794 in / 1046 out tokens · 34036 ms · 2026-06-30T07:12:49.491641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 42 canonical work pages · 40 internal anchors

  1. [2]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Alfworld: Aligning text and embodied environments for interactive learning , author=. arXiv preprint arXiv:2010.03768 , year=

  2. [3]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Reflexion: Language Agents with Verbal Reinforcement Learning , author=. arXiv preprint arXiv:2303.11366 , year=

  3. [4]

    ReAct: Synergizing Reasoning and Acting in Language Models

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. arXiv preprint arXiv:2210.03629 , year=

  4. [5]

    Policy Distillation

    Policy Distillation , author=. arXiv preprint arXiv:1511.06295 , year=

  5. [6]

    Proximal Policy Optimization Algorithms

    Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  6. [7]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. arXiv preprint arXiv:2005.11401 , year=

  7. [8]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. arXiv preprint arXiv:2305.16291 , year=

  8. [9]

    arXiv preprint arXiv:2308.10144 , year=

    ExpeL: LLM Agents Are Experiential Learners , author=. arXiv preprint arXiv:2308.10144 , year=

  9. [10]

    A Survey on the Memory Mechanism of Large Language Model based Agents

    A Survey on the Memory Mechanism of Large Language Model based Agents , author=. arXiv preprint arXiv:2404.13501 , year=

  10. [11]

    Group-in-Group Policy Optimization for LLM Agent Training

    Group-in-Group Policy Optimization for LLM Agent Training , author=. arXiv preprint arXiv:2505.10978 , year=

  11. [12]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning , author=. arXiv preprint arXiv:2504.20073 , year=

  12. [13]

    arXiv preprint arXiv:2505.11821 , year=

    Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design , author=. arXiv preprint arXiv:2505.11821 , year=

  13. [14]

    GAGPO: Generalized Advantage Grouped Policy Optimization

    GAGPO: Generalized Advantage Grouped Policy Optimization , author=. arXiv preprint arXiv:2605.13217 , year=

  14. [15]

    Distilling the Knowledge in a Neural Network

    Distilling the Knowledge in a Neural Network , author=. arXiv preprint arXiv:1503.02531 , year=

  15. [16]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

  16. [17]

    Self-Distilled Agentic Reinforcement Learning

    Self-Distilled Agentic Reinforcement Learning , author=. arXiv preprint arXiv:2605.15155 , year=

  17. [18]

    Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

    Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents , author=. arXiv preprint arXiv:2604.10674 , year=

  18. [19]

    Self-Distilled RLVR

    Self-Distilled RLVR , author=. arXiv preprint arXiv:2604.03128 , year=

  19. [20]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning , author=. arXiv preprint arXiv:2602.08234 , year=

  20. [21]

    Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

    Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents , author=. arXiv preprint arXiv:2602.01869 , year=

  21. [22]

    Dynamic Dual-Granularity Skill Bank for Agentic RL

    Dynamic Dual-Granularity Skill Bank for Agentic RL , author=. arXiv preprint arXiv:2603.28716 , year=

  22. [23]

    RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

    RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback , author=. arXiv preprint arXiv:2603.08561 , year=

  23. [24]

    Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning , author=. arXiv preprint arXiv:2605.06130 , year=

  24. [25]

    Skill-R1: Agent Skill Evolution via Reinforcement Learning

    Skill-R1: Agent Skill Evolution via Reinforcement Learning , author=. arXiv preprint arXiv:2605.09359 , year=

  25. [26]

    SkillOS: Learning Skill Curation for Self-Evolving Agents

    SkillOS: Learning Skill Curation for Self-Evolving Agents , author=. arXiv preprint arXiv:2605.06614 , year=

  26. [27]

    What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

    What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents , author=. arXiv preprint arXiv:2605.19447 , year=

  27. [28]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes , author=. arXiv preprint arXiv:2603.25562 , year=

  28. [29]

    A Survey of On-Policy Distillation for Large Language Models

    A Survey of On-Policy Distillation for Large Language Models , author=. arXiv preprint arXiv:2604.00626 , year=

  29. [30]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe , author=. arXiv preprint arXiv:2604.13016 , year=

  30. [31]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

  31. [32]

    ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

    ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains , author=. arXiv preprint arXiv:2605.28014 , year=

  32. [33]

    Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

    Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR , author=. arXiv preprint arXiv:2605.10781 , year=

  33. [34]

    2026 , url=

    Zhang, Yaocheng and Zhu, Yuanheng and Chong, Wenyue and Tu, Songjun and Zhang, Qichao and Chai, Jiajun and Wang, Xiaohan and Lin, Wei and Yin, Guojun and Zhao, Dongbin , journal=. 2026 , url=

  34. [35]

    SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

    SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization , author=. arXiv preprint arXiv:2604.02268 , year=

  35. [36]

    SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

    SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training , author=. arXiv preprint arXiv:2606.02355 , year=

  36. [37]

    Qwen2.5 Technical Report

    Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

  37. [38]

    Qwen3 Technical Report

    Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

  38. [39]

    SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

    SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment , author=. arXiv preprint arXiv:2605.27899 , year=

  39. [40]

    Co-Evolving Skill Generation and Policy Optimization

    Co-Evolving Skill Generation and Policy Optimization , author=. arXiv preprint arXiv:2606.08755 , year=

  40. [41]

    StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

    StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning , author=. arXiv preprint arXiv:2605.27140 , year=

  41. [42]

    Are Full Rollouts Necessary for On-Policy Distillation?

    Are Full Rollouts Necessary for On-Policy Distillation? , author=. arXiv preprint arXiv:2605.31490 , year=

  42. [43]

    Advances in Neural Information Processing Systems , volume=

    Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=

  43. [44]

    Trust Region Policy Optimization

    Trust Region Policy Optimization , author=. arXiv preprint arXiv:1502.05477 , year=

  44. [45]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , author=. arXiv preprint arXiv:1910.00177 , year=

  45. [46]

    Proceedings of the Nineteenth International Conference on Machine Learning , pages=

    Approximately Optimal Approximate Reinforcement Learning , author=. Proceedings of the Nineteenth International Conference on Machine Learning , pages=

  46. [47]

    International Conference on Learning Representations , year=

    Mirror Descent Policy Optimization , author=. International Conference on Learning Representations , year=