pith. sign in
Pith Number

pith:VFLFREZP

pith:2026:VFLFREZPN6POFUG3NPVHGIOBY4
not attested not anchored not stored refs pending

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Ge Liu, Hongyu Lu, Siqi Zhu, Weiye Shi, Xuyan Ye

On-policy distillation fails in LLMs due to distribution mismatch, biased gradients, and privileged information aggregation but targeted fixes restore effectiveness.

arxiv:2605.11182 v2 · 2026-05-11 · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{VFLFREZPN6POFUG3NPVHGIOBY4}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.

C2weakest assumption

The tested settings (mathematical reasoning trajectories and system-prompt/alignment PI) are representative enough that the three failure mechanisms and proposed fixes will apply to other LLM tasks, model scales, and data distributions without additional confounding factors.

C3one line summary

On-policy distillation for LLMs is sensitive to teacher choice and loss design, while self-distillation fails on instance-specific information but succeeds on shared rules, with stop-gradient TopK, adapted teachers, and SFT stabilization as mitigations.

Receipt and verification
First computed 2026-05-26T01:03:33.409214Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

a95658932f6f9ee2d0db6bea7321c1c71826e3bb411c55a1db19227701c45158

Aliases

arxiv: 2605.11182 · arxiv_version: 2605.11182v2 · doi: 10.48550/arxiv.2605.11182 · pith_short_12: VFLFREZPN6PO · pith_short_16: VFLFREZPN6POFUG3 · pith_short_8: VFLFREZP
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/VFLFREZPN6POFUG3NPVHGIOBY4 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: a95658932f6f9ee2d0db6bea7321c1c71826e3bb411c55a1db19227701c45158
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "de0bcebca2a87255dcfef7f5533ad79685999585cf31a57ff359212454eb5959",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2026-05-11T19:44:59Z",
    "title_canon_sha256": "b248cb51e667023b57534f1c43b260f0bb7b7d2661aab3993870fca78c420d80"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.11182",
    "kind": "arxiv",
    "version": 2
  }
}