{"work":{"id":"6509a633-b810-4a8d-9de9-8a9b59a768d0","openalex_id":null,"doi":null,"arxiv_id":"2604.14084","raw_key":null,"title":"TIP: Token Importance in On-Policy Distillation","authors":null,"authors_text":null,"year":2026,"venue":"cs.LG","abstract":"On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong.\n  Empirically, student entropy is a strong first-order proxy: retaining $50\\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules.\n  We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.","external_url":"https://arxiv.org/abs/2604.14084","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-30T12:44:39.944438+00:00","pith_arxiv_id":"2604.14084","created_at":"2026-05-11T03:05:53.691611+00:00","updated_at":"2026-06-30T12:44:39.944438+00:00","title_quality_ok":true,"display_title":"TIP: Token Importance in On-Policy Distillation","render_title":"TIP: Token Importance in On-Policy Distillation"},"hub":{"state":{"work_id":"6509a633-b810-4a8d-9de9-8a9b59a768d0","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":12,"external_cited_by_count":null,"distinct_field_count":3,"first_pith_cited_at":"2026-05-08T07:52:15+00:00","last_pith_cited_at":"2026-06-29T17:55:53+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-30T18:20:19.641161+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":4},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":4},{"context_polarity":"use_method","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}