{"total":15,"items":[{"citing_arxiv_id":"2606.07000","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization","primary_cat":"cs.AI","submitted_at":"2026-06-05T07:43:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06021","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OPRD: On-Policy Representation Distillation","primary_cat":"cs.LG","submitted_at":"2026-06-04T11:13:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09887","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SocraticPO: Policy Optimization via Interactive Guidance","primary_cat":"cs.LG","submitted_at":"2026-06-03T09:08:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SocraticPO adds Socratic-style teacher guidance and reward decay to RL rollouts for LLMs, improving performance on scientific reasoning benchmarks over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02684","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-06-01T17:58:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00172","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO","primary_cat":"cs.AI","submitted_at":"2026-05-29T13:21:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CAST adds non-privileged self-teacher scoring and bidirectional advantage flipping to GRPO so that zero-variance groups still produce verifier-signed token gradients.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21605","ref_index":12,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-20T18:12:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19436","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-19T06:46:19+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18141","ref_index":7,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Brief Overview: On-Policy Self-Distillation In Large Language Models","primary_cat":"cs.HC","submitted_at":"2026-05-18T09:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15155","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Distilled Agentic Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-14T17:51:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SDAR gates on-policy self-distillation signals into RL training to stabilize and improve multi-turn LLM agent performance on ALFWorld, WebShop, and Search-QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13643","ref_index":11,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-13T15:05:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Local teachability collapse occurs in later trajectory segments during strong-to-weak OPD; a margin-based release rule using top-K teacher advantage and BIC change-point detection on sentence segments outperforms full-trajectory supervision on five in-domain benchmarks and preserves out-of-domain pe","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12652","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-Rollout On-Policy Distillation via Peer Successes and Failures","primary_cat":"cs.LG","submitted_at":"2026-05-12T18:57:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MOPD improves on-policy distillation by using peer successes and failures from multiple rollouts to construct more informative teacher signals, yielding consistent gains over baselines on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11458","ref_index":6,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-12T03:15:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ATESD introduces a Beta-policy controller that adapts teacher exposure ratio during LLM self-distillation training and reports gains over fixed-exposure baselines on math benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems, pages 1171-1179, 2015. [4] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InICML, 2009. [5] David Chen, Omar Khattab, and Matei Zaharia. Soda: Semi on-policy black-box distillation for large language models.arXiv preprint, 2026. [6] Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026. [7] Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InICLR, 2024. [8] Etash Guha et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025."},{"citing_arxiv_id":"2605.09725","ref_index":7,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On-Policy Distillation with Best-of-N Teacher Rollout Selection","primary_cat":"cs.CV","submitted_at":"2026-05-10T19:49:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05040","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization","primary_cat":"cs.LG","submitted_at":"2026-05-06T15:31:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"distillation: the teacher should define useful support and inductive bias, but it should not be treated as a final target to be copied uniformly. Proposition 1(Optimal reward-tilted policy).For each fixed input x, the optimizer of Eq. (3) is the reward-tilted teacher distribution π⋆(y|x) = πteach(y|x) exp(r(x, y)/β) Z(x) , Z(x) = X y πteach(y|x) exp(r(x, y)/β).(4) Equivalently, the latent reward can be written in terms of the optimal policy and the teacher up to an x-dependent additive constant: r(x, y) =βlog π⋆(y|x) πteach(y|x) +βlogZ(x).(5) The derivation is deferred to Appendix D.1. Thus, the desired policy does not merely copy the teacher. Instead, it takes the teacher distribution as a reference measure and reweights it by the"},{"citing_arxiv_id":"2604.13016","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","primary_cat":"cs.LG","submitted_at":"2026-04-14T17:54:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The most lightweight variant evaluates only the token sampled by the student, and is also the most common implementation in prior on-policy distillation work [Lu and Lab, 2025, Xiao et al., 2026, Yang et al., 2026b]. Givenˆ𝑦𝑡 ∼𝑝 𝑡, the per-token loss isℓsample 𝑡 ≜log𝑝 𝑡 (ˆ𝑦𝑡) −log𝑞 𝑡 (ˆ𝑦𝑡), aggregated as: Lsample OPD (𝜃)=𝔼 𝑥∼D 𝑥 ,ˆ𝑦∼𝜋𝜃 (· |𝑥) \" 𝑇∑︁ 𝑡=1 ℓsample 𝑡 # .(3) Since 𝔼ˆ𝑦𝑡∼𝑝 𝑡 [ℓsample 𝑡 ]=𝐷 KL(𝑝 𝑡 ∥𝑞𝑡), each ℓsample 𝑡 is an unbiased single-sample estimator of the token- level reverse KL. Full-Vocabulary OPD.At the other extreme, one computes the divergence over the entire vocabulary at each prefix: Lfull OPD(𝜃)=𝔼 𝑥∼D 𝑥 ,ˆ𝑦∼𝜋𝜃 (· |𝑥) \" 𝑇∑︁ 𝑡=1 𝐷KL(𝑝 𝑡 ∥𝑞𝑡) # .(4) This yields denser gradients compared to sampled-token OPD, at the cost of𝑂(𝐵𝑇 𝑀) memory for"}],"limit":50,"offset":0}