{"total":12,"items":[{"citing_arxiv_id":"2606.01476","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification","primary_cat":"cs.LG","submitted_at":"2026-05-31T22:31:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22263","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-21T10:07:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17862","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$\\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control","primary_cat":"cs.LG","submitted_at":"2026-05-18T05:14:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13643","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-13T15:05:30+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12652","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-Rollout On-Policy Distillation via Peer Successes and Failures","primary_cat":"cs.LG","submitted_at":"2026-05-12T18:57:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12483","ref_index":23,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:57:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07725","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SOD: Step-wise On-policy Distillation for Small Language Model Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2602.02488, 2026. [54] Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang. Co-evolving llm coder and unit tester via reinforcement learning.arXiv preprint arXiv:2506.03136, 2025. [55] Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026. [56] Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643, 2025. [57] Wenhong Zhu, Ruobing Xie, Rui Wang, and Pengfei Liu. Hybrid policy distillation for llms. arXiv preprint arXiv:2604.20244, 2026. [58] Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang"},{"citing_arxiv_id":"2605.07711","ref_index":17,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:16:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"own generations [5, 6, 7], so early generation errors can compound and move the student into contexts that were rarely supervised during training [8, 9, 10, 11]. On-policy distillation (OPD) addresses this train-test mismatch by querying the teacher on prefixes sampled from the student's policy, following the broader principle of on-policy imitation learning [12, 13, 14, 15, 16]. This makes OPD a natural paradigm for LLM distillation [ 17, 18, 19, 20, 21], but it also exposes a hidden requirement for cross-model supervision: at each student-generated prefix, teacher and student predictions must be defined over comparable prediction units. This requirement is non-trivial in realistic LLM distillation, where teacher and student models come from different families and use different tokenizers [22]."},{"citing_arxiv_id":"2605.07396","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rubric-based On-policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-08T07:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We compare ROPD with SFT (with static teacher outputs), T-Judge (directly employing the teacher as a judge to provide scores), and representative black-box distillation methods OVD [ 20] and GAD [ 21].White-box Setting. Using Qwen3-30B-A3B [5] as the open-weight teacher, we compare ROPD with advanced logit- based methods OPD [6, 7] (hereafter LOPD) and ExOPD [22]. All experiments are conducted in non-thinkingmode. Crucially, ROPD only accesses teacher text, intentionally ignoring available logit information to demonstrate its black-box robustness.Data.Training is conducted on DAPO- Math-17K [4] for math, and RaR-Science/Medical-20K [23] for science and medical tracks. For fair comparison, all methods share the same training samples within each domain."},{"citing_arxiv_id":"2604.13010","ref_index":45,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-04-14T17:44:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[43, 44], while on-policy distillation (OPD) samples trajectories from the student and aligns it with the teacher's token-level distribution on these student-generated rollouts [8, 9], achieving faster and more effective distillation than offline counterparts [18, 10]. Recent OPD work has explored reward extrapolation to push the student beyond the teacher's own capability [10], black-box variants that do not require access to teacher logits [45], self-distillation paradigms that leverage the model's own in-context capabilities [11, 13, 12], controllable multi-budget reasoning via on-policy exploration [46], and privileged information distillation that transfers training-time knowledge to inference-time policies [47]. Concurrent work further validates OPD across diverse post-training tasks [14]."},{"citing_arxiv_id":"2604.08527","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-09T17:58:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07941","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning","primary_cat":"cs.CL","submitted_at":"2026-04-09T08:00:37+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\"Distilling Reasoning Capabilities into Smaller Language Models\". In:Findings of the Association for Computational Linguistics: ACL 2023. 2023, pp. 7059-7073. [62] R. Agarwal et al. \"On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes\". In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/ forum?id=3zKtaqxLhW. [63] T. Ye et al. \"Black-Box On-Policy Distillation of Large Language Models\". In:arXiv preprint arXiv:2511.10643 (2025). [64] T. Ye et al. \"On-Policy Context Distillation for Language Models\". In:arXiv preprint arXiv:2602.12275 (2026). [65] S. Zhao et al. \"Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models\". In:arXiv preprint arXiv:2601."}],"limit":50,"offset":0}