{"total":12,"items":[{"citing_arxiv_id":"2606.30626","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DOPD: Dual On-policy Distillation","primary_cat":"cs.AI","submitted_at":"2026-06-29T17:55:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DOPD is an advantage-aware dual distillation method that dynamically assigns token supervision from either privileged teacher or student to transfer capability while mitigating non-replicable information asymmetry in on-policy distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30518","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Regime-Aware Peer Specialization for Robust RAG under Heterogeneous Knowledge Conflicts","primary_cat":"cs.CL","submitted_at":"2026-06-29T16:25:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAPS-DA improves RAG robustness to heterogeneous knowledge conflicts by training regime-specific peer specialists with hard routing and a dual-layer token selector for focused supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28725","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation","primary_cat":"cs.CL","submitted_at":"2026-06-27T04:10:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DriftGuard introduces multi-monitor safety-aware drift detection paired with hard-mix selective adaptation, reporting toxic recall gains to 0.8777 on Civil Comments and 0.8523 on DynaHate under temporal and cross-dataset shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27814","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2026-06-26T07:56:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ATOD anneals from on-policy distillation to RL with turn-level reweighting to improve multi-turn agent success rates on ALFWorld, WebShop, and Search-QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26844","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-26T10:56:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Token teachability, based on local compatibility of teacher and student distributions, predicts on-policy distillation gains better than raw KL disagreement and enables TA-OPD to match or exceed full-token performance with 5% tokens across Qwen models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25381","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Not only where, But when: Temporal Scheduling for RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-25T03:10:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11739","ref_index":23,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-12T08:19:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10194","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-11T08:45:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with a bootstrap interval over problem IDs. A VG is the unweighted mean over the five benchmark means. Method MATH-500 AIME 24 AIME 25 AMC 23 GPQA-D A VG Qwen3-8B base (ref.)96.80 [95.48,98.00] 76.25 [62.50,87.92] 67.50 [54.17,80.42] 95.94 [92.50,98.75] 58.27 [52.21,64.20] 78.95 [75.04,82.86] GRPO97.30 [96.03,98.45] 77.08 [62.71,89.38] 68.96 [54.58,82.08] 96.56 [92.19,99.38] 53.85 [48.04,59.66] 78.75 [74.67,82.83] SDPO95.13 [93.65,96.48] 52.50 [38.33,66.67] 41.25 [26.04,56.67] 91.41 [84.69,96.25] 39.90 [34.85,44.95] 64.04 [59.58,68.50] SRPO96.48 [95.13,97.65] 69.17 [54.58,82.08] 54.17 [39.58,67.50] 93.12 [87.50,97.50] 37.31 [32.32,42.42] 70.05 [65.87,74.23] RLSD96.68 [95.35,97.90] 75.42 [61.67,87.08] 68.54 [55.21,81.46] 97."},{"citing_arxiv_id":"2605.08737","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs","primary_cat":"cs.LG","submitted_at":"2026-05-09T06:48:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"flat across λ, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini- graded rubric and inherits that evaluator's exposure. 1 Introduction On-policy distillation (OPD) trains a student LLM against a teacher's per-token log-probabilities on the student's own rollouts [2, 13]; its reward-extrapolation variant [42] sharpens the on-policy target by a coefficient λ >1 and can lift the student past the teacher in domain. But the same extrapolation step that produces the lift, past a threshold λ⋆, instead replaces format-preserving training with a sharp contract collapse on structured-output tasks [11, 38]. We derive that threshold in closed form and calibrate it on Amazon product-review listwise ranking."},{"citing_arxiv_id":"2605.07725","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SOD: Step-wise On-policy Distillation for Small Language Model Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[39] Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, and James Cheng. Tcod: Exploring temporal curriculum in on-policy distillation for multi-turn autonomous agents.arXiv preprint arXiv:2604.24005, 2026. [40] Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026. [41] Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. Tip: Token importance in on-policy distillation.arXiv preprint arXiv:2604.14084, 2026. [42] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing"},{"citing_arxiv_id":"2605.07711","ref_index":39,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:16:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[37] Yeongmin Kim, Donghyeok Shin, Mina Kang, Byeonghu Na, and Il-Chul Moon. Distillation of large language models via concrete score matching.arXiv preprint, arXiv:2509.25837, 2025. [38] Haiduo Huang, Jiangcheng Song, Yadong Zhang, and Pengju Ren. SelecTKD: Selective token-weighted knowledge distillation for LLMs.arXiv preprint, arXiv:2510.24021, 2025. [39] Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. TIP: Token importance in on-policy distillation.arXiv preprint, arXiv:2604.14084, 2026. [40] Anh Duc Le, Tu Vu, Nam Le Hai, Nguyen Thi Ngoc Diep, Linh Ngo Van, Trung Le, and Thien Huu Nguyen. CoT2Align: Cross-chain of thought distillation via optimal transport alignment for language models with different tokenizers."},{"citing_arxiv_id":"2605.07396","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rubric-based On-policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-08T07:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Architecturally,the teacher's independence from the training loop enables offline execution, significantly lowering GPU memory overhead and accelerating training process (Figure 3).Optimization-wise,ROPD exhibits superior robustness to model divergence: while logit-based OPD typically requires the teacher and student to share similar reasoning patterns [17], ROPD's high-level semantic guidance ensures stable convergence even across models with markedly disparate reasoning trajectories (Table 3). In summary, this work offers a complementary perspective to the prevailing logit-centric distillation landscape. Through ROPD, a simple framework requiring minimal hyperparameter, we demonstrate that high-level semantic rubrics can serve as an efficient and robust alternative to fine-grained"}],"limit":50,"offset":0}