{"total":12,"items":[{"citing_arxiv_id":"2606.00437","ref_index":97,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing","primary_cat":"cs.LG","submitted_at":"2026-05-30T00:05:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22263","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-21T10:07:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18141","ref_index":25,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Brief Overview: On-Policy Self-Distillation In Large Language Models","primary_cat":"cs.HC","submitted_at":"2026-05-18T09:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12741","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:46:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11609","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information","primary_cat":"cs.LG","submitted_at":"2026-05-12T06:40:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"contrastive mutual information.arXiv preprint arXiv:2604.10660, 2026. [11] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843-3857, 2022. [12] Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026. [13] Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al."},{"citing_arxiv_id":"2605.10781","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-11T16:16:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"points the student toward solutions it could not previously reach on its own, and distillation transfers that corrective signal token by token. On already-successful trajectories, the same mechanism inverts its role. Even when the student already reached the correct answer, distilling toward the teacher overwrites the student's choices with the teacher's, a problem recently identified asoptimization ambiguityin self-distillation [ 12]. Rather than being corrected, the student is forced to imitate a path it had already solved its own way, undermining the independent reasoning that produced the success. This observation motivates us toreverse the directionof self-distillation on correct rollouts. Consider the tokens where the student's choice differs most sharply from what the teacher would have predicted."},{"citing_arxiv_id":"2605.10194","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-11T08:45:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The pointwise logit pressure of Lemma 8 is what holds universally under Euclidean SGD. B.6 Proof of Proposition 3 Proof. For reference: Yang et al. [2026, Prop. 1] show the privileged-information-specific deviation δt(θ;c) :=− X v πT (v|x, c, y<t)−¯πT (v|x, y<t) \u0001 ∇θ logπ S(v|x, y<t)(8) hasE c[δt] = 0and per-position privileged variance Vt := X v Varc[πT (v|x, c, y<t)]≥0,(9) withV t = 0iffπ T is independent ofcatt. Step 1: Per-token bound on Ec[∥δt∥2].Let av :=π T (v|x, c, y<t)−¯πT (v|x, y<t), soP v av = 0 and δt =− P v av∇θ logπ S(v). The score-operator bound (3) applied to a gives ∥δt∥2 = P v av∇θ logπ S(v) 2 ≤C 2 s P v a2 v, and taking expectation overc, Ec \u0002 ∥δt(θ;c)∥ 2\u0003 ≤C 2 s X v Varc[πT (v|x, c, y <t)] =C 2"},{"citing_arxiv_id":"2605.07725","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SOD: Step-wise On-policy Distillation for Small Language Model Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026. [30] Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, and Xueqi Cheng. Rlkd: Distilling llms' reasoning via reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34151-34159, 2026. [31] Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026. [32] Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan."},{"citing_arxiv_id":"2605.07711","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:16:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06094","ref_index":21,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-07T12:13:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"First, the supervision lacks diagnostic specificity, making it difficult to distinguish between different types of reasoning errors, such as logical inconsistency versus grounding failure [28, 32]. Second, the interaction between self-distillation and reinforcement learning is often unstable, as auxiliary signals may override or conflict with reward-driven optimization [21, 24]. In this work, we proposeVISD, a structured self-distillation framework that integrates reinforcement learning with structured privileged information. The central idea is to elevate self-distillation from a generic auxiliary signal to a structured supervision space that captures the compositional nature of reasoning errors in video tasks. Concretely, VISD introduces a video-aware judge model that evaluates"},{"citing_arxiv_id":"2605.05040","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization","primary_cat":"cs.LG","submitted_at":"2026-05-06T15:31:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This margin induces, under a Bradley-Terry preference model, the probability that the teacher- generated response is preferred over the student-generated response: Pθ(y+ ≻y − |x) =σ mθ(x, y+, y−) \u0001 .(9) Here σ(z) := 1/(1 + exp(−z)) denotes the logistic sigmoid function. We maximize the correspond- ing pairwise log-likelihood, max θ E(x,c)∼DEy+∼πteach(·|x),y−∼πθ(·|x) \u0002 logσ mθ(x, y+, y−) \u0001\u0003 .(10) Equivalently, the implementation minimizes the negative log-likelihood in Eq. (10). Here y+ is generated by the better-conditioned teacher and y− is generated by the current student. Maximizing Eq. (10) increases the relative probability that the student assigns to teacher-generated responses over its own current responses. This online construction makes the preference signal adaptive: the teacher"},{"citing_arxiv_id":"2604.13016","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","primary_cat":"cs.LG","submitted_at":"2026-04-14T17:54:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}