{"total":17,"items":[{"citing_arxiv_id":"2606.01080","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks","primary_cat":"cs.LG","submitted_at":"2026-05-31T07:57:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ThinkSwitch uses iterative self-distillation with QLoRA and spherical weight interpolation to raise both instruct and thinking checkpoint accuracy on small AIME and PubMedQA sets using only 15 human prompts per domain.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22263","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-21T10:07:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20258","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-18T13:57:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SELFCI uses complementary self-distillation with two reverse KL divergences to align LLMs to contextual integrity while preserving utility, outperforming RL baselines like GRPO in agentic settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18226","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Context Memorization for Efficient Long Context Generation","primary_cat":"cs.CL","submitted_at":"2026-05-18T11:12:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention-state memory externalizes long prefixes into a lightweight lookup table of precomputed attention states, yielding higher accuracy than standard in-context learning at fixed memory budgets and lower latency than full attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17497","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Supervised On-Policy Distillation for Reasoning Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-17T15:14:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15604","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VSPO: Vector-Steered Policy Optimization for Behavioral Control","primary_cat":"cs.LG","submitted_at":"2026-05-15T04:31:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11613","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-12T06:43:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Advances in Neural Information Processing Systems, 35:15476-15488, 2022. 11 [31] Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models.arXiv preprint arXiv:2401.10020, 2024. [32] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information.Neural networks, 22(5-6):544-557, 2009. [33] Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022. [34] Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL"},{"citing_arxiv_id":"2605.10889","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why","primary_cat":"cs.LG","submitted_at":"2026-05-11T17:33:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08873","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-09T10:51:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07307","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts","primary_cat":"cs.CL","submitted_at":"2026-05-08T06:15:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20733","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Near-Future Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-04-22T16:20:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"on the current prompts, yielding guidance that is stronger than historical replay while remaining much closer to the current policy than any external teacher. 5.2 Self-Distillation and Self-Taught Self-distillation and self-Taught methods explore how a model can learn from a stronger version of itself. Context distillation showed that the same model can serve as both teacher and student when given privileged context [22], ReST and STaR bootstrap reasoning traces from the model's own successful generations [7, 34], and recent on-policy distillation methods provide token-level guidance from an internal or external teacher on the student's own rollout distribution [1, 10, 14, 21, 38]. NPO shares the intuition that a model can benefit from a stronger self, but the source of that strength"},{"citing_arxiv_id":"2604.07941","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning","primary_cat":"cs.CL","submitted_at":"2026-04-09T08:00:37+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and statistics. JMLR Workshop and Conference Proceedings. 2011, pp. 627-635. [56] A. M. Lamb et al. \"Professor Forcing: A New Algorithm for Training Recurrent Networks\". In:Advances in neural information processing systems29 (2016). [57] G. Hinton, O. Vinyals, and J. Dean. \"Distilling the Knowledge in a Neural Network\". In:arXiv preprint arXiv:1503.02531(2015). [58] C. Snell, D. Klein, and R. Zhong. \"Learning by Distilling Context\". In:arXiv preprint arXiv:2209.15189 (2022). [59] Y. Huang et al. \"In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models\". In:arXiv preprint arXiv:2212.10670(2022). [60] C.-Y. Hsieh et al. \"Distilling Step-by-Step! Outperforming Larger Language Models with Less Training"},{"citing_arxiv_id":"2604.07894","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation","primary_cat":"cs.CL","submitted_at":"2026-04-09T07:04:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09571","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tuning Qwen2.5-VL to Improve Its Web Interaction Skills","primary_cat":"cs.HC","submitted_at":"2026-02-20T13:35:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.10162","ref_index":297,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models","primary_cat":"cs.AI","submitted_at":"2024-06-14T16:26:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.13208","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions","primary_cat":"cs.CR","submitted_at":"2024-04-19T22:55:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2210.11610","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Models Can Self-Improve","primary_cat":"cs.CL","submitted_at":"2022-10-20T21:53:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}