{"total":15,"items":[{"citing_arxiv_id":"2606.19236","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability","primary_cat":"cs.LG","submitted_at":"2026-06-17T16:13:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"STARE applies surprisal-guided token-level advantage reweighting plus a target-entropy gate to stabilize entropy in GRPO RL for LLMs, yielding stable training and 4-8% gains on AIME24/25 over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18810","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards","primary_cat":"cs.LG","submitted_at":"2026-06-17T08:26:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18216","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients","primary_cat":"cs.CL","submitted_at":"2026-06-16T17:46:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13657","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-06-11T17:54:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"On-policy distillation produces coordinate-sparse, FFN-heavy updates that are full-rank but spectrally concentrated away from principal singular subspaces and near-zero source weights.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06021","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OPRD: On-Policy Representation Distillation","primary_cat":"cs.LG","submitted_at":"2026-06-04T11:13:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00869","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing LLM Metacognition via Cognitive Pairwise Training","primary_cat":"cs.LG","submitted_at":"2026-05-30T19:53:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11775","ref_index":10,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control","primary_cat":"cs.LG","submitted_at":"2026-05-12T08:47:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"After a short warmup with neutral weights, we record the warmup slope and entropy as the reference rate𝑠 ref <0and reference levelℎ ref, respectively. By rescaling the current slope against𝑠ref, we instantiate a normalized progress metric𝑝𝑘 that measures how much the current entropy decay has recovered from the warmup contraction rate: 𝑝𝑘 =clip \u0012 𝑠𝑘 −𝑠 ref −𝑠ref +𝜀 ,0,1 \u0013 .(10) Smaller𝑝 𝑘 indicates ongoing entropy collapse, while larger𝑝𝑘 indicates recovery. We then map𝑝𝑘 to polarity weights through a quadratic rule with reciprocal coupling: 𝜔neg(𝑘)=𝜔 min + (𝜔 max −𝜔 min)𝑝 2 𝑘 , 𝜔 pos(𝑘)= 1 𝜔neg(𝑘) .(11) When𝑝 𝑘 ≈0, the controller protects exploration by suppressing entropy-contracting updates and strengthening entropy-expanding ones; as𝑝𝑘 increases, this bias relaxes toward neutral weighting"},{"citing_arxiv_id":"2605.11739","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-12T08:19:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11636","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seir\\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-12T06:58:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[21] Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025. [22] Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025. [23] Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649, 2025. [24] Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, and Dong Yu. Guided self-evolving llms with minimal human supervision."},{"citing_arxiv_id":"2605.09725","ref_index":16,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On-Policy Distillation with Best-of-N Teacher Rollout Selection","primary_cat":"cs.CV","submitted_at":"2026-05-10T19:49:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06326","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-07T14:23:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7376-7399, 2025. [14] Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025. [15] Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe. arXiv preprint arXiv:2512.16649, 2025. [16] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica."},{"citing_arxiv_id":"2604.18936","ref_index":210,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fine-Tuning Small Reasoning Models for Quantum Field Theory","primary_cat":"cs.LG","submitted_at":"2026-04-21T00:21:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Scaling Relationship on Learning Mathematical Reasoning with Large Language Models. 2023. eprint:arXiv:2308.01825. [208] Edward J. Hu et al.LoRA: Low-Rank Adaptation of Large Language Models. 2021. arXiv:2106.09685 [cs.CL].url:https://arxiv.org/abs/2106.09685. [209] Shangshang Wang et al.Tina: Tiny Reasoning Models via LoRA. 2025. arXiv:2504.15777 [cs.CL]. url:https://arxiv.org/abs/2504.15777. [210] John Schulman and Thinking Machines Lab. \"LoRA Without Regret\". In:Thinking Machines Lab: Connectionism(2025). https://thinkingmachines.ai/blog/lora/.doi:10.64434/tml.20250929. [211] Tianzhe Chu et al.SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. 2025. arXiv:2501.17161 [cs.AI].url:https://arxiv.org/abs/2501."},{"citing_arxiv_id":"2604.13016","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","primary_cat":"cs.LG","submitted_at":"2026-04-14T17:54:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe Overlap-TokenAdvantage.To measure distributional agreement within the overlap tokens, we define 𝐴𝑡 (𝑣)≜¯𝑝 𝑡 (𝑣) (log ¯𝑞𝑡 (𝑣) −log ¯𝑝𝑡 (𝑣)) where ¯𝑝𝑡,¯𝑞𝑡 are the renormalized student and teacher distributions over𝑆 (𝑝) 𝑡 ∩𝑆 (𝑞) 𝑡 . The metric averages this quantity: Madv ≜𝔼 𝑡  1 |𝑆 (𝑝) 𝑡 ∩𝑆 (𝑞) 𝑡 | ∑︁ 𝑣∈𝑆 (𝑝) 𝑡 ∩𝑆(𝑞) 𝑡 𝐴𝑡 (𝑣)  .(7) A value close to zero indicates high-quality alignment where the student places mass on teacher- preferred tokens with appropriate confidence. Conversely, a large negative value indicates that within the intersection, the student is overconfident compared to the teacher (high𝑝𝑡 but lower𝑞𝑡). Entropy and Entropy Gap.To monitor the distributional properties of the policies, we track the"},{"citing_arxiv_id":"2602.15620","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens","primary_cat":"cs.CL","submitted_at":"2026-02-17T14:46:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.07389","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training","primary_cat":"cs.LG","submitted_at":"2026-01-12T10:14:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SFT and RL cannot be decoupled in LLM post-training because each step increases the loss or lowers the reward of the prior step under KL and PL analyses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}