{"total":17,"items":[{"citing_arxiv_id":"2605.29303","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-28T03:36:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EKSFT masks high-entropy or high-KL tokens in low-data SFT to preserve pre-trained distribution and improve downstream RL performance on math reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26184","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training","primary_cat":"cs.LG","submitted_at":"2026-05-25T07:52:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GAC derives adaptive mixing weights for SFT-RL hybrid post-training from online gradient variance and signal disagreement estimates, improving benchmark performance over fixed schedules with under 1% overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22567","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-21T14:47:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13230","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence","primary_cat":"cs.LG","submitted_at":"2026-05-13T09:20:03+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12004","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Agentic Policy from Action Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-12T11:54:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The landscape of agentic reinforcement learning for llms: A survey.arXiv preprint arXiv:2509.02547, 2025. [78] Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy rl meets off-policy experts: Harmonizing supervised fine- tuning and reinforcement learning via dynamic weighting.arXiv preprint arXiv:2508.11408, 2026. [79] Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. [80] Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy rl reach with stale data on llms?arXiv preprint arXiv:2510."},{"citing_arxiv_id":"2605.08401","ref_index":77,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AIPO: Learning to Reason from Active Interaction","primary_cat":"cs.CL","submitted_at":"2026-05-08T19:06:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We propose to expand the reasoning boundary via active interaction. To overcome this limitation, recent studies seek to enhance model performance and expand capability boundaries by leveraging guidance from stronger external expert models such as expert trajecto- ries [13] or critiques [39], primarily through supervised fine-tuning [6, 80] or offline reinforcement learning [77, 78] with expert demonstrations, as illustrated in Figure 1 (B) [45]. However, these meth- ods typically depend on complete expert trajectories, which are costly to sample, information-sparse, and often redundant for training. Moreover, full-trajectory learning provides only coarse-grained su- pervision, offering limited fine-grained guidance for identifying and resolving intermediate reasoning"},{"citing_arxiv_id":"2605.06326","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-07T14:23:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"lizes, and tool calls more often provide useful intermediate evidence. Correspondingly, both pass@1 and pass@8 improve, indicating that the model moves beyond exploiting the superficial signal of tool-call frequency and begins to learn the substance of TIR, adapting tool use to its reasoning prior. This resembles the degradation-recovery dynamics observed in long-CoT SFT [46, 24](stage 2). Finally, after TIR behavior has been sufficiently internalized and useful supervi- sion diminishes, the following SFT gradually overfits the teacher-side noise (e.g., rollout length), leading the performance to saturate or even decline (stage 3). 5.2 Identifying RL-Ready SFT Checkpoints The learning pattern observed in the student model informs our choice of the SFT endpoint."},{"citing_arxiv_id":"2604.19945","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual Reasoning through Tool-supervised Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-04-21T19:48:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18530","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-20T17:26:00+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13010","ref_index":38,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-04-14T17:44:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025. [37] Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can't: Interleaved online fine-tuning for hardest questions.arXiv preprint arXiv:2506.07527, 2026. [38] Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy RL meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting.arXiv preprint arXiv:2508.11408, 2025. [39] Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative SFT"},{"citing_arxiv_id":"2603.22267","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TiCo: Time-Controllable Spoken Dialogue Model","primary_cat":"cs.CL","submitted_at":"2026-03-23T17:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.11321","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings","primary_cat":"cs.LG","submitted_at":"2026-03-11T21:33:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.11470","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning","primary_cat":"cs.LG","submitted_at":"2025-12-12T11:13:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.17652","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots","primary_cat":"q-bio.QM","submitted_at":"2025-11-20T13:29:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.10606","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models","primary_cat":"cs.CV","submitted_at":"2025-10-12T13:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23352","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling","primary_cat":"cs.CV","submitted_at":"2025-09-27T14:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic-TreeRPO replaces independent trajectory sampling with a tree-structured search using dynamic noise intensities and integrates SFT into RL via a weighted Progress Reward Model to achieve better semantic consistency and efficiency in text-to-image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.02547","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","primary_cat":"cs.AI","submitted_at":"2025-09-02T17:46:26+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"CHORD [63] 2025 Weighted sum of GRPO's and Su- pervised Fine-Tuning losses Yes Yes Reframe Supervised Fine-Tuning as a dynam- ically weighted auxiliary objective within the on-policy RL process Group-based reward PAPO [64] 2025 Surrogate of GRPO's Yes Yes Encourage learning to perceive while learn- ing to reason through the Implicit Perception Loss Group-based reward Pass@k Training [65] 2025 Same as GRPO's Yes Yes Pass@k metric as the reward to continually train a model Group-based reward Thisgroup-relativeapproachishighlysample-efficientandreducescomputationaloverhead. Consequently, aseriesofnovelalgorithmsderivedfromtheGRPOframeworkhavebeensubsequentlyproposed(seeTable.2), aiming to substantially enhance both the sample efficiency and asymptotic performance of reinforcement"}],"limit":50,"offset":0}