{"total":12,"items":[{"citing_arxiv_id":"2605.22567","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-21T14:47:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12004","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Agentic Policy from Action Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-12T11:54:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Classical RLfD methods often use demonstration trajectories to bootstrap ex- ploration in sparse-reward settings, for example by retaining them in the replay buffer and combining RL updates with auxiliary imitation losses [21, 46, 58]. Following a similar intuition, several recent LLM studies incorporate off-policy expert trajectories into online RL to mitigate sparse-reward and hard-exploration challenges [18, 32, 37, 78]. Specifically, LUFFY [66] incorporates off-policy expert trajectories into online RL through mixed-policy optimization, using regularized importance shaping to avoid rigid imitation. Guide [42] utilizes adaptive hint-guided off-policy trajectories into online RL, reweighting them to improve exploration while training a policy that no longer relies on hints at"},{"citing_arxiv_id":"2605.08401","ref_index":17,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AIPO: Learning to Reason from Active Interaction","primary_cat":"cs.CL","submitted_at":"2026-05-08T19:06:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and et al. The llama 3 herd of models.CoRR, abs/2407.21783, 2024. 1, 4.1 [17] Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT: A single-stage method with super- vised and reinforcement fine-tuning for reasoning.CoRR, abs/2506.19767, 2025. 1, 1, 2, 5, D.1 [18] Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin,"},{"citing_arxiv_id":"2605.00610","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors","primary_cat":"cs.LG","submitted_at":"2026-05-01T12:20:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20733","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Near-Future Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-04-22T16:20:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A natural response is to enrich the learning signal by mixing in auxiliary trajectories from other sources, moving from pure on-policy updates to a mixed-policy regime. Recent work has explored this direction along two lines: importing stronger traces from outside the current policy through off-policy demonstrations [25], expert prefixes [9], or interleaved supervised corrections [5, 16]; or reusing successful trajectories produced during training itself, as in experience replay and restart-style methods [35, 36]. Yet both lines face a common tension. External trajectories carry rich signal but diverge in reasoning patterns from the current policy, making them difficult to internalize. Replayed trajectories stay close to the on-policy"},{"citing_arxiv_id":"2604.14054","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$\\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data","primary_cat":"cs.LG","submitted_at":"2026-04-15T16:34:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"16149, 2025. [3] Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Self- questioning language models.arXiv preprint arXiv:2508.03682, 2025. [4] Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024. [5] Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning.arXiv preprint arXiv:2506.19767, 2025. [6] Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu."},{"citing_arxiv_id":"2603.11321","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings","primary_cat":"cs.LG","submitted_at":"2026-03-11T21:33:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.11470","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning","primary_cat":"cs.LG","submitted_at":"2025-12-12T11:13:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.08827","ref_index":148,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Reinforcement Learning for Large Reasoning Models","primary_cat":"cs.CL","submitted_at":"2025-09-10T17:59:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13755","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration","primary_cat":"cs.LG","submitted_at":"2025-08-19T11:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DARS adaptively increases rollouts on hard problems in RLVR to improve Pass@K, and when paired with batch scaling for breadth, achieves gains in both Pass@K and Pass@1 by treating depth and breadth as complementary exploration dimensions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.07809","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2025-08-11T09:49:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvoCoT uses self-generated and verified CoT trajectories in a two-stage curriculum to let LLMs learn from initially unsolved hard problems in RLVR settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09567","ref_index":193,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-03-12T17:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Generation (RAG) techniques enhance LLMs by integrating dynamic knowledge retrieval and document refinement [418, 811, 221, 322, 827, 1103, 1100, 592, 438]. Research has combined RAG with reasoning modules to improve performance on complex tasks [726, 329, 474, 861, 88, 1060, 616]. O1 Embedder [919] optimizes multi-task retrieval and reasoning through synthetic data training. Furthermore, Stream of Search (SoS) [193], and CoRAG [786] boost search accuracy and addresses unresolved issues by incorporating more natural reflection and exploration in RAG. (2) Model Knowledge Injection: An alternative approach involves integrating additional knowledge during SFT or RL [496, 1031, 124, 1132]. Specifically, HuatuoGPT-o1 [83] utilize the R1-like paradigm to train LLMs by model-judged reward RL, which significantly improves the medical knowledge during"}],"limit":50,"offset":0}