{"total":12,"items":[{"citing_arxiv_id":"2605.19425","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-19T06:23:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11922","ref_index":78,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning","primary_cat":"cs.SE","submitted_at":"2026-05-12T10:36:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07353","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Confidence-Aware Alignment Makes Reasoning LLMs More Reliable","primary_cat":"cs.AI","submitted_at":"2026-05-08T07:08:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"As illustrated in Figure 1, our framework operates in two interconnected phases: (i) Confidence-Aware Preference Optimization, which aligns model uncertainty with step-wise correctness through iterative DPO, and (ii) Confidence-aware Thought (CaT) Inference, which leverages this calibrated uncertainty to dynamically navigate and prune the reasoning tree. 3.1 Motivation and Problem Formulation Recent progress [27, 44, 59] in LRMs have highlighted a critical tension: sampling multiple reasoning paths boosts performance via diversity, but often introduces plausible yet hallucinated steps. Existing paradigms primarily rely on compute-intensive external verifiers or large-scale sampling, which introduce substantial inference overhead and provide limited insight into the model's intrinsic"},{"citing_arxiv_id":"2605.04811","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tree-based Credit Assignment for Multi-Agent Memory System","primary_cat":"cs.MA","submitted_at":"2026-05-06T12:02:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TreeMem assigns credit to agents in multi-agent memory systems by expanding outputs into a tree and using Monte Carlo averaging of final rewards to optimize each agent's policy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27859","ref_index":44,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Agentic Reinforcement Learning In Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-04-30T13:43:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"outlier tokens exhibiting extreme importance ratios, is effectively mitigated by GMPO [124]. By shifting the optimization 10 Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, and Jiahong Li objective from the arithmetic to the geometric mean of token-level rewards, GMPO offers a plug-and-play stabilization mechanism that is inherently robust to reward outliers. Complementing this, TreePO [44] revolutionizes the rollout phase by conceptualizing generation as a tree-structured search; its dynamic sampling policy utilizes local uncertainty to spawn additional branches, addressing the exploration limitations of costly on-policy rollouts. Further extending the RL paradigm beyond text, PAPO [96] confronts the suboptimality of standard RLVR in multimodal contexts."},{"citing_arxiv_id":"2604.18292","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence","primary_cat":"cs.AI","submitted_at":"2026-04-20T14:01:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"23383, 2025. doi: 10.48550/ARXIV.2503.23383. URLhttps://doi.org/10.48550/arXiv.2503.23383. [52] Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, Heng Ji, and Mengdi Wang. From word to world: Can large language models be implicit text-based world models?, 2025. URLhttps://arxiv.org/abs/2512.18832. [53] Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, et al. Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025. [54] Yuetai Li, Huseyin A Inan, Xiang Yue, Wei-Ning Chen, Lukas Wutschitz, Janardhan Kulkarni, Radha Poovendran,"},{"citing_arxiv_id":"2604.14564","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation","primary_cat":"cs.AI","submitted_at":"2026-04-16T02:52:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MARS² integrates multi-agent collaboration with tree-structured search in RL to boost code generation by increasing exploratory diversity and using path-level group advantages for credit assignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02913","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-08T00:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.21619","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency","primary_cat":"cs.LG","submitted_at":"2026-01-29T12:22:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.00413","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse","primary_cat":"cs.LG","submitted_at":"2025-11-01T05:56:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tree Training serializes tree trajectories via DFS and uses redundancy-free partitioning to compute weighted per-token losses exactly once per token, achieving up to 6.2x training speedup on dense and MoE models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.08827","ref_index":291,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Reinforcement Learning for Large Reasoning Models","primary_cat":"cs.CL","submitted_at":"2025-09-10T17:59:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.02547","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","primary_cat":"cs.AI","submitted_at":"2025-09-02T17:46:26+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"function Yes Yes Apply a clipped bias directly to advantage function Group-based reward TreePo [60] 2025 Same as GRPO's Yes Yes Self-guided policy rollout for reducing the compute burden Group-based reward EDGE-GRPO [61] 2025 Same as GRPO's Yes Yes Entropy-driven advantage and duided error correction to mitigate advantage collapse Group-based reward DARS [62] 2025 Same as GRPO's Yes No Reallocate compute from medium-difficulty to the hardest problems via multi-stage roll- out sampling Group-based reward CHORD [63] 2025 Weighted sum of GRPO's and Su- pervised Fine-Tuning losses Yes Yes Reframe Supervised Fine-Tuning as a dynam- ically weighted auxiliary objective within the on-policy RL process Group-based reward"}],"limit":50,"offset":0}