{"total":13,"items":[{"citing_arxiv_id":"2605.22240","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unlocking Proactivity in Task-Oriented Dialogue","primary_cat":"cs.AI","submitted_at":"2026-05-21T09:46:25+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17877","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-18T05:39:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10674","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Step Rejection Fine-Tuning: A Practical Distillation Recipe","primary_cat":"cs.LG","submitted_at":"2026-05-11T14:55:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Step Rejection Fine-Tuning masks loss on erroneous steps identified by a critic LLM in unresolved trajectories, raising SWE-bench Verified resolution rate by 3.7% to 32.2% versus 2.4% for trajectory-level rejection.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"9%), as the model internalizes errors present in the failed attempts. However, by applying critic-guided masking, we not only mitigate this degradation but achieve a performance gain, outperforming the RFT baseline (32.2% vs 30.9%). This improvement is statistically significant; a bootstrap analysis confirms a gain of 1.3% with a 95% confidence interval of [0.4, 2.3] (refer to Section B for detailed statistical analysis). 5 Limitations and Future Work Our approach relies on the accuracy of the critic. Mislabeling valid steps as harmful can reduce the effective training data, while failing to identify subtle errors can allow them to propagate into the student model. We leave a thorough study of different critics and labeling approaches to future work."},{"citing_arxiv_id":"2605.08334","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators","primary_cat":"cs.CL","submitted_at":"2026-05-08T17:59:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27955","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GUI Agents with Reinforcement Learning: Toward Digital Inhabitants","primary_cat":"cs.AI","submitted_at":"2026-04-30T14:51:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07645","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent","primary_cat":"cs.AI","submitted_at":"2026-04-08T23:11:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05529","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ActivityEditor: Learning to Synthesize Physically Valid Human Mobility","primary_cat":"cs.AI","submitted_at":"2026-04-07T07:28:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ActivityEditor introduces a dual-LLM-agent system with reinforcement learning that produces statistically faithful and physically valid human mobility trajectories in zero-shot cross-regional settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02869","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration","primary_cat":"cs.AI","submitted_at":"2026-04-03T08:36:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Iterative Reward Calibration with MT-GRPO and GTPO enables effective multi-turn RL for tool-calling agents, raising Tau-Bench success from 63.8% to 66.7% for a 4B model and from 58.0% to 69.5% for a 30B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.06475","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-02-06T08:03:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.12538","ref_index":235,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic Reasoning for Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-01-18T18:58:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recentstudies[ 234,207,235,205,236,27,237,206]leveragesreinforcement learning (RL) during model post-training to go beyond imitation and achieve mastery in tool-integrated reasoning. With the integration of RL, models refine their tool-use strategies through outcome-driven rewards, learningwhen,how, andwhichtools to invoke via trial and error [205, 238, 206, 239]. For instance, SWE-RL [235] optimizes code-editing policies on large-scale software evolution data, improving not only software issue resolution but also general reasoning skills. ReSearch [205] embeds search operations into multi-hop reasoning chains, enabling adaptive retrieval during complex QA. ReTool integrates real-time code execution into reasoning rollouts, leading to optimal performance on advanced math reasoning benchmarks."},{"citing_arxiv_id":"2511.00413","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse","primary_cat":"cs.LG","submitted_at":"2025-11-01T05:56:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tree Training serializes tree trajectories via DFS and uses redundancy-free partitioning to compute weighted per-token losses exactly once per token, achieving up to 6.2x training speedup on dense and MoE models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.19225","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs","primary_cat":"cs.DC","submitted_at":"2025-10-22T04:19:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RLBoost harvests preemptible GPUs for RL rollout via a hybrid architecture with adaptive offload, pull-based transfer, and token-level migration, delivering 1.51x-1.97x throughput and 28-49% better cost efficiency than on-demand-only setups.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.02547","ref_index":169,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","primary_cat":"cs.AI","submitted_at":"2025-09-02T17:46:26+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"itself, this approach moves beyond simple self-correction and toward a state of continuous self-improvement in the learning process, representing a crucial step toward agents that can not only solve problems but also autonomously enhance their fundamental capacity to learn from experience. 3.5. Reasoning Reasoning in large language models can be broadly categorized intofast reasoningandslow reasoning, following the dual-process cognitive theory [169, 24]. Fast reasoning corresponds to rapid, heuristic-driven inference with minimal intermediate steps, while slow reasoning emphasizes deliberate, structured, and multi-step reasoning. Understanding the trade-offs between these two paradigms is crucial for designing models that balance efficiency and accuracy in complex problem-solving. Fast Reasoning: Intuitive and Efficient InferenceFast reasoning models operate in a manner analogous"}],"limit":50,"offset":0}