{"total":56,"items":[{"citing_arxiv_id":"2605.23590","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents","primary_cat":"cs.AI","submitted_at":"2026-05-22T12:59:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22905","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EVE-Agent: Evidence-Verifiable Self-Evolving Agents","primary_cat":"cs.AI","submitted_at":"2026-05-21T17:47:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EVE-Agent adds an evidence verifier to the proposer-solver loop that rewards spans by marginal accuracy gain, producing self-generated but inspectable training examples for search agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22511","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-21T14:00:57+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17946","ref_index":62,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain","primary_cat":"cs.AI","submitted_at":"2026-05-18T07:03:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13534","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging","primary_cat":"cs.AI","submitted_at":"2026-05-13T13:46:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12975","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation","primary_cat":"cs.AI","submitted_at":"2026-05-13T04:14:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11611","ref_index":14,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG","primary_cat":"cs.AI","submitted_at":"2026-05-12T06:42:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Once advanced, the phase never decreases, which prevents oscillation and enforces a monotonic progression of the search-depth curriculum. C.3.2 Concrete Examples When Smax = 5, the target allocations and priorities for each phase are as follows. At phase 0, target lens= [0,K, 0, 0, 0, 0],priorities= [6, 1, 2, 3, 4, 5]. (13) At phase 1, target lens= [0, 0,K, 0, 0, 0],priorities= [6, 5, 1, 2, 3, 4]. (14) At phase 2, target lens= [0, 0, 0,K, 0, 0],priorities= [6, 5, 4, 1, 2, 3]. (15) At phase 3, target lens= [0, 0, 0, 0,K, 0],priorities= [6, 5, 4, 3, 1, 2]. (16) At phase 4, target lens= [0, 0, 0, 0, 0,K],priorities= [6, 5, 4, 3, 2, 1]. (17) Note that at phase 4, the target allocation and priority ordering coincide with SDGA-Auto, as the curriculum has reached the deepest available search depth."},{"citing_arxiv_id":"2605.09931","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-11T03:28:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09038","ref_index":27,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks","primary_cat":"cs.AI","submitted_at":"2026-05-09T16:23:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"augmented generation (RAG) augments generation with external passages for knowledge-intensive prediction [15]. Tool-use methods such as Toolformer [25] and ReAct [35] further enable models to interleave reasoning with external actions. More recent search-centered systems train this capability more directly: Search-R1 [11] optimizes long-horizon search behavior with reinforcement learning, while ZeroSearch [27] incentivizes search capability without relying on live search during training. These advances have substantially improved LLM-retriever interaction, but they also expose a practical bottleneck.Challenge 1: most existing methods teach the model to search, but devote much less modeling capacity to how to formulate high-quality search queries.In multi-hop"},{"citing_arxiv_id":"2605.08401","ref_index":59,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AIPO: Learning to Reason from Active Interaction","primary_cat":"cs.CL","submitted_at":"2026-05-08T19:06:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"πθold (τi,t|τi,<t) , and πθold denotes the policy from the previous iteration. Directly applying Equation (1) to external tokens τϵ may introduce off-policy bias due to distributional mismatch between the policy model and external collaborators, potentially destabilizing training [54, 78]. Conventional approaches often avoid this issue by excluding external tokens 4 from the policy loss [59, 41]. In contrast, AIPO explicitly incorporates external tokens into policy optimization, enabling the policy model to acquire useful knowledge and reasoning patterns from collaborators. Amending the Importance Sampling Coefficient for External Tokens.To mitigate off-policy errors, we introduce a modified importance sampling coefficient ˜ρfor external tokens."},{"citing_arxiv_id":"2605.07725","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SOD: Step-wise On-policy Distillation for Small Language Model Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2505.13820, 2025. [17] Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025. 11 [18] Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025. [19] Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025. [20] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al."},{"citing_arxiv_id":"2605.06285","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG","primary_cat":"cs.CL","submitted_at":"2026-05-07T13:56:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"[76] as the corpus for retrieval. More details of the datasets can be found in Appendix C. 7 Baselines.We compare LatentRAG against a diverse set of baselines covering direct inference (Direct Infer), traditional single-step RAG (Naive RAG [ 10]), prompt-based agentic RAG (Iter- RetGen [77], Search-o1 [18]), and training-based agentic RAG (R1-Searcher [45], ZeroSearch [78], DeepRAG [24], Search-R1 [19], AutoRefine [35]). Implementation details.Following previous works [ 19, 35], we adopt Qwen2.5-7B [79] as the default LLM for all methods. For training-based baselines, we utilize their published model weights to ensure the faithful reproduction of their reported performance. Training trajectories are constructed"},{"citing_arxiv_id":"2605.01248","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data","primary_cat":"cs.LG","submitted_at":"2026-05-02T05:01:05+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00072","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"XekRung Technical Report","primary_cat":"cs.CR","submitted_at":"2026-04-30T11:50:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20659","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning","primary_cat":"cs.LG","submitted_at":"2026-04-22T15:08:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19264","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents","primary_cat":"cs.CV","submitted_at":"2026-04-21T09:28:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18292","ref_index":87,"ref_count":4,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence","primary_cat":"cs.AI","submitted_at":"2026-04-20T14:01:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"thought reasoning with self-evolving rubrics.CoRR, abs/2602.10885, 2026. doi: 10.48550/ARXIV.2602.10885. URLhttps://doi.org/10.48550/arXiv.2602.10885. [86] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634-8652, 2023. [87] Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.CoRR, abs/2503.05592, 2025. doi: 10.48550/ARXIV.2503.05592. URLhttps://doi.org/10.48550/arXiv.2503.05592. [88] Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, and Ji-Rong Wen."},{"citing_arxiv_id":"2604.18235","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents","primary_cat":"cs.CL","submitted_at":"2026-04-20T13:21:19+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17337","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-19T09:05:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15148","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning","primary_cat":"cs.AI","submitted_at":"2026-04-16T15:22:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-hop tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14054","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"$\\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data","primary_cat":"cs.LG","submitted_at":"2026-04-15T16:34:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"engines to perform multi-turn retrieval and analysis for complex questions, emerging as a promising paradigm for information acquisition. Recent work has leveraged RL to further enhance both reason- ing and access to up-to-date knowledge, enabling LLMs to tackle complex tasks more effectively [28, 9]. Some agentic RL works, including Search-R1 [ 13], R1-Searcher [ 30], DeepResearcher [49], and ZeroSearch [31], further enhance question-answering capabilities but remain constrained by limited training data. To scale agentic RL, some pipelines [ 37, 17, 6] employ offline question- 3 synthesis strategies, yet they do not explicitly couple task generation with the evolving capability of the solver. In contrast, self-play enables search agents to jointly generate and solve tasks without"},{"citing_arxiv_id":"2604.12890","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Long-horizon Agentic Multimodal Search","primary_cat":"cs.CV","submitted_at":"2026-04-14T15:40:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"effectiveness of our end-to-end approach, scalable framework design, and data synthesis technique in advancing multimodal deep search agents. 11 References [1] Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025. [2] Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025. [3] Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models."},{"citing_arxiv_id":"2604.09455","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning","primary_cat":"cs.AI","submitted_at":"2026-04-10T16:14:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08990","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-10T05:53:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07927","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools","primary_cat":"cs.AI","submitted_at":"2026-04-09T07:47:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04017","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces","primary_cat":"cs.CL","submitted_at":"2026-04-05T08:29:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and analysis:we benchmark 12 MLLMs and open-source agents, conduct single-tool ablations, and analyze milestone hit rates and an error taxonomy. 2 Related Work Agentic Multimodal Tool Use.Recent advances in autonomous web agents have demonstrated the potential of agentic reasoning with external tools, par- ticularly in open-domain information seeking and synthesis [26,34,42,52,56]. Extending such agents to multimodal settings further complicates reasoning, as models must integrate visual cues with textual knowledge and external verifica- tion. Prior work explores multimodal chain-of-thought prompting and structured visual reasoning [31,57], as well as grounding reasoning in external knowledge via multimodal Retrieval-Augmented Generation (RAG) [5,44,47]."},{"citing_arxiv_id":"2604.03675","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search","primary_cat":"cs.AI","submitted_at":"2026-04-04T10:23:46+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02794","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CharTool: Tool-Integrated Visual Reasoning for Chart Understanding","primary_cat":"cs.AI","submitted_at":"2026-04-03T07:02:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01496","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents","primary_cat":"cs.SE","submitted_at":"2026-04-02T00:11:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multilingual version.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01348","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Procedural Knowledge at Scale Improves Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-01T20:01:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04949","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Retrieve from Agent Trajectories","primary_cat":"cs.IR","submitted_at":"2026-03-30T17:59:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.21440","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-03-22T23:07:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KG-Hopper uses RL to embed full multi-hop KG traversal and backtracking into a single LLM inference round, enabling a 7B model to outperform larger multi-step systems and compete with GPT-3.5/GPT-4o-mini on eight benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.16876","ref_index":105,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation","primary_cat":"cs.CV","submitted_at":"2026-02-17T12:48:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MARL-Rad trains region-specific and global agents with reinforcement learning on clinical rewards to produce more accurate radiology reports than prior methods on MIMIC-CXR and IU X-ray datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InEmpirical Methods in Nat- ural Language Processing, pages 1500-1519. Association for Computational Linguistics, 2020. 2, 4 [104] Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-Searcher: Incentivizing the search capabil- ity in LLMs via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025. 2 [105] Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. Crossing the re- ward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025. 2 [106] Iustin S ˆırbu, Iulia-Renata Sˆırbu, Jasmina Bogojeska, and Traian Rebedea. GIT-CXR: End-to-end transformer for chest X-ray report generation."},{"citing_arxiv_id":"2601.21468","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning","primary_cat":"cs.AI","submitted_at":"2026-01-29T09:47:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MemOCR renders structured memory as images with adaptive visual density to improve long-horizon reasoning under tight context budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.11793","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling","primary_cat":"cs.CL","submitted_at":"2025-11-14T18:52:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.05271","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeepEyesV2: Toward Agentic Multimodal Model","primary_cat":"cs.CV","submitted_at":"2025-11-07T14:31:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.02805","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2025-11-04T18:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.00066","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sharpness-Guided Group Relative Policy Optimization via Probability Shaping","primary_cat":"cs.LG","submitted_at":"2025-10-29T08:07:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.22977","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination","primary_cat":"cs.LG","submitted_at":"2025-10-27T03:58:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Strengthening LLM reasoning through RL, SFT, or chain-of-thought prompting increases tool hallucination rates on SimpleToolHalluBench, with a reliability-capability trade-off observed across mitigation attempts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.00861","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs","primary_cat":"cs.CL","submitted_at":"2025-10-01T13:10:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.00568","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards","primary_cat":"cs.CL","submitted_at":"2025-10-01T06:44:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.02547","ref_index":278,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","primary_cat":"cs.AI","submitted_at":"2025-09-02T17:46:26+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Atom-Searcher [272] External Qwen2.5-7B-Instruct /githubGitHub MiroMind Open Deep Research [273]External - /gl⌢beWebsite SimpleDeepResearcher [274] External QwQ-32B /githubGitHub AWorld [275] External Qwen3-32B /githubGitHub SFR-DeepResearch [276] External QwQ-32B, Qwen3-8B, GPT-oss-20b - ZeroSearch [277] Internal Qwen2.5-3B/7B-Base/Instruct /githubGitHub SSRL [278] Internal Qwen2.5,Llama-3.2/Llama-3.1, Qwen3 /githubGitHub Closed Source Methods OpenAI Deep Research [111] External OpenAI Models /gl⌢beWebsite Perplexity's DeepResearch [261]External - /gl⌢beWebsite Google Gemini's DeepResearch [279]External Gemini /gl⌢beWebsite Kimi-Researcher [112] External Kimi K2 /gl⌢beWebsite Grok AI DeepSearch [280] External Grok3 /gl⌢beWebsite"},{"citing_arxiv_id":"2508.05748","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent","primary_cat":"cs.IR","submitted_at":"2025-08-07T18:03:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.00414","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training","primary_cat":"cs.AI","submitted_at":"2025-08-01T08:11:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Cognitive Kernel-Pro provides an open-source agent framework with curated training data across web, file, code, and reasoning domains plus test-time reflection and voting, achieving SOTA results on GAIA among free agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.02592","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WebSailor: Navigating Super-human Reasoning for Web Agent","primary_cat":"cs.CL","submitted_at":"2025-07-03T12:59:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.22095","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation","primary_cat":"cs.CL","submitted_at":"2025-05-28T08:17:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17086","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2025-05-20T18:33:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mujica-MyGo decomposes multi-turn RAG interactions via multi-agent workflows and applies minimalist policy gradient optimization to improve performance on QA benchmarks while avoiding long-context problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.04588","ref_index":35,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ZeroSearch: Incentivize the Search Capability of LLMs without Searching","primary_cat":"cs.CL","submitted_at":"2025-05-07T17:30:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ZeroSearch uses supervised fine-tuning to create a simulated retrieval module and curriculum-based RL rollouts that degrade document quality to train LLMs on search capabilities without real search API calls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.21776","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WebThinker: Empowering Large Reasoning Models with Deep Research Capability","primary_cat":"cs.CL","submitted_at":"2025-04-30T16:25:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebThinker equips large reasoning models with autonomous web exploration and interleaved reasoning-drafting via a Deep Web Explorer and RL-based DPO training, yielding gains on GPQA, GAIA, and report-generation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.01441","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2025-04-28T10:42:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ARTIST couples agentic reasoning with outcome-based reinforcement learning to let LLMs autonomously invoke tools in multi-turn chains, reporting up to 22% gains on math and function-calling benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}