{"total":14,"items":[{"citing_arxiv_id":"2605.22138","ref_index":104,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Agentic Reasoning Through Self-Regulated Simulative Planning","primary_cat":"cs.AI","submitted_at":"2026-05-21T08:11:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"objectivederivationisprovidedinAppendixD.2. 3.4 TrainingDataandHyperparameters Webuildourtrainingdatasetfromopen-sourcemath,science,tabular,andwebreasoningdatasets. Forv0.1, wesamplefromGuru[17]andmulti-hopQAdatasets[116,36,98,105],yielding4,845supervisedexamples after construction and filtering. For v1.0, we additionally incorporate MegaScience [22] and several web reasoningdatasets[ 104,95,87,25],yielding10,787supervisedexamples. ForRL,weperformdifficulty-based 5 filtering[17,90],retainingquestionswithintermediatePass@ 𝐾ratestoensureinformativegradientsignals. SR2AM-v0.1-8Bistrainedfrom Qwen3-8B[79];SR 2AM-v1.0-30Bfrom Qwen3-30B-A3B-Thinking-2507[78]. Fulldatasetcomposition,filteringprotocol,andtraininghyperparametersareprovidedinAppendixE."},{"citing_arxiv_id":"2605.13034","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence","primary_cat":"cs.CV","submitted_at":"2026-05-13T05:39:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"an open ended question, these systems can plan multistep searches, browse the web, collect evidence, and produce structured reports. Commercial systems such as OpenAI Deep Research [18], Gemini Deep Research [1], and Tongyi DeepResearch [23] show the practical value of this paradigm, while academic systems such as STORM [20], WebThinker [14], WebDancer [25], and DualGraph [21] improve different parts of the research loop. However, the final artifacts produced by most deep research agents remain largely text centered. This is a poor match for many research tasks. In scientific papers, technical reports, policy briefings, and market analysis, figures are often not decorative supplements. They are evidence: a model"},{"citing_arxiv_id":"2605.09287","ref_index":42,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-10T03:21:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The code is available at https://github.com/novdream/PiCA. 1 Introduction Large Language Model (LLM)-based search agents [11, 40, 36] have recently redefined the paradigm for addressing long-horizon, knowledge-intensive tasks, such as multi-hop question answering [3, 28] and open-domain information seeking [ 50, 49]. For example, popular search agents, such as WebDancer [42], WebLeaper [33] and MiroThinker [34], can autonomously refine queries, summarize retrieved information from external environments through search tools (e.g., online API, local corpus). A primary bottleneck in these long-horizon tasks lies in incorrect credit assignment [ 18, 32, 48], specifically the misattribution of rewards to less important steps."},{"citing_arxiv_id":"2605.01489","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-02T15:26:45+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Results of proprietary agent baselines are reported by Tang et al. [34]. *Note: In TRQA experiments, we exclude TRQA from the training data of SciResearcher. Agent Framework LLM Backbone HLE-Gold SuperGPQA-Hard TRQA* Vanilla LLMs - Qwen3-32B 5.37 31.52 37.79 - Kimi-K2 6.71 48.91 38.37 - Deepseek V3.1 13.42 66.30 43.60 - Gemini-2.5 Pro 18.79 65.22 45.93 Agent Systems AutoGen [46] GPT-4.1 7.38 29.35 51.74 SciMaster [3] GPT-4.1 9.45 19.78 47.67 Biomni [13] GPT-4.1 10.74 43.48 41.09 OpenAI Deep Research [27] o4-mini 22.82 39.13 - Cognitive Kernel-Pro[9] Qwen3-8B 8.05 22.83 34.88 Qwen3-32B 10.74 38.04 46.51 SciResearcher-8B-SFT 12.75 31.52 47.67 SciResearcher-8B-RL 19.46 35.87 49.42 -pass@331.54 51.09 60.47 Table 2: Training data composition."},{"citing_arxiv_id":"2605.00043","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SiriusHelper: An LLM Agent-Based Operations Assistant for Big Data Platforms","primary_cat":"cs.DB","submitted_at":"2026-04-29T06:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SiriusHelper deploys an LLM agent with intent routing, DeepSearch multi-hop retrieval, and automated SOP distillation to outperform alternatives and reduce ticket volume by 20.8% on Tencent's big data platform.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18292","ref_index":105,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence","primary_cat":"cs.AI","submitted_at":"2026-04-20T14:01:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"22648. URLhttps://doi.org/10. 48550/arXiv.2505.22648. 26 [104] Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. Webwalker: Benchmarking llms in web traversal.CoRR, abs/2501.07572, 2025. doi: 10.48550/ARXIV.2501.07572. URLhttps://doi.org/10.48550/arXiv.2501.07572. [105] Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, and Yuyu Luo. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026. URLhttps://arxiv.org/abs/ 2602.14296. [106] Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang,"},{"citing_arxiv_id":"2604.14518","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mind DeepResearch Technical Report","primary_cat":"cs.AI","submitted_at":"2026-04-16T01:20:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"adjudicating conflicting evidence across heterogeneous sources-a critical requirement in realistic research scenarios. Evaluation and Reward Design.Classic evaluation metrics for deep research agents are the RACE rubric score comprising Comprehensiveness, Insight, Instruction Following and Readability as shown in DeepResearch Bench [7]. Several other multi-dimensional rubric frameworks are also proposed including WritingBench [42], ResearchRubrics [27], and DEER [11], that enable reliable LLM-as- a-Judge evaluation along axes such as factual accuracy, structural coherence, and theme alignment. These rubrics in turn supply reward signals for RL algorithms including GRPO [26], GSPO [52], and DAPO [49]. Nevertheless, as indicated by FINDER [51], current models still exhibit critical gaps"},{"citing_arxiv_id":"2604.04017","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces","primary_cat":"cs.CL","submitted_at":"2026-04-05T08:29:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and analysis:we benchmark 12 MLLMs and open-source agents, conduct single-tool ablations, and analyze milestone hit rates and an error taxonomy. 2 Related Work Agentic Multimodal Tool Use.Recent advances in autonomous web agents have demonstrated the potential of agentic reasoning with external tools, par- ticularly in open-domain information seeking and synthesis [26,34,42,52,56]. Extending such agents to multimodal settings further complicates reasoning, as models must integrate visual cues with textual knowledge and external verifica- tion. Prior work explores multimodal chain-of-thought prompting and structured visual reasoning [31,57], as well as grounding reasoning in external knowledge via multimodal Retrieval-Augmented Generation (RAG) [5,44,47]."},{"citing_arxiv_id":"2604.03679","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LightThinker++: From Reasoning Compression to Memory Management","primary_cat":"cs.CL","submitted_at":"2026-04-04T10:46:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"learningsignalallowsthemodeltomaintaincontexthygieneandreasoningfidelityacrossextendedinteraction horizons. 6 Experiments: Long-Horizon Agentic Reasoning 6.1 Experimental Settings Dataset Construction and Filtering.The base query pool is curated from a diversified ensemble of sources, includingHotpotQA [40], MuSiQue [41], WebDancer [42], WebShaper [43], andWebWalkerQA-Silver [44]. To ensure the necessity of multi-hop reasoning and high-order planning, we perform heuristic filtering on HotpotQAandMuSiQueby selecting only those instances where Qwen3-30B-A3B-Instruct-2507 fails to yield direct solutions. Regarding theWebWalkerQA-Silvercorpus, we adopted a language-specific selection policy: the English subset was fully incorporated to maintain linguistic diversity, while the Chinese subset was"},{"citing_arxiv_id":"2603.04751","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating the Search Agent in a Parallel World","primary_cat":"cs.AI","submitted_at":"2026-03-05T02:56:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.11793","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling","primary_cat":"cs.CL","submitted_at":"2025-11-14T18:52:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.02805","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2025-11-04T18:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.02547","ref_index":114,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","primary_cat":"cs.AI","submitted_at":"2025-09-02T17:46:26+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"state transitions, establishes token-efficiency arguments for feasibility under finite budgets, and proposes Advantage Shaping Policy Optimization (ASPO) to stably guide agentic tool use. Today, such tool-integrated reasoning is no longer a niche capability but a baseline feature of advanced agentic models. Mature commercial and open-source systems, such as OpenAI's DeepResearch and o3 [111], Kimi K2 [112], Qwen QwQ-32B [113], Zhipu GLM Z1 [114], Microsoft rStar2-Agent [115] and Meituan LongCat [116], routinely incorporate these RL-honed strategies, underscoring the centrality of outcome- driven optimization in tool-augmented intelligence. 16 Prospective: Long-horizon TIR.While tool-integrated RL has proven effective for optimizing actions within a single reasoning loop, the primary frontier lies in extending this capability to robust, long-horizon"},{"citing_arxiv_id":"2507.02592","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WebSailor: Navigating Super-human Reasoning for Web Agent","primary_cat":"cs.CL","submitted_at":"2025-07-03T12:59:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}