pith. machine review for the scientific record. sign in

arxiv: 2604.07720 · v2 · submitted 2026-04-09 · 💻 cs.AI

Recognition: no theorem link

Towards Knowledgeable Deep Research: Framework and Benchmark

Bingbing Xu, Chunmao Zhang, Fei Wang, Fenghui Zhang, Jiafeng Guo, Jin Zhang, Long Bai, Tat-Seng Chua, Wei Li, Wenxuan Liu, Xiaolong Jin, Xueqi Cheng, Xuhui Jiang, Yuxin Zuo, Zhuo Chen, Zixuan Li

Pith reviewed 2026-05-10 18:12 UTC · model grok-4.3

classification 💻 cs.AI
keywords deep research agentsstructured knowledgeHybrid Knowledge AnalysisKDR-Benchmultimodal reportsLLM agentsknowledge analysis
0
0 comments X

The pith

A new multi-agent framework lets deep research agents combine tables, figures, and text to generate more accurate reports than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Knowledgeable Deep Research as the task of producing reports that draw on both unstructured web text and structured resources such as tables and figures. It presents the Hybrid Knowledge Analysis framework, which uses separate agents for coding and vision-language processing to extract insights from data and weave them into coherent multimodal outputs. A supporting benchmark, KDR-Bench, supplies 41 expert questions and over a thousand tables across nine domains to measure performance. Experiments indicate that the new framework beats most existing deep-research agents on standard and knowledge-focused scores and exceeds a strong Gemini baseline on vision-related measures. If correct, this shifts agent design from text-only search toward quantitative, data-grounded analysis.

Core claim

The authors establish that the Hybrid Knowledge Analysis framework, built around a Structured Knowledge Analyzer that employs both coding models and vision-language models, enables deep research agents to integrate structured and unstructured knowledge into coherent multimodal reports, resulting in higher scores than prior agents on the three categories of metrics defined for KDR-Bench.

What carries the argument

The Structured Knowledge Analyzer inside the Hybrid Knowledge Analysis multi-agent architecture, which converts tables and figures into insights using coding and vision-language models.

If this is right

  • Deep research agents can now perform quantitative computations directly from tables rather than relying solely on textual summaries.
  • Multimodal reports that incorporate figures and tables achieve higher scores on vision-enhanced evaluation metrics.
  • Systematic benchmarks like KDR-Bench allow comparison of agents across general-purpose, knowledge-centric, and vision-enhanced dimensions.
  • Future deep-research systems can treat structured knowledge as a core input rather than an optional add-on.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of structured and unstructured analysis in the framework could extend to other agent tasks that require both numerical reasoning and narrative synthesis.
  • If the benchmark's tables were replaced with live database connections, the same architecture might support ongoing monitoring or forecasting applications.
  • Adoption in specialized fields such as economics or biology could reveal whether the coding-plus-vision approach scales to domain-specific data formats.

Load-bearing premise

That the 41 expert-level questions and 1,252 tables in KDR-Bench across nine domains provide a representative test of an agent's ability to perform deep, structure-aware knowledge analysis.

What would settle it

A follow-up evaluation on a fresh collection of expert questions that include tables outside the original nine domains, where HKA fails to maintain its reported advantage on knowledge-centric or vision-enhanced metrics.

Figures

Figures reproduced from arXiv: 2604.07720 by Bingbing Xu, Chunmao Zhang, Fei Wang, Fenghui Zhang, Jiafeng Guo, Jin Zhang, Long Bai, Tat-Seng Chua, Wei Li, Wenxuan Liu, Xiaolong Jin, Xueqi Cheng, Xuhui Jiang, Yuxin Zuo, Zhuo Chen, Zixuan Li.

Figure 1
Figure 1. Figure 1: Figures about mismatch[figure1.png](The figure explores the mismatch in Election.) Useful Docs: Doc 1: The 15th Amendment (1870) Writing Instruction Subtask 2 Result Subtask 3 Result Final Report Compose final report by coherently integrating all preceding results. Report Instruction Identify and summarize the search intent reflected by the U.K.A. query. Intent Instruction Final Refinement Please carefully… view at source ↗
Figure 2
Figure 2. Figure 2: The construction procedure of KDR-Bench: (a) Dataset construction process; (b) Evaluation framework. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The statistics on tables in KDR-Bench. 4.1.1 Structured Knowledge Collection. We collect the structured knowledge from an online data analytics platform 1 . As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case study for HKA. market...” and generates a figure and the corresponding insights about the differences between developed and developing countries. And Unstructured Knowledge Analyzer further search the “Impact of Brexit on artist mobility UK art market EU exhibitions inter￾national fairs”. The Writer inserts the figure into the final report and integrates the corresponding insights into the textual con… view at source ↗
read the original abstract

Deep Research (DR) requires LLM agents to autonomously perform multi-step information seeking, processing, and reasoning to generate comprehensive reports. In contrast to existing studies that mainly focus on unstructured web content, a more challenging DR task should additionally utilize structured knowledge to provide a solid data foundation, facilitate quantitative computation, and lead to in-depth analyses. In this paper, we refer to this novel task as Knowledgeable Deep Research (KDR), which requires DR agents to generate reports with both structured and unstructured knowledge. Furthermore, we propose the Hybrid Knowledge Analysis framework (HKA), a multi-agent architecture that reasons over both kinds of knowledge and integrates the texts, figures, and tables into coherent multimodal reports. The key design is the Structured Knowledge Analyzer, which utilizes both coding and vision-language models to produce figures, tables, and corresponding insights. To support systematic evaluation, we construct KDR-Bench, which covers 9 domains, includes 41 expert-level questions, and incorporates a large number of structured knowledge resources (e.g., 1,252 tables). We further annotate the main conclusions and key points for each question and propose three categories of evaluation metrics including general-purpose, knowledge-centric, and vision-enhanced ones. Experimental results demonstrate that HKA consistently outperforms most existing DR agents on general-purpose and knowledge-centric metrics, and even surpasses the Gemini DR agent on vision-enhanced metrics, highlighting its effectiveness in deep, structure-aware knowledge analysis. Finally, we hope this work can serve as a new foundation for structured knowledge analysis in DR agents and facilitate future multimodal DR studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Knowledgeable Deep Research (KDR) as a task requiring LLM agents to generate reports that integrate both structured (tables, figures) and unstructured knowledge. It proposes the Hybrid Knowledge Analysis (HKA) multi-agent framework, centered on a Structured Knowledge Analyzer that combines coding and vision-language models to produce and interpret multimodal outputs. To evaluate this, the authors construct KDR-Bench covering 9 domains with 41 expert-level questions and 1,252 tables, annotate key conclusions, and define three metric categories (general-purpose, knowledge-centric, vision-enhanced). Experimental results claim that HKA consistently outperforms most existing DR agents on the first two metric types and surpasses the Gemini DR agent on vision-enhanced metrics.

Significance. If the performance claims hold under rigorous evaluation, the work provides a concrete benchmark and framework for incorporating structured knowledge into deep research agents, filling a gap between unstructured web-based DR and quantitative, multimodal analysis. The explicit construction of KDR-Bench with annotated conclusions and multimodal resources could serve as a reusable testbed for future studies, particularly if accompanied by reproducible code or detailed per-question results.

major comments (3)
  1. [Abstract / Experimental results] Abstract and experimental results section: the central claim of consistent outperformance across metric categories and domains rests on results from only 41 questions; no per-domain breakdowns, statistical significance tests, inter-question variance, or error analysis are supplied, leaving open the possibility that apparent wins are driven by a small subset of favorable items rather than robust superiority of HKA.
  2. [KDR-Bench construction] KDR-Bench description: the weakest assumption is that 41 expert-level questions plus 1,252 tables across nine domains constitute a representative test of deep, structure-aware knowledge analysis; without evidence of question diversity, domain balance, or coverage of edge cases (e.g., conflicting tables or complex figure-table interactions), the benchmark scale risks under-supporting the generalization claims.
  3. [HKA framework / Evaluation metrics] Framework and evaluation sections: the Structured Knowledge Analyzer is described as using both coding and vision-language models, yet no details are given on how outputs are fused, how baselines (including Gemini) were prompted or configured for fair comparison, or how vision-enhanced metrics were computed, making the reported superiority difficult to reproduce or verify.
minor comments (2)
  1. [Introduction] The abstract and introduction could more explicitly distinguish KDR from prior DR benchmarks (e.g., by citing specific limitations of unstructured-only approaches).
  2. [Evaluation metrics] Notation for the three metric categories is introduced without a summary table; adding one would improve clarity when comparing HKA to baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing honest responses and indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Experimental results] Abstract and experimental results section: the central claim of consistent outperformance across metric categories and domains rests on results from only 41 questions; no per-domain breakdowns, statistical significance tests, inter-question variance, or error analysis are supplied, leaving open the possibility that apparent wins are driven by a small subset of favorable items rather than robust superiority of HKA.

    Authors: We acknowledge that the evaluation relies on 41 expert-level questions and that the current results section lacks the requested breakdowns and statistical rigor. In the revised manuscript, we will add per-domain performance tables, report inter-question variance and standard deviations, include statistical significance tests (e.g., paired Wilcoxon tests against baselines), and provide an error analysis section that examines cases of strong and weak performance. These additions will allow readers to assess whether outperformance is robust or subset-driven. revision: yes

  2. Referee: [KDR-Bench construction] KDR-Bench description: the weakest assumption is that 41 expert-level questions plus 1,252 tables across nine domains constitute a representative test of deep, structure-aware knowledge analysis; without evidence of question diversity, domain balance, or coverage of edge cases (e.g., conflicting tables or complex figure-table interactions), the benchmark scale risks under-supporting the generalization claims.

    Authors: We agree that explicit evidence of representativeness strengthens the benchmark. The 41 questions were curated by domain experts for breadth across the nine domains, and the 1,252 tables include varied structures. In revision, we will expand the KDR-Bench section with: (i) quantitative domain-balance statistics, (ii) question-selection criteria and diversity metrics, and (iii) concrete examples of included edge cases such as conflicting tables and figure-table interactions. While we cannot immediately scale the question count without new expert annotations, we will release the full benchmark and annotation guidelines to support community extensions. revision: partial

  3. Referee: [HKA framework / Evaluation metrics] Framework and evaluation sections: the Structured Knowledge Analyzer is described as using both coding and vision-language models, yet no details are given on how outputs are fused, how baselines (including Gemini) were prompted or configured for fair comparison, or how vision-enhanced metrics were computed, making the reported superiority difficult to reproduce or verify.

    Authors: We apologize for the missing implementation details that impede reproducibility. The revised manuscript will include: (1) a precise description of the fusion process between coding-model outputs (tables, executed computations) and vision-language model interpretations; (2) the exact prompting templates, temperature settings, and configuration details used for all baselines including the Gemini DR agent; and (3) the full computation procedure for vision-enhanced metrics, including how visual elements are matched to annotated conclusions. We will also release the complete code, prompts, and evaluation scripts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on new benchmark and external baselines

full rationale

The paper introduces the KDR task, proposes the HKA multi-agent framework, constructs the KDR-Bench dataset (41 questions, 1,252 tables across 9 domains), defines three metric categories, and reports empirical outperformance against external DR agents including Gemini. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described content. The central experimental claim does not reduce by construction to any input; it depends on comparisons with independent baselines on a newly created testbed. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling are present. This is a standard systems/benchmark paper whose validity can be assessed externally via the benchmark itself rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented physical entities are identifiable from the abstract; the contribution consists of a new task definition, architectural design, and benchmark rather than fitted constants or postulated mechanisms.

pith-pipeline@v0.9.0 · 5631 in / 1201 out tokens · 62041 ms · 2026-05-10T18:12:28.679741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 30 canonical work pages · 10 internal anchors

  1. [1]

    Perplexity AI. 2025. Introducing Perplexity Deep Research. https://www. perplexity.ai/hub/blog/introducing-perplexity-deep-research. Accessed: 2025-12

  2. [2]

    Anthropic. 2025. Claude Haiku 4.5 (Thinking). https://www.anthropic.com/ claude. Accessed via Anthropic API, model version: claude-haiku-4-5-20251001- thinking

  3. [3]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, and et al. 2023. Qwen Technical Report.arXiv preprint arXiv:2309.16609(2023)

  4. [4]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  5. [5]

    Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. arXiv:2506.18824 [cs.SE] https://arxiv.org/abs/2506.18824

  6. [6]

    Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin. 2025. BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent.a...

  7. [7]

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, and Bochao Wu et. al. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https: //arxiv.org/abs/2412.19437

  8. [8]

    Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. 2025. Agentic Entropy-Balanced Policy Optimization. arXiv:2510.14545 [cs.LG] https://arxiv.org/abs/2510.14545

  9. [9]

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]

  10. [10]

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. 2025. DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents. arXiv:2506.11763 [cs.CL] https://arxiv.org/abs/2506.11763

  11. [11]

    Google. 2024. Gemini 2.0 Flash. https://gemini.google.com. Accessed: 12/2024

  12. [12]

    Google AI. 2025. Gemini Deep Research Agent Documentation. https://ai.google. dev/gemini-api/docs/deep-research. Official documentation for Gemini Deep Research agent, accessed December 2025

  13. [13]

    Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Ger- brand Ceder. 2025. CASCADE: Cumulative Agentic Skill Creation through Autonomous Development and Evolution. arXiv:2512.23880 [cs.AI] https: //arxiv.org/abs/2512.23880

  14. [14]

    Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bendersky. 2023. Query Expansion by Prompting Large Language Models. arXiv:2305.03653 [cs.IR] https://arxiv.org/abs/2305.03653

  15. [15]

    Jiajie Jin, Yuyao Zhang, Yimeng Xu, Hongjin Qian, Yutao Zhu, and Zhicheng Dou. 2025. FinSight: Towards Real-World Financial Deep Research. arXiv:2510.16844 [cs.CL] https://arxiv.org/abs/2510.16844

  16. [16]

    langchain-ai. 2025. Open Deep Research. https://github.com/langchain-ai/open_ deep_research. Open-source deep research agent built on LangGraph, accessed December 2025

  17. [17]

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. 2025. WebSailor: Navigating Super-human Reasoning for Web Agent. arXiv:2507.02592 [cs.CL] https://arxiv.org/abs/2507.02592

  18. [18]

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. 2025. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776(2025)

  19. [19]

    Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, and Wangchunshu Zhou. 2025. Towards Personalized Deep Research: Benchmarks and Evaluations. arXiv:2509.25106 [cs.CL] https: //arxiv.org/abs/2509.25106

  20. [20]

    Fan Liu, Zherui Yang, Cancheng Liu, Tianrui Song, Xiaofeng Gao, and Hao Liu

  21. [21]

    arXiv:2505.14148 [cs.AI] https://arxiv.org/abs/2505.14148

    MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem. arXiv:2505.14148 [cs.AI] https://arxiv.org/abs/2505.14148

  22. [22]

    MiniMax-AI. 2025. MiniMax M2. https://github.com/MiniMax-AI/MiniMax-M2. Open-source model for coding and agentic workflows released by MiniMax_AI; accessed Oct 2025

  23. [23]

    OpenAI. 2024. text-embedding-3 family embedding models. https://platform. openai.com/docs/api-reference/embeddings. Accessed: Month Day, Year

  24. [24]

    OpenAI. 2025. Deep research System Card. https://cdn.openai.com/deep- research-system-card.pdf. Accessed: 2025-12

  25. [25]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Towards Knowledgeable Deep Research: Framework and Benchmark Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Nguyen, Mobeen Mahmood...

  26. [26]

    Akshara Prabhakar, Roshan Ram, Zixiang Chen, Silvio Savarese, Frank Wang, Caiming Xiong, Huan Wang, and Weiran Yao. 2025. Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics.arXiv preprint arXiv:2510.17797(2025)

  27. [27]

    Serper.dev. 2025. Serper: The World’s Fastest & Cheapest Google Search API. https://serper.dev/

  28. [28]

    Zhengliang Shi, Yiqun Chen, Haitao Li, Weiwei Sun, Shiyu Ni, Yougang Lyu, Run-Ze Fan, Bowen Jin, Yixuan Weng, Minjun Zhu, Qiujie Xie, Xinyu Guo, Qu Yang, Jiayi Wu, Jujia Zhao, Xiaqiang Tang, Xinbei Ma, Cunxiang Wang, Jiaxin Mao, Qingyao Ai, Jen-Tse Huang, Wenxuan Wang, Yue Zhang, Yiming Yang, Zhaopeng Tu, and Zhaochun Ren. 2025. Deep Research: A Systemati...

  29. [29]

    Aaditya Singh et al. 2026. OpenAI GPT-5 System Card. arXiv:2601.03267 [cs.CL] https://arxiv.org/abs/2601.03267

  30. [30]

    GLM Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengx- ing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengx- iao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu...

  31. [31]

    Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. 2025. Tongyi DeepResearch Technical Report.arXiv preprint arXiv:2510.24701(2025)

  32. [32]

    Tencent. 2024. Hunyuan 2.0. https://hunyuan.tencent.com. Accessed via Hunyuan API, model version: Hunyuan2.0

  33. [33]

    thinkdepthai. 2025. Deep_Research: ThinkDepth.ai Deep Research. https://github. com/thinkdepthai/Deep_Research. GitHub repository, accessed on 2025-12-27

  34. [34]

    2024.Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper

    UncleCode. 2024.Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper

  35. [35]

    Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, Wenlong Zhang, Philip Torr, and Dongzhan Zhou. 2025. DeepResearch Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks. arXiv:2509.01396 [cs.AI] https://arxiv.org/abs/ 2509.01396

  36. [36]

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese

  37. [37]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents. arXiv:2504.12516 [cs.CL] https://arxiv.org/abs/2504.12516

  38. [38]

    Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. 2025. Agen- tic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools. arXiv:2502.04644 [cs.AI] https://arxiv.org/abs/2502.04644

  39. [39]

    xAI. 2025. Grok 4 Model Card. https://data.x.ai/2025-08-20-grok-4-model- card.pdf. Official Grok 4 model card, August 2025

  40. [40]

    Renjun Xu and Jingwen Peng. 2025. A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications. arXiv:2506.12594 [cs.AI] https:// arxiv.org/abs/2506.12594

  41. [41]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  42. [42]

    Yi Yao, He Zhu, Piaohong Wang, Jincheng Ren, Xinlong Yang, Qianben Chen, Xiaowan Li, Dingfeng Shi, Jiaxian Li, Qiexiang Wang, Sinuo Wang, Xinpeng Liu, Jiaqi Wu, Minghao Liu, and Wangchunshu Zhou. 2026. O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL. arXiv:2601.03743 [cs.CL] https://arxiv.org/abs/2601.03743

  43. [43]

    Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita-Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. 2025. The Landscape of Agentic R...

  44. [44]

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. 2025. DeepResearcher: Scaling Deep Research via Rein- forcement Learning in Real-world Environments.arXiv preprint arXiv:2504.03160 (2025)

  45. [45]

    Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, and Yining Hua. 2025. BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese. arXiv:2504.19314 [cs.CL] https://arxiv.org/abs/2504.19314