pith. machine review for the scientific record. sign in

arxiv: 2604.25256 · v1 · submitted 2026-04-28 · 💻 cs.AI

Recognition: unknown

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:32 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI agentsscientific literature discoverybenchmarkautonomous researchDeep ResearchWide ResearchLLM evaluation
0
0 comments X

The pith

Current AI agents reach only 9 percent accuracy on tasks that require finding and understanding specific scientific papers through open-ended search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AutoResearchBench to test whether AI agents can perform autonomous scientific literature discovery. It splits the problem into Deep Research, which tracks one target paper via progressive multi-step probing that demands concept comprehension, and Wide Research, which gathers every paper meeting given conditions when the total number is unknown in advance. These tasks differ from prior web-browsing benchmarks by staying research-oriented, literature-focused, and open-ended, exposing that even the strongest models score only 9.39 percent on Deep Research and 9.31 percent IoU on Wide Research. A sympathetic reader cares because finding the right papers is a foundational step in any autonomous research pipeline, so persistent failure here limits how far agents can advance real science without human guidance.

Core claim

AutoResearchBench consists of Deep Research tasks that require tracking a specific target paper through multi-step probing and Wide Research tasks that require collecting all papers satisfying stated conditions. Even the strongest current LLMs achieve only 9.39 percent accuracy on Deep Research and 9.31 percent IoU on Wide Research, while many baselines fall below 5 percent, despite high performance on general agentic web-browsing benchmarks such as BrowseComp. The benchmark is distinguished by requiring in-depth scientific concept comprehension, fine-grained use of detailed paper information, and deliberate reasoning over an unknown number of relevant papers.

What carries the argument

The AutoResearchBench suite of two task types: Deep Research (progressive multi-step probing to locate one target paper) and Wide Research (comprehensive collection of all papers meeting open conditions).

If this is right

  • Agents must develop stronger in-depth scientific concept comprehension to succeed on these tasks.
  • Open-ended collection of unknown numbers of papers requires more deliberate, uncertainty-aware search strategies than current general web agents use.
  • Future autonomous research systems will need specialized training or architectures beyond those sufficient for BrowseComp-style benchmarks.
  • Releasing the dataset and evaluation pipeline allows direct measurement of progress on research-oriented literature discovery.
  • Performance gaps highlight that literature-focused, research-oriented benchmarks are necessary to evaluate true autonomy beyond general web navigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the benchmark tasks prove representative, then progress toward fully autonomous research will require new agent designs that integrate deeper semantic understanding of scientific text rather than surface-level retrieval.
  • Similar benchmarks could be built for later research stages such as hypothesis generation or experimental design to create an end-to-end autonomy evaluation suite.
  • The low scores suggest that current scaling trends alone may not close the gap without explicit training on scientific literature structures and uncertainty handling.
  • The open-ended nature of Wide Research could serve as a test bed for studying how agents manage search under incomplete information, with direct relevance to other domains like legal or medical document discovery.

Load-bearing premise

That success or failure on the constructed Deep and Wide Research tasks will translate to success or failure at the real-world skill of autonomous scientific literature discovery.

What would settle it

A controlled study in which the same AI agents are tested on AutoResearchBench tasks and then on actual researcher-chosen literature searches for new problems, checking whether high benchmark scores predict high real-world retrieval quality and completeness.

Figures

Figures reproduced from arXiv: 2604.25256 by Chen Yue, Haiyu Xu, Hao Li, Hongjin Qian, Jianlyu Chen, Jin-Ge Yao, Jingying Shao, Kun Luo, Lei Xiong, Qian Yu, Wenbo Zhang, Xiaan Du, Xi Yang, Yesheng Liu, Yuyang Wang, Zheng Liu, Zhicheng Dou, Ziyi Xia.

Figure 1
Figure 1. Figure 1: All Flagship models Struggle on AutoReasearchBench. Two representative (Deep Research and Wide Research) instances showing multi-hop trajectory reasoning, fine-grained detail verification, and complex constraint decomposition (e.g., iterative web search and full-text reading) to derive a verifiable unique target paper or an exhaustive literature set. stems from both complexity and open-endedness. On the co… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the benchmark construction pipeline. The construction pipeline comprises 2 view at source ↗
Figure 3
Figure 3. Figure 3: Category distribution of two tasks across major computer science domains. view at source ↗
Figure 4
Figure 4. Figure 4: Wide Search IoU bucket analysis and prediction coverage of Gemini-3.1-pro (100 cases). (a) Distribution of IoU scores across different performance. (b) Scatter plot comparing the number of ground truth papers versus predicted papers. 3.2 Main Results view at source ↗
Figure 5
Figure 5. Figure 5: Test time scaling experiment. Research point to a recall bottleneck, where repeated runs tend to reproduce similar omissions rather than uncover complementary evidence. We also observe different scaling behaviors across models. kimi-k2.5 benefits more from larger k on Deep Search, while Gemini-3.1-pro remains strongest on Wide Search. Overall, additional inference-time compute is most useful when failures … view at source ↗
Figure 6
Figure 6. Figure 6: The topic distribution comparison of two tasks. view at source ↗
Figure 7
Figure 7. Figure 7: Statistics of the answers per query. (a) Distribution of Deep Search Tasks. (b) Distribution view at source ↗
Figure 8
Figure 8. Figure 8: Representative rejection cases from the DeepSearch verification pipeline. view at source ↗
Figure 9
Figure 9. Figure 9: The Verification cases of WideSearch Task. view at source ↗
Figure 10
Figure 10. Figure 10: Statistics of the answer supplementation process. (a) Distribution of supplemented view at source ↗
Figure 11
Figure 11. Figure 11: Error-type distribution (as a percentage of manually labelled errors) for three agents on view at source ↗
Figure 12
Figure 12. Figure 12: System prompt-1 of evaluation pipeline. 25 view at source ↗
Figure 13
Figure 13. Figure 13: System prompt-2 of evaluation pipeline. 26 view at source ↗
Figure 14
Figure 14. Figure 14: The trajectory-1 of Opus in Deep Researh task. We omit detailed model responses and view at source ↗
Figure 15
Figure 15. Figure 15: The trajectory-2 of Opus in Deep Researh task. We omit detailed model responses and view at source ↗
read the original abstract

Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through a progressive, multi-step probing process, and (2) Wide Research, which requires comprehensively collecting a set of papers satisfying given conditions. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout. These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging. Even the most powerful LLMs, despite having largely conquered general agentic web-browsing benchmarks such as BrowseComp, achieve only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, while many other strong baselines fall below 5%. We publicly release the dataset and evaluation pipeline to facilitate future research in this direction. We publicly release the dataset, evaluation pipeline, and code at https://github.com/CherYou/AutoResearchBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AutoResearchBench, a new benchmark for AI agents performing autonomous scientific literature discovery. It defines two tasks: Deep Research, which requires agents to locate a specific target paper via progressive multi-step probing, and Wide Research, which requires exhaustive collection of all papers meeting open-ended conditions. The authors evaluate multiple LLM-based agents and baselines on these tasks, reporting low scores (9.39% accuracy on Deep Research and 9.31% IoU on Wide Research) even for frontier models that perform well on general web-browsing benchmarks such as BrowseComp. They position the benchmark as uniquely challenging because it demands in-depth scientific concept comprehension, fine-grained use of literature details, and deliberate reasoning over an unknown number of relevant papers. The dataset, evaluation pipeline, and code are released publicly.

Significance. If the tasks are shown to genuinely require progressive concept comprehension and exhaustive open-ended collection rather than surface-level keyword matching, the benchmark would be a valuable addition to the field by exposing a clear capability gap in current agents for research-oriented workflows. The public release of the dataset and code is a concrete strength that enables follow-on work. However, the headline performance gap with BrowseComp is only informative about research skills to the extent that the task construction prevents trivial solutions.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Benchmark Construction): The claim that the tasks are 'research-oriented, calling for in-depth comprehension of scientific concepts' and 'open-ended, involving an unknown number of qualified papers' is not supported by any explicit protocol for generating targets, conditions, or query seeds. Without this protocol or the generation code, it is impossible to rule out that agents could succeed via repeated title/abstract keyword searches on arXiv/Google Scholar without reading full texts or synthesizing concepts, which would make the 9.39% / 9.31% scores uninformative about the intended research capabilities.
  2. [§5] §5 (Experiments and Results): The performance tables report aggregate accuracy and IoU but provide no error analysis, no breakdown by failure mode (e.g., search termination vs. incorrect paper selection), and no qualitative trajectory examples. This omission prevents readers from determining whether the low scores primarily reflect failures of comprehension, of long-horizon planning, or of tool use, weakening the diagnostic value of the benchmark.
minor comments (2)
  1. [Abstract] Abstract: The final paragraph contains two nearly identical sentences about public release; the repetition should be consolidated.
  2. [§3] The paper would be strengthened by adding a short appendix or supplementary note that lists the exact arXiv categories, date ranges, and filtering criteria used to source the underlying papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments have identified important areas where additional clarity and analysis will strengthen the presentation of AutoResearchBench. We respond to each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The claim that the tasks are 'research-oriented, calling for in-depth comprehension of scientific concepts' and 'open-ended, involving an unknown number of qualified papers' is not supported by any explicit protocol for generating targets, conditions, or query seeds. Without this protocol or the generation code, it is impossible to rule out that agents could succeed via repeated title/abstract keyword searches on arXiv/Google Scholar without reading full texts or synthesizing concepts, which would make the 9.39% / 9.31% scores uninformative about the intended research capabilities.

    Authors: We appreciate the referee raising this point about transparency in task construction. The manuscript states that the dataset, evaluation pipeline, and code are released at https://github.com/CherYou/AutoResearchBench, and the repository does contain the scripts used to generate the tasks. To make this fully self-contained and address concerns about potential trivial solutions, we will revise §3 to include an explicit description of the generation protocol. This will detail the criteria for selecting target papers and conditions (e.g., requirements for multi-hop concept linking and exhaustive coverage that cannot be satisfied by title/abstract keyword matching alone), along with examples of query seeds and verification steps. We will also add a pointer to the specific generation code in the text. These changes will allow readers to directly assess why surface-level searches are insufficient for high performance. revision: yes

  2. Referee: [§5] §5 (Experiments and Results): The performance tables report aggregate accuracy and IoU but provide no error analysis, no breakdown by failure mode (e.g., search termination vs. incorrect paper selection), and no qualitative trajectory examples. This omission prevents readers from determining whether the low scores primarily reflect failures of comprehension, of long-horizon planning, or of tool use, weakening the diagnostic value of the benchmark.

    Authors: We agree that the current results section would benefit from greater diagnostic detail to help readers interpret the sources of the observed performance gaps. In the revised manuscript, we will add a dedicated error analysis subsection to §5. This will include a quantitative breakdown of failure modes (e.g., premature termination of search, selection of incorrect papers, and tool invocation errors) derived from the logged agent trajectories. We will also incorporate 3–4 qualitative examples of representative trajectories for both Deep Research and Wide Research tasks, illustrating specific points where agents succeeded or failed in concept comprehension, planning, or tool use. These additions will improve the benchmark's utility for diagnosing capability limitations in current agents. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark definition with no derivation chain or self-referential reductions

full rationale

The paper introduces AutoResearchBench as a new evaluation suite consisting of Deep Research and Wide Research tasks, then reports direct empirical accuracy and IoU metrics from running various LLM agents on the released dataset. No equations, fitted parameters, predictions, or first-principles derivations are present. Task construction and evaluation protocols are defined explicitly in the paper itself without reducing to prior self-citations or renaming known results. The reported performance gaps (e.g., 9.39% Deep Research accuracy) are measured outcomes rather than quantities forced by construction or self-citation chains. The work is therefore self-contained as an empirical benchmark contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that scientific literature discovery requires progressive probing and comprehensive condition-based collection, with no free parameters or invented entities introduced.

axioms (1)
  • domain assumption Scientific literature discovery requires in-depth comprehension of concepts and handling of open-ended searches with unknown result counts.
    This premise underpins the design of both Deep and Wide Research tasks as described in the abstract.

pith-pipeline@v0.9.0 · 5633 in / 1213 out tokens · 63052 ms · 2026-05-07T16:32:42.991081+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 28 canonical work pages · 9 internal anchors

  1. [1]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

  2. [3]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025

  3. [4]

    2025 , doi =

    Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025

  4. [5]

    Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

  5. [6]

    Paper2code: Automating code generation from scientific papers in machine learning.arXiv preprint arXiv:2504.17192, 2025

    Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. Paper2code: Automating code generation from scientific papers in machine learning.arXiv preprint arXiv:2504.17192, 2025

  6. [7]

    In NeurIPS 2025 AI for Science Workshop

    Tengyue Xu, Zhuoyang Qian, Gaoge Liu, Li Ling, Zhentao Zhang, Biao Wu, Shuo Zhang, Ke Lu, Wei Shi, Ziqi Wang, et al. Idea2story: An automated pipeline for transforming research concepts into complete scientific narratives.arXiv preprint arXiv:2601.20833, 2026

  7. [8]

    https: //arxiv.org/abs/2512.07921

    Zongwei Li, Zhonghang Li, Zirui Guo, Xubin Ren, and Chao Huang. Deepcode: Open agentic coding.arXiv preprint arXiv:2512.07921, 2025

  8. [9]

    Toward Autonomous Long-Horizon Engineering for ML Research

    Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji-Rong Wen, and Kai Jia. Toward autonomous long-horizon engineering for ml research.arXiv preprint arXiv:2604.13018, 2026

  9. [10]

    International Conference on Learning Representations (ICLR) , year =

    Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively.arXiv preprint arXiv:2509.26603, 2025

  10. [11]

    Deepxiv-sdk: An agentic data interface for scientific papers

    Hongjin Qian, Ziyi Xia, Ze Liu, Jianlv Chen, Kun Luo, Minghao Qin, Chaofan Li, Lei Xiong, Sen Wang, Zhengyang Liang, et al. Deepxiv-sdk: An agentic data interface for scientific papers. arXiv preprint arXiv:2603.00084, 2026

  11. [12]

    Litsearch: A retrieval benchmark for scientific literature search

    Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. Litsearch: A retrieval benchmark for scientific literature search. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15068–15083, 2024

  12. [13]

    RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

    Yelin Chen, Fanjin Zhang, Suping Sun, Yunhe Pang, Yuanchun Wang, Jian Song, Xiaoyan Li, Lei Hou, Shu Zhao, Jie Tang, et al. Rpc-bench: A fine-grained benchmark for research paper comprehension.arXiv preprint arXiv:2601.14289, 2026

  13. [14]

    WideSearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

    Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, et al. Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

  14. [15]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  15. [16]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. 11

  16. [17]

    Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2508.06600, 2025

    Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, et al. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

  17. [18]

    Webdancer: Towards autonomous information seeking agency, 2025

    Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency, 2025

  18. [19]

    Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

  19. [20]

    Infoflow: Reinforcing search agent via reward density optimization.arXiv preprint arXiv:2510.26575, 2025

    Kun Luo, Hongjin Qian, Zheng Liu, Ziyi Xia, Shitao Xiao, Siqi Bao, Jun Zhao, and Kang Liu. Infoflow: Reinforcing search agent via reward density optimization.arXiv preprint arXiv:2510.26575, 2025

  20. [21]

    Search-o1: Agentic Search-Enhanced Large Reasoning Models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.CoRR, abs/2501.05366, 2025

  21. [22]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025

  22. [23]

    Smith, and Matt Gardner

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers.arXiv preprint arXiv:2105.03011, 2021

  23. [24]

    Sage: Benchmarking and improving retrieval for deep research agents.ArXiv, abs/2602.05975,

    Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, and Chen Zhao. Sage: Benchmarking and improving retrieval for deep research agents.arXiv preprint arXiv:2602.05975, 2026

  24. [25]

    Skarlinski, Sam Cox, Jon M

    Michael D Skarlinski, Sam Cox, Jon M Laurent, James D Braza, Michaela Hinks, Michael J Hammerling, Manvitha Ponnapati, Samuel G Rodriques, and Andrew D White. Language agents achieve superhuman synthesis of scientific knowledge.arXiv preprint arXiv:2409.13740, 2024

  25. [26]

    Paperarena: An evaluation benchmark for tool-augmented agentic reasoning on scientific literature.arXiv preprint arXiv:2510.10909, 2025

    Daoyu Wang, Mingyue Cheng, Shuo Yu, Zirui Liu, Ze Guo, Xin Li, and Qi Liu. Paperarena: An evaluation benchmark for tool-augmented agentic reasoning on scientific literature.arXiv preprint arXiv:2510.10909, 2025

  26. [27]

    Pasa: An llm agent for comprehensive academic paper search

    Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, et al. Pasa: An llm agent for comprehensive academic paper search. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11663–11679, 2025

  27. [28]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  28. [29]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  29. [30]

    Deepseek-v3.2: Pushing the frontier of open large language models, 2025

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025

  30. [31]

    Minimax m2.5: Built for real-world productivity

    MiniMax. Minimax m2.5: Built for real-world productivity. https://www.minimax.io/ news/minimax-m25, 2026

  31. [32]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team et al. Kimi k2.5: Visual agentic intelligence.https://arxiv.org/abs/2602.02276, 2026

  32. [33]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  33. [34]

    Seed 2.0 official launch

    ByteDance Seed Team. Seed 2.0 official launch. https://seed.bytedance.com/en/blog/ seed-2-0-official-launch, 2026. 12

  34. [35]

    Gemini 3 flash

    Google DeepMind. Gemini 3 flash. https://deepmind.google/models/gemini/flash/, 2026

  35. [36]

    Gemini 3.1 pro

    Google DeepMind. Gemini 3.1 pro. https://deepmind.google/models/gemini/pro/, 2026

  36. [37]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026

  37. [38]

    Introducing Claude Sonnet 4.6

    Anthropic. Introducing Claude Sonnet 4.6. https://www.anthropic.com/news/ claude-sonnet-4-6, Feb 2026

  38. [39]

    Introducing Claude Opus 4.6

    Anthropic. Introducing Claude Opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6, 2026

  39. [40]

    Alphaxiv: The ai-native platform for scientific discovery

    Alphaxiv. Alphaxiv: The ai-native platform for scientific discovery. https://www.alphaxiv. org/, 2026

  40. [41]

    Introducing deep research, 2025

    OpenAI. Introducing deep research, 2025

  41. [42]

    Google AI studio.https://aistudio.google.com/, 2026

    Google. Google AI studio.https://aistudio.google.com/, 2026

  42. [43]

    Webthinker: Empowering large reasoning models with deep research capability,

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025

  43. [44]

    Try deep research and our new experimental model in gemini, your ai assistant

    Dave Citron. Try deep research and our new experimental model in gemini, your ai assistant. https://blog.google/products/gemini/google-gemini-deep-research/, 2024

  44. [45]

    Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701,

    Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

  45. [46]

    Pasa: An llm agent for comprehensive academic paper search, 2025

    Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, and Weinan E. Pasa: An llm agent for comprehensive academic paper search, 2025

  46. [47]

    Spar: Scholar paper retrieval with llm-based agents for enhanced academic search, 2025

    Xiaofeng Shi, Yuduo Li, Qian Kou, Longbin Yu, Jinxin Xie, and Hua Zhou. Spar: Scholar paper retrieval with llm-based agents for enhanced academic search, 2025

  47. [48]

    In- fodeepseek: Benchmarking agentic information seeking for retrieval-augmented generation

    Yunjia Xi, Jianghao Lin, Menghui Zhu, Yongzhao Xiao, Zhuoying Ou, Jiaqi Liu, Tong Wan, Bo Chen, Weiwen Liu, Yasheng Wang, Ruiming Tang, Weinan Zhang, and Yong Yu. In- fodeepseek: Benchmarking agentic information seeking for retrieval-augmented generation. arXiv preprint arXiv:2505.15872, 2025

  48. [49]

    Deepwidesearch: Benchmarking depth and width in agentic information seeking, 2025

    Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, and Kaifu Zhang. Deepwidesearch: Benchmarking depth and width in agentic information seeking, 2025

  49. [50]

    Gisa: A benchmark for general information seeking assistant.CoRR, abs/2602.08543, 2026

    Yutao Zhu, Xingshuo Zhang, Maosen Zhang, Jiajie Jin, Liancheng Zhang, Xiaoshuai Song, Kangzhi Zhao, Wencong Zeng, Ruiming Tang, Han Li, Ji-Rong Wen, and Zhicheng Dou. Gisa: A benchmark for general information seeking assistant.CoRR, abs/2602.08543, 2026

  50. [51]

    Table-as-search: Formulate long-horizon agentic information seeking as table completion, 2026

    Tian Lan, Felix Henry, Bin Zhu, Qianghuai Jia, Junyang Ren, Qihang Pu, Haijun Li, Longyue Wang, Zhao Xu, and Weihua Luo. Table-as-search: Formulate long-horizon agentic information seeking as table completion, 2026

  51. [52]

    2026” upper bound, we invoke an LLM to calibrate the limit to month-level precision (e.g., “before February 2026

    Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, and Chen Zhao. Sage: Benchmarking and improving retrieval for deep research agents, 2026. 13 A Ethic Statement This benchmark is constructed from publicly accessible scientific papers in arXiv/DeepXiv and does not involve private personal data or direct interaction with human subjects. Human annotators a...

  52. [53]

    Locate the sub-question that is the most independent and bottom-level

  53. [54]

    Use the search tool to solve that sub-question and to clarify the user intent

  54. [55]

    Plug the answer of the sub-question back into the original problem to form a new, more direct search query

  55. [56]

    Deal with the new sub-question in the next round

  56. [57]

    name": "search

    When the Multi-Hop query is clarified and resolved to single-hop query, search one last time for final candidates. 1.3. Candidate Paper Evaluation - Evaluate the paper list provided in the latest user message within the `<tool_response>` tag. This list contains the results from your most recent search action. Identify which papers are useful and relevant ...

  57. [58]

    **Thinking**: Your reasoning process here for intent understanding and planning

  58. [59]

    Then output a list of selected paper IDs from that list within `<candidates>...</candidates>` tags

    **Candidate Selection**: Your brief analysis of searched paper from the latest tool response and decide what paper to add as candidates. Then output a list of selected paper IDs from that list within `<candidates>...</candidates>` tags. If you mistakenly keep the previous IDs, it will lead to errors in your results

  59. [60]

    query\": \

    **Action**: EITHER a tool call in <tool_call>...</tool_call> tags OR the finish signal <answer>Done</answer>. SYSTEM PROMPT-2 Figure 13: System prompt-2 of evaluation pipeline. 26 Opus inference case: Correct Trajectory-1 I'm searching for a research work where the authors conducted a comparative evaluation of architectural variants for a computational ma...