Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Can Qin; Ce Zhang; Chun Chen; Dong Yu; Haitao Mi; Hongming Zhang; Jiaqi Chen; Jingchen Ni; Jun-Yu Ma; Rui Wang

arxiv: 2508.00414 · v3 · submitted 2025-08-01 · 💻 cs.AI · cs.CL

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Tianqing Fang , Zhisong Zhang , Xiaoyang Wang , Rui Wang , Can Qin , Yuxuan Wan , Jun-Yu Ma , Ce Zhang

show 11 more authors

Jiaqi Chen Xiyun Li Yonglin Wang Jingchen Ni Tianshi Zheng Chun Chen Wenhao Yu Zhenwen Liang Hongming Zhang Haitao Mi Dong Yu

This is my paper

Pith reviewed 2026-05-19 01:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords AI agentsopen-source frameworkAgent Foundation ModelsGAIA benchmarkdata curationtest-time reflectionmulti-agent systemsautonomous agents

0 comments

The pith

Cognitive Kernel-Pro shows that an open-source 8B model can set a new standard for AI agent performance on GAIA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Cognitive Kernel-Pro, a fully open-source framework for building and training AI agents that handle web tasks, coding, file operations, and general reasoning. The authors describe how to curate high-quality training queries, trajectories, and answers, plus methods for reflection and voting at test time to improve results. If these techniques work as described, they would make advanced agent capabilities available to anyone without paid tools or APIs, broadening participation in agent research. The 8B model trained with this approach outperforms earlier open systems on the GAIA benchmark.

Core claim

The Cognitive Kernel-Pro framework systematically constructs training data across four domains and applies test-time reflection and voting to produce an 8B-parameter model that achieves state-of-the-art results among open-source agents on GAIA, surpassing systems like WebDancer and WebSailor.

What carries the argument

The multi-module agent framework with curated trajectory data for Agent Foundation Models and test-time reflection and voting strategies.

If this is right

Open and free agent systems can achieve competitive or superior performance to prior leading approaches on complex benchmarks.
Training data curation focused on verifiable answers in web, file, code, and reasoning domains drives agent capability gains.
Reflection and voting at inference time enhance agent robustness and accuracy.
Releasing the full framework and code supports reproducible research in agent foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other researchers could build upon the released code to train larger models or test on additional benchmarks.
The modular structure suggests potential for adapting the framework to new domains or integrating with different base models.
Success here may encourage more open development of agent systems, reducing reliance on closed APIs.

Load-bearing premise

The GAIA benchmark results genuinely measure general agent capabilities and are not influenced by unstated differences in tool access or data compared to previous open systems.

What would settle it

A controlled comparison where the Cognitive Kernel-Pro code is run with the same tool set and data constraints as WebDancer or WebSailor to verify if the performance advantage holds.

Figures

Figures reproduced from arXiv: 2508.00414 by Can Qin, Ce Zhang, Chun Chen, Dong Yu, Haitao Mi, Hongming Zhang, Jiaqi Chen, Jingchen Ni, Jun-Yu Ma, Rui Wang, Tianqing Fang, Tianshi Zheng, Wenhao Yu, Xiaoyang Wang, Xiyun Li, Yonglin Wang, Yuxuan Wan, Zhenwen Liang, Zhisong Zhang.

**Figure 2.** Figure 2: Technical roadmap showcasing prior innovations from Tencent AI Lab (Cognitive Ker [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the Cognitive Kernel-Pro Agent Framework. The left panel illustrates the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of information aggregation in the creation of URLQA. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present \textbf{Cognitive Kernel-Pro}, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cognitive Kernel-Pro gives researchers a usable open-source agent stack with domain-specific data curation and test-time reflection, but the GAIA gains for the 8B model need tighter checks on baseline equivalence and tool parity.

read the letter

The main point is that this paper ships a fully open multi-module agent framework plus concrete recipes for building training data across web, file, code, and reasoning, along with simple reflection and voting tricks at test time. Their 8B model is reported to beat earlier open systems like WebDancer and WebSailor on GAIA, and the code is on GitHub. That combination is the real deliverable here: something people can actually download and run without paid APIs or closed tools.

Referee Report

2 major / 2 minor

Summary. The paper introduces Cognitive Kernel-Pro, a fully open-source multi-module agent framework for web interaction, coding, file handling, and reasoning tasks. It details systematic curation of queries, trajectories, and verifiable answers across web/file/code/reasoning domains for training Agent Foundation Models, plus test-time reflection and voting mechanisms. The central empirical claim is that an 8B-parameter open-source model achieves SOTA performance on GAIA among open-source/free agents, outperforming prior systems such as WebDancer and WebSailor.

Significance. If the GAIA gains are shown to arise specifically from the described data curation pipeline and reflection/voting strategies under controlled conditions, the work would meaningfully advance accessible agent research by releasing a complete open framework and training methodology. The explicit release of code at https://github.com/Tencent/CognitiveKernel-Pro and emphasis on free/open components are concrete strengths that support reproducibility.

major comments (2)

[Experiments / Evaluation] Experiments section (presumably §4 or §5): the claim that the 8B model 'surpasses previous leading systems such as WebDancer and WebSailor' on GAIA is presented without tabulated baseline scores, statistical significance tests, or explicit confirmation that identical tool modules, API access, and environment configurations were used for the cited comparators. This equivalence is load-bearing for attributing gains to the proposed curation and test-time methods rather than implementation differences.
[Data Curation] Data curation subsection: while the construction of queries/trajectories/verifiable answers across four domains is described at a high level, the manuscript does not report quantitative metrics on trajectory quality (e.g., success rate of generated trajectories, inter-annotator agreement, or filtering criteria), making it difficult to assess whether the performance edge stems from superior data or from unstated advantages in scale or curation resources.

minor comments (2)

[Abstract / Introduction] The abstract and introduction repeatedly use 'fully open-source and free' without clarifying which components (e.g., any external APIs or models) remain non-free; a short clarifying paragraph would improve precision.
[Test-time Reflection and Voting] Notation for the reflection and voting modules is introduced without a compact algorithmic pseudocode or diagram; adding one would aid readability of the test-time enhancement section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for their thorough and constructive feedback. We address each major comment below and have revised the manuscript to enhance clarity, reproducibility, and the strength of our empirical claims.

read point-by-point responses

Referee: [Experiments / Evaluation] Experiments section (presumably §4 or §5): the claim that the 8B model 'surpasses previous leading systems such as WebDancer and WebSailor' on GAIA is presented without tabulated baseline scores, statistical significance tests, or explicit confirmation that identical tool modules, API access, and environment configurations were used for the cited comparators. This equivalence is load-bearing for attributing gains to the proposed curation and test-time methods rather than implementation differences.

Authors: We thank the referee for this important observation on experimental rigor. The manuscript references performance numbers from the original WebDancer and WebSailor papers and includes a results table in Section 5, but we agree the presentation of baselines and setup equivalence could be strengthened. In the revision, we have expanded the comparison table to explicitly list all cited baseline scores alongside our results, added a dedicated paragraph clarifying that our evaluations follow the public GAIA protocol with fully open-source tool modules, and noted any unavoidable differences arising from API evolution or proprietary components in prior work. Where multiple evaluation runs were performed, we now report means and standard deviations; single-run results are flagged as such. These updates better support attribution of gains to our curation pipeline and test-time reflection/voting. revision: partial
Referee: [Data Curation] Data curation subsection: while the construction of queries/trajectories/verifiable answers across four domains is described at a high level, the manuscript does not report quantitative metrics on trajectory quality (e.g., success rate of generated trajectories, inter-annotator agreement, or filtering criteria), making it difficult to assess whether the performance edge stems from superior data or from unstated advantages in scale or curation resources.

Authors: We agree that quantitative quality metrics would improve transparency and allow readers to better evaluate the data curation pipeline. In the revised manuscript we have added a dedicated paragraph and accompanying table in the Data Curation section that reports trajectory success rates after automated and manual filtering (approximately 82% overall across domains), inter-annotator agreement on a sampled subset (Cohen’s kappa = 0.76), and explicit filtering criteria including answer verifiability, trajectory length bounds, and error-type rejection rules. These additions directly address the concern and help substantiate that performance improvements derive from the described curation process. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical framework and benchmark reporting

full rationale

The paper describes an open-source agent framework, data curation process across web/file/code/reasoning domains, test-time reflection/voting strategies, and reports GAIA benchmark results for an 8B model. No equations, fitted parameters renamed as predictions, or self-referential derivations appear in the provided text. Performance claims rest on external benchmark comparisons rather than reducing to the paper's own inputs by construction, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems and empirical paper focused on framework implementation and benchmark evaluation; no mathematical derivations, fitted parameters, or postulated entities are described in the abstract.

pith-pipeline@v0.9.0 · 5816 in / 1071 out tokens · 26328 ms · 2026-05-19T01:47:27.141032+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce two inference-time optimization processes—reflection and voting—designed to enable the agent to evaluate and refine its own trajectories

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
DynaWeb: Model-Based Reinforcement Learning of Web Agents
cs.CL 2026-01 unverdicted novelty 6.0

DynaWeb introduces a model-based RL framework that trains web agents via imagined rollouts in a learned web world model interleaved with real expert trajectories, yielding consistent gains on WebArena and WebVoyager b...
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
cs.CL 2025-11 unverdicted novelty 6.0

MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
Mind DeepResearch Technical Report
cs.AI 2026-04 unverdicted novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 5 Pith papers · 3 internal anchors

[1]

Tapeagents: a holistic framework for agent development and optimization

Dzmitry Bahdanau, Nicolas Gontier, Gabriel Huang, Ehsan Kamalloo, Rafael Pardinas, Alex Piché, Torsten Scholak, Oleh Shliazhko, Jordan Prince Tremblay, Karam Ghanem, Soham Parikh, Mi- 11 Technical Report tul Tiwari, and Quaizar Vohra. Tapeagents: a holistic framework for agent development and optimization. arXiv preprint arXiv:2412.08445,

work page arXiv
[2]

Edward Beeching, Shengyi Costa Huang, Albert Jiang, Jia Li, Benjamin Lipkin, Zihan Qina, Kashif Rasul, Ziju Shen, Roman Soletskyi, and Lewis Tunstall

Published: 2024-12-11, Accessed: 2025-07-25. Edward Beeching, Shengyi Costa Huang, Albert Jiang, Jia Li, Benjamin Lipkin, Zihan Qina, Kashif Rasul, Ziju Shen, Roman Soletskyi, and Lewis Tunstall. Numinamath 7b tir. https://huggingf ace.co/AI-MO/NuminaMath-7B-TIR ,

work page 2024
[3]

Webevolver: Enhancing web agent self-improvement with coevolving world model

Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model. arXiv preprint arXiv:2504.21024,

work page arXiv
[4]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas. arXiv preprint arXiv:2406.20094,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Webvoyager: Building an end-to-end web agent with large multimodal models

Accessed: 2025-07-25. Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lo...

work page doi:10.18653/v1/2024.acl-long.371 2025
[6]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. CoRR, abs/2501.05366, 2025b. doi: 10.48550/ARXIV.2501.05366. URL https://arxiv.org/abs/ 2501.05366. Accessed: 2025-07-26. Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongka...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.05366 2025
[7]

doi: 10.18653/v1/2021.findings-acl.131

Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.131. URL https://aclanthology.org/2021.fi ndings-acl.131. Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, and Yue Zhang. LogiCoT: Logical chain-of-thought instruction tuning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Com...

work page doi:10.18653/v1/2021.findings-acl.131 2021
[8]

doi: 10.18653/v1/2023.findings-emn lp.191

Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emn lp.191. URL https://aclanthology.org/2023.findings-emnlp.191/. Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. In The Twelfth International Conference on Learning Rep- resentations, ICLR 2024, Vienna...

work page doi:10.18653/v1/2023.findings-emn 2023
[9]

Moonshot AI

URL https://manus.im/. Moonshot AI. Kimi-k2. https://github.com/MoonshotAI/Kimi-K2, 2025a. Published: 2025-07-11, Accessed: 2025-07-25. Moonshot AI. Kimi-researcher: End-to-end rl training for emerging agentic capabilities. https: //moonshotai.github.io, 2025b. Published: 2025-06-20, Accessed: 2025-07-25. OpenAI. Introducing deep research. Technical repor...

work page 2025
[10]

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunis- mäki

Published: 2025-02-14, Accessed: 2025-07-25. Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunis- mäki. ‘smolagents‘: a smol library to build great agentic systems. https://github.com/hugging face/smolagents,

work page 2025
[11]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2503.05592. Jiabin Tang, Tianyu Fan, and Chao Huang. Autoagent: A fully-automated and zero-code framework for llm agents. arXiv preprint arXiv:2502.05957,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Webshaper: Agentically datasynthesizingviainformation-seekingformalization.arXivpreprintarXiv:2507.15061,2025

Published: 2025-02-18, Accessed: 2025-07-25. Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webshaper: Agentically data synthesizing via information-seeking formalization. https://arxiv.org/abs/ 2507.15061,

work page arXiv 2025
[13]

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou

Published: 2025-07-20, Accessed: 2025-07-25. Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency, 2025a. URL https://arxiv.org/abs/2505.22648. Jialong Wu, Wenbiao Yin, Yong Jiang, Zhengli...

work page doi:10.48550/arxiv.2501.07572 2025
[14]

URL https://doi.org/10.48550/arXiv.2409.10277

doi: 10.48550/ARXIV.2409.10277. URL https://doi.org/10.48550/arXiv.2409.10277. Zhisong Zhang, Tianqing Fang, Kaixin Ma, Wenhao Yu, Hongming Zhang, Haitao Mi, and Dong Yu. Enhancing web agents with explicit rollback mechanisms. arXiv preprint arXiv:2504.11788,

work page doi:10.48550/arxiv.2409.10277
[15]

OAgents: An empirical study of building effective agents.arXiv preprint arXiv:2506.15741, 2025

He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, and Wangchunshu Zhou. Oagents: An empirical study of building effective agen...

work page arXiv
[16]

Docbench: A benchmark for evaluating llm-based document reading systems,

URL https://arxiv.org/abs/2407.10701. 14 Technical Report A Technical Details of Cognitive Kernel-Pro Framework Code-based Action and Tool-using. Both the main agent and the sub-agents employ a similar multi-step workflow for their problem-solving process. We utilize code-based actions: all actions, including sub-agent and tool invocations, are defined as...

work page arXiv
[17]

Instructions to download files (specify desired output path if needed). Returns: dict: A dictionary with the following structure: ‘output’ (str): The well-formatted answer, strictly following any specified output format; ‘log’(str): Additional notes, such as steps taken, issues encountered, or relevant context. Notes: - If the ‘task‘ specifies an output f...

work page 2000

[1] [1]

Tapeagents: a holistic framework for agent development and optimization

Dzmitry Bahdanau, Nicolas Gontier, Gabriel Huang, Ehsan Kamalloo, Rafael Pardinas, Alex Piché, Torsten Scholak, Oleh Shliazhko, Jordan Prince Tremblay, Karam Ghanem, Soham Parikh, Mi- 11 Technical Report tul Tiwari, and Quaizar Vohra. Tapeagents: a holistic framework for agent development and optimization. arXiv preprint arXiv:2412.08445,

work page arXiv

[2] [2]

Edward Beeching, Shengyi Costa Huang, Albert Jiang, Jia Li, Benjamin Lipkin, Zihan Qina, Kashif Rasul, Ziju Shen, Roman Soletskyi, and Lewis Tunstall

Published: 2024-12-11, Accessed: 2025-07-25. Edward Beeching, Shengyi Costa Huang, Albert Jiang, Jia Li, Benjamin Lipkin, Zihan Qina, Kashif Rasul, Ziju Shen, Roman Soletskyi, and Lewis Tunstall. Numinamath 7b tir. https://huggingf ace.co/AI-MO/NuminaMath-7B-TIR ,

work page 2024

[3] [3]

Webevolver: Enhancing web agent self-improvement with coevolving world model

Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model. arXiv preprint arXiv:2504.21024,

work page arXiv

[4] [4]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas. arXiv preprint arXiv:2406.20094,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Webvoyager: Building an end-to-end web agent with large multimodal models

Accessed: 2025-07-25. Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lo...

work page doi:10.18653/v1/2024.acl-long.371 2025

[6] [6]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. CoRR, abs/2501.05366, 2025b. doi: 10.48550/ARXIV.2501.05366. URL https://arxiv.org/abs/ 2501.05366. Accessed: 2025-07-26. Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongka...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.05366 2025

[7] [7]

doi: 10.18653/v1/2021.findings-acl.131

Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.131. URL https://aclanthology.org/2021.fi ndings-acl.131. Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, and Yue Zhang. LogiCoT: Logical chain-of-thought instruction tuning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Com...

work page doi:10.18653/v1/2021.findings-acl.131 2021

[8] [8]

doi: 10.18653/v1/2023.findings-emn lp.191

Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emn lp.191. URL https://aclanthology.org/2023.findings-emnlp.191/. Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. In The Twelfth International Conference on Learning Rep- resentations, ICLR 2024, Vienna...

work page doi:10.18653/v1/2023.findings-emn 2023

[9] [9]

Moonshot AI

URL https://manus.im/. Moonshot AI. Kimi-k2. https://github.com/MoonshotAI/Kimi-K2, 2025a. Published: 2025-07-11, Accessed: 2025-07-25. Moonshot AI. Kimi-researcher: End-to-end rl training for emerging agentic capabilities. https: //moonshotai.github.io, 2025b. Published: 2025-06-20, Accessed: 2025-07-25. OpenAI. Introducing deep research. Technical repor...

work page 2025

[10] [10]

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunis- mäki

Published: 2025-02-14, Accessed: 2025-07-25. Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunis- mäki. ‘smolagents‘: a smol library to build great agentic systems. https://github.com/hugging face/smolagents,

work page 2025

[11] [11]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2503.05592. Jiabin Tang, Tianyu Fan, and Chao Huang. Autoagent: A fully-automated and zero-code framework for llm agents. arXiv preprint arXiv:2502.05957,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Webshaper: Agentically datasynthesizingviainformation-seekingformalization.arXivpreprintarXiv:2507.15061,2025

Published: 2025-02-18, Accessed: 2025-07-25. Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webshaper: Agentically data synthesizing via information-seeking formalization. https://arxiv.org/abs/ 2507.15061,

work page arXiv 2025

[13] [13]

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou

Published: 2025-07-20, Accessed: 2025-07-25. Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency, 2025a. URL https://arxiv.org/abs/2505.22648. Jialong Wu, Wenbiao Yin, Yong Jiang, Zhengli...

work page doi:10.48550/arxiv.2501.07572 2025

[14] [14]

URL https://doi.org/10.48550/arXiv.2409.10277

doi: 10.48550/ARXIV.2409.10277. URL https://doi.org/10.48550/arXiv.2409.10277. Zhisong Zhang, Tianqing Fang, Kaixin Ma, Wenhao Yu, Hongming Zhang, Haitao Mi, and Dong Yu. Enhancing web agents with explicit rollback mechanisms. arXiv preprint arXiv:2504.11788,

work page doi:10.48550/arxiv.2409.10277

[15] [15]

OAgents: An empirical study of building effective agents.arXiv preprint arXiv:2506.15741, 2025

He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, and Wangchunshu Zhou. Oagents: An empirical study of building effective agen...

work page arXiv

[16] [16]

Docbench: A benchmark for evaluating llm-based document reading systems,

URL https://arxiv.org/abs/2407.10701. 14 Technical Report A Technical Details of Cognitive Kernel-Pro Framework Code-based Action and Tool-using. Both the main agent and the sub-agents employ a similar multi-step workflow for their problem-solving process. We utilize code-based actions: all actions, including sub-agent and tool invocations, are defined as...

work page arXiv

[17] [17]

Instructions to download files (specify desired output path if needed). Returns: dict: A dictionary with the following structure: ‘output’ (str): The well-formatted answer, strictly following any specified output format; ‘log’(str): Additional notes, such as steps taken, issues encountered, or relevant context. Notes: - If the ‘task‘ specifies an output f...

work page 2000