Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Bei Liu; Chong Luo; Gongrui Zhang; Guanting Dong; Hongjin Qian; Jiajie Jin; Kai Qiu; Lijuan Wang; Linjie Li; Qi Dai

arxiv: 2606.11926 · v1 · pith:Z3SZXXLMnew · submitted 2026-06-10 · 💻 cs.CL · cs.AI

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Jiajie Jin , Yuyang Hu , Kai Qiu , Qi Dai , Chong Luo , Guanting Dong , Xiaoxi Li , Tong Zhao

show 10 more authors

Xiaolong Ma Gongrui Zhang Zhirong Wu Bei Liu Zhengyuan Yang Linjie Li Lijuan Wang Hongjin Qian Yutao Zhu Zhicheng Dou

This is my paper

Pith reviewed 2026-06-27 09:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords autonomous researchhypothesis tree refinementAI agentsresearch automationcumulative learningmodel optimizationscientific discovery

0 comments

The pith

Arbor maintains a persistent Hypothesis Tree that lets an AI coordinator accumulate and refine research insights across many iterations instead of restarting each time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Arbor as a framework that runs the scientific loop of exploration, experimentation, and abstraction autonomously over long horizons. It pairs a long-lived coordinator that manages global strategy with short-lived executors that test individual hypotheses, all organized through Hypothesis Tree Refinement. The tree stores hypotheses, artifacts, evidence, and distilled lessons so that verified improvements and reusable insights propagate forward rather than being lost. Evaluation on six concrete tasks in model training, harness engineering, and data synthesis shows Arbor reaching the best held-out result on every task and more than 2.5 times the average relative gain of Codex and Claude Code under identical interfaces and budgets. On MLE-Bench Lite the system reaches 86.36 percent Any Medal with GPT-5.5.

Core claim

Arbor is a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. Acr

What carries the argument

Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights so the coordinator can propagate lessons and refine the search frontier across iterations.

If this is right

Research agents can improve an initial artifact through iterative experimentation without step-level human supervision.
Lessons from failed or successful hypotheses become reusable across later attempts rather than being discarded.
The same task interface and resource budget produce substantially higher held-out performance than prior agent baselines.
The framework scales to multiple domains including model training, harness engineering, and data synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the tree structure continues to grow without becoming intractable, the approach could support research campaigns lasting hundreds of iterations.
The coordinator-executor split may allow future versions to run many executors in parallel while the coordinator maintains a single coherent research plan.
Distilled insights stored in the tree could be extracted and reused by other agents or even by human researchers.

Load-bearing premise

The six chosen tasks plus MLE-Bench Lite are representative of general autonomous research and the reported gains arise from the Hypothesis Tree Refinement mechanism rather than from differences in prompting, model access, or task-specific engineering.

What would settle it

Running the same six tasks with an otherwise identical Arbor variant that replaces the Hypothesis Tree with a flat chronological log and measuring whether the 2.5x relative gain disappears.

read the original abstract

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The claimed gains look like they could come from running on GPT-5.5 while baselines used older models, so the HTR contribution is not yet isolated.

read the letter

The paper introduces Arbor, which keeps a persistent hypothesis tree across sessions, splits work between a long-lived coordinator and short-lived executors, and uses that tree to carry evidence and lessons forward. That structure is the main new piece relative to single-shot agent setups.

It does a reasonable job laying out the loop of exploration, testing, and abstraction that real research follows, and the six tasks plus MLE-Bench Lite give a concrete test bed. The idea of separating strategy from execution in isolated worktrees is practical and worth testing.

The soft spot is the comparison. Arbor is reported with GPT-5.5 and reaches 86% on MLE-Bench Lite, while Codex and Claude Code are the baselines under the same interface claim. The abstract does not confirm the baselines also ran on GPT-5.5 or equivalent prompting and executor code. If the tree is mostly an organizer around a stronger model, the 2.5x relative gain cannot be attributed to HTR. No statistical tests or variance numbers are mentioned either.

The evaluation tasks are real research steps, which is better than toy benchmarks, but without controls for model strength the central result stays provisional. The framework itself is coherent on paper and does not rely on circular math.

This is worth a serious referee if the authors add a controlled ablation on model and prompt. People building long-horizon agents would get value from the design even if the numbers need tightening. I would not cite it yet.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Arbor, a framework for autonomous research that integrates a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR) to maintain a persistent tree linking hypotheses, artifacts, evidence, and distilled insights across iterations. It evaluates the system in an Autonomous Optimization setting on six real research tasks (model training, harness engineering, data synthesis) plus MLE-Bench Lite, claiming that Arbor achieves the best held-out result on all six tasks, more than 2.5x the average relative held-out gain of Codex and Claude Code under identical task interface and resource budget, and 86.36% Any Medal on MLE-Bench Lite when using GPT-5.5.

Significance. If the reported gains prove robust and attributable to the HTR mechanism rather than model or implementation differences, the work would offer a concrete approach to cumulative long-horizon autonomous research, moving beyond isolated attempts to a process that propagates lessons across time; the persistent tree structure is a clear conceptual contribution.

major comments (2)

[Abstract] Abstract: the central performance claims (best result on all six tasks, >2.5x average relative held-out gain, 86.36% Any Medal) are presented without any description of experimental controls, statistical significance testing, or confirmation that the baselines (Codex, Claude Code) used the identical base model GPT-5.5 rather than weaker models; this directly affects whether gains can be attributed to HTR versus model access.
[Abstract] Abstract and evaluation description: the claim that results arise from the Hypothesis Tree Refinement mechanism (persistent linking of hypotheses/artifacts/evidence) rather than task-specific engineering or prompting differences is not supported by any ablation or controlled comparison that isolates the tree component while holding model, prompt template, and executor fixed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater clarity in the abstract regarding experimental controls and the attribution of gains to the HTR mechanism. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (best result on all six tasks, >2.5x average relative held-out gain, 86.36% Any Medal) are presented without any description of experimental controls, statistical significance testing, or confirmation that the baselines (Codex, Claude Code) used the identical base model GPT-5.5 rather than weaker models; this directly affects whether gains can be attributed to HTR versus model access.

Authors: We agree that the abstract should explicitly describe the experimental controls to support attribution. The manuscript already states that comparisons used the same task interface and resource budget, with the MLE-Bench Lite result reported using GPT-5.5 for Arbor. We will revise the abstract to add a concise description of these controls and to note that statistical significance testing across multiple runs was not performed due to computational cost. This revision will make the shared setup clearer without altering the reported numbers. revision: yes
Referee: [Abstract] Abstract and evaluation description: the claim that results arise from the Hypothesis Tree Refinement mechanism (persistent linking of hypotheses/artifacts/evidence) rather than task-specific engineering or prompting differences is not supported by any ablation or controlled comparison that isolates the tree component while holding model, prompt template, and executor fixed.

Authors: The referee correctly observes that the manuscript contains no ablation that removes only the HTR component while holding the model, prompt templates, and executor implementation fixed. The reported comparisons evaluate the complete Arbor system against other agent frameworks under matched model and interface conditions, but do not isolate the persistent tree. We will revise the evaluation description to more explicitly articulate the intended contribution of HTR and to acknowledge this limitation. A dedicated ablation would require additional controlled experiments that are outside the scope of the current results. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical framework evaluation with no derivations or self-referential reductions.

full rationale

The paper introduces the Arbor framework and Hypothesis Tree Refinement for autonomous research, then reports empirical results on six tasks and MLE-Bench Lite. No equations, fitted parameters, or mathematical derivations are present. Performance claims rest on held-out comparisons under a stated task interface and budget, without any step that reduces by construction to inputs, self-citations, or ansatzes. The central attribution to HTR is an empirical claim open to experimental scrutiny rather than a definitional or fitted tautology. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; the ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5846 in / 1157 out tokens · 18674 ms · 2026-06-27T09:32:49.678744+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

127 extracted references · 82 canonical work pages · 33 internal anchors

[1]

2010 , url =

William Webber and Alistair Moffat and Justin Zobel , title =. 2010 , url =. doi:10.1145/1852102.1852106 , timestamp =

work page doi:10.1145/1852102.1852106 2010
[2]

Widesearch: Benchmarking agentic broad info- seeking,

Ryan Wong and Jiawei Wang and Junjie Zhao and Li Chen and Yan Gao and Long Zhang and Xuan Zhou and Zuo Wang and Kai Xiang and Ge Zhang and Wenhao Huang and Yang Wang and Ke Wang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.07999 , eprinttype =. 2508.07999 , timestamp =

work page doi:10.48550/arxiv.2508.07999 2025
[4]

2025 , howpublished =

Keller Jordan and contributors , title =. 2025 , howpublished =

2025
[6]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Peilin Zhou and Bruce Leon and Xiang Ying and Can Zhang and Yifan Shao and Qichen Ye and Dading Chong and Zhiling Jin and Chenxuan Xie and Meng Cao and Yuxin Gu and Sixin Hong and Jing Ren and Jian Chen and Chao Liu and Yining Hua , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.19314 , eprinttype =. 2504.19314 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19314 2025
[7]

, year =

Kidd, Celeste and Hayden, Benjamin Y. , year =. The Psychology and Neuroscience of Curiosity , volume =. Neuron , publisher =. doi:10.1016/j.neuron.2015.09.010 , number =

work page doi:10.1016/j.neuron.2015.09.010 2015
[8]

The Twelfth International Conference on Learning Representations,

Gr. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[9]

Infodeepseek: Benchmarking agentic information seeking for retrieval- augmented generation,

Yunjia Xi and Jianghao Lin and Menghui Zhu and Yongzhao Xiao and Zhuoying Ou and Jiaqi Liu and Tong Wan and Bo Chen and Weiwen Liu and Yasheng Wang and Ruiming Tang and Weinan Zhang and Yong Yu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.15872 , eprinttype =. 2505.15872 , timestamp =

work page doi:10.48550/arxiv.2505.15872 2025
[10]

CoRR , volume =

Tian Lan and Bin Zhu and Qianghuai Jia and Junyang Ren and Haijun Li and Longyue Wang and Zhao Xu and Weihua Luo and Kaifu Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.20168 , eprinttype =. 2510.20168 , timestamp =

work page doi:10.48550/arxiv.2510.20168 2025
[11]

CoRR , volume =

Junting Zhou and Wang Li and Yiyan Liao and Nengyuan Zhang and Tingjia Miao and Zhihui Qi and Yuhan Wu and Tong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.13784 , eprinttype =. 2506.13784 , timestamp =

work page doi:10.48550/arxiv.2506.13784 2025
[12]

xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations , journal =

Kaiyuan Chen and Yixin Ren and Yang Liu and Xiaobo Hu and Haotong Tian and Tianbao Xie and Fangfu Liu and Haoye Zhang and Hongzhang Liu and Yuan Gong and Chen Sun and Han Hou and Hui Yang and James Pan and Jianan Lou and Jiayi Mao and Jizheng Liu and Jinpeng Li and Kangyi Liu and Kenkun Liu and Rui Wang and Run Li and Tong Niu and Wenlong Zhang and Wenqi ...

work page doi:10.48550/arxiv.2506.13651 2025
[13]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham and Nguyen Nguyen and Pratibha Zunjare and Weiyuan Chen and Yu. SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2506.01062 , eprinttype =. 2506.01062 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.01062 2025
[14]

CoRR , volume =

Yilong Xu and Xiang Long and Zhi Zheng and Jinhua Gao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.16725 , eprinttype =. 2507.16725 , timestamp =

work page doi:10.48550/arxiv.2507.16725 2025
[15]

CoRR , volume =

Tomer Wolfson and Harsh Trivedi and Mor Geva and Yoav Goldberg and Dan Roth and Tushar Khot and Ashish Sabharwal and Reut Tsarfaty , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.11133 , eprinttype =. 2508.11133 , timestamp =

work page doi:10.48550/arxiv.2508.11133 2025
[16]

CoRR , volume =

Heng Zhou and Ao Yu and Yuchen Fan and Jianing Shi and Li Kang and Hejia Geng and Yongting Zhang and Yutao Fan and Yuhao Wu and Tiancheng He and Yiran Qin and Lei Bai and Zhenfei Yin , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.01409 , eprinttype =. 2511.01409 , timestamp =

work page doi:10.48550/arxiv.2511.01409 2025
[17]

Large Language Models for Information Retrieval:

Yutao Zhu and Huaying Yuan and Shuting Wang and Jiongnan Liu and Wenhan Liu and Chenlong Deng and Zhicheng Dou and Ji. Large Language Models for Information Retrieval:. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2308.07107 , eprinttype =. 2308.07107 , timestamp =

work page doi:10.48550/arxiv.2308.07107 2023
[18]

CoRR , volume =

Yunjia Xi and Jianghao Lin and Yongzhao Xiao and Zheli Zhou and Rong Shan and Te Gao and Jiachen Zhu and Weiwen Liu and Yong Yu and Weinan Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.05668 , eprinttype =. 2508.05668 , timestamp =

work page doi:10.48550/arxiv.2508.05668 2025
[19]

Aggarwal and Hui Liu and Xiang Zhang and Suhang Wang , title =

Minhua Lin and Zongyu Wu and Zhichao Xu and Hui Liu and Xianfeng Tang and Qi He and Charu C. Aggarwal and Hui Liu and Xiang Zhang and Suhang Wang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.16724 , eprinttype =. 2510.16724 , timestamp =

work page doi:10.48550/arxiv.2510.16724 2025
[20]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li and Guanting Dong and Jiajie Jin and Yuyao Zhang and Yujia Zhou and Yutao Zhu and Peitian Zhang and Zhicheng Dou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2501.05366 , eprinttype =. 2501.05366 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.05366 2025
[21]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li and Jiajie Jin and Guanting Dong and Hongjin Qian and Yutao Zhu and Yongkang Wu and Ji. WebThinker: Empowering Large Reasoning Models with Deep Research Capability , journal =. 2025 , url =. doi:10.48550/ARXIV.2504.21776 , eprinttype =. 2504.21776 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.21776 2025
[22]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin and Hansi Zeng and Zhenrui Yue and Dong Wang and Hamed Zamani and Jiawei Han , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.09516 , eprinttype =. 2503.09516 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.09516 2025
[23]

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li and Zhongwang Zhang and Huifeng Yin and Liwen Zhang and Litu Ou and Jialong Wu and Wenbiao Yin and Baixuan Li and Zhengwei Tao and Xinyu Wang and Weizhou Shen and Junkai Zhang and Dingchu Zhang and Xixi Wu and Yong Jiang and Ming Yan and Pengjun Xie and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.02...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.02592 2025
[24]

CoRR , volume =

Rui Lu and Zhenyu Hou and Zihan Wang and Hanchen Zhang and Xiao Liu and Yujiang Li and Shi Feng and Jie Tang and Yuxiao Dong , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.10446 , eprinttype =. 2509.10446 , timestamp =

work page doi:10.48550/arxiv.2509.10446 2025
[25]

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Dayoon Ko and Jihyuk Kim and Haeju Park and Sohyeon Kim and Dahyun Lee and Yongrae Jo and Gunhee Kim and Moontae Lee and Kyungjae Lee , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.19113 , eprinttype =. 2508.19113 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.19113 2025
[26]

CoRR , volume =

Baixuan Li and Dingchu Zhang and Jialong Wu and Wenbiao Yin and Zhengwei Tao and Yida Zhao and Liwen Zhang and Haiyang Shen and Runnan Fang and Pengjun Xie and Jingren Zhou and Yong Jiang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.24698 , eprinttype =. 2510.24698 , timestamp =

work page doi:10.48550/arxiv.2510.24698 2025
[27]

CoRR , volume =

Lisheng Huang and Yichen Liu and Jinhao Jiang and Rongxiang Zhang and Jiahao Yan and Junyi Li and Wayne Xin Zhao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.18105 , eprinttype =. 2505.18105 , timestamp =

work page doi:10.48550/arxiv.2505.18105 2025
[28]

Holistically Guided Monte Carlo Tree Search for Intricate Information Seeking , journal =

Ruiyang Ren and Yuhao Wang and Junyi Li and Jinhao Jiang and Wayne Xin Zhao and Wenjie Wang and Tat. Holistically Guided Monte Carlo Tree Search for Intricate Information Seeking , journal =. 2025 , url =. doi:10.48550/ARXIV.2502.04751 , eprinttype =. 2502.04751 , timestamp =

work page doi:10.48550/arxiv.2502.04751 2025
[30]

Yangyang Yu and Zhiyuan Yao and Haohang Li and Zhiyang Deng and Yuechen Jiang and Yupeng Cao and Zhi Chen and Jordan W. Suchow and Zhenyu Cui and Rong Liu and Zhaozhuo Xu and Denghui Zhang and Koduvayur Subbalakshmi and Guojun Xiong and Yueru He and Jimin Huang and Dong Li and Qianqian Xie , editor =. FinCon:. Advances in Neural Information Processing Sys...

2024
[31]

ChatDev: Communicative agents for software development

Chen Qian and Wei Liu and Hongzhang Liu and Nuo Chen and Yufan Dang and Jiahao Li and Cheng Yang and Weize Chen and Yusheng Su and Xin Cong and Juyuan Xu and Dahai Li and Zhiyuan Liu and Maosong Sun , editor =. ChatDev: Communicative Agents for Software Development , booktitle =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.810 , timestamp =

work page doi:10.18653/v1/2024.acl-long.810 2024
[32]

CoRR , volume =

Shanghua Gao and Ada Fang and Yepeng Huang and Valentina Giunchiglia and Ayush Noori and Jonathan Richard Schwarz and Yasha Ektefaie and Jovana Kondic and Marinka Zitnik , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.02831 , eprinttype =. 2404.02831 , timestamp =

work page doi:10.48550/arxiv.2404.02831 2024
[33]

WebWalker: Benchmarking LLMs in Web Traversal , booktitle =

Jialong Wu and Wenbiao Yin and Yong Jiang and Zhenglin Wang and Zekun Xi and Runnan Fang and Linhai Zhang and Yulan He and Deyu Zhou and Pengjun Xie and Fei Huang , editor =. WebWalker: Benchmarking LLMs in Web Traversal , booktitle =. 2025 , url =

2025
[34]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2512.02556 , eprinttype =. 2512.02556 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556 2025
[35]

Narasimhan and Yuan Cao , title =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023
[36]

Qwen3 Technical Report

An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jian Yang and Jiaxi Yang and Ji...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[37]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng and Xin Lv and Qinkai Zheng and Zhenyu Hou and Bin Chen and Chengxing Xie and Cunxiang Wang and Da Yin and Hao Zeng and Jiajie Zhang and Kedong Wang and Lucen Zhong and Mingdao Liu and Rui Lu and Shulin Cao and Xiaohan Zhang and Xuancheng Huang and Yao Wei and Yean Cheng and Yifan An and Yilin Niu and Yuanhao Wen and Yushi Bai and Zhengxiao Du ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.06471 2025
[38]

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments , booktitle =

Yuxiang Zheng and Dayuan Fu and Xiangkun Hu and Xiaojie Cai and Lyumanshan Ye and Pengrui Lu and Pengfei Liu , editor =. DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.22 , timestamp =

work page doi:10.18653/v1/2025.emnlp-main.22 2025
[39]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song and Jinhao Jiang and Yingqian Min and Jie Chen and Zhipeng Chen and Wayne Xin Zhao and Lei Fang and Ji. R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2503.05592 , eprinttype =. 2503.05592 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.05592 2025
[40]

Tongyi DeepResearch Technical Report

Baixuan Li and Bo Zhang and Dingchu Zhang and Fei Huang and Guangyu Li and Guoxin Chen and Huifeng Yin and Jialong Wu and Jingren Zhou and Kuan Li and Liangcai Su and Litu Ou and Liwen Zhang and Pengjun Xie and Rui Ye and Wenbiao Yin and Xinmiao Yu and Xinyu Wang and Xixi Wu and Xuanzhong Chen and Yida Zhao and Zhen Zhang and Zhengwei Tao and Zhongwang Zh...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.24701 2025
[41]

2026 , url =

Harness design for long-running application development , author =. 2026 , url =

2026
[43]

2026 , url =

Harness engineering: leveraging Codex in an agent-first world , author =. 2026 , url =

2026
[44]

CoRR , volume =

Jialong Wu and Baixuan Li and Runnan Fang and Wenbiao Yin and Liwen Zhang and Zhengwei Tao and Dingchu Zhang and Zekun Xi and Yong Jiang and Pengjun Xie and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.22648 , eprinttype =. 2505.22648 , timestamp =

work page doi:10.48550/arxiv.2505.22648 2025
[45]

Webshaper: Agentically data synthesizing via information- seeking formalization,

Zhengwei Tao and Jialong Wu and Wenbiao Yin and Junkai Zhang and Baixuan Li and Haiyang Shen and Kuan Li and Liwen Zhang and Xinyu Wang and Yong Jiang and Pengjun Xie and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.15061 , eprinttype =. 2507.15061 , timestamp =

work page doi:10.48550/arxiv.2507.15061 2025
[46]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun and Zile Qiao and Jiayan Guo and Xuanbo Fan and Yingyan Hou and Yong Jiang and Pengjun Xie and Yan Zhang and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.04588 , eprinttype =. 2505.04588 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.04588 2025
[47]

Kanell and Peter Xu and Omar Khattab and Monica S

Yijia Shao and Yucheng Jiang and Theodore A. Kanell and Peter Xu and Omar Khattab and Monica S. Lam , editor =. Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.347 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.347 2024
[48]

Daya Guo and Dejian Yang and Haowei Zhang and Junxiao Song and Peiyi Wang and Qihao Zhu and Runxin Xu and Ruoyu Zhang and Shirong Ma and Xiao Bi and Xiaokang Zhang and Xingkai Yu and Yu Wu and Z. F. Wu and Zhibin Gou and Zhihong Shao and Zhuoshu Li and Ziyi Gao and Aixin Liu and Bing Xue and Bingxuan Wang and Bochao Wu and Bei Feng and Chengda Lu and Chen...

work page doi:10.1038/s41586-025-09422-z 2025
[49]

Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =

Timo Schick and Jane Dwivedi. Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =. 2023 , url =

2023
[50]

The Twelfth International Conference on Learning Representations,

Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun , title =. The Twelfth International Conference on Learning Representations,. 2...

2024
[51]

HuggingGPT: Solving

Yongliang Shen and Kaitao Song and Xu Tan and Dongsheng Li and Weiming Lu and Yueting Zhuang , editor =. HuggingGPT: Solving. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =

2023
[52]

Reflexion: language agents with verbal reinforcement learning , booktitle =

Noah Shinn and Federico Cassano and Ashwin Gopinath and Karthik Narasimhan and Shunyu Yao , editor =. Reflexion: language agents with verbal reinforcement learning , booktitle =. 2023 , url =

2023
[57]

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , booktitle =

Qian Huang and Jian Vora and Percy Liang and Jure Leskovec , editor =. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , booktitle =. 2024 , url =

2024
[58]

CoRR , volume =

Reiichiro Nakano and Jacob Hilton and Suchir Balaji and Jeff Wu and Long Ouyang and Christina Kim and Christopher Hesse and Shantanu Jain and Vineet Kosaraju and William Saunders and Xu Jiang and Karl Cobbe and Tyna Eloundou and Gretchen Krueger and Kevin Button and Matthew Knight and Benjamin Chess and John Schulman , title =. CoRR , volume =. 2021 , url...

Pith/arXiv arXiv 2021
[63]

arXiv preprint arXiv:2503.18102 , year =

Samuel Schmidgall and Michael Moor , title =. arXiv preprint arXiv:2503.18102 , year =

arXiv
[64]

arXiv preprint arXiv:2505.18705 , year =

Jiabin Tang and Lianghao Xia and Zhonghang Li and Chao Huang , title =. arXiv preprint arXiv:2505.18705 , year =

arXiv
[65]

bioRxiv , year =

Kexin Huang and Serena Zhang and Hanchen Wang and Yuanhao Qu and Yingzhou Lu and Yusuf Roohani and Ryan Li and Lin Qiu and Junze Zhang and Yin Di and others , title =. bioRxiv , year =
[67]

2026 , howpublished =

Jiaqi Liu and Peng Xia and Siwei Han and Shi Qiu and Letian Zhang and Guiming Chen and Haoqin Tu and Xinyu Yang and Jiawei Zhou and Hongtu Zhu and Yun Li and Jiaheng Zhang and Yuyin Zhou and Zeyu Zheng and Cihang Xie and Mingyu Ding and Huaxiu Yao , title =. 2026 , howpublished =

2026
[68]

2026 , howpublished =

Andrej Karpathy , title =. 2026 , howpublished =

2026
[79]

arXiv preprint arXiv:2503.21248 , year =

Yujie Liu and Zonglin Yang and Tong Xie and Jinjie Ni and Ben Gao and Yuqiang Li and Shixiang Tang and Wanli Ouyang and Erik Cambria and Dongzhan Zhou , title =. arXiv preprint arXiv:2503.21248 , year =

Pith/arXiv arXiv
[81]

Siegel and Sayash Kapoor and Nitya Nadgir and Benedikt Stroebl and Arvind Narayanan , title =

Zachary S. Siegel and Sayash Kapoor and Nitya Nadgir and Benedikt Stroebl and Arvind Narayanan , title =. arXiv preprint arXiv:2409.11363 , year =

Pith/arXiv arXiv
[82]

arXiv preprint arXiv:2407.01725 , year =

Bodhisattwa Prasad Majumder and Harshit Surana and Dhruv Agarwal and Bhavana Dalvi Mishra and Abhijeetsingh Meena and Aryan Prakhar and Tirth Vora and Tushar Khot and Ashish Sabharwal and Peter Clark , title =. arXiv preprint arXiv:2407.01725 , year =

arXiv
[84]

arXiv preprint arXiv:2505.19955 , year =

Hui Chen and Miao Xiong and Yujie Lu and Wei Han and Ailin Deng and Yang He and Jiaying Wu and Kai Wang and Yibo Wang and Shen Li and Jiani Yu and Bryan Hooi , title =. arXiv preprint arXiv:2505.19955 , year =

arXiv
[85]

Browne and Edward Powley and Daniel Whitehouse and Simon M

Cameron B. Browne and Edward Powley and Daniel Whitehouse and Simon M. Lucas and Peter I. Cowling and Philipp Rohlfshagen and Stephen Tavener and Diego Perez and Spyridon Samothrakis and Simon Colton , title =. IEEE Transactions on Computational Intelligence and AI in Games , volume =
[91]

2026 , eprint=

DataMaster: Data-Centric Autonomous AI Research , author=. 2026 , eprint=

2026
[93]

2025 , month = mar, organization =

Automated Researchers Can Subtly Sandbag , author =. 2025 , month = mar, organization =

2025
[97]

2026 , eprint=

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning , author=. 2026 , eprint=

2026
[107]

2026 , eprint=

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent , author=. 2026 , eprint=

2026
[109]

CoRR , volume =

Jiafeng Liang and Hao Li and Chang Li and Jiaqi Zhou and Shixin Jiang and Zekun Wang and Changkai Ji and Zhihao Zhu and Runxuan Liu and Tao Ren and Jinlan Fu and See. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.23343 , eprinttype =. 2512.23343 , timestamp =

work page doi:10.48550/arxiv.2512.23343 2025
[110]

The Thirteenth International Conference on Learning Representations,

Jiayi Zhang and Jinyu Xiang and Zhaoyang Yu and Fengwei Teng and Xionghui Chen and Jiaqi Chen and Mingchen Zhuge and Xin Cheng and Sirui Hong and Jinlin Wang and Bingnan Zheng and Bang Liu and Yuyu Luo and Chenglin Wu , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[111]

2025 , howpublished =

2025
[112]

Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H

Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and et al. , title =. The Thirteenth ...

2025
[113]

Claude Code

Anthropic . Claude Code . https://github.com/anthropics/claude-code, 2025. Agentic coding tool for terminal, IDE, and GitHub workflows. Accessed: 2026-06-02

2025
[114]

Scimaster: Towards general-purpose scientific AI agents, part i

Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, and Siheng Chen. Scimaster: Towards general-purpose scientific AI agents, part i. x-master as foundation: Can we lead on humanity's last exam? CoRR, abs/2507.05241, 2025. doi:10.48550/ARXIV.2507.05241. https://doi.org/10.48550/arXiv.2507.05241

work page doi:10.48550/arxiv.2507.05241 2025
[115]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. Mle-bench: Evaluating machine learning agents on machine learning engineering. CoRR, abs/2410.07095, 2024. doi:10.48550/ARXIV.2410.07095. https://doi.org/10.48550/arXiv.2410.07095

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.07095 2024
[116]

Toward Autonomous Long-Horizon Engineering for ML Research

Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji - Rong Wen, and Kai Jia. Toward autonomous long-horizon engineering for ML research. CoRR, abs/2604.13018, 2026 a . doi:10.48550/ARXIV.2604.13018. https://doi.org/10.48550/arXiv.2604.13018

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.13018 2026
[117]

MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Yoon. MARS: modular agent with reflective search for automated AI research. CoRR, abs/2602.02660, 2026 b . doi:10.48550/ARXIV.2602.02660. https://doi.org/10.48550/arXiv.2602.02660

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02660 2026
[118]

Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. C...

work page doi:10.48550/arxiv.2410.05080 2024
[119]

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, and Qinhuai Na. Frontier-eng: Benchmarking self-evolving agents on real-world engineering tasks with generative o...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.12290 2026
[120]

Datamaster: Data-centric autonomous ai research, 2026

Yaxin Du, Xiyuan Yang, Zhifan Zhou, Wanxu Liu, Zixing Lei, Zimeng Chen, Fenyi Liu, Haotian Wu, Yuzhu Cai, Zexi Liu, Xinyu Zhu, WenHao Wang, Linfeng Zhang, Chen Qian, and Siheng Chen. Datamaster: Data-centric autonomous ai research, 2026. https://arxiv.org/abs/2605.10906

Pith/arXiv arXiv 2026
[121]

Automated researchers can subtly sandbag, March 2025

Johannes Gasteiger, Akbir Khan, Sam Bowman, Vladimir Mikulik, Ethan Perez, and Fabien Roger. Automated researchers can subtly sandbag, March 2025. https://alignment.anthropic.com/2025/automated-researchers-sandbag/

2025
[122]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.13564 2025
[123]

Agentfugue: Agent scaling for long-horizon tasks through collective reasoning, 2026 a

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Tong Zhao, Xiaoxi Li, Zheng Liu, and Zhicheng Dou. Agentfugue: Agent scaling for long-horizon tasks through collective reasoning, 2026 a . https://arxiv.org/abs/2605.24486

Pith/arXiv arXiv 2026
[124]

Sam: State-adaptive memory for long-horizon reasoning agent, 2026 b

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Ziliang Zhao, Jiejun Tan, Zheng Liu, and Zhicheng Dou. Sam: State-adaptive memory for long-horizon reasoning agent, 2026 b . https://arxiv.org/abs/2605.24468

Pith/arXiv arXiv 2026

Showing first 80 references.

[1] [1]

2010 , url =

William Webber and Alistair Moffat and Justin Zobel , title =. 2010 , url =. doi:10.1145/1852102.1852106 , timestamp =

work page doi:10.1145/1852102.1852106 2010

[2] [2]

Widesearch: Benchmarking agentic broad info- seeking,

Ryan Wong and Jiawei Wang and Junjie Zhao and Li Chen and Yan Gao and Long Zhang and Xuan Zhou and Zuo Wang and Kai Xiang and Ge Zhang and Wenhao Huang and Yang Wang and Ke Wang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.07999 , eprinttype =. 2508.07999 , timestamp =

work page doi:10.48550/arxiv.2508.07999 2025

[3] [4]

2025 , howpublished =

Keller Jordan and contributors , title =. 2025 , howpublished =

2025

[4] [6]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Peilin Zhou and Bruce Leon and Xiang Ying and Can Zhang and Yifan Shao and Qichen Ye and Dading Chong and Zhiling Jin and Chenxuan Xie and Meng Cao and Yuxin Gu and Sixin Hong and Jing Ren and Jian Chen and Chao Liu and Yining Hua , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.19314 , eprinttype =. 2504.19314 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19314 2025

[5] [7]

, year =

Kidd, Celeste and Hayden, Benjamin Y. , year =. The Psychology and Neuroscience of Curiosity , volume =. Neuron , publisher =. doi:10.1016/j.neuron.2015.09.010 , number =

work page doi:10.1016/j.neuron.2015.09.010 2015

[6] [8]

The Twelfth International Conference on Learning Representations,

Gr. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[7] [9]

Infodeepseek: Benchmarking agentic information seeking for retrieval- augmented generation,

Yunjia Xi and Jianghao Lin and Menghui Zhu and Yongzhao Xiao and Zhuoying Ou and Jiaqi Liu and Tong Wan and Bo Chen and Weiwen Liu and Yasheng Wang and Ruiming Tang and Weinan Zhang and Yong Yu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.15872 , eprinttype =. 2505.15872 , timestamp =

work page doi:10.48550/arxiv.2505.15872 2025

[8] [10]

CoRR , volume =

Tian Lan and Bin Zhu and Qianghuai Jia and Junyang Ren and Haijun Li and Longyue Wang and Zhao Xu and Weihua Luo and Kaifu Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.20168 , eprinttype =. 2510.20168 , timestamp =

work page doi:10.48550/arxiv.2510.20168 2025

[9] [11]

CoRR , volume =

Junting Zhou and Wang Li and Yiyan Liao and Nengyuan Zhang and Tingjia Miao and Zhihui Qi and Yuhan Wu and Tong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.13784 , eprinttype =. 2506.13784 , timestamp =

work page doi:10.48550/arxiv.2506.13784 2025

[10] [12]

xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations , journal =

Kaiyuan Chen and Yixin Ren and Yang Liu and Xiaobo Hu and Haotong Tian and Tianbao Xie and Fangfu Liu and Haoye Zhang and Hongzhang Liu and Yuan Gong and Chen Sun and Han Hou and Hui Yang and James Pan and Jianan Lou and Jiayi Mao and Jizheng Liu and Jinpeng Li and Kangyi Liu and Kenkun Liu and Rui Wang and Run Li and Tong Niu and Wenlong Zhang and Wenqi ...

work page doi:10.48550/arxiv.2506.13651 2025

[11] [13]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham and Nguyen Nguyen and Pratibha Zunjare and Weiyuan Chen and Yu. SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2506.01062 , eprinttype =. 2506.01062 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.01062 2025

[12] [14]

CoRR , volume =

Yilong Xu and Xiang Long and Zhi Zheng and Jinhua Gao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.16725 , eprinttype =. 2507.16725 , timestamp =

work page doi:10.48550/arxiv.2507.16725 2025

[13] [15]

CoRR , volume =

Tomer Wolfson and Harsh Trivedi and Mor Geva and Yoav Goldberg and Dan Roth and Tushar Khot and Ashish Sabharwal and Reut Tsarfaty , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.11133 , eprinttype =. 2508.11133 , timestamp =

work page doi:10.48550/arxiv.2508.11133 2025

[14] [16]

CoRR , volume =

Heng Zhou and Ao Yu and Yuchen Fan and Jianing Shi and Li Kang and Hejia Geng and Yongting Zhang and Yutao Fan and Yuhao Wu and Tiancheng He and Yiran Qin and Lei Bai and Zhenfei Yin , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.01409 , eprinttype =. 2511.01409 , timestamp =

work page doi:10.48550/arxiv.2511.01409 2025

[15] [17]

Large Language Models for Information Retrieval:

Yutao Zhu and Huaying Yuan and Shuting Wang and Jiongnan Liu and Wenhan Liu and Chenlong Deng and Zhicheng Dou and Ji. Large Language Models for Information Retrieval:. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2308.07107 , eprinttype =. 2308.07107 , timestamp =

work page doi:10.48550/arxiv.2308.07107 2023

[16] [18]

CoRR , volume =

Yunjia Xi and Jianghao Lin and Yongzhao Xiao and Zheli Zhou and Rong Shan and Te Gao and Jiachen Zhu and Weiwen Liu and Yong Yu and Weinan Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.05668 , eprinttype =. 2508.05668 , timestamp =

work page doi:10.48550/arxiv.2508.05668 2025

[17] [19]

Aggarwal and Hui Liu and Xiang Zhang and Suhang Wang , title =

Minhua Lin and Zongyu Wu and Zhichao Xu and Hui Liu and Xianfeng Tang and Qi He and Charu C. Aggarwal and Hui Liu and Xiang Zhang and Suhang Wang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.16724 , eprinttype =. 2510.16724 , timestamp =

work page doi:10.48550/arxiv.2510.16724 2025

[18] [20]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li and Guanting Dong and Jiajie Jin and Yuyao Zhang and Yujia Zhou and Yutao Zhu and Peitian Zhang and Zhicheng Dou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2501.05366 , eprinttype =. 2501.05366 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.05366 2025

[19] [21]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li and Jiajie Jin and Guanting Dong and Hongjin Qian and Yutao Zhu and Yongkang Wu and Ji. WebThinker: Empowering Large Reasoning Models with Deep Research Capability , journal =. 2025 , url =. doi:10.48550/ARXIV.2504.21776 , eprinttype =. 2504.21776 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.21776 2025

[20] [22]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin and Hansi Zeng and Zhenrui Yue and Dong Wang and Hamed Zamani and Jiawei Han , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.09516 , eprinttype =. 2503.09516 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.09516 2025

[21] [23]

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li and Zhongwang Zhang and Huifeng Yin and Liwen Zhang and Litu Ou and Jialong Wu and Wenbiao Yin and Baixuan Li and Zhengwei Tao and Xinyu Wang and Weizhou Shen and Junkai Zhang and Dingchu Zhang and Xixi Wu and Yong Jiang and Ming Yan and Pengjun Xie and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.02...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.02592 2025

[22] [24]

CoRR , volume =

Rui Lu and Zhenyu Hou and Zihan Wang and Hanchen Zhang and Xiao Liu and Yujiang Li and Shi Feng and Jie Tang and Yuxiao Dong , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.10446 , eprinttype =. 2509.10446 , timestamp =

work page doi:10.48550/arxiv.2509.10446 2025

[23] [25]

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Dayoon Ko and Jihyuk Kim and Haeju Park and Sohyeon Kim and Dahyun Lee and Yongrae Jo and Gunhee Kim and Moontae Lee and Kyungjae Lee , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.19113 , eprinttype =. 2508.19113 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.19113 2025

[24] [26]

CoRR , volume =

Baixuan Li and Dingchu Zhang and Jialong Wu and Wenbiao Yin and Zhengwei Tao and Yida Zhao and Liwen Zhang and Haiyang Shen and Runnan Fang and Pengjun Xie and Jingren Zhou and Yong Jiang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.24698 , eprinttype =. 2510.24698 , timestamp =

work page doi:10.48550/arxiv.2510.24698 2025

[25] [27]

CoRR , volume =

Lisheng Huang and Yichen Liu and Jinhao Jiang and Rongxiang Zhang and Jiahao Yan and Junyi Li and Wayne Xin Zhao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.18105 , eprinttype =. 2505.18105 , timestamp =

work page doi:10.48550/arxiv.2505.18105 2025

[26] [28]

Holistically Guided Monte Carlo Tree Search for Intricate Information Seeking , journal =

Ruiyang Ren and Yuhao Wang and Junyi Li and Jinhao Jiang and Wayne Xin Zhao and Wenjie Wang and Tat. Holistically Guided Monte Carlo Tree Search for Intricate Information Seeking , journal =. 2025 , url =. doi:10.48550/ARXIV.2502.04751 , eprinttype =. 2502.04751 , timestamp =

work page doi:10.48550/arxiv.2502.04751 2025

[27] [30]

Yangyang Yu and Zhiyuan Yao and Haohang Li and Zhiyang Deng and Yuechen Jiang and Yupeng Cao and Zhi Chen and Jordan W. Suchow and Zhenyu Cui and Rong Liu and Zhaozhuo Xu and Denghui Zhang and Koduvayur Subbalakshmi and Guojun Xiong and Yueru He and Jimin Huang and Dong Li and Qianqian Xie , editor =. FinCon:. Advances in Neural Information Processing Sys...

2024

[28] [31]

ChatDev: Communicative agents for software development

Chen Qian and Wei Liu and Hongzhang Liu and Nuo Chen and Yufan Dang and Jiahao Li and Cheng Yang and Weize Chen and Yusheng Su and Xin Cong and Juyuan Xu and Dahai Li and Zhiyuan Liu and Maosong Sun , editor =. ChatDev: Communicative Agents for Software Development , booktitle =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.810 , timestamp =

work page doi:10.18653/v1/2024.acl-long.810 2024

[29] [32]

CoRR , volume =

Shanghua Gao and Ada Fang and Yepeng Huang and Valentina Giunchiglia and Ayush Noori and Jonathan Richard Schwarz and Yasha Ektefaie and Jovana Kondic and Marinka Zitnik , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.02831 , eprinttype =. 2404.02831 , timestamp =

work page doi:10.48550/arxiv.2404.02831 2024

[30] [33]

WebWalker: Benchmarking LLMs in Web Traversal , booktitle =

Jialong Wu and Wenbiao Yin and Yong Jiang and Zhenglin Wang and Zekun Xi and Runnan Fang and Linhai Zhang and Yulan He and Deyu Zhou and Pengjun Xie and Fei Huang , editor =. WebWalker: Benchmarking LLMs in Web Traversal , booktitle =. 2025 , url =

2025

[31] [34]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2512.02556 , eprinttype =. 2512.02556 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556 2025

[32] [35]

Narasimhan and Yuan Cao , title =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023

[33] [36]

Qwen3 Technical Report

An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jian Yang and Jiaxi Yang and Ji...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[34] [37]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng and Xin Lv and Qinkai Zheng and Zhenyu Hou and Bin Chen and Chengxing Xie and Cunxiang Wang and Da Yin and Hao Zeng and Jiajie Zhang and Kedong Wang and Lucen Zhong and Mingdao Liu and Rui Lu and Shulin Cao and Xiaohan Zhang and Xuancheng Huang and Yao Wei and Yean Cheng and Yifan An and Yilin Niu and Yuanhao Wen and Yushi Bai and Zhengxiao Du ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.06471 2025

[35] [38]

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments , booktitle =

Yuxiang Zheng and Dayuan Fu and Xiangkun Hu and Xiaojie Cai and Lyumanshan Ye and Pengrui Lu and Pengfei Liu , editor =. DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.22 , timestamp =

work page doi:10.18653/v1/2025.emnlp-main.22 2025

[36] [39]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song and Jinhao Jiang and Yingqian Min and Jie Chen and Zhipeng Chen and Wayne Xin Zhao and Lei Fang and Ji. R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2503.05592 , eprinttype =. 2503.05592 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.05592 2025

[37] [40]

Tongyi DeepResearch Technical Report

Baixuan Li and Bo Zhang and Dingchu Zhang and Fei Huang and Guangyu Li and Guoxin Chen and Huifeng Yin and Jialong Wu and Jingren Zhou and Kuan Li and Liangcai Su and Litu Ou and Liwen Zhang and Pengjun Xie and Rui Ye and Wenbiao Yin and Xinmiao Yu and Xinyu Wang and Xixi Wu and Xuanzhong Chen and Yida Zhao and Zhen Zhang and Zhengwei Tao and Zhongwang Zh...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.24701 2025

[38] [41]

2026 , url =

Harness design for long-running application development , author =. 2026 , url =

2026

[39] [43]

2026 , url =

Harness engineering: leveraging Codex in an agent-first world , author =. 2026 , url =

2026

[40] [44]

CoRR , volume =

Jialong Wu and Baixuan Li and Runnan Fang and Wenbiao Yin and Liwen Zhang and Zhengwei Tao and Dingchu Zhang and Zekun Xi and Yong Jiang and Pengjun Xie and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.22648 , eprinttype =. 2505.22648 , timestamp =

work page doi:10.48550/arxiv.2505.22648 2025

[41] [45]

Webshaper: Agentically data synthesizing via information- seeking formalization,

Zhengwei Tao and Jialong Wu and Wenbiao Yin and Junkai Zhang and Baixuan Li and Haiyang Shen and Kuan Li and Liwen Zhang and Xinyu Wang and Yong Jiang and Pengjun Xie and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.15061 , eprinttype =. 2507.15061 , timestamp =

work page doi:10.48550/arxiv.2507.15061 2025

[42] [46]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun and Zile Qiao and Jiayan Guo and Xuanbo Fan and Yingyan Hou and Yong Jiang and Pengjun Xie and Yan Zhang and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.04588 , eprinttype =. 2505.04588 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.04588 2025

[43] [47]

Kanell and Peter Xu and Omar Khattab and Monica S

Yijia Shao and Yucheng Jiang and Theodore A. Kanell and Peter Xu and Omar Khattab and Monica S. Lam , editor =. Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.347 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.347 2024

[44] [48]

Daya Guo and Dejian Yang and Haowei Zhang and Junxiao Song and Peiyi Wang and Qihao Zhu and Runxin Xu and Ruoyu Zhang and Shirong Ma and Xiao Bi and Xiaokang Zhang and Xingkai Yu and Yu Wu and Z. F. Wu and Zhibin Gou and Zhihong Shao and Zhuoshu Li and Ziyi Gao and Aixin Liu and Bing Xue and Bingxuan Wang and Bochao Wu and Bei Feng and Chengda Lu and Chen...

work page doi:10.1038/s41586-025-09422-z 2025

[45] [49]

Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =

Timo Schick and Jane Dwivedi. Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =. 2023 , url =

2023

[46] [50]

The Twelfth International Conference on Learning Representations,

Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun , title =. The Twelfth International Conference on Learning Representations,. 2...

2024

[47] [51]

HuggingGPT: Solving

Yongliang Shen and Kaitao Song and Xu Tan and Dongsheng Li and Weiming Lu and Yueting Zhuang , editor =. HuggingGPT: Solving. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =

2023

[48] [52]

Reflexion: language agents with verbal reinforcement learning , booktitle =

Noah Shinn and Federico Cassano and Ashwin Gopinath and Karthik Narasimhan and Shunyu Yao , editor =. Reflexion: language agents with verbal reinforcement learning , booktitle =. 2023 , url =

2023

[49] [57]

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , booktitle =

Qian Huang and Jian Vora and Percy Liang and Jure Leskovec , editor =. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , booktitle =. 2024 , url =

2024

[50] [58]

CoRR , volume =

Reiichiro Nakano and Jacob Hilton and Suchir Balaji and Jeff Wu and Long Ouyang and Christina Kim and Christopher Hesse and Shantanu Jain and Vineet Kosaraju and William Saunders and Xu Jiang and Karl Cobbe and Tyna Eloundou and Gretchen Krueger and Kevin Button and Matthew Knight and Benjamin Chess and John Schulman , title =. CoRR , volume =. 2021 , url...

Pith/arXiv arXiv 2021

[51] [63]

arXiv preprint arXiv:2503.18102 , year =

Samuel Schmidgall and Michael Moor , title =. arXiv preprint arXiv:2503.18102 , year =

arXiv

[52] [64]

arXiv preprint arXiv:2505.18705 , year =

Jiabin Tang and Lianghao Xia and Zhonghang Li and Chao Huang , title =. arXiv preprint arXiv:2505.18705 , year =

arXiv

[53] [65]

bioRxiv , year =

Kexin Huang and Serena Zhang and Hanchen Wang and Yuanhao Qu and Yingzhou Lu and Yusuf Roohani and Ryan Li and Lin Qiu and Junze Zhang and Yin Di and others , title =. bioRxiv , year =

[54] [67]

2026 , howpublished =

Jiaqi Liu and Peng Xia and Siwei Han and Shi Qiu and Letian Zhang and Guiming Chen and Haoqin Tu and Xinyu Yang and Jiawei Zhou and Hongtu Zhu and Yun Li and Jiaheng Zhang and Yuyin Zhou and Zeyu Zheng and Cihang Xie and Mingyu Ding and Huaxiu Yao , title =. 2026 , howpublished =

2026

[55] [68]

2026 , howpublished =

Andrej Karpathy , title =. 2026 , howpublished =

2026

[56] [79]

arXiv preprint arXiv:2503.21248 , year =

Yujie Liu and Zonglin Yang and Tong Xie and Jinjie Ni and Ben Gao and Yuqiang Li and Shixiang Tang and Wanli Ouyang and Erik Cambria and Dongzhan Zhou , title =. arXiv preprint arXiv:2503.21248 , year =

Pith/arXiv arXiv

[57] [81]

Siegel and Sayash Kapoor and Nitya Nadgir and Benedikt Stroebl and Arvind Narayanan , title =

Zachary S. Siegel and Sayash Kapoor and Nitya Nadgir and Benedikt Stroebl and Arvind Narayanan , title =. arXiv preprint arXiv:2409.11363 , year =

Pith/arXiv arXiv

[58] [82]

arXiv preprint arXiv:2407.01725 , year =

Bodhisattwa Prasad Majumder and Harshit Surana and Dhruv Agarwal and Bhavana Dalvi Mishra and Abhijeetsingh Meena and Aryan Prakhar and Tirth Vora and Tushar Khot and Ashish Sabharwal and Peter Clark , title =. arXiv preprint arXiv:2407.01725 , year =

arXiv

[59] [84]

arXiv preprint arXiv:2505.19955 , year =

Hui Chen and Miao Xiong and Yujie Lu and Wei Han and Ailin Deng and Yang He and Jiaying Wu and Kai Wang and Yibo Wang and Shen Li and Jiani Yu and Bryan Hooi , title =. arXiv preprint arXiv:2505.19955 , year =

arXiv

[60] [85]

Browne and Edward Powley and Daniel Whitehouse and Simon M

Cameron B. Browne and Edward Powley and Daniel Whitehouse and Simon M. Lucas and Peter I. Cowling and Philipp Rohlfshagen and Stephen Tavener and Diego Perez and Spyridon Samothrakis and Simon Colton , title =. IEEE Transactions on Computational Intelligence and AI in Games , volume =

[61] [91]

2026 , eprint=

DataMaster: Data-Centric Autonomous AI Research , author=. 2026 , eprint=

2026

[62] [93]

2025 , month = mar, organization =

Automated Researchers Can Subtly Sandbag , author =. 2025 , month = mar, organization =

2025

[63] [97]

2026 , eprint=

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning , author=. 2026 , eprint=

2026

[64] [107]

2026 , eprint=

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent , author=. 2026 , eprint=

2026

[65] [109]

CoRR , volume =

Jiafeng Liang and Hao Li and Chang Li and Jiaqi Zhou and Shixin Jiang and Zekun Wang and Changkai Ji and Zhihao Zhu and Runxuan Liu and Tao Ren and Jinlan Fu and See. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.23343 , eprinttype =. 2512.23343 , timestamp =

work page doi:10.48550/arxiv.2512.23343 2025

[66] [110]

The Thirteenth International Conference on Learning Representations,

Jiayi Zhang and Jinyu Xiang and Zhaoyang Yu and Fengwei Teng and Xionghui Chen and Jiaqi Chen and Mingchen Zhuge and Xin Cheng and Sirui Hong and Jinlin Wang and Bingnan Zheng and Bang Liu and Yuyu Luo and Chenglin Wu , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025

[67] [111]

2025 , howpublished =

2025

[68] [112]

Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H

Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and et al. , title =. The Thirteenth ...

2025

[69] [113]

Claude Code

Anthropic . Claude Code . https://github.com/anthropics/claude-code, 2025. Agentic coding tool for terminal, IDE, and GitHub workflows. Accessed: 2026-06-02

2025

[70] [114]

Scimaster: Towards general-purpose scientific AI agents, part i

Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, and Siheng Chen. Scimaster: Towards general-purpose scientific AI agents, part i. x-master as foundation: Can we lead on humanity's last exam? CoRR, abs/2507.05241, 2025. doi:10.48550/ARXIV.2507.05241. https://doi.org/10.48550/arXiv.2507.05241

work page doi:10.48550/arxiv.2507.05241 2025

[71] [115]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. Mle-bench: Evaluating machine learning agents on machine learning engineering. CoRR, abs/2410.07095, 2024. doi:10.48550/ARXIV.2410.07095. https://doi.org/10.48550/arXiv.2410.07095

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.07095 2024

[72] [116]

Toward Autonomous Long-Horizon Engineering for ML Research

Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji - Rong Wen, and Kai Jia. Toward autonomous long-horizon engineering for ML research. CoRR, abs/2604.13018, 2026 a . doi:10.48550/ARXIV.2604.13018. https://doi.org/10.48550/arXiv.2604.13018

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.13018 2026

[73] [117]

MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Yoon. MARS: modular agent with reflective search for automated AI research. CoRR, abs/2602.02660, 2026 b . doi:10.48550/ARXIV.2602.02660. https://doi.org/10.48550/arXiv.2602.02660

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02660 2026

[74] [118]

Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. C...

work page doi:10.48550/arxiv.2410.05080 2024

[75] [119]

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, and Qinhuai Na. Frontier-eng: Benchmarking self-evolving agents on real-world engineering tasks with generative o...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.12290 2026

[76] [120]

Datamaster: Data-centric autonomous ai research, 2026

Yaxin Du, Xiyuan Yang, Zhifan Zhou, Wanxu Liu, Zixing Lei, Zimeng Chen, Fenyi Liu, Haotian Wu, Yuzhu Cai, Zexi Liu, Xinyu Zhu, WenHao Wang, Linfeng Zhang, Chen Qian, and Siheng Chen. Datamaster: Data-centric autonomous ai research, 2026. https://arxiv.org/abs/2605.10906

Pith/arXiv arXiv 2026

[77] [121]

Automated researchers can subtly sandbag, March 2025

Johannes Gasteiger, Akbir Khan, Sam Bowman, Vladimir Mikulik, Ethan Perez, and Fabien Roger. Automated researchers can subtly sandbag, March 2025. https://alignment.anthropic.com/2025/automated-researchers-sandbag/

2025

[78] [122]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.13564 2025

[79] [123]

Agentfugue: Agent scaling for long-horizon tasks through collective reasoning, 2026 a

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Tong Zhao, Xiaoxi Li, Zheng Liu, and Zhicheng Dou. Agentfugue: Agent scaling for long-horizon tasks through collective reasoning, 2026 a . https://arxiv.org/abs/2605.24486

Pith/arXiv arXiv 2026

[80] [124]

Sam: State-adaptive memory for long-horizon reasoning agent, 2026 b

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Ziliang Zhao, Jiejun Tan, Zheng Liu, and Zhicheng Dou. Sam: State-adaptive memory for long-horizon reasoning agent, 2026 b . https://arxiv.org/abs/2605.24468

Pith/arXiv arXiv 2026