Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Pith reviewed 2026-06-27 09:32 UTC · model grok-4.3
The pith
Arbor maintains a persistent Hypothesis Tree that lets an AI coordinator accumulate and refine research insights across many iterations instead of restarting each time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Arbor is a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. Acr
What carries the argument
Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights so the coordinator can propagate lessons and refine the search frontier across iterations.
If this is right
- Research agents can improve an initial artifact through iterative experimentation without step-level human supervision.
- Lessons from failed or successful hypotheses become reusable across later attempts rather than being discarded.
- The same task interface and resource budget produce substantially higher held-out performance than prior agent baselines.
- The framework scales to multiple domains including model training, harness engineering, and data synthesis.
Where Pith is reading between the lines
- If the tree structure continues to grow without becoming intractable, the approach could support research campaigns lasting hundreds of iterations.
- The coordinator-executor split may allow future versions to run many executors in parallel while the coordinator maintains a single coherent research plan.
- Distilled insights stored in the tree could be extracted and reused by other agents or even by human researchers.
Load-bearing premise
The six chosen tasks plus MLE-Bench Lite are representative of general autonomous research and the reported gains arise from the Hypothesis Tree Refinement mechanism rather than from differences in prompting, model access, or task-specific engineering.
What would settle it
Running the same six tasks with an otherwise identical Arbor variant that replaces the Hypothesis Tree with a flat chronological log and measuring whether the 2.5x relative gain disappears.
read the original abstract
Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Arbor, a framework for autonomous research that integrates a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR) to maintain a persistent tree linking hypotheses, artifacts, evidence, and distilled insights across iterations. It evaluates the system in an Autonomous Optimization setting on six real research tasks (model training, harness engineering, data synthesis) plus MLE-Bench Lite, claiming that Arbor achieves the best held-out result on all six tasks, more than 2.5x the average relative held-out gain of Codex and Claude Code under identical task interface and resource budget, and 86.36% Any Medal on MLE-Bench Lite when using GPT-5.5.
Significance. If the reported gains prove robust and attributable to the HTR mechanism rather than model or implementation differences, the work would offer a concrete approach to cumulative long-horizon autonomous research, moving beyond isolated attempts to a process that propagates lessons across time; the persistent tree structure is a clear conceptual contribution.
major comments (2)
- [Abstract] Abstract: the central performance claims (best result on all six tasks, >2.5x average relative held-out gain, 86.36% Any Medal) are presented without any description of experimental controls, statistical significance testing, or confirmation that the baselines (Codex, Claude Code) used the identical base model GPT-5.5 rather than weaker models; this directly affects whether gains can be attributed to HTR versus model access.
- [Abstract] Abstract and evaluation description: the claim that results arise from the Hypothesis Tree Refinement mechanism (persistent linking of hypotheses/artifacts/evidence) rather than task-specific engineering or prompting differences is not supported by any ablation or controlled comparison that isolates the tree component while holding model, prompt template, and executor fixed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater clarity in the abstract regarding experimental controls and the attribution of gains to the HTR mechanism. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claims (best result on all six tasks, >2.5x average relative held-out gain, 86.36% Any Medal) are presented without any description of experimental controls, statistical significance testing, or confirmation that the baselines (Codex, Claude Code) used the identical base model GPT-5.5 rather than weaker models; this directly affects whether gains can be attributed to HTR versus model access.
Authors: We agree that the abstract should explicitly describe the experimental controls to support attribution. The manuscript already states that comparisons used the same task interface and resource budget, with the MLE-Bench Lite result reported using GPT-5.5 for Arbor. We will revise the abstract to add a concise description of these controls and to note that statistical significance testing across multiple runs was not performed due to computational cost. This revision will make the shared setup clearer without altering the reported numbers. revision: yes
-
Referee: [Abstract] Abstract and evaluation description: the claim that results arise from the Hypothesis Tree Refinement mechanism (persistent linking of hypotheses/artifacts/evidence) rather than task-specific engineering or prompting differences is not supported by any ablation or controlled comparison that isolates the tree component while holding model, prompt template, and executor fixed.
Authors: The referee correctly observes that the manuscript contains no ablation that removes only the HTR component while holding the model, prompt templates, and executor implementation fixed. The reported comparisons evaluate the complete Arbor system against other agent frameworks under matched model and interface conditions, but do not isolate the persistent tree. We will revise the evaluation description to more explicitly articulate the intended contribution of HTR and to acknowledge this limitation. A dedicated ablation would require additional controlled experiments that are outside the scope of the current results. revision: partial
Circularity Check
No circularity; empirical framework evaluation with no derivations or self-referential reductions.
full rationale
The paper introduces the Arbor framework and Hypothesis Tree Refinement for autonomous research, then reports empirical results on six tasks and MLE-Bench Lite. No equations, fitted parameters, or mathematical derivations are present. Performance claims rest on held-out comparisons under a stated task interface and budget, without any step that reduces by construction to inputs, self-citations, or ansatzes. The central attribution to HTR is an empirical claim open to experimental scrutiny rather than a definitional or fitted tautology. This is a standard non-circular empirical paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
William Webber and Alistair Moffat and Justin Zobel , title =. 2010 , url =. doi:10.1145/1852102.1852106 , timestamp =
-
[2]
Widesearch: Benchmarking agentic broad info- seeking,
Ryan Wong and Jiawei Wang and Junjie Zhao and Li Chen and Yan Gao and Long Zhang and Xuan Zhou and Zuo Wang and Kai Xiang and Ge Zhang and Wenhao Huang and Yang Wang and Ke Wang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.07999 , eprinttype =. 2508.07999 , timestamp =
-
[4]
2025 , howpublished =
Keller Jordan and contributors , title =. 2025 , howpublished =
2025
-
[6]
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Peilin Zhou and Bruce Leon and Xiang Ying and Can Zhang and Yifan Shao and Qichen Ye and Dading Chong and Zhiling Jin and Chenxuan Xie and Meng Cao and Yuxin Gu and Sixin Hong and Jing Ren and Jian Chen and Chao Liu and Yining Hua , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.19314 , eprinttype =. 2504.19314 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19314 2025
-
[7]
Kidd, Celeste and Hayden, Benjamin Y. , year =. The Psychology and Neuroscience of Curiosity , volume =. Neuron , publisher =. doi:10.1016/j.neuron.2015.09.010 , number =
-
[8]
The Twelfth International Conference on Learning Representations,
Gr. The Twelfth International Conference on Learning Representations,. 2024 , url =
2024
-
[9]
Infodeepseek: Benchmarking agentic information seeking for retrieval- augmented generation,
Yunjia Xi and Jianghao Lin and Menghui Zhu and Yongzhao Xiao and Zhuoying Ou and Jiaqi Liu and Tong Wan and Bo Chen and Weiwen Liu and Yasheng Wang and Ruiming Tang and Weinan Zhang and Yong Yu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.15872 , eprinttype =. 2505.15872 , timestamp =
-
[10]
Tian Lan and Bin Zhu and Qianghuai Jia and Junyang Ren and Haijun Li and Longyue Wang and Zhao Xu and Weihua Luo and Kaifu Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.20168 , eprinttype =. 2510.20168 , timestamp =
-
[11]
Junting Zhou and Wang Li and Yiyan Liao and Nengyuan Zhang and Tingjia Miao and Zhihui Qi and Yuhan Wu and Tong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.13784 , eprinttype =. 2506.13784 , timestamp =
-
[12]
Kaiyuan Chen and Yixin Ren and Yang Liu and Xiaobo Hu and Haotong Tian and Tianbao Xie and Fangfu Liu and Haoye Zhang and Hongzhang Liu and Yuan Gong and Chen Sun and Han Hou and Hui Yang and James Pan and Jianan Lou and Jiayi Mao and Jizheng Liu and Jinpeng Li and Kangyi Liu and Kenkun Liu and Rui Wang and Run Li and Tong Niu and Wenlong Zhang and Wenqi ...
-
[13]
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham and Nguyen Nguyen and Pratibha Zunjare and Weiyuan Chen and Yu. SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2506.01062 , eprinttype =. 2506.01062 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.01062 2025
-
[14]
Yilong Xu and Xiang Long and Zhi Zheng and Jinhua Gao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.16725 , eprinttype =. 2507.16725 , timestamp =
-
[15]
Tomer Wolfson and Harsh Trivedi and Mor Geva and Yoav Goldberg and Dan Roth and Tushar Khot and Ashish Sabharwal and Reut Tsarfaty , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.11133 , eprinttype =. 2508.11133 , timestamp =
-
[16]
Heng Zhou and Ao Yu and Yuchen Fan and Jianing Shi and Li Kang and Hejia Geng and Yongting Zhang and Yutao Fan and Yuhao Wu and Tiancheng He and Yiran Qin and Lei Bai and Zhenfei Yin , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.01409 , eprinttype =. 2511.01409 , timestamp =
-
[17]
Large Language Models for Information Retrieval:
Yutao Zhu and Huaying Yuan and Shuting Wang and Jiongnan Liu and Wenhan Liu and Chenlong Deng and Zhicheng Dou and Ji. Large Language Models for Information Retrieval:. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2308.07107 , eprinttype =. 2308.07107 , timestamp =
-
[18]
Yunjia Xi and Jianghao Lin and Yongzhao Xiao and Zheli Zhou and Rong Shan and Te Gao and Jiachen Zhu and Weiwen Liu and Yong Yu and Weinan Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.05668 , eprinttype =. 2508.05668 , timestamp =
-
[19]
Aggarwal and Hui Liu and Xiang Zhang and Suhang Wang , title =
Minhua Lin and Zongyu Wu and Zhichao Xu and Hui Liu and Xianfeng Tang and Qi He and Charu C. Aggarwal and Hui Liu and Xiang Zhang and Suhang Wang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.16724 , eprinttype =. 2510.16724 , timestamp =
-
[20]
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Xiaoxi Li and Guanting Dong and Jiajie Jin and Yuyao Zhang and Yujia Zhou and Yutao Zhu and Peitian Zhang and Zhicheng Dou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2501.05366 , eprinttype =. 2501.05366 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.05366 2025
-
[21]
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Xiaoxi Li and Jiajie Jin and Guanting Dong and Hongjin Qian and Yutao Zhu and Yongkang Wu and Ji. WebThinker: Empowering Large Reasoning Models with Deep Research Capability , journal =. 2025 , url =. doi:10.48550/ARXIV.2504.21776 , eprinttype =. 2504.21776 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.21776 2025
-
[22]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin and Hansi Zeng and Zhenrui Yue and Dong Wang and Hamed Zamani and Jiawei Han , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.09516 , eprinttype =. 2503.09516 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.09516 2025
-
[23]
WebSailor: Navigating Super-human Reasoning for Web Agent
Kuan Li and Zhongwang Zhang and Huifeng Yin and Liwen Zhang and Litu Ou and Jialong Wu and Wenbiao Yin and Baixuan Li and Zhengwei Tao and Xinyu Wang and Weizhou Shen and Junkai Zhang and Dingchu Zhang and Xixi Wu and Yong Jiang and Ming Yan and Pengjun Xie and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.02...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.02592 2025
-
[24]
Rui Lu and Zhenyu Hou and Zihan Wang and Hanchen Zhang and Xiao Liu and Yujiang Li and Shi Feng and Jie Tang and Yuxiao Dong , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.10446 , eprinttype =. 2509.10446 , timestamp =
-
[25]
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Dayoon Ko and Jihyuk Kim and Haeju Park and Sohyeon Kim and Dahyun Lee and Yongrae Jo and Gunhee Kim and Moontae Lee and Kyungjae Lee , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.19113 , eprinttype =. 2508.19113 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.19113 2025
-
[26]
Baixuan Li and Dingchu Zhang and Jialong Wu and Wenbiao Yin and Zhengwei Tao and Yida Zhao and Liwen Zhang and Haiyang Shen and Runnan Fang and Pengjun Xie and Jingren Zhou and Yong Jiang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.24698 , eprinttype =. 2510.24698 , timestamp =
-
[27]
Lisheng Huang and Yichen Liu and Jinhao Jiang and Rongxiang Zhang and Jiahao Yan and Junyi Li and Wayne Xin Zhao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.18105 , eprinttype =. 2505.18105 , timestamp =
-
[28]
Holistically Guided Monte Carlo Tree Search for Intricate Information Seeking , journal =
Ruiyang Ren and Yuhao Wang and Junyi Li and Jinhao Jiang and Wayne Xin Zhao and Wenjie Wang and Tat. Holistically Guided Monte Carlo Tree Search for Intricate Information Seeking , journal =. 2025 , url =. doi:10.48550/ARXIV.2502.04751 , eprinttype =. 2502.04751 , timestamp =
-
[30]
Yangyang Yu and Zhiyuan Yao and Haohang Li and Zhiyang Deng and Yuechen Jiang and Yupeng Cao and Zhi Chen and Jordan W. Suchow and Zhenyu Cui and Rong Liu and Zhaozhuo Xu and Denghui Zhang and Koduvayur Subbalakshmi and Guojun Xiong and Yueru He and Jimin Huang and Dong Li and Qianqian Xie , editor =. FinCon:. Advances in Neural Information Processing Sys...
2024
-
[31]
ChatDev: Communicative agents for software development
Chen Qian and Wei Liu and Hongzhang Liu and Nuo Chen and Yufan Dang and Jiahao Li and Cheng Yang and Weize Chen and Yusheng Su and Xin Cong and Juyuan Xu and Dahai Li and Zhiyuan Liu and Maosong Sun , editor =. ChatDev: Communicative Agents for Software Development , booktitle =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.810 , timestamp =
-
[32]
Shanghua Gao and Ada Fang and Yepeng Huang and Valentina Giunchiglia and Ayush Noori and Jonathan Richard Schwarz and Yasha Ektefaie and Jovana Kondic and Marinka Zitnik , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.02831 , eprinttype =. 2404.02831 , timestamp =
-
[33]
WebWalker: Benchmarking LLMs in Web Traversal , booktitle =
Jialong Wu and Wenbiao Yin and Yong Jiang and Zhenglin Wang and Zekun Xi and Runnan Fang and Linhai Zhang and Yulan He and Deyu Zhou and Pengjun Xie and Fei Huang , editor =. WebWalker: Benchmarking LLMs in Web Traversal , booktitle =. 2025 , url =
2025
-
[34]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2512.02556 , eprinttype =. 2512.02556 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556 2025
-
[35]
Narasimhan and Yuan Cao , title =
Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =
2023
-
[36]
An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jian Yang and Jiaxi Yang and Ji...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[37]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aohan Zeng and Xin Lv and Qinkai Zheng and Zhenyu Hou and Bin Chen and Chengxing Xie and Cunxiang Wang and Da Yin and Hao Zeng and Jiajie Zhang and Kedong Wang and Lucen Zhong and Mingdao Liu and Rui Lu and Shulin Cao and Xiaohan Zhang and Xuancheng Huang and Yao Wei and Yean Cheng and Yifan An and Yilin Niu and Yuanhao Wen and Yushi Bai and Zhengxiao Du ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.06471 2025
-
[38]
Yuxiang Zheng and Dayuan Fu and Xiangkun Hu and Xiaojie Cai and Lyumanshan Ye and Pengrui Lu and Pengfei Liu , editor =. DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.22 , timestamp =
-
[39]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song and Jinhao Jiang and Yingqian Min and Jie Chen and Zhipeng Chen and Wayne Xin Zhao and Lei Fang and Ji. R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2503.05592 , eprinttype =. 2503.05592 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.05592 2025
-
[40]
Tongyi DeepResearch Technical Report
Baixuan Li and Bo Zhang and Dingchu Zhang and Fei Huang and Guangyu Li and Guoxin Chen and Huifeng Yin and Jialong Wu and Jingren Zhou and Kuan Li and Liangcai Su and Litu Ou and Liwen Zhang and Pengjun Xie and Rui Ye and Wenbiao Yin and Xinmiao Yu and Xinyu Wang and Xixi Wu and Xuanzhong Chen and Yida Zhao and Zhen Zhang and Zhengwei Tao and Zhongwang Zh...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.24701 2025
-
[41]
2026 , url =
Harness design for long-running application development , author =. 2026 , url =
2026
-
[43]
2026 , url =
Harness engineering: leveraging Codex in an agent-first world , author =. 2026 , url =
2026
-
[44]
Jialong Wu and Baixuan Li and Runnan Fang and Wenbiao Yin and Liwen Zhang and Zhengwei Tao and Dingchu Zhang and Zekun Xi and Yong Jiang and Pengjun Xie and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.22648 , eprinttype =. 2505.22648 , timestamp =
-
[45]
Webshaper: Agentically data synthesizing via information- seeking formalization,
Zhengwei Tao and Jialong Wu and Wenbiao Yin and Junkai Zhang and Baixuan Li and Haiyang Shen and Kuan Li and Liwen Zhang and Xinyu Wang and Yong Jiang and Pengjun Xie and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.15061 , eprinttype =. 2507.15061 , timestamp =
-
[46]
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Hao Sun and Zile Qiao and Jiayan Guo and Xuanbo Fan and Yingyan Hou and Yong Jiang and Pengjun Xie and Yan Zhang and Fei Huang and Jingren Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.04588 , eprinttype =. 2505.04588 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.04588 2025
-
[47]
Kanell and Peter Xu and Omar Khattab and Monica S
Yijia Shao and Yucheng Jiang and Theodore A. Kanell and Peter Xu and Omar Khattab and Monica S. Lam , editor =. Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.347 , timestamp =
-
[48]
Daya Guo and Dejian Yang and Haowei Zhang and Junxiao Song and Peiyi Wang and Qihao Zhu and Runxin Xu and Ruoyu Zhang and Shirong Ma and Xiao Bi and Xiaokang Zhang and Xingkai Yu and Yu Wu and Z. F. Wu and Zhibin Gou and Zhihong Shao and Zhuoshu Li and Ziyi Gao and Aixin Liu and Bing Xue and Bingxuan Wang and Bochao Wu and Bei Feng and Chengda Lu and Chen...
-
[49]
Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =
Timo Schick and Jane Dwivedi. Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =. 2023 , url =
2023
-
[50]
The Twelfth International Conference on Learning Representations,
Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun , title =. The Twelfth International Conference on Learning Representations,. 2...
2024
-
[51]
HuggingGPT: Solving
Yongliang Shen and Kaitao Song and Xu Tan and Dongsheng Li and Weiming Lu and Yueting Zhuang , editor =. HuggingGPT: Solving. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =
2023
-
[52]
Reflexion: language agents with verbal reinforcement learning , booktitle =
Noah Shinn and Federico Cassano and Ashwin Gopinath and Karthik Narasimhan and Shunyu Yao , editor =. Reflexion: language agents with verbal reinforcement learning , booktitle =. 2023 , url =
2023
-
[57]
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , booktitle =
Qian Huang and Jian Vora and Percy Liang and Jure Leskovec , editor =. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , booktitle =. 2024 , url =
2024
-
[58]
Reiichiro Nakano and Jacob Hilton and Suchir Balaji and Jeff Wu and Long Ouyang and Christina Kim and Christopher Hesse and Shantanu Jain and Vineet Kosaraju and William Saunders and Xu Jiang and Karl Cobbe and Tyna Eloundou and Gretchen Krueger and Kevin Button and Matthew Knight and Benjamin Chess and John Schulman , title =. CoRR , volume =. 2021 , url...
Pith/arXiv arXiv 2021
-
[63]
arXiv preprint arXiv:2503.18102 , year =
Samuel Schmidgall and Michael Moor , title =. arXiv preprint arXiv:2503.18102 , year =
-
[64]
arXiv preprint arXiv:2505.18705 , year =
Jiabin Tang and Lianghao Xia and Zhonghang Li and Chao Huang , title =. arXiv preprint arXiv:2505.18705 , year =
-
[65]
bioRxiv , year =
Kexin Huang and Serena Zhang and Hanchen Wang and Yuanhao Qu and Yingzhou Lu and Yusuf Roohani and Ryan Li and Lin Qiu and Junze Zhang and Yin Di and others , title =. bioRxiv , year =
-
[67]
2026 , howpublished =
Jiaqi Liu and Peng Xia and Siwei Han and Shi Qiu and Letian Zhang and Guiming Chen and Haoqin Tu and Xinyu Yang and Jiawei Zhou and Hongtu Zhu and Yun Li and Jiaheng Zhang and Yuyin Zhou and Zeyu Zheng and Cihang Xie and Mingyu Ding and Huaxiu Yao , title =. 2026 , howpublished =
2026
-
[68]
2026 , howpublished =
Andrej Karpathy , title =. 2026 , howpublished =
2026
-
[79]
arXiv preprint arXiv:2503.21248 , year =
Yujie Liu and Zonglin Yang and Tong Xie and Jinjie Ni and Ben Gao and Yuqiang Li and Shixiang Tang and Wanli Ouyang and Erik Cambria and Dongzhan Zhou , title =. arXiv preprint arXiv:2503.21248 , year =
-
[81]
Siegel and Sayash Kapoor and Nitya Nadgir and Benedikt Stroebl and Arvind Narayanan , title =
Zachary S. Siegel and Sayash Kapoor and Nitya Nadgir and Benedikt Stroebl and Arvind Narayanan , title =. arXiv preprint arXiv:2409.11363 , year =
-
[82]
arXiv preprint arXiv:2407.01725 , year =
Bodhisattwa Prasad Majumder and Harshit Surana and Dhruv Agarwal and Bhavana Dalvi Mishra and Abhijeetsingh Meena and Aryan Prakhar and Tirth Vora and Tushar Khot and Ashish Sabharwal and Peter Clark , title =. arXiv preprint arXiv:2407.01725 , year =
-
[84]
arXiv preprint arXiv:2505.19955 , year =
Hui Chen and Miao Xiong and Yujie Lu and Wei Han and Ailin Deng and Yang He and Jiaying Wu and Kai Wang and Yibo Wang and Shen Li and Jiani Yu and Bryan Hooi , title =. arXiv preprint arXiv:2505.19955 , year =
-
[85]
Browne and Edward Powley and Daniel Whitehouse and Simon M
Cameron B. Browne and Edward Powley and Daniel Whitehouse and Simon M. Lucas and Peter I. Cowling and Philipp Rohlfshagen and Stephen Tavener and Diego Perez and Spyridon Samothrakis and Simon Colton , title =. IEEE Transactions on Computational Intelligence and AI in Games , volume =
-
[91]
2026 , eprint=
DataMaster: Data-Centric Autonomous AI Research , author=. 2026 , eprint=
2026
-
[93]
2025 , month = mar, organization =
Automated Researchers Can Subtly Sandbag , author =. 2025 , month = mar, organization =
2025
-
[97]
2026 , eprint=
AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning , author=. 2026 , eprint=
2026
-
[107]
2026 , eprint=
SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent , author=. 2026 , eprint=
2026
-
[109]
Jiafeng Liang and Hao Li and Chang Li and Jiaqi Zhou and Shixin Jiang and Zekun Wang and Changkai Ji and Zhihao Zhu and Runxuan Liu and Tao Ren and Jinlan Fu and See. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.23343 , eprinttype =. 2512.23343 , timestamp =
-
[110]
The Thirteenth International Conference on Learning Representations,
Jiayi Zhang and Jinyu Xiang and Zhaoyang Yu and Fengwei Teng and Xionghui Chen and Jiaqi Chen and Mingchen Zhuge and Xin Cheng and Sirui Hong and Jinlin Wang and Bingnan Zheng and Bang Liu and Yuyu Luo and Chenglin Wu , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =
2025
-
[111]
2025 , howpublished =
2025
-
[112]
Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H
Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and et al. , title =. The Thirteenth ...
2025
-
[113]
Claude Code
Anthropic . Claude Code . https://github.com/anthropics/claude-code, 2025. Agentic coding tool for terminal, IDE, and GitHub workflows. Accessed: 2026-06-02
2025
-
[114]
Scimaster: Towards general-purpose scientific AI agents, part i
Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, and Siheng Chen. Scimaster: Towards general-purpose scientific AI agents, part i. x-master as foundation: Can we lead on humanity's last exam? CoRR, abs/2507.05241, 2025. doi:10.48550/ARXIV.2507.05241. https://doi.org/10.48550/arXiv.2507.05241
-
[115]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. Mle-bench: Evaluating machine learning agents on machine learning engineering. CoRR, abs/2410.07095, 2024. doi:10.48550/ARXIV.2410.07095. https://doi.org/10.48550/arXiv.2410.07095
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.07095 2024
-
[116]
Toward Autonomous Long-Horizon Engineering for ML Research
Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji - Rong Wen, and Kai Jia. Toward autonomous long-horizon engineering for ML research. CoRR, abs/2604.13018, 2026 a . doi:10.48550/ARXIV.2604.13018. https://doi.org/10.48550/arXiv.2604.13018
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.13018 2026
-
[117]
MARS: Modular Agent with Reflective Search for Automated AI Research
Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Yoon. MARS: modular agent with reflective search for automated AI research. CoRR, abs/2602.02660, 2026 b . doi:10.48550/ARXIV.2602.02660. https://doi.org/10.48550/arXiv.2602.02660
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02660 2026
-
[118]
Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu - Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. C...
-
[119]
Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, and Qinhuai Na. Frontier-eng: Benchmarking self-evolving agents on real-world engineering tasks with generative o...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.12290 2026
-
[120]
Datamaster: Data-centric autonomous ai research, 2026
Yaxin Du, Xiyuan Yang, Zhifan Zhou, Wanxu Liu, Zixing Lei, Zimeng Chen, Fenyi Liu, Haotian Wu, Yuzhu Cai, Zexi Liu, Xinyu Zhu, WenHao Wang, Linfeng Zhang, Chen Qian, and Siheng Chen. Datamaster: Data-centric autonomous ai research, 2026. https://arxiv.org/abs/2605.10906
Pith/arXiv arXiv 2026
-
[121]
Automated researchers can subtly sandbag, March 2025
Johannes Gasteiger, Akbir Khan, Sam Bowman, Vladimir Mikulik, Ethan Perez, and Fabien Roger. Automated researchers can subtly sandbag, March 2025. https://alignment.anthropic.com/2025/automated-researchers-sandbag/
2025
-
[122]
Memory in the Age of AI Agents
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.13564 2025
-
[123]
Agentfugue: Agent scaling for long-horizon tasks through collective reasoning, 2026 a
Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Tong Zhao, Xiaoxi Li, Zheng Liu, and Zhicheng Dou. Agentfugue: Agent scaling for long-horizon tasks through collective reasoning, 2026 a . https://arxiv.org/abs/2605.24486
Pith/arXiv arXiv 2026
-
[124]
Sam: State-adaptive memory for long-horizon reasoning agent, 2026 b
Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Ziliang Zhao, Jiejun Tan, Zheng Liu, and Zhicheng Dou. Sam: State-adaptive memory for long-horizon reasoning agent, 2026 b . https://arxiv.org/abs/2605.24468
Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.