AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

Li Zhang; Peijia Qin; Pengtao Xie; Qi Cao; Ruiyi Zhang

arxiv: 2605.27873 · v1 · pith:YBIREH2Rnew · submitted 2026-05-27 · 💻 cs.AI

AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

Ruiyi Zhang , Peijia Qin , Qi Cao , Li Zhang , Pengtao Xie This is my paper

Pith reviewed 2026-06-29 12:30 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI agentsautomated machine learningknowledge enhanced agentsmodel buildingexternal knowledge systemshierarchical knowledgeMLE-Bench

0 comments

The pith

AIBuildAI-2 equips an AI agent with an external hierarchical knowledge system to automatically build high-performing AI models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AIBuildAI-2 as a way to overcome the limits of large language models' static knowledge when building AI models. It adds an external system that stores AI development knowledge in categories, with instructions at the top level and detailed documents below. The agent pulls only the needed parts for the current task and updates the system with lessons from each completed project. This setup lets the agent make decisions based on verifiable external information rather than just what the model was trained on. The result is better performance on automated machine learning tasks, as shown by top rankings against both other agents and human teams.

Core claim

AIBuildAI-2 achieves state-of-the-art results by using a hierarchical knowledge system that organizes AI development knowledge into high-level instructions and low-level documents, allowing the agent to dynamically load relevant context and evolve the system from its own experience, leading to a 70.7% medal rate on MLE-Bench and top placement in human competitions.

What carries the argument

The hierarchical knowledge system, which stores high-level knowledge instructions over topical categories and low-level knowledge documents, enabling dynamic retrieval of only the context relevant to the current state and task.

If this is right

The agent produces model designs grounded in concrete expertise rather than internal parameters alone.
It ranks first on MLE-Bench with a 70.7% medal rate.
It places in the top 6.6% among 4,370 human-expert teams in a heart disease prediction competition.
The knowledge system evolves by distilling completed runs into structured takeaways added back to the system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could allow domain scientists to build custom AI models without deep engineering skills.
Similar external knowledge mechanisms might improve agents in other areas like code generation or scientific simulation.
Over time the evolving knowledge could create a compounding advantage as more tasks are completed.
Retrieval of specific documents might reduce hallucinations in design choices compared to pure LLM prompting.

Load-bearing premise

The external knowledge system supplies concrete, externally verifiable expertise that the agent can dynamically retrieve and apply to produce measurably better model designs and implementations than an LLM relying only on its internal parameters.

What would settle it

Running AIBuildAI-2 without access to the external knowledge system on the same MLE-Bench tasks and finding no improvement in medal rate or competition ranking.

read the original abstract

AI models underpin data-centric applications from image and text processing to scientific discovery in biology, physics, and chemistry. Yet developing them remains heavily manual, requiring practitioners to design architectures, build training pipelines, and iteratively refine solutions, making it challenging for natural scientists without specialized AI engineering expertise to build the high-performing models their research demands. To reduce this burden and broaden access to AI for scientific discovery, agents that automatically build AI models have been proposed. However, the performance of these agents is largely limited by the parametric knowledge of their underlying large language models, which is static, often outdated, and sparse on practical AI model engineering know-how. To address this limitation, we introduce AIBuildAI-2, a knowledge-enhanced agent with an external, evolving knowledge system for automatically building AI models. The knowledge system of AIBuildAI-2 is hierarchical, organizing curated AI development knowledge into high-level knowledge instructions over topical categories and low-level knowledge documents under each category, from which the agent dynamically loads only the context relevant to its current state and the AI task being solved, grounding each design and implementation decision in concrete, externally verifiable expertise. The system is initialized by collecting and cleaning AI-development-related documents from the web and organizing them into the corresponding categories, and continually evolves from the agent's own experience by distilling each completed run on an AI task into structured takeaways that are written back into the knowledge system. AIBuildAI-2 achieves state-of-the-art results, ranking first on MLE-Bench with a 70.7% medal rate and placing in the top 6.6% among 4,370 human-expert teams in a heart disease prediction competition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new piece is a hierarchical self-evolving external knowledge system for an AI model-building agent, but the SOTA claims on MLE-Bench and the competition rest on unshown ablations and method details.

read the letter

The main thing to know is that AIBuildAI-2 tries to fix the static knowledge problem in LLM-based agents by adding an external hierarchical knowledge base: high-level instructions over categories and low-level documents underneath, pulled from the web at start and then updated by distilling the agent's own completed runs back into it. The agent is supposed to load only the relevant slice for the current task. That setup is presented as the key difference from earlier agents.

The work does a straightforward job naming the practical gap—LLMs often lack up-to-date, concrete AI engineering details—and shows the agent reaching first on MLE-Bench at 70.7% medal rate plus a top-6.6% finish against thousands of human teams in one competition. Those numbers are the kind of real-world signal that gets attention.

The soft spot is exactly what the stress-test note flags: the abstract states the knowledge system exists and drives the results, but supplies none of the needed mechanics or evidence. No description of the retrieval process, no ablation removing the external knowledge to measure its contribution, no error bars or statistical checks on the benchmark scores. Without those links, the performance numbers sit unconnected to the claimed mechanism. The competition result is harder to interpret for the same reason.

This is aimed at people building or evaluating agents for automated model engineering and AutoML tooling. A reader already working in that area could pick up the knowledge-organization idea even if the evaluation stays thin.

It is worth sending to peer review so referees can check whether the full methods and experiments actually tie the knowledge system to the gains.

Referee Report

3 major / 0 minor

Summary. The paper introduces AIBuildAI-2, a knowledge-enhanced agent equipped with a hierarchical external knowledge system (high-level instructions over topical categories and low-level documents) that is initialized from web-sourced AI-development documents and evolves by distilling agent experience into structured takeaways. The agent dynamically retrieves relevant context to ground design and implementation decisions. The central empirical claim is that this yields state-of-the-art performance: first place on MLE-Bench (70.7% medal rate) and top 6.6% among 4,370 human-expert teams in a heart-disease prediction competition.

Significance. If the external knowledge system can be shown to supply concrete, verifiable expertise that causally improves model-building outcomes beyond base LLM prompting, the work would address a genuine bottleneck in applying AI to scientific domains and could meaningfully broaden access for non-AI experts. The reported competition rankings, if substantiated with proper controls, would constitute a strong empirical result.

major comments (3)

[Abstract] Abstract: The performance claims (70.7% medal rate on MLE-Bench; top 6.6% ranking) are stated without any description of experimental protocol, baselines, statistical tests, error bars, or ablation studies isolating the contribution of the external knowledge system versus the base LLM or other agent components. This absence directly undermines evaluation of the central hypothesis that the knowledge system supplies externally verifiable expertise driving the measured gains.
[Abstract] Abstract: The initialization ('collecting and cleaning AI-development-related documents from the web') and evolution ('distilling each completed run into structured takeaways') mechanisms are described only at a high level; no details are given on curation criteria, deduplication, relevance scoring for dynamic retrieval, or how the hierarchical structure prevents context overload or hallucinated retrieval.
[Abstract] Abstract: No evidence is supplied that the knowledge documents contain concrete, externally verifiable expertise (e.g., specific hyperparameter heuristics, architecture patterns, or implementation pitfalls) rather than generic or already-internalized LLM knowledge; without such grounding or controlled comparison, the weakest assumption of the work remains untested.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the presentation of our evaluation and knowledge system.

read point-by-point responses

Referee: [Abstract] Abstract: The performance claims (70.7% medal rate on MLE-Bench; top 6.6% ranking) are stated without any description of experimental protocol, baselines, statistical tests, error bars, or ablation studies isolating the contribution of the external knowledge system versus the base LLM or other agent components. This absence directly undermines evaluation of the central hypothesis that the knowledge system supplies externally verifiable expertise driving the measured gains.

Authors: We agree that the abstract's brevity omits key evaluation details. The full manuscript contains a dedicated Experiments section describing the MLE-Bench protocol, baselines (including vanilla LLM agents and prior agent systems), statistical tests, error bars from multiple runs, and ablations isolating the knowledge component. We will revise the abstract to add a concise clause referencing the evaluation protocol and directing readers to the Experiments section for baselines, ablations, and statistical details. revision: yes
Referee: [Abstract] Abstract: The initialization ('collecting and cleaning AI-development-related documents from the web') and evolution ('distilling each completed run into structured takeaways') mechanisms are described only at a high level; no details are given on curation criteria, deduplication, relevance scoring for dynamic retrieval, or how the hierarchical structure prevents context overload or hallucinated retrieval.

Authors: The abstract intentionally summarizes at a high level. The full paper provides the requested details in Section 3: curation criteria and deduplication during initialization (3.1), relevance scoring and dynamic retrieval (3.4), and hierarchical organization to limit context length and reduce hallucination risk (3.2). We will revise the abstract to include a brief parenthetical reference to these mechanisms and their implementation sections. revision: yes
Referee: [Abstract] Abstract: No evidence is supplied that the knowledge documents contain concrete, externally verifiable expertise (e.g., specific hyperparameter heuristics, architecture patterns, or implementation pitfalls) rather than generic or already-internalized LLM knowledge; without such grounding or controlled comparison, the weakest assumption of the work remains untested.

Authors: We accept that the abstract alone does not demonstrate this. The manuscript already includes concrete examples of knowledge documents (e.g., specific hyperparameter schedules and architecture pitfalls for tabular and image tasks) in Section 4 and the appendix, together with ablation results showing gains over base LLM prompting. In revision we will add an explicit subsection with side-by-side excerpts contrasting retrieved knowledge against typical LLM-internal knowledge and will strengthen the controlled comparisons already present in the Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: external knowledge system presented as independent input

full rationale

The abstract describes an external hierarchical knowledge system initialized from web documents and updated from agent experience, with dynamic retrieval of relevant context to ground decisions. No equations, fitted parameters, self-referential definitions, or self-citation chains are present that would reduce the SOTA performance claims (70.7% medal rate, top 6.6% ranking) to the inputs by construction. The mechanism is framed as supplying externally verifiable expertise distinct from the base LLM's parametric knowledge, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5848 in / 1031 out tokens · 42646 ms · 2026-06-29T12:30:26.058911+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 27 canonical work pages · 4 internal anchors

[1]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Lourdes Agapito, Tamara Berg, Jana Kosecka, and Lihi Zelnik-Manor, editors,Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016
[2]

A survey of the usages of deep learn- ing for natural language processing.IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020

Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of the usages of deep learn- ing for natural language processing.IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020

2020
[3]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Web- ster, Greg S. Corrad...
[4]

doi:10.1038/s41586-023-06291-2

work page doi:10.1038/s41586-023-06291-2
[5]

Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Domi- nowska, Blaise Agüera y Arca...

work page doi:10.1038/s41591-024-03423-7 2025
[6]

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman...

work page doi:10.1038/s41586-021-03819-2 2021
[7]

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, Anima Anandkumar, Karianne Bergen, Carla P . Gomes, Shirley Ho, Pushmeet Kohli, Joan Lasenby, Jure Leskovec, Tie- Y an Liu, Arjun Manrai, Debora Marks, Bharath Ramsundar, Le Song, Jimeng Sun, Jian Tang, Petar Veliˇckovi´c...

work page doi:10.1038/s41586-023-06221-2 2023
[8]

Stokes, Kevin Y ang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M

Jonathan M. Stokes, Kevin Y ang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M. Donghia, Craig R. MacNair, Shawn French, Lindsey A. Carfrae, Zohar Bloom- Ackermann, Victoria M. Tran, Anush Chiappino-Pepe, Ahmed H. Badran, Ian W. Andrews, Emma J. Chory, George M. Church, Eric D. Brown, Tommi S. Jaakkola, Regina Barzilay, and James J. Collins. A d...

work page doi:10.1016/j.cell.2020.01.021 2020
[9]

Zheng, Jacqueline A

Felix Wong, Erica J. Zheng, Jacqueline A. Valeri, Nina M. Donghia, Melis N. Anahtar, Sato- taka Omori, Alicia Li, Andres Cubillos-Ruiz, Aarti Krishnan, Wengong Jin, Abigail L. Man- son, Jens Friedrichs, Ralf Helbig, Behnoush Hajian, Dawid K. Fiejtek, Florence F . Wagner, Holly H. Soutter, Ashlee M. Earl, Jonathan M. Stokes, Lars D. Renner, and James J. Co...

work page doi:10.1038/s41586-023-06887-8 2023
[10]

Springer, 2007

Alan M Turing.Computing machinery and intelligence, pages 23–65. Springer, 2007

2007
[11]

Machine learning: Trends, perspectives, and prospects.Science, 349(6245):255–260, 2015

Michael I Jordan and Tom M Mitchell. Machine learning: Trends, perspectives, and prospects.Science, 349(6245):255–260, 2015

2015
[12]

Probabilistic machine learning and artificial intelligence.Nature, 521 (7553):452–459, 2015

Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence.Nature, 521 (7553):452–459, 2015

2015
[13]

Quantum machine learning.Nature, 549(7671):195–202, 2017

Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. Quantum machine learning.Nature, 549(7671):195–202, 2017

2017
[14]

Random search for hyper-parameter optimization

James Bergstra and Y oshua Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012

2012
[15]

Hidden technical debt in machine learning systems

D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Y oung, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors,Proceedings of the International Confer- ence on Neu...

2015
[16]

Springer, 2019

Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren.Automated machine learning: meth- ods, systems, challenges. Springer, 2019

2019
[17]

Re-thinking data strat- egy and integration for artificial intelligence: concepts, opportunities, and challenges.Ap- plied Sciences, 13(12):7082, 2023

Abdulaziz Aldoseri, Khalifa N Al-Khalifa, and Abdel Magid Hamouda. Re-thinking data strat- egy and integration for artificial intelligence: concepts, opportunities, and challenges.Ap- plied Sciences, 13(12):7082, 2023

2023
[18]

The limits of fair medical imaging ai in real-world generalization.Nature medicine, 30(10):2838– 2848, 2024

Yuzhe Y ang, Haoran Zhang, Judy W Gichoya, Dina Katabi, and Marzyeh Ghassemi. The limits of fair medical imaging ai in real-world generalization.Nature medicine, 30(10):2838– 2848, 2024

2024
[19]

AI research agents for machine learning: Search, exploration, and generalization in MLE-bench

Edan Toledo, Karen Hambardzumyan, Martin Josifoski, RISHI HAZRA, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Tatiana Shavrina, Kelvin Niu, Jean-Christophe Gagnon- Audet, Michael Shvartsman, Shagun Sodhani, Alexander H Miller, Abhishek Charnalia, Derek Dunfield, Car...

2025
[20]

AutoMLGen: Navigating fine-grained optimization for coding agents.arXiv preprint arXiv:2510.08511, 2025

Shangheng Du, Xiangchao Y an, Dengyang Jiang, Jiakang Yuan, Yusong Hu, Xin Li, Liang He, Bo Zhang, and Lei Bai. AutoMLGen: Navigating fine-grained optimization for coding agents.arXiv preprint arXiv:2510.08511, 2025

work page arXiv 2025
[21]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Ja- cenko, and Yuxiang Wu. AIDE: AI-Driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

R&d-agent: Automating data-driven ai solution building through llm-powered automated research, development, and evolution.arXiv preprint arXiv:2505.14738, 2025a

Xu Y ang, Xiao Y ang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Y elong Shen, Weizhu Chen, and Jiang Bian. R&D-Agent: An LLM-Agent framework towards autonomous data science.arXiv preprint arXiv:2505.14738, 2025

work page arXiv 2025
[23]

AIBuildAI: An AI Agent for Automatically Building AI Models

Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang, and Pengtao Xie. AIBuildAI: An AI agent for automatically building AI models.arXiv preprint arXiv:2604.14455, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

MLE-bench: Evaluating machine learning agents on machine learning engineering,

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering,
[25]

International Conference on Learning Representations (ICLR)
[26]

Kaggle: Y our machine learning and data science community.https://www

Kaggle. Kaggle: Y our machine learning and data science community.https://www. kaggle.com. Accessed: 2026-05-20

2026
[27]

Mle-bench leaderboard (commit c5631ba).https://github.com/openai/ mle-bench/tree/c5631ba61ceeb0573235a6ce209db435327a1e84, 2026

OpenAI. Mle-bench leaderboard (commit c5631ba).https://github.com/openai/ mle-bench/tree/c5631ba61ceeb0573235a6ce209db435327a1e84, 2026. Ac- cessed: 2026-03-18

2026
[28]

Retrieval-augmented generation for knowledge-intensive NLP tasks.Ad- vances in Neural Information Processing Systems, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.Ad- vances in Neural Information Processing Systems, 2020

2020
[29]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Aditi Singh, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei, and Athanasios V. Vasi- lakos. Agentic retrieval-augmented generation: A survey on agentic RAG.arXiv preprint arXiv:2501.09136, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Y oon. MARS: Modular agent with reflective search for automated AI research.arXiv preprint arXiv:2602.02660, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

The FM agent.arXiv preprint arXiv:2510.26144, 2025

Annan Li, Chufan Wu, Zengle Ge, Y ee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, Shenghao Feng, Mo Zhao, Fengzhi Qiu, Rui Y ang, Mengmeng Zhang, Wenyi Zhu, Yingying Sun, Quan Sun, Shunhao Y an, Danyu Liu, Dawei Yin, and Dou Shen. The FM agent.arXiv preprint arXiv:2510.26144, 2025

work page arXiv 2025
[32]

ML-Master: Towards AI-for-AI via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499, 2025

Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Y anfeng Wang, Weinan E, and Siheng Chen. ML-Master: Towards AI-for-AI via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499, 2025

work page arXiv 2025
[33]

KAPSO: A knowledge- grounded framework for autonomous program synthesis and optimization.arXiv preprint arXiv:2601.21526, 2026

Alireza Nadafian, Alireza Mohammadshahi, and Majid Y azdani. KAPSO: A knowledge- grounded framework for autonomous program synthesis and optimization.arXiv preprint arXiv:2601.21526, 2026

work page arXiv 2026
[34]

InternAgent: When agent be- comes the scientist—building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025

InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Y an, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, et al. InternAgent: When agent be- comes the scientist—building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025

work page arXiv 2025
[35]

CatBoost: Unbiased boosting with categorical features.Advances in Neural Information Processing Systems, 31, 2018

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features.Advances in Neural Information Processing Systems, 31, 2018

2018
[36]

Xgboost: A scalable tree boosting system, 2016

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system, 2016

2016
[37]

Lightgbm: a highly efficient gradient boosting decision tree, 2017

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Y e, and Tie-Y an Liu. Lightgbm: a highly efficient gradient boosting decision tree, 2017

2017
[38]

Predicting heart disease (Playground Series S6E2).https://www.kaggle

Kaggle. Predicting heart disease (Playground Series S6E2).https://www.kaggle. com/competitions/playground-series-s6e2/overview, 2026. Accessed: 2026-04-29

2026
[39]

Guppy, Stella Lee, and Victor Froelicher

Robert Detrano, Andras Janosi, Walter Steinbrunn, Matthias Pfisterer, Johann-Jakob Schmid, Sarbjit Sandhu, Kern H. Guppy, Stella Lee, and Victor Froelicher. International application of a new probability algorithm for the diagnosis of coronary artery disease.The American Journal of Cardiology, 64(5):304–310, 1989. doi:10.1016/0002-9149(89)90524-9

work page doi:10.1016/0002-9149(89)90524-9 1989
[40]

Ensemble selection from libraries of models

Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selec- tion from libraries of models.Proceedings of the Twenty-first International Conference on Machine Learning, page 18, 2004. doi:10.1145/1015330.1015432

work page doi:10.1145/1015330.1015432 2004
[41]

Burns, Akshat Shirish Zalte, Charlles R

Jackson W. Burns, Akshat Shirish Zalte, Charlles R. A. Abreu, Jochen Sieg, Christian Feld- mann, Miriam Mathea, and William H. Green. Deep learning foundation models from clas- sical molecular descriptors.arXiv preprint arXiv:2506.15792, 2025

work page arXiv 2025
[42]

Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010

David Rogers and Mathew Hahn. Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010. doi:10.1021/ci100050t

work page doi:10.1021/ci100050t 2010
[43]

RDKit: Open-source cheminformatics software

Greg Landrum and The RDKit Contributors. RDKit: Open-source cheminformatics software. https://www.rdkit.org, 2024. Accessed: 2026-04-29

2024
[44]

Mordred: A molecular descriptor calculator.Journal of Cheminformatics, 10(1):4, 2018

Hirotomo Moriwaki, Yu-Shi Tian, Norihito Kawashita, and Tatsuya Takagi. Mordred: A molecular descriptor calculator.Journal of Cheminformatics, 10(1):4, 2018. doi:10.1186/ s13321-018-0258-y

2018
[45]

A software package for sequential quadratic programming

Dieter Kraft. A software package for sequential quadratic programming. Tech. Rep. DFVLR- FB 88-28, DFVLR, Institut für Dynamik der Flugsysteme, Oberpfaffenhofen, Germany, 1988

1988
[46]

Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik

Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Con- nor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development.Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

2021
[47]

OpenADMET ExpansionRx blind challenge.https://huggingface

OpenADMET. OpenADMET ExpansionRx blind challenge.https://huggingface. co/spaces/openadmet/OpenADMET-ExpansionRx-Challenge, 2026. Ac- cessed: 2026-04-29

2026
[48]

SMILES, a chemical language and information system

David Weininger. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988. doi:10.1021/ci00057a005

work page doi:10.1021/ci00057a005 1988
[49]

Schoenholz, Patrick F

Justin Gilmer, Samuel S. Schoenholz, Patrick F . Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry.Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1263–1272, 2017

2017
[50]

Analyzing learned molecular Zhanget al.| AIBuildAI-2 9 representations for property prediction.Journal of Chemical Information and Modeling, 59 (8):3370–3388, 2019

Kevin Y ang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker Set- tels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay. Analyzing learned molecular Zhanget al.| AIBuildAI-2 9 representations for property prediction.Journal of Chemical Information and...

work page doi:10.1021/acs.jcim.9b00237 2019
[51]

Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism.Journal of Medicinal Chemistry, 63(16):8749–8760, 2020

Zhaoping Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xu- tong Li, Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, and Mingyue Zheng. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism.Journal of Medicinal Chemistry, 63(16):8749–8760, 2020. doi: 10.1021/acs.jmedchem.9b00959

work page doi:10.1021/acs.jmedchem.9b00959 2020
[52]

Durant, Burton A

Joseph L. Durant, Burton A. Leland, Douglas R. Henry, and James G. Nourse. Reoptimiza- tion of MDL keys for use in drug discovery.Journal of Chemical Information and Computer Sciences, 42(6):1273–1280, 2002. doi:10.1021/ci010132r

work page doi:10.1021/ci010132r 2002
[53]

Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering.arXiv preprint arXiv:2601.10402, 2026

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Y e, Yuzhi Zhang, Linfeng Zhang, Weinan E, Siheng Chen, and Y anfeng Wang. Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering.arXiv preprint arXiv:2601.10402, 2026

work page arXiv 2026
[54]

Automated design of agentic systems.International Conference on Learning Representations, 2025

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.International Conference on Learning Representations, 2025

2025
[55]

Rosser and Jakob Nicolaus Foerster

J. Rosser and Jakob Nicolaus Foerster. Agentbreeder: Mitigating the ai safety risks of multi- agent scaffolds via self-improvement.Advances in Neural Information Processing Systems, 2025. Methods Problem formulation.We formalize automated AI model development as the task of constructing a runnable AI solution from a task description and a dataset. The inp...

2025
[56]

Rafael Martí, Mauricio G. C. Resende, and Celso C. Ribeiro. Multi-start methods for combi- natorial optimization.European Journal of Operational Research, 226(1):1–8, 2013

2013
[57]

ReAct: Synergizing reasoning and acting in language models.International Confer- ence on Learning Representations, 2023

Shunyu Y ao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.International Confer- ence on Learning Representations, 2023

2023
[58]

Agent skills: An open standard for extending AI agent capabilities.https: //agentskills.io/home, 2025

Agent Skills. Agent skills: An open standard for extending AI agent capabilities.https: //agentskills.io/home, 2025. Accessed: 2026-05-20

2025
[59]

Introducing Agent Skills.https://claude.com/blog/skills, 2025

Anthropic. Introducing Agent Skills.https://claude.com/blog/skills, 2025. Ac- cessed: 2026-04-27

2025
[60]

OpenAI skills.https://openai.com/academy/skills/, 2025

OpenAI. OpenAI skills.https://openai.com/academy/skills/, 2025. Accessed: 2026-05-20

2025
[61]

PyTorch: An impera- tive style, high-performance deep learning library.Advances in Neural Information Process- ing Systems, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Y ang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An impera- tive style, high-per...

2019
[62]

Hugging Face: The ai community building the future.https:// huggingface.co

Hugging Face. Hugging Face: The ai community building the future.https:// huggingface.co. Accessed: 2026-05-20

2026
[63]

Scikit-learn: Machine learning in Python.Journal of Machine Learn- ing Research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Van- derplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learn- ing Rese...

2011
[64]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Y acine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State- of-the-a...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[65]

GitHub: Build software better, together.https://github.com

GitHub. GitHub: Build software better, together.https://github.com. Accessed: 2026-05-20

2026
[66]

arXiv.org: Open-access archive for scholarly articles.https://arxiv.org

arXiv. arXiv.org: Open-access archive for scholarly articles.https://arxiv.org. Ac- cessed: 2026-05-20

2026
[67]

Andrei Z. Broder. On the resemblance and containment of documents.Proceedings of the Compression and Complexity of Sequences, pages 21–29, 1997. doi:10.1109/SEQUEN. 1997.666900

work page doi:10.1109/sequen 1997
[68]

Claude opus 4.7 system card.https://anthropic.com/ claude-opus-4-7-system-card, 2026

Anthropic. Claude opus 4.7 system card.https://anthropic.com/ claude-opus-4-7-system-card, 2026. Accessed: 2026-04-27. 12 Zhanget al.| AIBuildAI-2

2026

[1] [1]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Lourdes Agapito, Tamara Berg, Jana Kosecka, and Lihi Zelnik-Manor, editors,Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016

[2] [2]

A survey of the usages of deep learn- ing for natural language processing.IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020

Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of the usages of deep learn- ing for natural language processing.IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020

2020

[3] [3]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Web- ster, Greg S. Corrad...

[4] [4]

doi:10.1038/s41586-023-06291-2

work page doi:10.1038/s41586-023-06291-2

[5] [5]

Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Domi- nowska, Blaise Agüera y Arca...

work page doi:10.1038/s41591-024-03423-7 2025

[6] [6]

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman...

work page doi:10.1038/s41586-021-03819-2 2021

[7] [7]

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, Anima Anandkumar, Karianne Bergen, Carla P . Gomes, Shirley Ho, Pushmeet Kohli, Joan Lasenby, Jure Leskovec, Tie- Y an Liu, Arjun Manrai, Debora Marks, Bharath Ramsundar, Le Song, Jimeng Sun, Jian Tang, Petar Veliˇckovi´c...

work page doi:10.1038/s41586-023-06221-2 2023

[8] [8]

Stokes, Kevin Y ang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M

Jonathan M. Stokes, Kevin Y ang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M. Donghia, Craig R. MacNair, Shawn French, Lindsey A. Carfrae, Zohar Bloom- Ackermann, Victoria M. Tran, Anush Chiappino-Pepe, Ahmed H. Badran, Ian W. Andrews, Emma J. Chory, George M. Church, Eric D. Brown, Tommi S. Jaakkola, Regina Barzilay, and James J. Collins. A d...

work page doi:10.1016/j.cell.2020.01.021 2020

[9] [9]

Zheng, Jacqueline A

Felix Wong, Erica J. Zheng, Jacqueline A. Valeri, Nina M. Donghia, Melis N. Anahtar, Sato- taka Omori, Alicia Li, Andres Cubillos-Ruiz, Aarti Krishnan, Wengong Jin, Abigail L. Man- son, Jens Friedrichs, Ralf Helbig, Behnoush Hajian, Dawid K. Fiejtek, Florence F . Wagner, Holly H. Soutter, Ashlee M. Earl, Jonathan M. Stokes, Lars D. Renner, and James J. Co...

work page doi:10.1038/s41586-023-06887-8 2023

[10] [10]

Springer, 2007

Alan M Turing.Computing machinery and intelligence, pages 23–65. Springer, 2007

2007

[11] [11]

Machine learning: Trends, perspectives, and prospects.Science, 349(6245):255–260, 2015

Michael I Jordan and Tom M Mitchell. Machine learning: Trends, perspectives, and prospects.Science, 349(6245):255–260, 2015

2015

[12] [12]

Probabilistic machine learning and artificial intelligence.Nature, 521 (7553):452–459, 2015

Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence.Nature, 521 (7553):452–459, 2015

2015

[13] [13]

Quantum machine learning.Nature, 549(7671):195–202, 2017

Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. Quantum machine learning.Nature, 549(7671):195–202, 2017

2017

[14] [14]

Random search for hyper-parameter optimization

James Bergstra and Y oshua Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012

2012

[15] [15]

Hidden technical debt in machine learning systems

D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Y oung, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors,Proceedings of the International Confer- ence on Neu...

2015

[16] [16]

Springer, 2019

Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren.Automated machine learning: meth- ods, systems, challenges. Springer, 2019

2019

[17] [17]

Re-thinking data strat- egy and integration for artificial intelligence: concepts, opportunities, and challenges.Ap- plied Sciences, 13(12):7082, 2023

Abdulaziz Aldoseri, Khalifa N Al-Khalifa, and Abdel Magid Hamouda. Re-thinking data strat- egy and integration for artificial intelligence: concepts, opportunities, and challenges.Ap- plied Sciences, 13(12):7082, 2023

2023

[18] [18]

The limits of fair medical imaging ai in real-world generalization.Nature medicine, 30(10):2838– 2848, 2024

Yuzhe Y ang, Haoran Zhang, Judy W Gichoya, Dina Katabi, and Marzyeh Ghassemi. The limits of fair medical imaging ai in real-world generalization.Nature medicine, 30(10):2838– 2848, 2024

2024

[19] [19]

AI research agents for machine learning: Search, exploration, and generalization in MLE-bench

Edan Toledo, Karen Hambardzumyan, Martin Josifoski, RISHI HAZRA, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Tatiana Shavrina, Kelvin Niu, Jean-Christophe Gagnon- Audet, Michael Shvartsman, Shagun Sodhani, Alexander H Miller, Abhishek Charnalia, Derek Dunfield, Car...

2025

[20] [20]

AutoMLGen: Navigating fine-grained optimization for coding agents.arXiv preprint arXiv:2510.08511, 2025

Shangheng Du, Xiangchao Y an, Dengyang Jiang, Jiakang Yuan, Yusong Hu, Xin Li, Liang He, Bo Zhang, and Lei Bai. AutoMLGen: Navigating fine-grained optimization for coding agents.arXiv preprint arXiv:2510.08511, 2025

work page arXiv 2025

[21] [21]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Ja- cenko, and Yuxiang Wu. AIDE: AI-Driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

R&d-agent: Automating data-driven ai solution building through llm-powered automated research, development, and evolution.arXiv preprint arXiv:2505.14738, 2025a

Xu Y ang, Xiao Y ang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Y elong Shen, Weizhu Chen, and Jiang Bian. R&D-Agent: An LLM-Agent framework towards autonomous data science.arXiv preprint arXiv:2505.14738, 2025

work page arXiv 2025

[23] [23]

AIBuildAI: An AI Agent for Automatically Building AI Models

Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang, and Pengtao Xie. AIBuildAI: An AI agent for automatically building AI models.arXiv preprint arXiv:2604.14455, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

MLE-bench: Evaluating machine learning agents on machine learning engineering,

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering,

[25] [25]

International Conference on Learning Representations (ICLR)

[26] [26]

Kaggle: Y our machine learning and data science community.https://www

Kaggle. Kaggle: Y our machine learning and data science community.https://www. kaggle.com. Accessed: 2026-05-20

2026

[27] [27]

Mle-bench leaderboard (commit c5631ba).https://github.com/openai/ mle-bench/tree/c5631ba61ceeb0573235a6ce209db435327a1e84, 2026

OpenAI. Mle-bench leaderboard (commit c5631ba).https://github.com/openai/ mle-bench/tree/c5631ba61ceeb0573235a6ce209db435327a1e84, 2026. Ac- cessed: 2026-03-18

2026

[28] [28]

Retrieval-augmented generation for knowledge-intensive NLP tasks.Ad- vances in Neural Information Processing Systems, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.Ad- vances in Neural Information Processing Systems, 2020

2020

[29] [29]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Aditi Singh, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei, and Athanasios V. Vasi- lakos. Agentic retrieval-augmented generation: A survey on agentic RAG.arXiv preprint arXiv:2501.09136, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Y oon. MARS: Modular agent with reflective search for automated AI research.arXiv preprint arXiv:2602.02660, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

The FM agent.arXiv preprint arXiv:2510.26144, 2025

Annan Li, Chufan Wu, Zengle Ge, Y ee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, Shenghao Feng, Mo Zhao, Fengzhi Qiu, Rui Y ang, Mengmeng Zhang, Wenyi Zhu, Yingying Sun, Quan Sun, Shunhao Y an, Danyu Liu, Dawei Yin, and Dou Shen. The FM agent.arXiv preprint arXiv:2510.26144, 2025

work page arXiv 2025

[32] [32]

ML-Master: Towards AI-for-AI via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499, 2025

Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Y anfeng Wang, Weinan E, and Siheng Chen. ML-Master: Towards AI-for-AI via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499, 2025

work page arXiv 2025

[33] [33]

KAPSO: A knowledge- grounded framework for autonomous program synthesis and optimization.arXiv preprint arXiv:2601.21526, 2026

Alireza Nadafian, Alireza Mohammadshahi, and Majid Y azdani. KAPSO: A knowledge- grounded framework for autonomous program synthesis and optimization.arXiv preprint arXiv:2601.21526, 2026

work page arXiv 2026

[34] [34]

InternAgent: When agent be- comes the scientist—building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025

InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Y an, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, et al. InternAgent: When agent be- comes the scientist—building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025

work page arXiv 2025

[35] [35]

CatBoost: Unbiased boosting with categorical features.Advances in Neural Information Processing Systems, 31, 2018

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features.Advances in Neural Information Processing Systems, 31, 2018

2018

[36] [36]

Xgboost: A scalable tree boosting system, 2016

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system, 2016

2016

[37] [37]

Lightgbm: a highly efficient gradient boosting decision tree, 2017

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Y e, and Tie-Y an Liu. Lightgbm: a highly efficient gradient boosting decision tree, 2017

2017

[38] [38]

Predicting heart disease (Playground Series S6E2).https://www.kaggle

Kaggle. Predicting heart disease (Playground Series S6E2).https://www.kaggle. com/competitions/playground-series-s6e2/overview, 2026. Accessed: 2026-04-29

2026

[39] [39]

Guppy, Stella Lee, and Victor Froelicher

Robert Detrano, Andras Janosi, Walter Steinbrunn, Matthias Pfisterer, Johann-Jakob Schmid, Sarbjit Sandhu, Kern H. Guppy, Stella Lee, and Victor Froelicher. International application of a new probability algorithm for the diagnosis of coronary artery disease.The American Journal of Cardiology, 64(5):304–310, 1989. doi:10.1016/0002-9149(89)90524-9

work page doi:10.1016/0002-9149(89)90524-9 1989

[40] [40]

Ensemble selection from libraries of models

Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selec- tion from libraries of models.Proceedings of the Twenty-first International Conference on Machine Learning, page 18, 2004. doi:10.1145/1015330.1015432

work page doi:10.1145/1015330.1015432 2004

[41] [41]

Burns, Akshat Shirish Zalte, Charlles R

Jackson W. Burns, Akshat Shirish Zalte, Charlles R. A. Abreu, Jochen Sieg, Christian Feld- mann, Miriam Mathea, and William H. Green. Deep learning foundation models from clas- sical molecular descriptors.arXiv preprint arXiv:2506.15792, 2025

work page arXiv 2025

[42] [42]

Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010

David Rogers and Mathew Hahn. Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010. doi:10.1021/ci100050t

work page doi:10.1021/ci100050t 2010

[43] [43]

RDKit: Open-source cheminformatics software

Greg Landrum and The RDKit Contributors. RDKit: Open-source cheminformatics software. https://www.rdkit.org, 2024. Accessed: 2026-04-29

2024

[44] [44]

Mordred: A molecular descriptor calculator.Journal of Cheminformatics, 10(1):4, 2018

Hirotomo Moriwaki, Yu-Shi Tian, Norihito Kawashita, and Tatsuya Takagi. Mordred: A molecular descriptor calculator.Journal of Cheminformatics, 10(1):4, 2018. doi:10.1186/ s13321-018-0258-y

2018

[45] [45]

A software package for sequential quadratic programming

Dieter Kraft. A software package for sequential quadratic programming. Tech. Rep. DFVLR- FB 88-28, DFVLR, Institut für Dynamik der Flugsysteme, Oberpfaffenhofen, Germany, 1988

1988

[46] [46]

Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik

Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Con- nor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development.Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

2021

[47] [47]

OpenADMET ExpansionRx blind challenge.https://huggingface

OpenADMET. OpenADMET ExpansionRx blind challenge.https://huggingface. co/spaces/openadmet/OpenADMET-ExpansionRx-Challenge, 2026. Ac- cessed: 2026-04-29

2026

[48] [48]

SMILES, a chemical language and information system

David Weininger. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988. doi:10.1021/ci00057a005

work page doi:10.1021/ci00057a005 1988

[49] [49]

Schoenholz, Patrick F

Justin Gilmer, Samuel S. Schoenholz, Patrick F . Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry.Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1263–1272, 2017

2017

[50] [50]

Analyzing learned molecular Zhanget al.| AIBuildAI-2 9 representations for property prediction.Journal of Chemical Information and Modeling, 59 (8):3370–3388, 2019

Kevin Y ang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker Set- tels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay. Analyzing learned molecular Zhanget al.| AIBuildAI-2 9 representations for property prediction.Journal of Chemical Information and...

work page doi:10.1021/acs.jcim.9b00237 2019

[51] [51]

Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism.Journal of Medicinal Chemistry, 63(16):8749–8760, 2020

Zhaoping Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xu- tong Li, Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, and Mingyue Zheng. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism.Journal of Medicinal Chemistry, 63(16):8749–8760, 2020. doi: 10.1021/acs.jmedchem.9b00959

work page doi:10.1021/acs.jmedchem.9b00959 2020

[52] [52]

Durant, Burton A

Joseph L. Durant, Burton A. Leland, Douglas R. Henry, and James G. Nourse. Reoptimiza- tion of MDL keys for use in drug discovery.Journal of Chemical Information and Computer Sciences, 42(6):1273–1280, 2002. doi:10.1021/ci010132r

work page doi:10.1021/ci010132r 2002

[53] [53]

Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering.arXiv preprint arXiv:2601.10402, 2026

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Y e, Yuzhi Zhang, Linfeng Zhang, Weinan E, Siheng Chen, and Y anfeng Wang. Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering.arXiv preprint arXiv:2601.10402, 2026

work page arXiv 2026

[54] [54]

Automated design of agentic systems.International Conference on Learning Representations, 2025

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.International Conference on Learning Representations, 2025

2025

[55] [55]

Rosser and Jakob Nicolaus Foerster

J. Rosser and Jakob Nicolaus Foerster. Agentbreeder: Mitigating the ai safety risks of multi- agent scaffolds via self-improvement.Advances in Neural Information Processing Systems, 2025. Methods Problem formulation.We formalize automated AI model development as the task of constructing a runnable AI solution from a task description and a dataset. The inp...

2025

[56] [56]

Rafael Martí, Mauricio G. C. Resende, and Celso C. Ribeiro. Multi-start methods for combi- natorial optimization.European Journal of Operational Research, 226(1):1–8, 2013

2013

[57] [57]

ReAct: Synergizing reasoning and acting in language models.International Confer- ence on Learning Representations, 2023

Shunyu Y ao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.International Confer- ence on Learning Representations, 2023

2023

[58] [58]

Agent skills: An open standard for extending AI agent capabilities.https: //agentskills.io/home, 2025

Agent Skills. Agent skills: An open standard for extending AI agent capabilities.https: //agentskills.io/home, 2025. Accessed: 2026-05-20

2025

[59] [59]

Introducing Agent Skills.https://claude.com/blog/skills, 2025

Anthropic. Introducing Agent Skills.https://claude.com/blog/skills, 2025. Ac- cessed: 2026-04-27

2025

[60] [60]

OpenAI skills.https://openai.com/academy/skills/, 2025

OpenAI. OpenAI skills.https://openai.com/academy/skills/, 2025. Accessed: 2026-05-20

2025

[61] [61]

PyTorch: An impera- tive style, high-performance deep learning library.Advances in Neural Information Process- ing Systems, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Y ang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An impera- tive style, high-per...

2019

[62] [62]

Hugging Face: The ai community building the future.https:// huggingface.co

Hugging Face. Hugging Face: The ai community building the future.https:// huggingface.co. Accessed: 2026-05-20

2026

[63] [63]

Scikit-learn: Machine learning in Python.Journal of Machine Learn- ing Research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Van- derplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learn- ing Rese...

2011

[64] [64]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Y acine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State- of-the-a...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[65] [65]

GitHub: Build software better, together.https://github.com

GitHub. GitHub: Build software better, together.https://github.com. Accessed: 2026-05-20

2026

[66] [66]

arXiv.org: Open-access archive for scholarly articles.https://arxiv.org

arXiv. arXiv.org: Open-access archive for scholarly articles.https://arxiv.org. Ac- cessed: 2026-05-20

2026

[67] [67]

Andrei Z. Broder. On the resemblance and containment of documents.Proceedings of the Compression and Complexity of Sequences, pages 21–29, 1997. doi:10.1109/SEQUEN. 1997.666900

work page doi:10.1109/sequen 1997

[68] [68]

Claude opus 4.7 system card.https://anthropic.com/ claude-opus-4-7-system-card, 2026

Anthropic. Claude opus 4.7 system card.https://anthropic.com/ claude-opus-4-7-system-card, 2026. Accessed: 2026-04-27. 12 Zhanget al.| AIBuildAI-2

2026