Recognition: no theorem link
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
Pith reviewed 2026-05-12 01:09 UTC · model grok-4.3
The pith
AI agents cannot reliably invent ML methods that beat human designs on generalization and scaling tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MLS-Bench contains 140 tasks across 12 domains. Each task requires an agent to improve one targeted component of an ML system or algorithm and to demonstrate that the improvement generalizes across controlled settings and scales. Current agents remain far from reliably surpassing human-designed methods. Engineering-style tuning is easier for them than genuine method invention. The bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck.
What carries the argument
MLS-Bench, the benchmark of 140 tasks across 12 domains that each require an agent to propose and validate an improvement to an ML component with explicit tests for generalization and scalability.
If this is right
- Agents perform better on engineering adjustments than on creating original methods.
- Providing more test-time compute, adaptive allocation, or extra context does not overcome the insight limitation.
- Planning, validating, and scaling claims remain harder than idea generation for current systems.
Where Pith is reading between the lines
- Future agent designs may need built-in mechanisms for scientific validation to advance beyond tuning.
- The benchmark setup could be extended to new domains to check whether the insight gap persists across fields.
- Human oversight might stay necessary for the insight step while agents handle execution and testing.
Load-bearing premise
The 140 tasks and 12 domains capture the essential skills for inventing generalizable and scalable ML methods without missing major aspects of actual research.
What would settle it
An agent that proposes and validates improvements outperforming human baselines on a majority of the tasks while passing controlled generalization and scaling checks.
Figures
read the original abstract
Modern AI progress has been driven by ML methods that are generalizable across settings and scalable to larger regimes. As large language models demonstrate advanced capabilities in reasoning, coding, and engineering tasks, it is increasingly important to understand whether they can discover such methods rather than only apply existing ones. We introduce MLS-Bench, a benchmark for evaluating whether AI systems can invent generalizable and scalable ML methods. MLS-Bench contains 140 tasks across 12 domains, each requiring an agent to improve one targeted component of an ML system or algorithm and demonstrate that the improvement generalizes across controlled settings and scales. We find that current agents remain far from reliably surpassing human-designed methods, and that engineering-style tuning is easier for them than genuine method invention. We further study the effects of test-time scaling, adaptive compute allocation, and context provision on agents' discovery performance, together with case studies of their behavior. Our analyses suggest that the bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck. We build and maintain a community platform for cumulative and comparable iteration, and release the data and code at https://mls-bench.com.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MLS-Bench, a benchmark with 140 tasks across 12 domains for evaluating whether AI agents can invent generalizable and scalable ML methods. Each task requires an agent to improve a targeted component of an ML system or algorithm while demonstrating that the improvement generalizes across controlled settings and scales. The authors report that current agents remain far from reliably surpassing human-designed methods, that engineering-style tuning is easier than genuine method invention, and that the bottleneck lies in scientific insight for planning, validation, and scaling. Analyses examine test-time scaling, adaptive compute allocation, and context provision, concluding that more search, compute, or context alone does not remove the bottleneck. Data, code, and a community platform are released.
Significance. If the tasks are constructed such that they require planning, validation, and scaling insight beyond hyperparameter search or prompt engineering, MLS-Bench would offer a valuable, reproducible resource for measuring progress on AI-driven method discovery. The explicit release of data and code, along with the community platform, is a clear strength that enables cumulative iteration and supports the empirical claims about scaling effects.
major comments (1)
- Abstract and task construction: the central finding that 'more search, compute, or context alone does not remove this bottleneck' and that tuning is easier than invention is load-bearing on the 140 tasks requiring genuine scientific insight rather than admitting solutions via hyperparameter tuning, local modifications, or prompt engineering without new theoretical justification or controlled scaling experiments. The manuscript should provide concrete examples or analyses demonstrating how the task definitions penalize pure tuning approaches.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the potential value of MLS-Bench as a reproducible resource. We address the single major comment below and will revise the manuscript to strengthen the exposition of task construction.
read point-by-point responses
-
Referee: Abstract and task construction: the central finding that 'more search, compute, or context alone does not remove this bottleneck' and that tuning is easier than invention is load-bearing on the 140 tasks requiring genuine scientific insight rather than admitting solutions via hyperparameter tuning, local modifications, or prompt engineering without new theoretical justification or controlled scaling experiments. The manuscript should provide concrete examples or analyses demonstrating how the task definitions penalize pure tuning approaches.
Authors: We agree that the central claims rest on the tasks demanding more than hyperparameter tuning or prompt engineering. Each task in MLS-Bench requires an agent to improve a targeted component while also designing and reporting controlled experiments that establish generalization across held-out settings and scaling behavior to larger regimes; these requirements are stated in the task templates and evaluation rubrics. Pure tuning approaches typically succeed on a single training configuration but fail the generalization and scaling criteria, as shown in our agent failure analyses. That said, we acknowledge the manuscript would be clearer with explicit illustrations. In the revision we will add a dedicated subsection (likely in Section 3 or 4) containing 3–4 concrete task examples, the precise success criteria, and side-by-side results showing where tuning-only baselines plateau while insight-driven solutions continue to improve. We will also include a short quantitative comparison of agent success rates on tuning versus invention-oriented subtasks. revision: yes
Circularity Check
No significant circularity in MLS-Bench benchmark
full rationale
The paper introduces MLS-Bench as an empirical benchmark consisting of 140 tasks across 12 domains for assessing AI agents on inventing generalizable ML methods. It reports experimental findings that current agents fall short of human-designed methods and that tuning is easier than invention. No mathematical derivations, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. The work releases data and code externally and makes no load-bearing self-citations for any theoretical claim. All central assertions rest on direct task evaluations rather than self-referential definitions or renamed known results. This is a standard benchmark release with no internal derivation chain to inspect for circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected 140 tasks across 12 domains are representative of generalizable ML method invention.
Reference graph
Works this paper leans on
-
[1]
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent.Advances in neural information processing systems, 29, 2016
work page 2016
-
[2]
System card: Claude opus 4 & claude sonnet 4
Anthropic. System card: Claude opus 4 & claude sonnet 4. https://www-cdn.anthropic. com/07b2a3f9902ee19fe39a36ca638e5ae987bc64dd.pdf, 2025
work page 2025
-
[3]
Automated alignment researchers
Anthropic. Automated alignment researchers. https://www.anthropic.com/research/ automated-alignment-researchers, 2026
work page 2026
-
[4]
K. I. Appel and Wolfgang Haken. The solution of the four-color-map problem.Scientific American, 237(4):108–121, 1977
work page 1977
-
[5]
arXiv preprint arXiv:2510.14150 , year =
Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025
-
[6]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite
Jonathan Bragg, Mike D’Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feld- man, Dany Haddad, Jena D. Hwang, Peter A. Jansen, Varsha Kishore, Bodhisattwa Prasad Majumder, Aakanksha Naik, Sigal Rahamimov, Kyle Richardson, Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu, Guy Wiener, Chloe Anastasiades, Stefan Candra, Jason Dunkelberg...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
T. B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric J. Sigler, Mateusz Lit...
work page 1901
-
[9]
Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws. In International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=sckjveqlCZ
work page 2023
-
[10]
Mle-bench: Evaluating machine learning agents on machine learning engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[11]
arXiv preprint arXiv:2505.19955 , year =
Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating ai agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025
-
[12]
Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, and Bing Zhao. Swe-ci: Evaluating agent capa- bilities in maintaining codebases via continuous integration.arXiv preprint arXiv:2603.03823, 2026
-
[13]
Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, and Thomas Hanwen Zhu. Seed-prover 1.5: Mastering undergraduate-level theorem proving vi...
-
[14]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi
Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger A. Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi. Avo: Agentic variation operators for autonomous evolutiona...
-
[16]
Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy
Tianqi Chen, Lianmin Zheng, Eddie Q. Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. 2018
work page 2018
-
[17]
Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?
Wanyi Chen, Xiao Yang, Xu Yang, Tianming Sha, Qizheng Li, Zhuo Wang, Bowen Xian, Fang Kong, Weiqing Liu, and Jiang Bian. Agent2 rl-bench: Can llm agents engineer agentic rl post-training?arXiv preprint arXiv:2604.10547, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Baker, Benjamin 12 Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin 12 Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. ...
work page 2025
-
[19]
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, pages 8359–8388, 2024
work page 2024
-
[20]
Connor W. Coley, Dale A. Thomas, Justin A. M. Lummiss, Jonathan N. Jaworski, C. Breen, Victor Schultz, Travis Hart, Joshua Fishman, Luke Rogers, Hanyu Gao, Robert W. Hicklin, Pieter Plehiers, Joshua Byington, John S. Piotti, William H. Green, A. John Hart, Timothy F. Jamison, and Klavs F. Jensen. A robotic platform for flow synthesis of organic compounds ...
work page 2019
-
[21]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V
Ekin D. Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V . Le. Autoaug- ment: Learning augmentation strategies from data. InIEEE Conference on Computer Vision and Pattern Recognition, pages 113–123, 2019
work page 2019
-
[23]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022
work page 2022
-
[24]
Everett, Lechao Xiao, Mitchell Wortsman, Alexander A
Katie E. Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Mach...
work page 2024
-
[25]
Model-agnostic meta-learning for fast adaptation of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InProceedings of the 34th International Conference on Machine Learning, pages 1126–1135, 2017
work page 2017
-
[26]
Aniketh Garikaparthi, Manasi Patwardhan, and Arman Cohan. Researchgym: Evaluating language model agents on real-world ai research.arXiv preprint arXiv:2602.15112, 2026
-
[27]
Improved training speed, accuracy, and data utilization through loss function optimization
Santiago Gonzalez and Risto Miikkulainen. Improved training speed, accuracy, and data utilization through loss function optimization. InIEEE Congress on Evolutionary Computation, pages 1–8, 2020
work page 2020
-
[28]
Juraj Gottweis, Wei-Hung Weng, Alexander N. Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...
work page 2025
-
[30]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016
work page 2016
-
[31]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[32]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Training Compute-Optimal Large Language Models
doi: 10.48550/arXiv.2203.15556. URLhttps://arxiv.org/abs/2203.15556
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556
-
[34]
Lei Zhang Huaizheng Zhang*, Yizheng Huang*. Mle-agent: Your intelligent companion for seamless ai engineering and research, 2024.https://github.com/MLSysOps/MLE-agent
work page 2024
-
[35]
Mlagentbench: Evaluating language agents on machine learning experimentation
Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InForty-first International Conference on Machine Learning, pages 20271–20309, 2024
work page 2024
-
[36]
Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025
Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z Horváth, Goran Žuži´c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025
work page 2025
-
[37]
Automated machine learning - methods, systems, challenges
Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning - methods, systems, challenges. 2019
work page 2019
-
[38]
Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering. arXiv preprint arXiv:2506.09050, 2025
-
[39]
Livecodebench: Holistic and contam- ination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contam- ination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[40]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. 14
work page 2024
-
[41]
Muon: An optimizer for hidden layers in neural networks, 2024
Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/
work page 2024
-
[42]
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon Köhl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, El...
work page 2021
-
[43]
Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale Minervini, and Matt J. Kus- ner. No train no gain: Revisiting efficient training algorithms for transformer- based language models. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ hash/51f3d6252706100325ddc435ba0ade0e-Abstract...
work page 2023
-
[44]
Jared Kaplan, Sam McCandlish, Tom Henighan, T. B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[45]
autoresearch.https://github.com/karpathy/autoresearch, 2025
Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch, 2025
work page 2025
-
[46]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[47]
On scientific understanding with artificial intelligence
Mario Krenn, Robert Pollice, Si Yue Guo, Matteo Aldeghi, Alba Cervera-Lierta, Pascal Friederich, Gabriel dos Passos Gomes, Florian Häse, Adrián Jinich, AkshatKumar Nigam, Zhenpeng Yao, and Alán Aspuru-Guzik. On scientific understanding with artificial intelligence. Nature Reviews Physics, 4(12):761–769, 2022
work page 2022
-
[48]
Learning skillful medium-range global weather forecasting
Rémi Lam, Álvaro Sánchez-González, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. Learning skillful medium-range global weather forecasting. Scien...
work page 2023
-
[49]
Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025
-
[50]
(mis)fitting scaling laws: A survey of scaling law fitting techniques in deep learning
Margaret Li, Sneha Kudugunta, and Luke Zettlemoyer. (mis)fitting scaling laws: A survey of scaling law fitting techniques in deep learning. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=xI71dsS3o4
work page 2025
-
[51]
Jiarong Liang, Zhiheng Lyu, Zijie Liu, Xiangchao Chen, Ping Nie, Kai Zou, and Wenhu Chen. Swe-next: Scalable real-world software engineering tasks for agents.arXiv preprint arXiv:2603.20691, 2026
-
[52]
Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025
-
[53]
Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprin...
-
[54]
Darts: Differentiable architecture search
Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In7th International Conference on Learning Representations, 2019. 15
work page 2019
-
[55]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, and Dongzhan Zhou. Researchbench: Benchmarking llms in scientific discovery via inspiration-based task decomposition.arXiv preprint arXiv:2503.21248, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Alisia Maria Lupidi, Bhavul Gauri, Thomas Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Mario Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol-Estapé, Amar Budhi- raja, Ga...
-
[59]
Discoverybench: Towards data-driven discovery with large language models
Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth V ora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[60]
Gonzalez, Jingbo 12 Preprint FrontierCS T eam Shang, and Alvin Cheung
Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, ...
-
[61]
Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas Köppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, Robert Tung, Minjae Hwang, A. Taylan Cemgil, Mohammadamin Barekatain, Yujia Li, Amol Mandh...
work page 2023
-
[62]
Amil Merchant, Simon L. Batzner, Samuel S. Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery.Nature, 624(7990): 80–85, 2023
work page 2023
-
[63]
Gaia: a benchmark for general ai assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[64]
A graph placement methodology for fast chip design.Nature, 594(7862):207–212, 2021
Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nova, et al. A graph placement methodology for fast chip design.Nature, 594(7862):207–212, 2021
work page 2021
-
[65]
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement lea...
work page 2015
-
[66]
Niels Mündler, Mark N Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 37:81857–81887, 2024
work page 2024
-
[67]
Mlgym: A new framework and benchmark for advancing ai research agents,
Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob N. Foerster, Yoram Bachrach, William Yang Wang, and Roberta Raileanu. Mlgym: A new framework and benchmark for advancing ai r...
-
[68]
Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Ab- bas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algo...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. https://openai.com/index/ o3-o4-mini-system-card/, 2025. System card, published April 16, 2025
work page 2025
-
[71]
Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini
Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels? InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[72]
Gihan Panapitiya, Emily Saldanha, Heather Job, and Olivia Hess. Autolabs: Cognitive multi- agent systems with self-correction for autonomous chemical experimentation.arXiv preprint arXiv:2509.25651, 2025
-
[73]
Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms.arXiv preprint arXiv:2603.24511, 2026
-
[74]
Heurekabench: A benchmarking framework for ai co-scientist.arXiv preprint arXiv:2601.01678, 2026
Siba Smarak Panigrahi, Jovana Videnovic, and Maria Brbic. Heurekabench: A benchmarking framework for ai co-scientist.arXiv preprint arXiv:2601.01678, 2026
-
[75]
Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025
Peter Phan, Dhruv Agarwal, Kavitha Srinivas, Horst Samulowitz, Pavan Kapanipathi, and Andrew McCallum. Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025
-
[76]
Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, an...
-
[77]
Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V . K, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, and Bo Dai. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering.arXiv preprint arXiv:2505.07782, 2025
-
[78]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[79]
Posttrainbench: Can llm agents automate llm post- training?arXiv preprint arXiv:2603.08640, 2026
Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post- training?arXiv preprint arXiv:2603.08640, 2026. 17
-
[80]
Pawan Kumar, Emilien Dupont, Francisco J
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.