AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Alex Pentland; Dongfu Jiang; Hang Hua; Hao Zhang; Jiaxin Pei; Jiefeng Chen; Jinsung Yoon; Junda Chen; Lichi Li; Mengdi Wang

arxiv: 2606.05080 · v1 · pith:JE6GW6NGnew · submitted 2026-06-03 · 💻 cs.AI · cs.LG

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Zhangchen Xu , Junda Chen , Yue Huang , Dongfu Jiang , Jiefeng Chen , Hang Hua , Zijian Wu , Zheyuan Liu

show 11 more authors

Zexue He Lichi Li Shizhe Diao Jiaxin Pei Jinsung Yoon Hao Zhang Mengdi Wang Radha Poovendran Misha Sra Alex Pentland Zichen Chen

This is my paper

Pith reviewed 2026-06-28 06:34 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords long-horizon agentsAutoLab benchmarkiterative optimizationfrontier modelsempirical feedbackautonomous researchwall-clock budgetclosed-loop agents

0 comments

The pith

Frontier models succeed at long-horizon research tasks mainly through repeated benchmarking and feedback incorporation rather than strong initial attempts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current benchmarks test agents on single turns or short trajectories, but real scientific and engineering work requires sustained iteration: propose changes, run experiments, measure results, and refine over extended time. AutoLab introduces 36 expert-curated tasks across system optimization, puzzles, model development, and CUDA kernels, each starting from a correct but suboptimal baseline that agents must improve under a strict wall-clock budget. Evaluation of 17 frontier models shows the strongest predictor of progress is an agent's willingness to keep benchmarking, editing, and using empirical feedback rather than the quality of its first output. Most models terminate early or stall with little gain, while one exhibits stronger persistence. This matters because building agents that can autonomously drive research requires exactly this capacity for long-horizon closed-loop optimization.

Core claim

Evaluating 17 state-of-the-art models on AutoLab reveals that the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress.

What carries the argument

AutoLab benchmark consisting of 36 tasks with deliberately suboptimal starting code and strict wall-clock time budgets that enforce closed-loop iterative improvement across four domains.

If this is right

Agent success on these tasks requires time awareness to avoid early termination within the budget.
Incorporating empirical feedback through repeated benchmarking and editing drives measurable improvement.
Most current frontier models lack the persistence needed to make sustained progress on long-horizon tasks.
Designing agents around continued iteration rather than one-shot generation becomes a central requirement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents could be augmented with explicit budget-tracking modules that force continued iteration even when early results look promising.
The benchmark could be applied to test whether hybrid human-AI loops or external memory systems increase persistence on the same tasks.
If persistence proves decisive here, similar closed-loop setups may be needed to evaluate agents on other extended research workflows such as theorem proving or experimental design.

Load-bearing premise

The 36 expert-curated tasks and the strict wall-clock budget setup accurately represent the challenges of real long-horizon research and engineering without introducing harness-specific biases or unrealistic constraints.

What would settle it

If agents that produce high-quality initial attempts but perform few iterations achieve higher final scores than agents that iterate many times but start weaker, across the full set of 36 tasks, the claim that persistence is the dominant factor would be falsified.

read the original abstract

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoLab adds a long-horizon benchmark where persistence in iteration matters more than first-try quality, but the abstract leaves task construction and controls opaque.

read the letter

The core point is that this benchmark finds most frontier models fail at sustained improvement on long tasks because they stop iterating or run out of time, while persistence predicts success better than the quality of the opening attempt.

The work introduces 36 expert-curated tasks across system tuning, puzzles, model training, and CUDA kernels. Each starts from a working but weak baseline and runs under a fixed wall-clock limit. Testing 17 models and releasing the harness and artifacts is concrete and helpful for anyone measuring agent iteration. The split result—Claude-opus-4.6 keeps going and improves, while others do not—matches what people see when agents are left to run without hand-holding.

The soft spot is the lack of detail on how tasks were selected or whether the time budget and feedback loop were tuned in ways that reward volume of edits over smart first moves. The abstract gives no variance numbers, no breakdown by domain, and no check on whether strong initial attempts are simply cut short by the harness. That makes the dominance claim hard to separate from the evaluation setup itself.

The paper is aimed at groups building or benchmarking autonomous research agents. Readers who need a new testbed for multi-step optimization will find the released tasks useful even if they treat the headline result as preliminary.

Send it to review. The benchmark construction and open release give it enough substance to justify referee time, provided the methods section gets expanded.

Referee Report

3 major / 1 minor

Summary. The paper introduces AutoLab, a benchmark consisting of 36 expert-curated tasks across four domains (system optimization, puzzle & challenge, model development, and CUDA kernel optimization). Each task starts from a correct but deliberately suboptimal baseline and requires agents to improve performance within a strict wall-clock budget. Evaluation of 17 frontier models finds that the dominant predictor of success is not the quality of the initial attempt but the agent's persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. Claude-opus-4.6 shows strong capabilities while most other models terminate prematurely or make minimal progress; the benchmark, harness, and artifacts are open-sourced.

Significance. If the empirical findings hold after detailed verification, the work provides a new benchmark targeting a clear gap in existing single-turn or short-horizon evaluations and identifies persistence and time awareness as critical for long-horizon autonomous agents. The open-sourcing of the full benchmark, evaluation harness, and task artifacts is a concrete strength that enables reproducibility and community follow-up.

major comments (3)

[Abstract] Abstract: the central claim that persistence is the dominant predictor of success on the 36 tasks is presented without any description of the statistical method, metrics, or controls used to establish dominance (e.g., regression of success on initial quality vs. iteration count). This is load-bearing for the claim.
[Abstract] Abstract: no information is supplied on task selection criteria, inter-task variance in the necessity of iteration, or controls that separate model capability from persistence behavior. Without these, it is impossible to rule out that the reported dominance is an artifact of the harness (starting from suboptimal baselines + fixed wall-clock budget).
[Abstract] Abstract: the evaluation of 17 models reports qualitative outcomes (e.g., "most frontier models either terminate prematurely or exhaust their budgets with minimal progress") but supplies no quantitative breakdown, error analysis, or per-domain statistics that would allow assessment of the persistence result.

minor comments (1)

[Abstract] The model name "claude-opus-4.6" should be clarified with the precise version or API identifier used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on the abstract. These points help us improve the clarity of our central claims. We will revise the abstract to include more details on the statistical methods, task selection, and quantitative results as outlined below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that persistence is the dominant predictor of success on the 36 tasks is presented without any description of the statistical method, metrics, or controls used to establish dominance (e.g., regression of success on initial quality vs. iteration count). This is load-bearing for the claim.

Authors: In the full manuscript (Section 4.2), we describe the regression analysis used: we fit a linear model with task success as the outcome, using initial baseline performance and iteration count as predictors, with controls for model identity and domain. Iteration count emerges as the dominant factor based on standardized coefficients and partial R-squared values. We will add a short clause to the abstract referencing this analysis. revision: yes
Referee: [Abstract] Abstract: no information is supplied on task selection criteria, inter-task variance in the necessity of iteration, or controls that separate model capability from persistence behavior. Without these, it is impossible to rule out that the reported dominance is an artifact of the harness (starting from suboptimal baselines + fixed wall-clock budget).

Authors: Task selection is detailed in Section 3, involving expert curation to ensure tasks benefit from iteration, with variance quantified by the range of iterations needed across tasks in pilot experiments. We include controls such as comparing to agents limited to single attempts. We will summarize these criteria and controls concisely in the revised abstract. revision: yes
Referee: [Abstract] Abstract: the evaluation of 17 models reports qualitative outcomes (e.g., "most frontier models either terminate prematurely or exhaust their budgets with minimal progress") but supplies no quantitative breakdown, error analysis, or per-domain statistics that would allow assessment of the persistence result.

Authors: The manuscript includes quantitative data in Section 5 and associated tables/figures: success rates per model (e.g., Claude at 33% vs. others <10%), average iterations, per-domain breakdowns, and error analysis categorizing failures into premature termination (dominant for most models) vs. other issues. We will incorporate key quantitative metrics into the abstract to bolster the qualitative description. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with external model evaluations

full rationale

The paper introduces AutoLab as an empirical benchmark consisting of 36 expert-curated tasks and reports results from running 17 frontier models under fixed wall-clock constraints. The central claim (persistence as dominant predictor) is an observed statistical pattern across those runs, not a derivation, fitted parameter, or self-citation chain. No equations, ansatzes, or uniqueness theorems are invoked; all performance numbers derive from direct execution of external models on released artifacts. This is a standard self-contained benchmark paper whose findings are falsifiable by re-running the open-sourced harness.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark introduction with no mathematical derivations, fitted parameters, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5828 in / 1011 out tokens · 21336 ms · 2026-06-28T06:34:19.210790+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
cs.CL 2026-06 unverdicted novelty 7.0

NatureBench evaluates ten frontier AI coding agents on 90 tasks from Nature papers under web-search-disabled conditions and finds the strongest agent surpasses published SOTA on only 17.8% of tasks, succeeding mainly ...

Reference graph

Works this paper leans on

60 extracted references · 14 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

arXiv preprint arXiv:2512.15699 , year=

FrontierCS: Evolving Challenges for Evolving Intelligence , author=. arXiv preprint arXiv:2512.15699 , year=

work page arXiv
[2]

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization , author=. arXiv preprint arXiv:2604.12290 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

2026 , howpublished =

pi-mono: AI Agent Toolkit , author =. 2026 , howpublished =

2026
[4]

2026 , month = feb, howpublished =

System Card: Claude Opus 4.6 , author =. 2026 , month = feb, howpublished =

2026
[5]

2026 , month = mar, howpublished =

GPT-5.4 Thinking System Card , author =. 2026 , month = mar, howpublished =

2026
[6]

2026 , month = feb, howpublished =

Grok 4.20 , author =. 2026 , month = feb, howpublished =

2026
[7]

2026 , month = feb, howpublished =

Gemini 3.1 Pro Model Card , author =. 2026 , month = feb, howpublished =

2026
[8]

2025 , month = jan, howpublished =

Kimi K2.5 , author =. 2025 , month = jan, howpublished =

2025
[9]

2026 , month = feb, howpublished =

Kimi K2.6 , author =. 2026 , month = feb, howpublished =

2026
[10]

2026 , month = feb, howpublished =

MiniMax M2.5: Built for Real-World Productivity , author =. 2026 , month = feb, howpublished =

2026
[11]

2026 , month = mar, howpublished =

MiniMax M2.7: Early Echoes of Self-Evolution , author =. 2026 , month = mar, howpublished =

2026
[12]

2026 , month = feb, howpublished =

Qwen3.5: Towards Native Multimodal Agents , author =. 2026 , month = feb, howpublished =

2026
[13]

2026 , month = apr, howpublished =

Qwen3.6-Plus: Towards Real World Agents , author =. 2026 , month = apr, howpublished =

2026
[14]

2026 , month = apr, howpublished =

Hy3 preview: The First Step in Rebuilding the Hy model , author =. 2026 , month = apr, howpublished =

2026
[15]

2026 , month = mar, howpublished =

Xiaomi MiMo-V2-Pro , author =. 2026 , month = mar, howpublished =

2026
[16]

2026 , month = apr, howpublished =

Xiaomi MiMo-V2.5-Pro , author =. 2026 , month = apr, howpublished =

2026
[17]

2026 , month = apr, howpublished =

Xiaomi MiMo-V2.5 , author =. 2026 , month = apr, howpublished =

2026
[18]

2026 , month = apr, howpublished =

DeepSeek V4 Preview Release , author =. 2026 , month = apr, howpublished =

2026
[19]

GLM-5: from Vibe Coding to Agentic Engineering

Glm-5: from vibe coding to agentic engineering , author=. arXiv preprint arXiv:2602.15763 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

International Conference on Learning Representations , year =

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =
[21]

2021 , eprint =

Training Verifiers to Solve Math Word Problems , author =. 2021 , eprint =

2021
[22]

Measuring Mathematical Problem Solving With the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving With the. 2021 , url =

2021
[23]

International Conference on Learning Representations , volume=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. International Conference on Learning Representations , volume=
[24]

International Conference on Learning Representations , volume=

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions , author=. International Conference on Learning Representations , volume=
[25]

2021 , eprint =

Evaluating Large Language Models Trained on Code , author =. 2021 , eprint =

2021
[26]

2021 , eprint =

Program Synthesis with Large Language Models , author =. 2021 , eprint =

2021
[27]

Transactions on Machine Learning Research , year =

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author =. Transactions on Machine Learning Research , year =
[28]

Transactions on Machine Learning Research , year =

Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =
[29]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =. 2024 , url =

2024
[30]

2023 , eprint =

Mialon, Gr. 2023 , eprint =

2023
[31]

2024 , url =

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. 2024 , url =

2024
[32]

2025 , url =

White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Dey, Sreemanti and Shubh-Agrawal and Sandha, Sandeep Singh and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah , booktitle =....

2025
[33]

2024 , url =

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , booktitle...

2024
[34]

Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle =. 2024 , url =

2024
[35]

2024 , url =

Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yitao and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , booktitle =. 2024 , url =

2024
[36]

Preprint, arXiv:2407.18901

Trivedi, Harsh and Khot, Tushar and Hartmann, Mareike and Manku, Ruskin and Dong, Vinty and Li, Edward and Gupta, Shashank and Sabharwal, Ashish and Balasubramanian, Niranjan , year =. 2407.18901 , archivePrefix =

work page arXiv
[37]

2025 , url =

Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , booktitle =. 2025 , url =

2025
[38]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Xu, Frank F. and Song, Yufan and Li, Boxuan and Tang, Yuxuan and Jain, Kritanjali and Bao, Mengxue and Wang, Zora Z. and Zhou, Xuhui and Guo, Zhitong and Cao, Murong and Yang, Mingyang and Lu, Hao Yang and Martin, Amaad and Su, Zhe and Maben, Leander and Mehta, Raj and Chi, Wayne and Jang, Lawrence and Xie, Yiqing and Zhou, Shuyan and Neubig, Graham , yea...

work page internal anchor Pith review Pith/arXiv arXiv
[39]

2026 , eprint =

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author =. 2026 , eprint =

2026
[40]

2024 , url =

Huang, Qian and Vora, Jian and Liang, Percy and Leskovec, Jure , booktitle =. 2024 , url =

2024
[41]

2024 , eprint =

Chan, Jun Shern and Chowdhury, Neil and Jaffe, Oliver and Aung, James and Sherburn, Dane and Mays, Evan and Starace, Giulio and Liu, Kevin and Maksin, Leon and Patwardhan, Tejal and Weng, Lilian and M. 2024 , eprint =

2024
[42]

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

Wijk, Hjalmar and Lin, Tao and Becker, Joel and Jawhar, Sami and Parikh, Neev and Broadley, Thomas and Chan, Lawrence and Chen, Michael and Clymer, Josh and Dhyani, Jai and Ericheva, Elena and Garcia, Katharyn and Goodrich, Brian and Jurkovic, Nikola and Kinniment, Megan and Lajko, Aron and Nix, Seraphina and Sato, Lucas and Saunders, William and Taran, M...

work page arXiv
[43]

2025 , url =

Nathani, Deepak and Madaan, Lovish and Roberts, Nicholas and Bashlykov, Nikolay and Menon, Ajay and Moens, Vincent and Budhiraja, Amar and Magka, Despoina and Vorotilov, Vladislav and Chaurasia, Gaurav and Hupkes, Dieuwke and Cabral, Ricardo Silveira and Shavrina, Tatiana and Foerster, Jakob and Bachrach, Yoram and Wang, William Yang and Raileanu, Roberta...

2025
[44]

2602.06855 , archivePrefix =

Lupidi, Alisia and Gauri, Bhavul and Foster, Thomas Simon and Al Omari, Bassel and Magka, Despoina and Pepe, Alberto and Audran-Reiss, Alexis and Aghamelu, Muna and Baldwin, Nicolas and Cipolina-Kun, Lucia and Gagnon-Audet, Jean-Christophe and Leow, Chee Hau and Lefdal, Sandra and Mossalam, Hossam and Moudgil, Abhinav and Nazir, Saba and Tewolde, Emanuel ...

work page arXiv
[45]

and Burns, Benjamin and Adu-Ampratwum, Daniel and Huang, Xuhui and Ning, Xia and Gao, Song and Su, Yu and Sun, Huan , booktitle =

Chen, Ziru and Chen, Shijie and Ning, Yuting and Zhang, Qianheng and Wang, Boshi and Yu, Botao and Li, Yifei and Liao, Zeyi and Wei, Chen and Lu, Zitong and Dey, Vishal and Xue, Mingyi and Baker, Frazier N. and Burns, Benjamin and Adu-Ampratwum, Daniel and Huang, Xuhui and Ning, Xia and Gao, Song and Su, Yu and Sun, Huan , booktitle =. 2025 , url =

2025
[46]

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Sun, Qiushi and Liu, Zhoumianze and Ma, Chang and Ding, Zichen and Xu, Fangzhi and Yin, Zhangyue and Zhao, Haiteng and Wu, Zhenyu and Cheng, Kanzhi and Liu, Zhaoyang and Wang, Jianing and Li, Qintong and Tang, Xiangru and Xie, Tianbao and Feng, Xiaochong and Li, Xiang and Kao, Ben and Wang, Wenhai and Qi, Biqing and Kong, Lingpeng and Wu, Zhiyong , year =...

work page internal anchor Pith review Pith/arXiv arXiv
[47]

and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik , booktitle =

Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik , booktitle =. 2024 , url =

2024
[48]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Wang, Xingyao and others , year =. 2407.16741 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Gauthier, Paul and Aider-AI Contributors , year =. Aider:
[50]

Lu, Chris and Lu, Cong and Lange, Robert Tjarko and Foerster, Jakob and Clune, Jeff and Ha, David , year =. The. 2408.06292 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[51]

and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D

Bran, Andres M. and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D. and Schwaller, Philippe , journal =. 2024 , url =

2024
[52]

Nature , year =

Autonomous Chemical Research with Large Language Models , author =. Nature , year =
[53]

Novikov, Alexander and Vu, Ngan and Eisenberger, Marvin and Dupont, Emilien and Huang, Po-Sen and Wagner, Adam Zsolt and Shirobokov, Sergey and Kozlovskii, Borislav and Ruiz, Francisco J. R. and Mehrabian, Abbas and Kumar, M. Pawan and See, Abigail and Chaudhuri, Swarat and Holland, George and Davies, Alex and Nowozin, Sebastian and Kohli, Pushmeet and Ba...

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Autoresearch:

Karpathy, Andrej , year =. Autoresearch:
[55]

2025 , url =

Peng, Yun and Wan, Jun and Li, Yichen and Ren, Xiaoxue , journal =. 2025 , url =

2025
[56]

2025 , url =

Ouyang, Andy and others , booktitle =. 2025 , url =

2025
[57]

Training Software Engineering Agents and Verifiers with SWE-Gym

Pan, Jiayi and Wang, Xingyao and Neubig, Graham and Jaitly, Navdeep and Ji, Heng and Suhr, Alane and Zhang, Yizhe , year =. Training Software Engineering Agents and Verifiers with. 2412.21139 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[58]

2504.07164 , archivePrefix =

Jain, Naman and Singh, Jaskirat and Shetty, Manish and Zheng, Liang and Sen, Koushik and Stoica, Ion , year =. 2504.07164 , archivePrefix =

work page arXiv
[59]

International Conference on Machine Learning , year =

Starace, Giulio and Jaffe, Oliver and Sherburn, Dane and Aung, James and Chan, Jun Shern and Maksin, Leon and Dias, Rachel and Mays, Evan and Kinsella, Benjamin and Thompson, Wyatt and Ahmad, Johannes and Wang, Tina and Patwardhan, Tejal and Shah, Kevin and M. International Conference on Machine Learning , year =
[60]

Posttrainbench: Can llm agents automate llm post-training? 2026

Rank, Ben and Bhatnagar, Hardik and Prabhu, Ameya and Eisenberg, Shira and Nguyen, Karina and Bethge, Matthias and Andriushchenko, Maksym , year =. 2603.08640 , archivePrefix =

work page arXiv

[1] [1]

arXiv preprint arXiv:2512.15699 , year=

FrontierCS: Evolving Challenges for Evolving Intelligence , author=. arXiv preprint arXiv:2512.15699 , year=

work page arXiv

[2] [2]

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization , author=. arXiv preprint arXiv:2604.12290 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

2026 , howpublished =

pi-mono: AI Agent Toolkit , author =. 2026 , howpublished =

2026

[4] [4]

2026 , month = feb, howpublished =

System Card: Claude Opus 4.6 , author =. 2026 , month = feb, howpublished =

2026

[5] [5]

2026 , month = mar, howpublished =

GPT-5.4 Thinking System Card , author =. 2026 , month = mar, howpublished =

2026

[6] [6]

2026 , month = feb, howpublished =

Grok 4.20 , author =. 2026 , month = feb, howpublished =

2026

[7] [7]

2026 , month = feb, howpublished =

Gemini 3.1 Pro Model Card , author =. 2026 , month = feb, howpublished =

2026

[8] [8]

2025 , month = jan, howpublished =

Kimi K2.5 , author =. 2025 , month = jan, howpublished =

2025

[9] [9]

2026 , month = feb, howpublished =

Kimi K2.6 , author =. 2026 , month = feb, howpublished =

2026

[10] [10]

2026 , month = feb, howpublished =

MiniMax M2.5: Built for Real-World Productivity , author =. 2026 , month = feb, howpublished =

2026

[11] [11]

2026 , month = mar, howpublished =

MiniMax M2.7: Early Echoes of Self-Evolution , author =. 2026 , month = mar, howpublished =

2026

[12] [12]

2026 , month = feb, howpublished =

Qwen3.5: Towards Native Multimodal Agents , author =. 2026 , month = feb, howpublished =

2026

[13] [13]

2026 , month = apr, howpublished =

Qwen3.6-Plus: Towards Real World Agents , author =. 2026 , month = apr, howpublished =

2026

[14] [14]

2026 , month = apr, howpublished =

Hy3 preview: The First Step in Rebuilding the Hy model , author =. 2026 , month = apr, howpublished =

2026

[15] [15]

2026 , month = mar, howpublished =

Xiaomi MiMo-V2-Pro , author =. 2026 , month = mar, howpublished =

2026

[16] [16]

2026 , month = apr, howpublished =

Xiaomi MiMo-V2.5-Pro , author =. 2026 , month = apr, howpublished =

2026

[17] [17]

2026 , month = apr, howpublished =

Xiaomi MiMo-V2.5 , author =. 2026 , month = apr, howpublished =

2026

[18] [18]

2026 , month = apr, howpublished =

DeepSeek V4 Preview Release , author =. 2026 , month = apr, howpublished =

2026

[19] [19]

GLM-5: from Vibe Coding to Agentic Engineering

Glm-5: from vibe coding to agentic engineering , author=. arXiv preprint arXiv:2602.15763 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

International Conference on Learning Representations , year =

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =

[21] [21]

2021 , eprint =

Training Verifiers to Solve Math Word Problems , author =. 2021 , eprint =

2021

[22] [22]

Measuring Mathematical Problem Solving With the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving With the. 2021 , url =

2021

[23] [23]

International Conference on Learning Representations , volume=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. International Conference on Learning Representations , volume=

[24] [24]

International Conference on Learning Representations , volume=

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions , author=. International Conference on Learning Representations , volume=

[25] [25]

2021 , eprint =

Evaluating Large Language Models Trained on Code , author =. 2021 , eprint =

2021

[26] [26]

2021 , eprint =

Program Synthesis with Large Language Models , author =. 2021 , eprint =

2021

[27] [27]

Transactions on Machine Learning Research , year =

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author =. Transactions on Machine Learning Research , year =

[28] [28]

Transactions on Machine Learning Research , year =

Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =

[29] [29]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =. 2024 , url =

2024

[30] [30]

2023 , eprint =

Mialon, Gr. 2023 , eprint =

2023

[31] [31]

2024 , url =

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. 2024 , url =

2024

[32] [32]

2025 , url =

White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Dey, Sreemanti and Shubh-Agrawal and Sandha, Sandeep Singh and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah , booktitle =....

2025

[33] [33]

2024 , url =

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , booktitle...

2024

[34] [34]

Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle =. 2024 , url =

2024

[35] [35]

2024 , url =

Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yitao and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , booktitle =. 2024 , url =

2024

[36] [36]

Preprint, arXiv:2407.18901

Trivedi, Harsh and Khot, Tushar and Hartmann, Mareike and Manku, Ruskin and Dong, Vinty and Li, Edward and Gupta, Shashank and Sabharwal, Ashish and Balasubramanian, Niranjan , year =. 2407.18901 , archivePrefix =

work page arXiv

[37] [37]

2025 , url =

Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , booktitle =. 2025 , url =

2025

[38] [38]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Xu, Frank F. and Song, Yufan and Li, Boxuan and Tang, Yuxuan and Jain, Kritanjali and Bao, Mengxue and Wang, Zora Z. and Zhou, Xuhui and Guo, Zhitong and Cao, Murong and Yang, Mingyang and Lu, Hao Yang and Martin, Amaad and Su, Zhe and Maben, Leander and Mehta, Raj and Chi, Wayne and Jang, Lawrence and Xie, Yiqing and Zhou, Shuyan and Neubig, Graham , yea...

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

2026 , eprint =

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author =. 2026 , eprint =

2026

[40] [40]

2024 , url =

Huang, Qian and Vora, Jian and Liang, Percy and Leskovec, Jure , booktitle =. 2024 , url =

2024

[41] [41]

2024 , eprint =

Chan, Jun Shern and Chowdhury, Neil and Jaffe, Oliver and Aung, James and Sherburn, Dane and Mays, Evan and Starace, Giulio and Liu, Kevin and Maksin, Leon and Patwardhan, Tejal and Weng, Lilian and M. 2024 , eprint =

2024

[42] [42]

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

Wijk, Hjalmar and Lin, Tao and Becker, Joel and Jawhar, Sami and Parikh, Neev and Broadley, Thomas and Chan, Lawrence and Chen, Michael and Clymer, Josh and Dhyani, Jai and Ericheva, Elena and Garcia, Katharyn and Goodrich, Brian and Jurkovic, Nikola and Kinniment, Megan and Lajko, Aron and Nix, Seraphina and Sato, Lucas and Saunders, William and Taran, M...

work page arXiv

[43] [43]

2025 , url =

Nathani, Deepak and Madaan, Lovish and Roberts, Nicholas and Bashlykov, Nikolay and Menon, Ajay and Moens, Vincent and Budhiraja, Amar and Magka, Despoina and Vorotilov, Vladislav and Chaurasia, Gaurav and Hupkes, Dieuwke and Cabral, Ricardo Silveira and Shavrina, Tatiana and Foerster, Jakob and Bachrach, Yoram and Wang, William Yang and Raileanu, Roberta...

2025

[44] [44]

2602.06855 , archivePrefix =

Lupidi, Alisia and Gauri, Bhavul and Foster, Thomas Simon and Al Omari, Bassel and Magka, Despoina and Pepe, Alberto and Audran-Reiss, Alexis and Aghamelu, Muna and Baldwin, Nicolas and Cipolina-Kun, Lucia and Gagnon-Audet, Jean-Christophe and Leow, Chee Hau and Lefdal, Sandra and Mossalam, Hossam and Moudgil, Abhinav and Nazir, Saba and Tewolde, Emanuel ...

work page arXiv

[45] [45]

and Burns, Benjamin and Adu-Ampratwum, Daniel and Huang, Xuhui and Ning, Xia and Gao, Song and Su, Yu and Sun, Huan , booktitle =

Chen, Ziru and Chen, Shijie and Ning, Yuting and Zhang, Qianheng and Wang, Boshi and Yu, Botao and Li, Yifei and Liao, Zeyi and Wei, Chen and Lu, Zitong and Dey, Vishal and Xue, Mingyi and Baker, Frazier N. and Burns, Benjamin and Adu-Ampratwum, Daniel and Huang, Xuhui and Ning, Xia and Gao, Song and Su, Yu and Sun, Huan , booktitle =. 2025 , url =

2025

[46] [46]

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Sun, Qiushi and Liu, Zhoumianze and Ma, Chang and Ding, Zichen and Xu, Fangzhi and Yin, Zhangyue and Zhao, Haiteng and Wu, Zhenyu and Cheng, Kanzhi and Liu, Zhaoyang and Wang, Jianing and Li, Qintong and Tang, Xiangru and Xie, Tianbao and Feng, Xiaochong and Li, Xiang and Kao, Ben and Wang, Wenhai and Qi, Biqing and Kong, Lingpeng and Wu, Zhiyong , year =...

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik , booktitle =

Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik , booktitle =. 2024 , url =

2024

[48] [48]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Wang, Xingyao and others , year =. 2407.16741 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

Gauthier, Paul and Aider-AI Contributors , year =. Aider:

[50] [50]

Lu, Chris and Lu, Cong and Lange, Robert Tjarko and Foerster, Jakob and Clune, Jeff and Ha, David , year =. The. 2408.06292 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D

Bran, Andres M. and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D. and Schwaller, Philippe , journal =. 2024 , url =

2024

[52] [52]

Nature , year =

Autonomous Chemical Research with Large Language Models , author =. Nature , year =

[53] [53]

Novikov, Alexander and Vu, Ngan and Eisenberger, Marvin and Dupont, Emilien and Huang, Po-Sen and Wagner, Adam Zsolt and Shirobokov, Sergey and Kozlovskii, Borislav and Ruiz, Francisco J. R. and Mehrabian, Abbas and Kumar, M. Pawan and See, Abigail and Chaudhuri, Swarat and Holland, George and Davies, Alex and Nowozin, Sebastian and Kohli, Pushmeet and Ba...

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Autoresearch:

Karpathy, Andrej , year =. Autoresearch:

[55] [55]

2025 , url =

Peng, Yun and Wan, Jun and Li, Yichen and Ren, Xiaoxue , journal =. 2025 , url =

2025

[56] [56]

2025 , url =

Ouyang, Andy and others , booktitle =. 2025 , url =

2025

[57] [57]

Training Software Engineering Agents and Verifiers with SWE-Gym

Pan, Jiayi and Wang, Xingyao and Neubig, Graham and Jaitly, Navdeep and Ji, Heng and Suhr, Alane and Zhang, Yizhe , year =. Training Software Engineering Agents and Verifiers with. 2412.21139 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

2504.07164 , archivePrefix =

Jain, Naman and Singh, Jaskirat and Shetty, Manish and Zheng, Liang and Sen, Koushik and Stoica, Ion , year =. 2504.07164 , archivePrefix =

work page arXiv

[59] [59]

International Conference on Machine Learning , year =

Starace, Giulio and Jaffe, Oliver and Sherburn, Dane and Aung, James and Chan, Jun Shern and Maksin, Leon and Dias, Rachel and Mays, Evan and Kinsella, Benjamin and Thompson, Wyatt and Ahmad, Johannes and Wang, Tina and Patwardhan, Tejal and Shah, Kevin and M. International Conference on Machine Learning , year =

[60] [60]

Posttrainbench: Can llm agents automate llm post-training? 2026

Rank, Ben and Bhatnagar, Hardik and Prabhu, Ameya and Eisenberg, Shira and Nguyen, Karina and Bethge, Matthias and Andriushchenko, Maksym , year =. 2603.08640 , archivePrefix =

work page arXiv