arxiv: 2605.11518 · v1 · submitted 2026-05-12 · 💻 cs.AI · cs.CL· cs.LG

Recognition: no theorem link

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

Nitesh V. Chawla, Olaf Wiest, Taicheng Guo, Xiangliang Zhang

Pith reviewed 2026-05-13 01:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords LLM experiment automationmulti-fidelity optimizationresearch agentsconfiguration searchreinforcement learninghyperparameter tuningcross-fidelity learning

0 comments

The pith

AutoLLMResearch trains agents to learn LLM configuration principles from cheap low-fidelity experiments and extrapolate them to expensive high-fidelity settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AutoLLMResearch as an agentic system that automates the configuration of large language model experiments, including architecture choices and hyperparameters. Current automated methods rely on many low-cost trials, but scaling to realistic LLM sizes makes such iteration prohibitively expensive. The framework creates a multi-fidelity training environment called LLMConfig-Gym that includes four core tasks and draws on over one million GPU hours of recorded outcomes. Agents are trained inside this environment as a long-horizon decision process whose rewards encourage them to discover rules that transfer across fidelity levels. If the transfer holds, researchers could locate effective configurations while running far fewer full-scale experiments.

Core claim

AutoLLMResearch formulates configuration search as a long-horizon Markov Decision Process inside a multi-fidelity environment and supplies a training pipeline that rewards agents for learning cross-fidelity extrapolation rules, enabling them to identify promising LLM setups after exposure to cheap proxies rather than repeated expensive trials.

What carries the argument

LLMConfig-Gym, a multi-fidelity environment spanning four LLM experiment tasks and supported by over one million GPU hours of verifiable outcomes, which supplies the structured interaction data needed for agents to practice and internalize extrapolation reasoning.

If this is right

Agents achieve higher success rates than diverse baselines on held-out configuration tasks.
Performance generalizes across different LLM experiment types within the same multi-fidelity setup.
The learned decision process produces interpretable traces of the reasoning steps that link low-fidelity observations to high-fidelity recommendations.
The overall pipeline supplies a reusable template for automating other high-cost experimental configuration problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the learned extrapolation rules prove stable, the same training pattern could be reused for configuration tasks in other domains whose cost curves allow cheap proxy measurements, such as certain simulation or hardware-tuning problems.
Interpretability of the agent's reasoning might surface previously unnoticed regularities in how small changes in architecture or hyperparameters affect large-model behavior.
The approach implicitly assumes that configuration landscapes share enough low-dimensional structure across fidelities; testing the agents on models whose scale or training regime differs markedly from the training distribution would expose where that assumption breaks.

Load-bearing premise

The multi-fidelity experimental environment captures the structure of the LLM configuration landscape in a way that permits reliable cross-fidelity extrapolation from cheap to expensive settings.

What would settle it

Deploy the trained agents on a fresh collection of high-fidelity LLM experiments withheld from training and measure whether they reach target performance metrics using substantially fewer expensive evaluations than random search or other strong baselines.

read the original abstract

Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoLLMResearch builds a multi-fidelity gym from real LLM runs and trains an MDP agent to extrapolate configs, but the central claim rests on unshown correlations between cheap and expensive regimes.

read the letter

The paper's main move is to treat LLM experiment configuration as a long-horizon decision process where an agent learns from cheap proxies and applies the lessons to expensive runs. It supplies LLMConfig-Gym, a new environment covering four tasks backed by over a million GPU hours of actual outcomes, plus a training pipeline that rewards cross-fidelity reasoning. That framing is new; prior AutoML work stays in low-cost regimes where you can just run many trials, and no one has packaged the high-cost LLM case this way with verifiable data behind it. The setup gives the agent a concrete place to practice extrapolation instead of relying on hand-crafted heuristics, which is a reasonable way to attack the problem. The held-out evaluation and interpretability claims are the parts that would matter most if they hold up. The soft spot is the missing link between fidelities. The abstract and stress-test note give no correlation numbers, no ablation on how low-fidelity rankings predict high-fidelity ones, and no explicit definition of what counts as cheap versus expensive for each task. If the relative ordering breaks on some tasks, the learned policy will chase artifacts rather than general principles. Without those diagnostics the performance numbers on held-out experiments are hard to interpret. The work is aimed at people who run large-scale LLM ablations and want to reduce wasted compute; a methods reader who cares about automated experiment design would get value from the gym construction and the MDP formulation. It deserves a serious referee because the practical motivation is clear, the data scale is real, and the gap it targets is genuine, even if the current evidence for reliable extrapolation is still thin.

Referee Report

2 major / 1 minor

Summary. The paper proposes AutoLLMResearch, an agentic framework for automating high-cost LLM experiment configuration. It introduces LLMConfig-Gym, a multi-fidelity environment spanning four tasks and backed by over 1M GPU hours of verifiable outcomes, together with a training pipeline that casts configuration search as a long-horizon MDP to incentivize cross-fidelity extrapolation reasoning from cheap to expensive regimes. The central claim is that extensive evaluation on held-out experiments demonstrates effectiveness, generalization, and interpretability relative to strong baselines.

Significance. If the cross-fidelity extrapolation result holds, the work would be significant for scalable LLM research automation, where exhaustive search is infeasible. The release of a large-scale, verifiable multi-fidelity dataset constitutes a concrete strength that could support reproducibility and follow-on studies.

major comments (2)

[Abstract / LLMConfig-Gym] Abstract and LLMConfig-Gym description: the extrapolation claim rests on the assumption that low-fidelity proxies preserve the relative ordering and local structure of the high-fidelity configuration landscape, yet no fidelity definitions (e.g., model-size or token reductions), correlation statistics, or ablation results showing that low-fidelity rankings predict high-fidelity ones are supplied. Without these, it is impossible to rule out that the learned policy overfits to cheap artifacts.
[Evaluation] Evaluation section: the abstract states that held-out experiments demonstrate effectiveness against diverse baselines, but the manuscript provides neither the precise baseline implementations, quantitative tables of gains, nor controls (e.g., high-fidelity-only training) that would confirm the gains arise from cross-fidelity reasoning rather than post-hoc selection or environment-specific tuning.

minor comments (1)

[Training pipeline] Clarify the exact state, action, and reward definitions used in the long-horizon MDP formulation of the training pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and agree that the manuscript requires additional details to fully support the cross-fidelity extrapolation claims and evaluation rigor. Revisions will be made accordingly.

read point-by-point responses

Referee: [Abstract / LLMConfig-Gym] Abstract and LLMConfig-Gym description: the extrapolation claim rests on the assumption that low-fidelity proxies preserve the relative ordering and local structure of the high-fidelity configuration landscape, yet no fidelity definitions (e.g., model-size or token reductions), correlation statistics, or ablation results showing that low-fidelity rankings predict high-fidelity ones are supplied. Without these, it is impossible to rule out that the learned policy overfits to cheap artifacts.

Authors: We acknowledge that the current manuscript does not include explicit fidelity definitions, correlation statistics, or ablations to demonstrate preservation of landscape structure across fidelities. In the revised version, we will add a dedicated subsection under LLMConfig-Gym that defines the fidelity levels for each of the four tasks (including specific model-size reductions, token limits, and other proxy parameters), reports Spearman and Pearson correlations between low- and high-fidelity outcomes across sampled configurations, and presents ablation studies comparing the full multi-fidelity policy against low-fidelity-only training to show that gains arise from learned extrapolation rather than overfitting to cheap artifacts. revision: yes
Referee: [Evaluation] Evaluation section: the abstract states that held-out experiments demonstrate effectiveness against diverse baselines, but the manuscript provides neither the precise baseline implementations, quantitative tables of gains, nor controls (e.g., high-fidelity-only training) that would confirm the gains arise from cross-fidelity reasoning rather than post-hoc selection or environment-specific tuning.

Authors: We agree that the evaluation section lacks sufficient transparency and controls. We will expand it to provide precise implementation details and hyperparameters for all baselines, include comprehensive quantitative tables with performance metrics, relative gains, and statistical significance tests on held-out experiments, and add a high-fidelity-only training control to isolate the benefit of the cross-fidelity MDP pipeline. These additions will clarify that observed improvements stem from extrapolation reasoning rather than other factors. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on empirical held-out evaluation from externally-run experiments

full rationale

The paper constructs LLMConfig-Gym from >1M GPU hours of real experiment outcomes across four tasks, formulates configuration search as a long-horizon MDP, trains an agent to learn cross-fidelity extrapolation, and reports effectiveness via comparison to baselines on held-out experiments. No derivation step reduces by construction to its own inputs: the environment data are independent empirical measurements, the agent's policy is learned from interaction, and success is measured on unseen configurations rather than fitted parameters or self-referential definitions. The central claim is therefore self-contained against the constructed benchmark and does not rely on self-citation chains, ansatz smuggling, or renaming of known results. The cross-fidelity assumption is a validity claim, not a circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that low-fidelity experiments preserve enough structure of the high-fidelity configuration landscape for extrapolation to succeed, plus the ad-hoc design of the MDP reward structure that incentivizes cross-fidelity reasoning.

axioms (1)

domain assumption Low-fidelity experiments capture transferable structure of the LLM configuration landscape
Invoked when the authors state that the agent learns generalizable principles from cheap experiments and extrapolates to expensive ones.

invented entities (1)

LLMConfig-Gym no independent evidence
purpose: Multi-fidelity environment providing four LLM experiment tasks and over one million GPU hours of data for agent training
Newly introduced environment that supplies the interaction data for the training pipeline.

pith-pipeline@v0.9.0 · 5590 in / 1347 out tokens · 52342 ms · 2026-05-13T01:12:48.706472+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 9 internal anchors

[1]

Droste Effect. 2026. URLhttps://en.wikipedia.org/wiki/Droste_effect

work page 2026
[2]

Towards learning universal hyperparameter optimizers with transformers

Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Richard Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'Aurelio Ranzato, Sagi Perel, and Nando de Freitas. Towards learning universal hyperparameter optimizers with transformers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Informa...

work page 2022
[3]

Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel

Michael Volpp, Lukas P. Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel. Meta-learning acquisition functions for transfer learning in bayesian optimization. InInternational Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=ryeYpJSKwr

work page 2020
[4]

End- to-end meta-bayesian optimisation with transformer neural processes

Alexandre Maraval, Matthieu Zimmer, Antoine Grosnit, and Haitham Bou Ammar. End- to-end meta-bayesian optimisation with transformer neural processes. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 11246–11260. Curran Associates, Inc.,

work page
[5]

17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 2561721d0ca69bab22b749cfc4f48f6c-Paper-Conference.pdf. 17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration

work page 2023
[6]

Few-shotbayesianoptimizationwithdeepkernelsurrogates

MartinWistubaandJosifGrabocka. Few-shotbayesianoptimizationwithdeepkernelsurrogates. InInternational Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=bJxgv5C3sYc

work page 2021
[7]

Large language models to enhance bayesian optimization

Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=OOxotBmGol

work page 2024
[8]

Using large language models for hyperparameter optimization

Michael Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, and Jimmy Ba. Using large language models for hyperparameter optimization. InNeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URLhttps://openreview.net/forum?id=FUdZ6HEOre

work page 2023
[9]

Sequential Large Language Model-Based Hyper-parameter Optimiza- tion

KananMahammadliandSeydaErtekin. Sequentiallargelanguagemodel-basedhyper-parameter optimization, 2025. URLhttps://arxiv.org/abs/2410.20302

work page arXiv 2025
[10]

Agenthpo: Large language model agent for hyper-parameter optimization

Siyi Liu, Chen Gao, and Yong Li. Agenthpo: Large language model agent for hyper-parameter optimization. In Beidi Chen, Shijia Liu, Mert Pilanci, Weijie Su, Jeremias Sulam, Yuxiang Wang, and Zhihui Zhu, editors,Conference on Parsimony and Learning, volume 280 ofProceedings of Machine Learning Research, pages 1146–1169. PMLR, 24–27 Mar 2025. URLhttps:// pro...

work page 2025
[11]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URLhttps://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

Rhea Sanjay Sukthanker, Arber Zela, Benedikt Staffler, Aaron Klein, Lennart Purucker, Jörg K. H. Franke, and Frank Hutter. Hw-gpt-bench: Hardware-aware architecture benchmark for language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Pro- cessing Systems, volume 37, pag...

work page doi:10.52202/079017-1944 2024
[13]

Tuning large neural networks via zero-shot hyperparameter transfer

Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pag...

work page 2021
[14]

Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance. InTheThirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=jjCB27TMK3

work page 2025
[15]

An empirical analysis of compute-optimal large language model train- ing

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...

work page 2022
[16]

Stay tuned: An empirical study of the impact of hyperparameters on llm tuning in real-world applications, 2024

Alon Halfon, Shai Gretz, Ofir Arviv, Artem Spector, Orith Toledo-Ronen, Yoav Katz, Liat Ein-Dor, Michal Shmueli-Scheuer, and Noam Slonim. Stay tuned: An empirical study of the impact of hyperparameters on llm tuning in real-world applications, 2024. URLhttps://arxiv.org/ abs/2407.18990

work page arXiv 2024
[17]

autoresearch.https://github.com/karpathy/autoresearch, 2026

Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch, 2026

work page 2026
[18]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025

work page 2025
[19]

Aide: Ai-driven exploration in the space of code, 2025

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025. URLhttps://arxiv. org/abs/2502.13138

work page arXiv 2025
[20]

Optuna: A Next-generation Hyperparameter Optimization Framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Op- tuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 2623–2631, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi...

work page doi:10.1145/3292500.3330701 2019
[21]

Scikit-optimize: Sequential model-based optimization in python

Scikit-Optimize. Scikit-optimize: Sequential model-based optimization in python. URLhttps: //scikit-optimize.github.io/

work page
[22]

Speculations concerning the first ultraintelligent machine

Irving John Good. Speculations concerning the first ultraintelligent machine. volume 6 ofAdvances in Computers, pages 31–88. Elsevier, 1966. doi: https://doi.org/10.1016/ S0065-2458(08)60418-0. URL https://www.sciencedirect.com/science/article/ pii/S0065245808604180

work page 1966
[23]

Goedel machines: Self-referential universal problem solvers making provably optimal self-improvements, 2006

Juergen Schmidhuber. Goedel machines: Self-referential universal problem solvers making provably optimal self-improvements, 2006. URLhttps://arxiv.org/abs/cs/0309048

work page arXiv 2006
[24]

AI with recursive self-improvement

MingchenZhuge, AilingZeng,DeyaoZhu,SherryYang,VikasChandra,andJürgenSchmidhuber. AI with recursive self-improvement. InICLR 2026 Workshop Proposals, 2026. URLhttps: //openreview.net/forum?id=OsPQ6zTQXV

work page 2026
[25]

Posttrainbench: Can llm agents automate llm post-training?,

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training?,

work page
[26]

URLhttps://arxiv.org/abs/2603.08640

work page arXiv
[27]

Darwin gödel machine: Open-ended evolution of self-improving agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin gödel machine: Open-ended evolution of self-improving agents. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= pUpzQZTvGY

work page 2026
[28]

Huxley-g\”odel machine: Human-level cod- ing agent development by an approximation of the optimal self-improving machine

Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and Jürgen Schmidhuber. Huxley-g\”odel machine: Human-level cod- ing agent development by an approximation of the optimal self-improving machine. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum...

work page 2026
[29]

Learning to reason with LLMs

OpenAI. Learning to reason with LLMs. 2024. URL https://openai.com/index/ learning-to-reason-with-llms

work page 2024
[30]

Deepseek-r1incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September

DayaGuo,DejianYang,HaoweiZhang,JunxiaoSong,PeiyiWang,etal. Deepseek-r1incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September

work page
[31]

Nature645(8081), 633–638 (2025) https://doi.org/10.1038/s41586-025-09422-z

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URLhttp://dx.doi.org/10. 1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z
[32]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, et al. Kimi k1.5: Scaling reinforcement learning with llms, 2025. URLhttps://arxiv.org/abs/2501.12599

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

difflib — Helpers for computing deltas

Python Software Foundation. difflib — Helpers for computing deltas. 2026. URLhttps: //docs.python.org/3/library/difflib.html

work page 2026
[35]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025

work page 2025
[36]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Openai gpt-5 system card, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, et al. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601. 03267

work page 2025
[38]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

LlamaFactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. LlamaFactory: Unified efficient fine-tuning of 100+ language models. In Yixin Cao, Yang Feng, and Deyi Xiong, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 400–410, Bangkok, Thailand, August

work page
[40]

doi: 10.18653/v1/2024.acl-demos.38

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-demos.38. URL https://aclanthology.org/2024.acl-demos.38/

work page doi:10.18653/v1/2024.acl-demos.38 2024
[41]

Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Stor...

work page 2022
[42]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961. doi: 1...

work page doi:10.1145/3689031.3696075 2025
[43]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural I...

work page doi:10.52202/079017-2000 2024
[44]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps: //arxiv.org/abs/2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Schmid, Sterling G

Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stanley Lo, Sergio Pablo-García, Ella M. Rajaonson, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc, Felix Strieth-Kalthoff, Martin Seifrid, and Alán Aspuru-Guzik. Self-driving lab- oratories for chemistry and materials science.Chemical Reviews, 124(16):963...

work page doi:10.1021/acs.chemrev.4c00055 2024
[46]

Rushil Gupta, Jason Hartford, and Bang Liu. LLMs for Bayesian optimization in scientific domains: Are we there yet? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510, Suzhou, China, November 2025. Association for Computational ...

work page doi:10.18653/v1/2025.findings-emnlp.838 2025
[47]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://arxiv.org/ abs/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020
[48]

Predictable scale: Part i, step law – optimal hyperparameter scaling law in large language model pretraining, 2025

Houyi Li, Wenzhen Zheng, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Zhenyu Ding, Haoying Wang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. Predictable scale: Part i, step law – optimal hyperparameter scaling law in large language model pretraining, 2025. URLhttps://arxiv.org/abs/2503.04715

work page arXiv 2025
[49]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/ abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

MMLU-pro: A more robust and challenging multi-task 21 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration language understanding benchmark

YuboWang, XueguangMa, GeZhang, YuanshengNi, AbhranilChandra, ShiguangGuo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task 21 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration language und...

work page 2024
[52]

ADMIRE-bayesopt: Accelerated data MIxture RE-weighting for language models with bayesian optimization.Transactions on Machine Learning Research, 2025

Xu Ouyang, Shengzhuang Chen, Michael Arthur Leopold Pearce, Thomas Hartvigsen, and Jonathan Richard Schwarz. ADMIRE-bayesopt: Accelerated data MIxture RE-weighting for language models with bayesian optimization.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=0Euvm9zDpu

work page 2025
[53]

Available tasks:

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christo- pher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannan...

work page 2025
[54]

learning rate: 0.001953, batch size: 128.0 ######

learning rate: 0.001953, batch size: 64.0 3. learning rate: 0.001953, batch size: 128.0 ###### ...... ###### Experiment Environment information: the total number of training tokens seen by the model during training is: 8000000000, and the count of trainable model parameters excluding token embedding matrices is: 119992320 In this environment, the Top-3 co...

work page
[55]

exec_config

learning rate: 0.002762, batch size: 128.0 2. learning rate: 0.003906, batch size: 128.0 3. learning rate: 0.005524, batch size: 128.0 ###### Remember: 1. Consideryourremainingbudgeis2, previousexperimentalresults,bestconfigurationsfromlow-fidelityexperiments when making decisions. 2. You MUST have to call "exec_config" tool to query the score of the conf...

work page
[56]

exec_config

kl loss weight: 0.0, learning rate: 5e-06, batch size: 32.0 ###### ...... ###### Experiment Environment information: the dataset is mmlu_history, the model is Qwen2.5-1.5B-Instruct, Note the Training epoch is 15 In this environment, the Top-3 configurations are: 1. kl loss weight: 0.0, learning rate: 1e-06, batch size: 64.0 2. kl loss weight: 0.0, learnin...

work page