Recognition: no theorem link
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
Pith reviewed 2026-05-13 01:12 UTC · model grok-4.3
The pith
AutoLLMResearch trains agents to learn LLM configuration principles from cheap low-fidelity experiments and extrapolate them to expensive high-fidelity settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AutoLLMResearch formulates configuration search as a long-horizon Markov Decision Process inside a multi-fidelity environment and supplies a training pipeline that rewards agents for learning cross-fidelity extrapolation rules, enabling them to identify promising LLM setups after exposure to cheap proxies rather than repeated expensive trials.
What carries the argument
LLMConfig-Gym, a multi-fidelity environment spanning four LLM experiment tasks and supported by over one million GPU hours of verifiable outcomes, which supplies the structured interaction data needed for agents to practice and internalize extrapolation reasoning.
If this is right
- Agents achieve higher success rates than diverse baselines on held-out configuration tasks.
- Performance generalizes across different LLM experiment types within the same multi-fidelity setup.
- The learned decision process produces interpretable traces of the reasoning steps that link low-fidelity observations to high-fidelity recommendations.
- The overall pipeline supplies a reusable template for automating other high-cost experimental configuration problems.
Where Pith is reading between the lines
- If the learned extrapolation rules prove stable, the same training pattern could be reused for configuration tasks in other domains whose cost curves allow cheap proxy measurements, such as certain simulation or hardware-tuning problems.
- Interpretability of the agent's reasoning might surface previously unnoticed regularities in how small changes in architecture or hyperparameters affect large-model behavior.
- The approach implicitly assumes that configuration landscapes share enough low-dimensional structure across fidelities; testing the agents on models whose scale or training regime differs markedly from the training distribution would expose where that assumption breaks.
Load-bearing premise
The multi-fidelity experimental environment captures the structure of the LLM configuration landscape in a way that permits reliable cross-fidelity extrapolation from cheap to expensive settings.
What would settle it
Deploy the trained agents on a fresh collection of high-fidelity LLM experiments withheld from training and measure whether they reach target performance metrics using substantially fewer expensive evaluations than random search or other strong baselines.
read the original abstract
Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AutoLLMResearch, an agentic framework for automating high-cost LLM experiment configuration. It introduces LLMConfig-Gym, a multi-fidelity environment spanning four tasks and backed by over 1M GPU hours of verifiable outcomes, together with a training pipeline that casts configuration search as a long-horizon MDP to incentivize cross-fidelity extrapolation reasoning from cheap to expensive regimes. The central claim is that extensive evaluation on held-out experiments demonstrates effectiveness, generalization, and interpretability relative to strong baselines.
Significance. If the cross-fidelity extrapolation result holds, the work would be significant for scalable LLM research automation, where exhaustive search is infeasible. The release of a large-scale, verifiable multi-fidelity dataset constitutes a concrete strength that could support reproducibility and follow-on studies.
major comments (2)
- [Abstract / LLMConfig-Gym] Abstract and LLMConfig-Gym description: the extrapolation claim rests on the assumption that low-fidelity proxies preserve the relative ordering and local structure of the high-fidelity configuration landscape, yet no fidelity definitions (e.g., model-size or token reductions), correlation statistics, or ablation results showing that low-fidelity rankings predict high-fidelity ones are supplied. Without these, it is impossible to rule out that the learned policy overfits to cheap artifacts.
- [Evaluation] Evaluation section: the abstract states that held-out experiments demonstrate effectiveness against diverse baselines, but the manuscript provides neither the precise baseline implementations, quantitative tables of gains, nor controls (e.g., high-fidelity-only training) that would confirm the gains arise from cross-fidelity reasoning rather than post-hoc selection or environment-specific tuning.
minor comments (1)
- [Training pipeline] Clarify the exact state, action, and reward definitions used in the long-horizon MDP formulation of the training pipeline.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and agree that the manuscript requires additional details to fully support the cross-fidelity extrapolation claims and evaluation rigor. Revisions will be made accordingly.
read point-by-point responses
-
Referee: [Abstract / LLMConfig-Gym] Abstract and LLMConfig-Gym description: the extrapolation claim rests on the assumption that low-fidelity proxies preserve the relative ordering and local structure of the high-fidelity configuration landscape, yet no fidelity definitions (e.g., model-size or token reductions), correlation statistics, or ablation results showing that low-fidelity rankings predict high-fidelity ones are supplied. Without these, it is impossible to rule out that the learned policy overfits to cheap artifacts.
Authors: We acknowledge that the current manuscript does not include explicit fidelity definitions, correlation statistics, or ablations to demonstrate preservation of landscape structure across fidelities. In the revised version, we will add a dedicated subsection under LLMConfig-Gym that defines the fidelity levels for each of the four tasks (including specific model-size reductions, token limits, and other proxy parameters), reports Spearman and Pearson correlations between low- and high-fidelity outcomes across sampled configurations, and presents ablation studies comparing the full multi-fidelity policy against low-fidelity-only training to show that gains arise from learned extrapolation rather than overfitting to cheap artifacts. revision: yes
-
Referee: [Evaluation] Evaluation section: the abstract states that held-out experiments demonstrate effectiveness against diverse baselines, but the manuscript provides neither the precise baseline implementations, quantitative tables of gains, nor controls (e.g., high-fidelity-only training) that would confirm the gains arise from cross-fidelity reasoning rather than post-hoc selection or environment-specific tuning.
Authors: We agree that the evaluation section lacks sufficient transparency and controls. We will expand it to provide precise implementation details and hyperparameters for all baselines, include comprehensive quantitative tables with performance metrics, relative gains, and statistical significance tests on held-out experiments, and add a high-fidelity-only training control to isolate the benefit of the cross-fidelity MDP pipeline. These additions will clarify that observed improvements stem from extrapolation reasoning rather than other factors. revision: yes
Circularity Check
No circularity: performance claims rest on empirical held-out evaluation from externally-run experiments
full rationale
The paper constructs LLMConfig-Gym from >1M GPU hours of real experiment outcomes across four tasks, formulates configuration search as a long-horizon MDP, trains an agent to learn cross-fidelity extrapolation, and reports effectiveness via comparison to baselines on held-out experiments. No derivation step reduces by construction to its own inputs: the environment data are independent empirical measurements, the agent's policy is learned from interaction, and success is measured on unseen configurations rather than fitted parameters or self-referential definitions. The central claim is therefore self-contained against the constructed benchmark and does not rely on self-citation chains, ansatz smuggling, or renaming of known results. The cross-fidelity assumption is a validity claim, not a circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Low-fidelity experiments capture transferable structure of the LLM configuration landscape
invented entities (1)
-
LLMConfig-Gym
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Droste Effect. 2026. URLhttps://en.wikipedia.org/wiki/Droste_effect
work page 2026
-
[2]
Towards learning universal hyperparameter optimizers with transformers
Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Richard Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'Aurelio Ranzato, Sagi Perel, and Nando de Freitas. Towards learning universal hyperparameter optimizers with transformers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Informa...
work page 2022
-
[3]
Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel
Michael Volpp, Lukas P. Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel. Meta-learning acquisition functions for transfer learning in bayesian optimization. InInternational Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=ryeYpJSKwr
work page 2020
-
[4]
End- to-end meta-bayesian optimisation with transformer neural processes
Alexandre Maraval, Matthieu Zimmer, Antoine Grosnit, and Haitham Bou Ammar. End- to-end meta-bayesian optimisation with transformer neural processes. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 11246–11260. Curran Associates, Inc.,
-
[5]
17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 2561721d0ca69bab22b749cfc4f48f6c-Paper-Conference.pdf. 17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration
work page 2023
-
[6]
Few-shotbayesianoptimizationwithdeepkernelsurrogates
MartinWistubaandJosifGrabocka. Few-shotbayesianoptimizationwithdeepkernelsurrogates. InInternational Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=bJxgv5C3sYc
work page 2021
-
[7]
Large language models to enhance bayesian optimization
Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=OOxotBmGol
work page 2024
-
[8]
Using large language models for hyperparameter optimization
Michael Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, and Jimmy Ba. Using large language models for hyperparameter optimization. InNeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URLhttps://openreview.net/forum?id=FUdZ6HEOre
work page 2023
-
[9]
Sequential Large Language Model-Based Hyper-parameter Optimiza- tion
KananMahammadliandSeydaErtekin. Sequentiallargelanguagemodel-basedhyper-parameter optimization, 2025. URLhttps://arxiv.org/abs/2410.20302
-
[10]
Agenthpo: Large language model agent for hyper-parameter optimization
Siyi Liu, Chen Gao, and Yong Li. Agenthpo: Large language model agent for hyper-parameter optimization. In Beidi Chen, Shijia Liu, Mert Pilanci, Weijie Su, Jeremias Sulam, Yuxiang Wang, and Zhihui Zhu, editors,Conference on Parsimony and Learning, volume 280 ofProceedings of Machine Learning Research, pages 1146–1169. PMLR, 24–27 Mar 2025. URLhttps:// pro...
work page 2025
-
[11]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URLhttps://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[12]
Rhea Sanjay Sukthanker, Arber Zela, Benedikt Staffler, Aaron Klein, Lennart Purucker, Jörg K. H. Franke, and Frank Hutter. Hw-gpt-bench: Hardware-aware architecture benchmark for language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Pro- cessing Systems, volume 37, pag...
-
[13]
Tuning large neural networks via zero-shot hyperparameter transfer
Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pag...
work page 2021
-
[14]
Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance
Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance. InTheThirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=jjCB27TMK3
work page 2025
-
[15]
An empirical analysis of compute-optimal large language model train- ing
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...
work page 2022
-
[16]
Alon Halfon, Shai Gretz, Ofir Arviv, Artem Spector, Orith Toledo-Ronen, Yoav Katz, Liat Ein-Dor, Michal Shmueli-Scheuer, and Noam Slonim. Stay tuned: An empirical study of the impact of hyperparameters on llm tuning in real-world applications, 2024. URLhttps://arxiv.org/ abs/2407.18990
-
[17]
autoresearch.https://github.com/karpathy/autoresearch, 2026
Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch, 2026
work page 2026
-
[18]
OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025
work page 2025
-
[19]
Aide: Ai-driven exploration in the space of code, 2025
Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025. URLhttps://arxiv. org/abs/2502.13138
-
[20]
Optuna: A Next-generation Hyperparameter Optimization Framework
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Op- tuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 2623–2631, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi...
-
[21]
Scikit-optimize: Sequential model-based optimization in python
Scikit-Optimize. Scikit-optimize: Sequential model-based optimization in python. URLhttps: //scikit-optimize.github.io/
-
[22]
Speculations concerning the first ultraintelligent machine
Irving John Good. Speculations concerning the first ultraintelligent machine. volume 6 ofAdvances in Computers, pages 31–88. Elsevier, 1966. doi: https://doi.org/10.1016/ S0065-2458(08)60418-0. URL https://www.sciencedirect.com/science/article/ pii/S0065245808604180
work page 1966
-
[23]
Juergen Schmidhuber. Goedel machines: Self-referential universal problem solvers making provably optimal self-improvements, 2006. URLhttps://arxiv.org/abs/cs/0309048
-
[24]
AI with recursive self-improvement
MingchenZhuge, AilingZeng,DeyaoZhu,SherryYang,VikasChandra,andJürgenSchmidhuber. AI with recursive self-improvement. InICLR 2026 Workshop Proposals, 2026. URLhttps: //openreview.net/forum?id=OsPQ6zTQXV
work page 2026
-
[25]
Posttrainbench: Can llm agents automate llm post-training?,
Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training?,
- [26]
-
[27]
Darwin gödel machine: Open-ended evolution of self-improving agents
Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin gödel machine: Open-ended evolution of self-improving agents. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= pUpzQZTvGY
work page 2026
-
[28]
Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and Jürgen Schmidhuber. Huxley-g\”odel machine: Human-level cod- ing agent development by an approximation of the optimal self-improving machine. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum...
work page 2026
-
[29]
OpenAI. Learning to reason with LLMs. 2024. URL https://openai.com/index/ learning-to-reason-with-llms
work page 2024
-
[30]
DayaGuo,DejianYang,HaoweiZhang,JunxiaoSong,PeiyiWang,etal. Deepseek-r1incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September
-
[31]
Nature645(8081), 633–638 (2025) https://doi.org/10.1038/s41586-025-09422-z
ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URLhttp://dx.doi.org/10. 1038/s41586-025-09422-z
-
[32]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, et al. Kimi k1.5: Scaling reinforcement learning with llms, 2025. URLhttps://arxiv.org/abs/2501.12599
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/ 2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
difflib — Helpers for computing deltas
Python Software Foundation. difflib — Helpers for computing deltas. 2026. URLhttps: //docs.python.org/3/library/difflib.html
work page 2026
-
[35]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025
work page 2025
-
[36]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Openai gpt-5 system card, 2025
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, et al. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601. 03267
work page 2025
-
[38]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
LlamaFactory: Unified efficient fine-tuning of 100+ language models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. LlamaFactory: Unified efficient fine-tuning of 100+ language models. In Yixin Cao, Yang Feng, and Deyi Xiong, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 400–410, Bangkok, Thailand, August
-
[40]
doi: 10.18653/v1/2024.acl-demos.38
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-demos.38. URL https://aclanthology.org/2024.acl-demos.38/
-
[41]
Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale
Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Stor...
work page 2022
-
[42]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961. doi: 1...
-
[43]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural I...
-
[44]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps: //arxiv.org/abs/2408.06292
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stanley Lo, Sergio Pablo-García, Ella M. Rajaonson, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc, Felix Strieth-Kalthoff, Martin Seifrid, and Alán Aspuru-Guzik. Self-driving lab- oratories for chemistry and materials science.Chemical Reviews, 124(16):963...
-
[46]
Rushil Gupta, Jason Hartford, and Bang Liu. LLMs for Bayesian optimization in scientific domains: Are we there yet? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510, Suzhou, China, November 2025. Association for Computational ...
-
[47]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://arxiv.org/ abs/2101.00027
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[48]
Houyi Li, Wenzhen Zheng, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Zhenyu Ding, Haoying Wang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. Predictable scale: Part i, step law – optimal hyperparameter scaling law in large language model pretraining, 2025. URLhttps://arxiv.org/abs/2503.04715
-
[49]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/ abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[50]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
YuboWang, XueguangMa, GeZhang, YuanshengNi, AbhranilChandra, ShiguangGuo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task 21 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration language und...
work page 2024
-
[52]
Xu Ouyang, Shengzhuang Chen, Michael Arthur Leopold Pearce, Thomas Hartvigsen, and Jonathan Richard Schwarz. ADMIRE-bayesopt: Accelerated data MIxture RE-weighting for language models with bayesian optimization.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=0Euvm9zDpu
work page 2025
-
[53]
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christo- pher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannan...
work page 2025
-
[54]
learning rate: 0.001953, batch size: 128.0 ######
learning rate: 0.001953, batch size: 64.0 3. learning rate: 0.001953, batch size: 128.0 ###### ...... ###### Experiment Environment information: the total number of training tokens seen by the model during training is: 8000000000, and the count of trainable model parameters excluding token embedding matrices is: 119992320 In this environment, the Top-3 co...
-
[55]
learning rate: 0.002762, batch size: 128.0 2. learning rate: 0.003906, batch size: 128.0 3. learning rate: 0.005524, batch size: 128.0 ###### Remember: 1. Consideryourremainingbudgeis2, previousexperimentalresults,bestconfigurationsfromlow-fidelityexperiments when making decisions. 2. You MUST have to call "exec_config" tool to query the score of the conf...
-
[56]
kl loss weight: 0.0, learning rate: 5e-06, batch size: 32.0 ###### ...... ###### Experiment Environment information: the dataset is mmlu_history, the model is Qwen2.5-1.5B-Instruct, Note the Training epoch is 15 In this environment, the Top-3 configurations are: 1. kl loss weight: 0.0, learning rate: 1e-06, batch size: 64.0 2. kl loss weight: 0.0, learnin...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.