arxiv: 2604.14116 · v2 · submitted 2026-04-15 · 💻 cs.AI · cs.CL

Recognition: unknown

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

Zerun Ma , Guoqiang Wang , Xinchen Xie , Yicheng Chen , He Du , Bowen Li , Yanan Sun , Wenran Liu

show 2 more authors

Kai Chen Yining Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM fine-tuningmulti-agent systemsautomated trainingtree searchAI agentsbenchmarkworkflow automationperformance optimization

0 comments

The pith

A multi-agent system models LLM fine-tuning as a tree search to automate analysis, strategy, data prep, training, and evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TREX as a way to automate the full LLM training lifecycle, which has been difficult for agents to handle beyond isolated tasks. It uses two collaborating modules, one for research and planning and one for execution, while framing the iterative experiments as a search tree that reuses prior results and extracts insights. The approach is tested on FT-Bench, a collection of ten tasks drawn from real scenarios that range from basic capability boosts to domain-specific improvements. If the method works as described, fine-tuning would shift from manual trial-and-error to a largely self-directed process that still delivers measurable performance gains.

Core claim

TREX automates the entire LLM training life-cycle through collaboration between a Researcher module that conducts requirement analysis, open-domain literature and data research, and strategy formulation, and an Executor module that prepares data recipes, performs training, and runs evaluation. The multi-round process is represented as a search tree that supports efficient path planning, reuse of historical results, and distillation of high-level insights, resulting in consistent optimization of model performance on the FT-Bench tasks.

What carries the argument

The search tree that models the experimental process, enabling the agents to plan exploration paths, reuse historical results, and distill insights from trials.

If this is right

Requirement analysis, literature research, and strategy formulation occur automatically without manual direction.
Data recipes, training runs, and evaluations are generated and executed in a closed loop that reuses past outcomes.
High-level insights extracted from trials guide subsequent steps and reduce redundant exploration.
Performance gains appear across both fundamental capability tasks and domain-specific applications.
The full workflow from initial requirements to final model evaluation runs with minimal external oversight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the tree-based reuse mechanism scales, similar structures could apply to automating other open-ended agent workflows such as full model architecture search.
The separation of Researcher and Executor roles suggests a template for dividing planning from action in other multi-agent scientific systems.
Success on the ten FT-Bench tasks raises the question of how the system would behave on training pipelines that involve multiple sequential fine-tuning stages.
Wider adoption could lower the expertise threshold for organizations to produce custom fine-tuned models.

Load-bearing premise

That the multi-agent orchestration and tree-structured planning can reliably handle the open-ended complexity of real-world LLM training without human intervention or major planning errors.

What would settle it

Observing that TREX fails to improve performance on most of the ten FT-Bench tasks or that its tree planning repeatedly produces invalid strategies or requires human corrections would undermine the automation claim.

read the original abstract

While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formulation of training strategies, preparation of data recipes, and model training and evaluation. The multi-round experimental process is modeled as a search tree, enabling the system to efficiently plan exploration paths, reuse historical results, and distill high-level insights from iterative trials. To evaluate the capability of automated LLM training, we construct FT-Bench, a benchmark comprising 10 tasks derived from real-world scenarios, ranging from optimizing fundamental model capabilities to enhancing performance on domain-specific tasks. Experimental results demonstrate that the TREX agent consistently optimizes model performance on target tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TREX gives a concrete dual-agent tree search for full LLM fine-tuning automation and beats weak baselines on its new FT-Bench, but the evaluation stays narrow.

read the letter

TREX sets up a Researcher agent that plans strategies and an Executor that handles data prep, training, and evaluation, all organized as a search tree so past trials get reused. The main result is that this loop produces better numbers than random or greedy search on the 10 tasks in their FT-Bench benchmark drawn from real scenarios. They include the actual agent prompts and task definitions in the paper, which makes the setup easier to inspect than most abstract claims in this area.

Referee Report

1 major / 3 minor

Summary. The paper introduces TREX, a multi-agent system with Researcher and Executor modules that automates the full LLM fine-tuning lifecycle, including requirement analysis, literature/data research, strategy formulation, data preparation, training, and evaluation. It models the iterative process as a tree-structured search to plan paths, reuse historical results, and distill insights. A new benchmark FT-Bench with 10 tasks derived from real-world scenarios is constructed, and experiments report that TREX consistently outperforms random and greedy baselines on these tasks.

Significance. If the experimental claims hold under scrutiny, this work would represent a meaningful advance in agentic AI by demonstrating reliable automation of complex, open-ended workflows like end-to-end model training. It could reduce reliance on human experts for iterative experimentation and provide a reusable framework for scaling automated scientific discovery in machine learning.

major comments (1)

Experimental Results section: The quantitative tables demonstrate TREX outperforming random and greedy baselines across the 10 FT-Bench tasks, but the manuscript does not report the number of independent runs, variance across seeds, or any statistical significance tests. This directly weakens support for the central claim of 'consistent' optimization stated in the abstract and conclusion.

minor comments (3)

The description of the tree-structured planning and Researcher/Executor loop would benefit from a formal pseudocode algorithm or diagram to clarify the multi-round reuse mechanism.
FT-Bench task definitions are provided, but additional details on how each task's success criteria and evaluation metrics were chosen would strengthen the benchmark's justification.
The agent prompts are described as concrete, yet the paper could include a short discussion of observed failure modes or edge cases in the orchestration to address the open-ended complexity concern.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment. We agree that additional statistical details are needed to strengthen the claims of consistent performance and will revise the manuscript accordingly.

read point-by-point responses

Referee: Experimental Results section: The quantitative tables demonstrate TREX outperforming random and greedy baselines across the 10 FT-Bench tasks, but the manuscript does not report the number of independent runs, variance across seeds, or any statistical significance tests. This directly weakens support for the central claim of 'consistent' optimization stated in the abstract and conclusion.

Authors: We agree that the lack of reported run counts, variance, and significance tests weakens the support for 'consistent' gains. In the revised manuscript, we will update the Experimental Results section and all tables to explicitly state that each configuration was evaluated over 5 independent runs with distinct random seeds. We will report mean performance along with standard deviations. We will also add paired statistical significance tests (Wilcoxon signed-rank test) between TREX and each baseline, including p-values, to rigorously substantiate the improvements across the 10 tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical multi-agent system (TREX) for automating LLM fine-tuning and reports experimental results on the constructed FT-Bench benchmark. No derivation chain, equations, fitted parameters, or first-principles predictions exist; performance claims rest directly on tabulated comparisons against random and greedy baselines rather than any self-referential reduction or self-citation load-bearing step. The tree-structured exploration and Researcher/Executor loop are presented as design choices with concrete prompts and task definitions supplied, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, parameters, or new entities are introduced in the abstract; the work is entirely empirical and system-oriented.

pith-pipeline@v0.9.0 · 5492 in / 944 out tokens · 33561 ms · 2026-05-10T12:30:45.722751+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 28 canonical work pages · 7 internal anchors

[1]

Automaticsemanticclassificationofscientificliteratureaccordingtothehallmarksofcancer.Bioinformatics, 32(3):432–440, 2016

Simon Baker, Ilona Silins, Yufan Guo, Imran Ali, Johan Högberg, Ulla Stenius, and Anna Korhonen. Automaticsemanticclassificationofscientificliteratureaccordingtothehallmarksofcancer.Bioinformatics, 32(3):432–440, 2016. 4.1, 1, A.1

2016
[2]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024. 1, 2.2, 4, 4.2, 2

work page Pith review arXiv 2024
[3]

Data-juicer: A one-stop data processing system for large language models

Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, and Jingren Zhou. Data-juicer: A one-stop data processing system for large language models. InInternational Conference on Management of Data, 2024. 2.3

2024
[4]

Baker, Benjamin Burns, Daniel Adu- Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Y u Su, and Huan Sun

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024. 4, 4.1, 4.2, 2

work page arXiv 2024
[5]

Language modeling by language models

Junyan Cheng, Peter Clark, and Kyle Richardson. Language modeling by language models. InNIPS,
[6]

Sela: Tree-search enhanced llm agents for automated machine learning, 2024

Yizhou Chi, Yizhang Lin, Sirui Hong, Duyi Pan, Yaying Fei, Guanghao Mei, Bangbang Liu, Tianqi Pang, Jacky Kwok, Ceyao Zhang, Bang Liu, and Chenglin Wu. Sela: Tree-search enhanced llm agents for automated machine learning, 2024. 1, 2.2, 3.3

2024
[7]

Efficient selectivity and backup operators in monte-carlo tree search

Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. InInternational conference on computers and games, pages 72–83. Springer, 2006. 3.3

2006
[8]

Openfindata

East Money and Shanghai AI Lab. Openfindata. https://github.com/open-compass/ OpenFinData/, 2023. Accessed: 2023. 4.1, 1, A.1

2023
[9]

Autogluon-tabular: Robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505, 2020

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. Autogluon-tabular: Robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505,

work page arXiv 2003
[10]

Lawbench: Benchmarking legal knowledge of large language models

Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, et al. Lawbench: Benchmarking legal knowledge of large language models. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 7933–7962,

2024
[11]

Gemini 3 pro model card

Google DeepMind. Gemini 3 pro model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf, 2025. 5.1

2025
[12]

Towards an AI co-scientist

JurajGottweis,Wei-HungWeng,AlexanderDaryin,TaoTu,AnilPalepu,PetarSirkovic,ArtiomMyaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025. 1

work page internal anchor Pith review arXiv 2025
[13]

Ideabench: Benchmarking large language models for research idea generation

Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Myles Kim, Corey M Williams, Stefan Bekiranov, and Aidong Zhang. Ideabench: Benchmarking large language models for research idea generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 5888–5899, 2025. 2.2

2025
[14]

Mlagentbench: Evaluating language agents on machine learning experimentation, 2024

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023. 4, 4.2, 2

work page arXiv 2023
[15]

ALE-Bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050,

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050,

work page arXiv
[16]

1 12 TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
[17]

Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024

Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bod- hisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024. 2.2

2024
[18]

Aide: Ai-driven exploration in the space of code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code. 2025. 2.2

2025
[19]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

CarlosEJimenez,JohnYang,AlexanderWettig,ShunyuYao,KexinPei,OfirPress,andKarthikNarasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review arXiv
[20]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, and ... Dense passage retrieval for open-domain question answering. InEMNLP, 2020. 2.1

2020
[21]

Scaling tree-based automated machine learning to biomedical big data with a feature set selector.Bioinformatics, 36(1):250–256, 2020

Trang T Le, Weixuan Fu, and Jason H Moore. Scaling tree-based automated machine learning to biomedical big data with a feature set selector.Bioinformatics, 36(1):250–256, 2020. 2.3

2020
[22]

Retrieval-augmented generation for knowledge- intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks. volume 33, pages 9459–9474, 2020. 2.1

2020
[23]

The fm agent, 2025.https://arxiv.org/abs/2510.26144

Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, et al. The fm agent.arXiv preprint arXiv:2510.26144, 2025. 1, 2.2

work page arXiv 2025
[24]

arXiv preprint arXiv:2412.14642 , year=

Jiatong Li, Junxian Li, Yunqing Liu, Dongzhan Zhou, and Qing Li. Tomg-bench: Evaluating llms on text-based open molecule generation.arXiv preprint arXiv:2412.14642, 2024. 1, 5.2, A.1

work page arXiv 2024
[25]

Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

HaoLiang, XiaochenMa, ZhouLiu, ZhenHaoWong, ZhengyangZhao, ZimoMeng, RunmingHe, Chengyu Shen, Qifeng Cai, Zhaoyang Han, et al. Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025. 2.3

work page arXiv 2025
[26]

Agenthpo: Largelanguagemodelagentforhyper-parameteroptimization

SiyiLiu, ChenGao, andYongLi. Agenthpo: Largelanguagemodelagentforhyper-parameteroptimization. In Beidi Chen, Shijia Liu, Mert Pilanci, Weijie Su, Jeremias Sulam, Yuxiang Wang, and Zhihui Zhu, editors, Conference on Parsimony and Learning, volume 280 ofProceedings of Machine Learning Research, pages 1146–1169. PMLR, 24–27 Mar 2025. 1

2025
[27]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. InThe Twelfth International Conference on Learning Representations, 2024. 2.3

2024
[28]

Ml-master: Towards ai-for-ai via integration of exploration and reasoning, 2025

Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Yanfeng Wang, Weinan E, and Siheng Chen. Ml-master: Towards ai-for-ai via integration of exploration and reasoning, 2025. 1, 2.2, 3.3

2025
[29]

Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Tjarko Lange

Chris Lu, Samuel Holt, Claudio Fanconi, Alex J. Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Tjarko Lange. Discovering preference optimization algorithms with and for large language models. InNIPS, 2024. 2.3

2024
[31]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. 2.1, 2.2

work page internal anchor Pith review arXiv 2024
[32]

Agentin- struct: Toward generative teaching with agentic flows,

Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah. Agentinstruct: Toward generative teaching with agentic flows.arXiv preprint arXiv:2407.03502, 2024. 2.3

work page arXiv 2024
[33]

Mlgym: A new framework and benchmark for advancing ai research agents,

Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, et al. Mlgym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025. 4.2, 2 13 TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

work page arXiv 2025
[35]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025. 1

work page internal anchor Pith review arXiv 2025
[36]

Fevo: Financial knowledge expansion and reasoning evolution for large language models.ArXiv, abs/2507.06057, 2025

Bo Pang, Yalu Ouyang, Hang Xu, Ziqi Jia, Pan Li, Shengzhao Wen, Lu Wang, Shiyong Li, and Yanpeng Wang. Fevo: Financial knowledge expansion and reasoning evolution for large language models.ArXiv, abs/2507.06057, 2025. 5.2

work page arXiv 2025
[37]

Econlogicqa: Aquestion-answeringbenchmarkforevaluatinglargelanguage models in economic sequential reasoning.arXiv preprint arXiv:2405.07938, 2024

YinzhuQuanandZefangLiu. Econlogicqa: Aquestion-answeringbenchmarkforevaluatinglargelanguage models in economic sequential reasoning.arXiv preprint arXiv:2405.07938, 2024. 4.1, 1, A.1

work page arXiv 2024
[38]

arXiv preprint arXiv:2412.17596 , year=

Kai Ruan, Xuan Wang, Jixiang Hong, Peng Wang, Yang Liu, and Hao Sun. Liveideabench: Evaluating llms’ divergent thinking for scientific idea generation with minimal context.arXiv preprint arXiv:2412.17596,

work page arXiv
[39]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642,

2013
[40]

Cs-bench: A comprehensive benchmark for large language models towards computer science mastery.arXiv preprint arXiv:2406.08587, 2024

Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, et al. Cs-bench: A comprehensive benchmark for large language models towards computer science mastery.arXiv preprint arXiv:2406.08587, 2024. 1, A.1

work page arXiv 2024
[41]

2025 , doi =

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025. 2.1, 2.2

work page arXiv 2025
[42]

Autogluon-multimodal (automm): Supercharging multimodal automl with foundation models.arXiv preprint arXiv:2404.16233, 2024

Zhiqiang Tang, Haoyang Fang, Su Zhou, Taojiannan Yang, Zihan Zhong, Tony Hu, Katrin Kirchhoff, and George Karypis. Autogluon-multimodal (automm): Supercharging multimodal automl with foundation models.arXiv preprint arXiv:2404.16233, 2024. 2.3, 4.1

work page arXiv 2024
[43]

Internagent: When agent becomes the scientist–building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025

NovelSeek Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, et al. Novelseek: When agent becomes the scientist–building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025. 2.1

work page arXiv 2025
[44]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 1, 5.1, 5.1

2025
[45]

Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37:75749–75790, 2024

Jize Wang, Ma Zerun, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37:75749–75790, 2024. 4.1, 1, A.1

2024
[46]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024. 3.2

work page internal anchor Pith review arXiv 2024
[47]

International Conference on Learning Representations (ICLR) , year =

Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively.arXiv preprint arXiv:2509.26603, 2025. 2.2

work page arXiv 2025
[48]

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents.arXiv preprint arXiv:2411.15114,

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024. 2.2, 4.2, 2

work page arXiv 2024
[49]

Wizardlm: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. InICLR, 2024. 2.3 14 TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

2024
[50]

oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, and Heng Ji. omebench: Towards robust benchmarking of llms in organic mechanism elucidation and reasoning.arXiv preprint arXiv:2510.07731,

work page internal anchor Pith review Pith/arXiv arXiv
[51]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025. 2.1, 2.2, 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation.Scientific data, 10(1):586, 2023

Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, and Meliha Yetisgen. Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation.Scientific data, 10(1):586, 2023. 4.1, 1, A.1

2023
[53]

Dolphin: Closed-loop open-ended auto-research through thinking, practice, and feedback.arXiv preprint arXiv:2501.03916, 2025

Jiakang Yuan, Xiangchao Yan, Shiyang Feng, Bo Zhang, Tao Chen, Botian Shi, Wanli Ouyang, Yu Qiao, Lei Bai, and Bowen Zhou. Dolphin: Closed-loop open-ended auto-research through thinking, practice, and feedback.arXiv preprint arXiv:2501.03916, 2025. 2.1

work page arXiv 2025
[54]

monthly performance,

Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025. 1, 2.1 15 TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration A. Appendix A.1. Details of the tasks in FT-Bench • ACI-Bench[ 51]: This ...

work page arXiv 2025