Recognition: unknown
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
Pith reviewed 2026-05-10 12:30 UTC · model grok-4.3
The pith
A multi-agent system models LLM fine-tuning as a tree search to automate analysis, strategy, data prep, training, and evaluation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TREX automates the entire LLM training life-cycle through collaboration between a Researcher module that conducts requirement analysis, open-domain literature and data research, and strategy formulation, and an Executor module that prepares data recipes, performs training, and runs evaluation. The multi-round process is represented as a search tree that supports efficient path planning, reuse of historical results, and distillation of high-level insights, resulting in consistent optimization of model performance on the FT-Bench tasks.
What carries the argument
The search tree that models the experimental process, enabling the agents to plan exploration paths, reuse historical results, and distill insights from trials.
If this is right
- Requirement analysis, literature research, and strategy formulation occur automatically without manual direction.
- Data recipes, training runs, and evaluations are generated and executed in a closed loop that reuses past outcomes.
- High-level insights extracted from trials guide subsequent steps and reduce redundant exploration.
- Performance gains appear across both fundamental capability tasks and domain-specific applications.
- The full workflow from initial requirements to final model evaluation runs with minimal external oversight.
Where Pith is reading between the lines
- If the tree-based reuse mechanism scales, similar structures could apply to automating other open-ended agent workflows such as full model architecture search.
- The separation of Researcher and Executor roles suggests a template for dividing planning from action in other multi-agent scientific systems.
- Success on the ten FT-Bench tasks raises the question of how the system would behave on training pipelines that involve multiple sequential fine-tuning stages.
- Wider adoption could lower the expertise threshold for organizations to produce custom fine-tuned models.
Load-bearing premise
That the multi-agent orchestration and tree-structured planning can reliably handle the open-ended complexity of real-world LLM training without human intervention or major planning errors.
What would settle it
Observing that TREX fails to improve performance on most of the ten FT-Bench tasks or that its tree planning repeatedly produces invalid strategies or requires human corrections would undermine the automation claim.
read the original abstract
While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formulation of training strategies, preparation of data recipes, and model training and evaluation. The multi-round experimental process is modeled as a search tree, enabling the system to efficiently plan exploration paths, reuse historical results, and distill high-level insights from iterative trials. To evaluate the capability of automated LLM training, we construct FT-Bench, a benchmark comprising 10 tasks derived from real-world scenarios, ranging from optimizing fundamental model capabilities to enhancing performance on domain-specific tasks. Experimental results demonstrate that the TREX agent consistently optimizes model performance on target tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TREX, a multi-agent system with Researcher and Executor modules that automates the full LLM fine-tuning lifecycle, including requirement analysis, literature/data research, strategy formulation, data preparation, training, and evaluation. It models the iterative process as a tree-structured search to plan paths, reuse historical results, and distill insights. A new benchmark FT-Bench with 10 tasks derived from real-world scenarios is constructed, and experiments report that TREX consistently outperforms random and greedy baselines on these tasks.
Significance. If the experimental claims hold under scrutiny, this work would represent a meaningful advance in agentic AI by demonstrating reliable automation of complex, open-ended workflows like end-to-end model training. It could reduce reliance on human experts for iterative experimentation and provide a reusable framework for scaling automated scientific discovery in machine learning.
major comments (1)
- Experimental Results section: The quantitative tables demonstrate TREX outperforming random and greedy baselines across the 10 FT-Bench tasks, but the manuscript does not report the number of independent runs, variance across seeds, or any statistical significance tests. This directly weakens support for the central claim of 'consistent' optimization stated in the abstract and conclusion.
minor comments (3)
- The description of the tree-structured planning and Researcher/Executor loop would benefit from a formal pseudocode algorithm or diagram to clarify the multi-round reuse mechanism.
- FT-Bench task definitions are provided, but additional details on how each task's success criteria and evaluation metrics were chosen would strengthen the benchmark's justification.
- The agent prompts are described as concrete, yet the paper could include a short discussion of observed failure modes or edge cases in the orchestration to address the open-ended complexity concern.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive overall assessment. We agree that additional statistical details are needed to strengthen the claims of consistent performance and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Experimental Results section: The quantitative tables demonstrate TREX outperforming random and greedy baselines across the 10 FT-Bench tasks, but the manuscript does not report the number of independent runs, variance across seeds, or any statistical significance tests. This directly weakens support for the central claim of 'consistent' optimization stated in the abstract and conclusion.
Authors: We agree that the lack of reported run counts, variance, and significance tests weakens the support for 'consistent' gains. In the revised manuscript, we will update the Experimental Results section and all tables to explicitly state that each configuration was evaluated over 5 independent runs with distinct random seeds. We will report mean performance along with standard deviations. We will also add paired statistical significance tests (Wilcoxon signed-rank test) between TREX and each baseline, including p-values, to rigorously substantiate the improvements across the 10 tasks. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes an empirical multi-agent system (TREX) for automating LLM fine-tuning and reports experimental results on the constructed FT-Bench benchmark. No derivation chain, equations, fitted parameters, or first-principles predictions exist; performance claims rest directly on tabulated comparisons against random and greedy baselines rather than any self-referential reduction or self-citation load-bearing step. The tree-structured exploration and Researcher/Executor loop are presented as design choices with concrete prompts and task definitions supplied, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Automaticsemanticclassificationofscientificliteratureaccordingtothehallmarksofcancer.Bioinformatics, 32(3):432–440, 2016
Simon Baker, Ilona Silins, Yufan Guo, Imran Ali, Johan Högberg, Ulla Stenius, and Anna Korhonen. Automaticsemanticclassificationofscientificliteratureaccordingtothehallmarksofcancer.Bioinformatics, 32(3):432–440, 2016. 4.1, 1, A.1
2016
-
[2]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024. 1, 2.2, 4, 4.2, 2
work page Pith review arXiv 2024
-
[3]
Data-juicer: A one-stop data processing system for large language models
Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, and Jingren Zhou. Data-juicer: A one-stop data processing system for large language models. InInternational Conference on Management of Data, 2024. 2.3
2024
-
[4]
Baker, Benjamin Burns, Daniel Adu- Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Y u Su, and Huan Sun
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024. 4, 4.1, 4.2, 2
-
[5]
Language modeling by language models
Junyan Cheng, Peter Clark, and Kyle Richardson. Language modeling by language models. InNIPS,
-
[6]
Sela: Tree-search enhanced llm agents for automated machine learning, 2024
Yizhou Chi, Yizhang Lin, Sirui Hong, Duyi Pan, Yaying Fei, Guanghao Mei, Bangbang Liu, Tianqi Pang, Jacky Kwok, Ceyao Zhang, Bang Liu, and Chenglin Wu. Sela: Tree-search enhanced llm agents for automated machine learning, 2024. 1, 2.2, 3.3
2024
-
[7]
Efficient selectivity and backup operators in monte-carlo tree search
Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. InInternational conference on computers and games, pages 72–83. Springer, 2006. 3.3
2006
-
[8]
Openfindata
East Money and Shanghai AI Lab. Openfindata. https://github.com/open-compass/ OpenFinData/, 2023. Accessed: 2023. 4.1, 1, A.1
2023
-
[9]
Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. Autogluon-tabular: Robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505,
-
[10]
Lawbench: Benchmarking legal knowledge of large language models
Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, et al. Lawbench: Benchmarking legal knowledge of large language models. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 7933–7962,
2024
-
[11]
Gemini 3 pro model card
Google DeepMind. Gemini 3 pro model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf, 2025. 5.1
2025
-
[12]
JurajGottweis,Wei-HungWeng,AlexanderDaryin,TaoTu,AnilPalepu,PetarSirkovic,ArtiomMyaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[13]
Ideabench: Benchmarking large language models for research idea generation
Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Myles Kim, Corey M Williams, Stefan Bekiranov, and Aidong Zhang. Ideabench: Benchmarking large language models for research idea generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 5888–5899, 2025. 2.2
2025
-
[14]
Mlagentbench: Evaluating language agents on machine learning experimentation, 2024
Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023. 4, 4.2, 2
-
[15]
Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050,
-
[16]
1 12 TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
-
[17]
Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024
Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bod- hisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024. 2.2
2024
-
[18]
Aide: Ai-driven exploration in the space of code
Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code. 2025. 2.2
2025
-
[19]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
CarlosEJimenez,JohnYang,AlexanderWettig,ShunyuYao,KexinPei,OfirPress,andKarthikNarasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,
work page internal anchor Pith review arXiv
-
[20]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, and ... Dense passage retrieval for open-domain question answering. InEMNLP, 2020. 2.1
2020
-
[21]
Scaling tree-based automated machine learning to biomedical big data with a feature set selector.Bioinformatics, 36(1):250–256, 2020
Trang T Le, Weixuan Fu, and Jason H Moore. Scaling tree-based automated machine learning to biomedical big data with a feature set selector.Bioinformatics, 36(1):250–256, 2020. 2.3
2020
-
[22]
Retrieval-augmented generation for knowledge- intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks. volume 33, pages 9459–9474, 2020. 2.1
2020
-
[23]
The fm agent, 2025.https://arxiv.org/abs/2510.26144
Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, et al. The fm agent.arXiv preprint arXiv:2510.26144, 2025. 1, 2.2
-
[24]
arXiv preprint arXiv:2412.14642 , year=
Jiatong Li, Junxian Li, Yunqing Liu, Dongzhan Zhou, and Qing Li. Tomg-bench: Evaluating llms on text-based open molecule generation.arXiv preprint arXiv:2412.14642, 2024. 1, 5.2, A.1
-
[25]
HaoLiang, XiaochenMa, ZhouLiu, ZhenHaoWong, ZhengyangZhao, ZimoMeng, RunmingHe, Chengyu Shen, Qifeng Cai, Zhaoyang Han, et al. Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025. 2.3
-
[26]
Agenthpo: Largelanguagemodelagentforhyper-parameteroptimization
SiyiLiu, ChenGao, andYongLi. Agenthpo: Largelanguagemodelagentforhyper-parameteroptimization. In Beidi Chen, Shijia Liu, Mert Pilanci, Weijie Su, Jeremias Sulam, Yuxiang Wang, and Zhihui Zhu, editors, Conference on Parsimony and Learning, volume 280 ofProceedings of Machine Learning Research, pages 1146–1169. PMLR, 24–27 Mar 2025. 1
2025
-
[27]
What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning
Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. InThe Twelfth International Conference on Learning Representations, 2024. 2.3
2024
-
[28]
Ml-master: Towards ai-for-ai via integration of exploration and reasoning, 2025
Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Yanfeng Wang, Weinan E, and Siheng Chen. Ml-master: Towards ai-for-ai via integration of exploration and reasoning, 2025. 1, 2.2, 3.3
2025
-
[29]
Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Tjarko Lange
Chris Lu, Samuel Holt, Claudio Fanconi, Alex J. Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Tjarko Lange. Discovering preference optimization algorithms with and for large language models. InNIPS, 2024. 2.3
2024
-
[31]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. 2.1, 2.2
work page internal anchor Pith review arXiv 2024
-
[32]
Agentin- struct: Toward generative teaching with agentic flows,
Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah. Agentinstruct: Toward generative teaching with agentic flows.arXiv preprint arXiv:2407.03502, 2024. 2.3
-
[33]
Mlgym: A new framework and benchmark for advancing ai research agents,
Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, et al. Mlgym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025. 4.2, 2 13 TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
-
[35]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[36]
Bo Pang, Yalu Ouyang, Hang Xu, Ziqi Jia, Pan Li, Shengzhao Wen, Lu Wang, Shiyong Li, and Yanpeng Wang. Fevo: Financial knowledge expansion and reasoning evolution for large language models.ArXiv, abs/2507.06057, 2025. 5.2
-
[37]
YinzhuQuanandZefangLiu. Econlogicqa: Aquestion-answeringbenchmarkforevaluatinglargelanguage models in economic sequential reasoning.arXiv preprint arXiv:2405.07938, 2024. 4.1, 1, A.1
-
[38]
arXiv preprint arXiv:2412.17596 , year=
Kai Ruan, Xuan Wang, Jixiang Hong, Peng Wang, Yang Liu, and Hao Sun. Liveideabench: Evaluating llms’ divergent thinking for scientific idea generation with minimal context.arXiv preprint arXiv:2412.17596,
-
[39]
Recursive deep models for semantic compositionality over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642,
2013
-
[40]
Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, et al. Cs-bench: A comprehensive benchmark for large language models towards computer science mastery.arXiv preprint arXiv:2406.08587, 2024. 1, A.1
-
[41]
Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025. 2.1, 2.2
-
[42]
Zhiqiang Tang, Haoyang Fang, Su Zhou, Taojiannan Yang, Zihan Zhong, Tony Hu, Katrin Kirchhoff, and George Karypis. Autogluon-multimodal (automm): Supercharging multimodal automl with foundation models.arXiv preprint arXiv:2404.16233, 2024. 2.3, 4.1
-
[43]
NovelSeek Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, et al. Novelseek: When agent becomes the scientist–building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025. 2.1
-
[44]
Qwen3 technical report, 2025
Qwen Team. Qwen3 technical report, 2025. 1, 5.1, 5.1
2025
-
[45]
Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37:75749–75790, 2024
Jize Wang, Ma Zerun, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37:75749–75790, 2024. 4.1, 1, A.1
2024
-
[46]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024. 3.2
work page internal anchor Pith review arXiv 2024
-
[47]
International Conference on Learning Representations (ICLR) , year =
Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively.arXiv preprint arXiv:2509.26603, 2025. 2.2
-
[48]
Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024. 2.2, 4.2, 2
-
[49]
Wizardlm: Empowering large pre-trained language models to follow complex instructions
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. InICLR, 2024. 2.3 14 TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
2024
-
[50]
oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning
Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, and Heng Ji. omebench: Towards robust benchmarking of llms in organic mechanism elucidation and reasoning.arXiv preprint arXiv:2510.07731,
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025. 2.1, 2.2, 4.1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation.Scientific data, 10(1):586, 2023
Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, and Meliha Yetisgen. Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation.Scientific data, 10(1):586, 2023. 4.1, 1, A.1
2023
-
[53]
Jiakang Yuan, Xiangchao Yan, Shiyang Feng, Bo Zhang, Tao Chen, Botian Shi, Wanli Ouyang, Yu Qiao, Lei Bai, and Bowen Zhou. Dolphin: Closed-loop open-ended auto-research through thinking, practice, and feedback.arXiv preprint arXiv:2501.03916, 2025. 2.1
-
[54]
Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025. 1, 2.1 15 TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration A. Appendix A.1. Details of the tasks in FT-Bench • ACI-Bench[ 51]: This ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.