pith. sign in

arxiv: 2606.03841 · v1 · pith:BMPDYASTnew · submitted 2026-06-02 · 💻 cs.AI

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

Pith reviewed 2026-06-28 09:30 UTC · model grok-4.3

classification 💻 cs.AI
keywords autonomous agentsdata science agentsskill acquisitioncontext compressionreinforcement learninglarge language modelsself-evolving systemsinformation bottleneck
0
0 comments X

The pith

EvoDS lets data science agents acquire reusable skills and learn context compression through reinforcement learning, raising benchmark performance by 28.9 percent while removing token-limit failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current LLM-based data science agents are held back by fixed action lists and crude truncation of history, so it builds EvoDS to let agents invent, test, and keep executable skills on their own and to treat context length as a policy that can be trained rather than a fixed rule. A two-stage multi-agent reinforcement learning process first trains a skill-acquisition loop and then an adaptive compressor, with a proof that the hierarchy cuts tool-choice mistakes and that the training objective matches the information bottleneck. On four benchmarks the resulting system beats prior open-source agents by 28.9 percent on average and never runs out of tokens. A sympathetic reader would care because the same limits appear in any long-horizon agent task that must reuse experience without human rewriting of prompts.

Core claim

EvoDS introduces an Autonomous Skill Acquisition mechanism that lets the agent synthesize, validate, and reuse executable skills together with an Adaptive Context Compression strategy that frames context management as a learned control problem. These components run inside a two-stage multi-agent training scheme that enables autonomous improvement over time. The authors prove the hierarchical design reduces tool-selection error and that the optimization objective aligns with the information bottleneck principle. Experiments show a 28.9 percent average gain over state-of-the-art open-source agents across four benchmarks and complete removal of out-of-token failures.

What carries the argument

Autonomous Skill Acquisition (ASA) and Adaptive Context Compression (ACC) inside a two-stage multi-agent reinforcement learning scheme that produces and reuses executable skills while learning to compress history.

If this is right

  • Agents accumulate executable experience across separate tasks instead of restarting from scratch each time.
  • Multi-stage iterative pipelines become feasible without repeated out-of-token crashes.
  • Tool-selection errors drop because the hierarchy separates high-level planning from low-level execution.
  • Context use becomes efficient by design rather than by manual truncation rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the skill-validation step can be made fully automatic and bias-free, the same loop could be tried in non-data-science domains such as code refactoring or scientific experiment design.
  • Treating context compression as a trainable policy may transfer to other long-horizon LLM settings where simple truncation currently loses critical details.
  • The information-bottleneck alignment suggests the method could be extended to measure exactly how much task-relevant information survives each compression step.

Load-bearing premise

The two-stage training and Autonomous Skill Acquisition will reliably produce stable, reusable skills whose automatic validation introduces no hidden biases and requires no extra human oversight.

What would settle it

Run EvoDS on a new multi-stage data pipeline benchmark where the generated skills either fail validation repeatedly or the learned compressor drops information that later steps need, producing lower accuracy than a static baseline.

Figures

Figures reproduced from arXiv: 2606.03841 by Fan Liu, Hao Liu, Yansong Ning, Zherui Yang.

Figure 1
Figure 1. Figure 1: Overview of EvoDS. (a) EvoDS adopts a hierarchical multi-agent architecture with autonomous skill acquisition and [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Case study of EvoDS. Case 1 shows skill synthesis for solving a new task. Case 2 demonstrates cross-task skill reuse. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reward and context length of EvoDS and naive [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The system prompt used for the Manager agent. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The system prompt used for the Cleaner agent. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: The system prompt used for the Modeler agent. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: The system prompt used for the Featurizer agent. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 11
Figure 11. Figure 11: The input prompt used for the Cleaner agent. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The input prompt used for the Featurizer agent. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 9
Figure 9. Figure 9: The system prompt used for the Visualizer agent. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The system prompt used for the Debugger agent. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: The input prompt used for the Modeler agent. [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The input prompt used for the Visualizer agent. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The input prompt used for the Debugger agent. [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
read the original abstract

Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces EvoDS, a self-evolving LLM-based data science agent using Autonomous Skill Acquisition (ASA) to synthesize/validate/reuse executable skills and Adaptive Context Compression (ACC) for learned context management. These are orchestrated in a two-stage multi-agent training scheme. The manuscript claims a theoretical proof that the hierarchical design reduces tool-selection error and that the optimization objective aligns with an information-bottleneck principle. Empirically, it reports a 28.9% average outperformance over state-of-the-art open-source data science agents across four benchmarks, with zero out-of-token failures, and releases code and data.

Significance. If the empirical gains and theoretical alignment can be substantiated with full experimental protocols, derivations, and validation details, the work would represent a meaningful advance in autonomous LLM agents for iterative data science pipelines by addressing static action sets and context management. The open release of code/data strengthens potential impact and reproducibility.

major comments (3)
  1. [Abstract / theoretical section] Abstract and § on theoretical contributions: the claimed proof that the hierarchical design reduces tool-selection error and that the optimization aligns with an information-bottleneck principle is asserted without any equations, derivation steps, or formal statements, preventing verification of whether the alignment is independent or circular with the training objective.
  2. [Abstract / ASA mechanism] Abstract and ASA description: the Autonomous Skill Acquisition mechanism is described as enabling agents to 'synthesize, validate, and reuse executable skills,' but supplies no concrete validation criteria, success thresholds, failure-mode handling, or quantification of human oversight, which is load-bearing for both the 28.9% benchmark gains and the self-evolving property.
  3. [Abstract / experimental results] Empirical claims: the 28.9% average improvement, elimination of out-of-token failures, and cross-benchmark superiority are stated without dataset details, experimental protocol, error bars, statistical tests, or baseline implementations, rendering the central empirical result unverifiable.
minor comments (1)
  1. [Abstract] The abstract mentions 'four diverse benchmarks' without naming them or providing links to the released code/data repository contents.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will incorporate the requested clarifications, derivations, and experimental details into a revised manuscript.

read point-by-point responses
  1. Referee: [Abstract / theoretical section] Abstract and § on theoretical contributions: the claimed proof that the hierarchical design reduces tool-selection error and that the optimization aligns with an information-bottleneck principle is asserted without any equations, derivation steps, or formal statements, preventing verification of whether the alignment is independent or circular with the training objective.

    Authors: We acknowledge that the current manuscript states the theoretical claims at a high level without providing the supporting equations or derivations. In the revision we will add a dedicated subsection containing the formal statements, the full derivation showing how the hierarchical design reduces tool-selection error, and the step-by-step alignment of the optimization objective with the information-bottleneck principle, explicitly demonstrating that the alignment is not circular with the training loss. revision: yes

  2. Referee: [Abstract / ASA mechanism] Abstract and ASA description: the Autonomous Skill Acquisition mechanism is described as enabling agents to 'synthesize, validate, and reuse executable skills,' but supplies no concrete validation criteria, success thresholds, failure-mode handling, or quantification of human oversight, which is load-bearing for both the 28.9% benchmark gains and the self-evolving property.

    Authors: The manuscript currently presents ASA at a conceptual level. We will expand the ASA section with explicit validation criteria (including execution success thresholds and consistency checks), failure-mode handling procedures, and a clear statement of the (minimal) human oversight involved in the validation loop, thereby making the self-evolving claims fully verifiable. revision: yes

  3. Referee: [Abstract / experimental results] Empirical claims: the 28.9% average improvement, elimination of out-of-token failures, and cross-benchmark superiority are stated without dataset details, experimental protocol, error bars, statistical tests, or baseline implementations, rendering the central empirical result unverifiable.

    Authors: We agree that the empirical section requires substantially more detail. The revised manuscript will include complete dataset descriptions, the full experimental protocol, per-benchmark results with error bars, statistical significance tests, and explicit descriptions of how each baseline was implemented and evaluated. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical claims lack equations for inspection

full rationale

The provided abstract asserts a proof that the optimization objective aligns with an information bottleneck principle and that the hierarchical design reduces tool-selection error, but supplies no equations, definitions, or derivation steps. No load-bearing step can be quoted that reduces a claimed result to its own inputs by construction, nor is any fitted parameter renamed as a prediction. The empirical performance numbers are presented as benchmark outcomes rather than derived predictions. The derivation chain is therefore self-contained on the basis of the given text; the absence of mathematical detail prevents any circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claims rest on unstated assumptions about the stability of learned skills and the correctness of the information-bottleneck alignment.

pith-pipeline@v0.9.1-grok · 5776 in / 1236 out tokens · 20678 ms · 2026-06-28T09:30:55.414964+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

99 extracted references · 12 linked inside Pith

  1. [1]

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. In ACL. 12248–12267

  2. [2]

    Alemi, Ian Fischer, Joshua V

    Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. 2017. Deep Variational Information Bottleneck. InICLR

  3. [3]

    Hoos, Padhraic Smyth, and Christopher K

    Tijl De Bie, Luc De Raedt, José Hernández-Orallo, Holger H. Hoos, Padhraic Smyth, and Christopher K. I. Williams. 2022. Automating data science.Commun. ACM65, 3 (2022), 76–87

  4. [4]

    Xiaohe Bo, Zeyu Zhang, Quanyu Dai, Xueyang Feng, Lei Wang, Rui Li, Xu Chen, and Ji-Rong Wen. 2024. Reflective Multi-Agent Collaboration based on Large Language Models. InNeurIPS. 138595–138631

  5. [5]

    Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. 2024. Large Language Models as Tool Makers. InICLR

  6. [6]

    Yibin Chen, Yifu Yuan, Zeyu Zhang, Yan Zheng, Jinyi Liu, Fei Ni, Jianye Hao, Hangyu Mao, and Fuzheng Zhang. 2025. SheetAgent: Towards a Generalist Agent for Spreadsheet Reasoning and Manipulation via Large Language Models. In WWW. 158–177

  7. [7]

    Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. 2025. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discover...

  8. [8]

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav

  9. [9]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.CoRRabs/2504.19413 (2025)

  10. [10]

    Rao, and Branislav Kveton

    Yaswanth Chittepu, Raghavendra Addanki, Tung Mai, Anup B. Rao, and Branislav Kveton. 2025. ML-Tool-Bench: Tool-Augmented Planning for ML Tasks.CoRR abs/2512.00672 (2025)

  11. [11]

    DeepSeek. 2025. DeepSeek-V3.1 Release. https://api-docs.deepseek.com/news/ news250821

  12. [12]

    Shangheng Du, Xiangchao Yan, Dengyang Jiang, Jiakang Yuan, Yusong Hu, Xin Li, Liang He, Bo Zhang, and Lei Bai. 2025. AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents.CoRRabs/2510.08511 (2025)

  13. [13]

    Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Da- gar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, and George Karypis. 2025. MLZero: A Multi-Agent System for End-to-end Machine Learning Automation. InNeurIPS

  14. [14]

    Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. 2025. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems.CoRRabs/2508.07407 (2025)

  15. [15]

    Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, and Wei Han

  16. [16]

    InACL (Findings)

    Extending Context Window of Large Language Models via Semantic Compression. InACL (Findings). 5169–5181

  17. [17]

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. 2024. Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution. InICML. 13481–13544

  18. [18]

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. 2025. A Survey...

  19. [19]

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. 2024. Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers. InICLR

  20. [20]

    Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang

  21. [21]

    DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning. InICML. 16813–16848

  22. [22]

    Chawla, Olaf Wiest, and Xiangliang Zhang

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large Language Model Based Multi-agents: A Survey of Progress and Challenges. InIJCAI. 8048–8057

  23. [23]

    Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A survey of the state- of-the-art.Knowl-based Syst212 (2021), 106622

  24. [24]

    Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, and Jinjie Gu. 2025. Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO.CoRRabs/2511.13288 (2025)

  25. [25]

    Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chenglin Wu. 2025. Da...

  26. [26]

    Jian Hu. 2025. REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models.CoRRabs/2501.03262 (2025)

  27. [27]

    Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu. 2024. InfiAgent-DABench: Evalu- ating Agents on Data Analysis Tasks. InICML. 19544–19572

  28. [28]

    Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, and Kang Liu. 2024. DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models. InEMNLP. 13487–13521

  29. [29]

    Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, and Wenhu Chen. 2025. Verl- Tool: Towards Holistic Agentic Reinforcement Learning with Tool Use.CoRR abs/2509.01055 (2025)

  30. [30]

    Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. 2025. DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?. InICLR

  31. [31]

    Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan

    Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. 2025. ACON: Optimizing Context Compression for Long-horizon LLM Agents.CoRRabs/2510.00615 (2025)

  32. [32]

    Canny, and Ian Fischer

    Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John F. Canny, and Ian Fischer

  33. [33]

    A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts. InICML. 26396–26415

  34. [34]

    Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tianyu Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Stephen Huang, and Ge Zhang. 2025. AutoKaggle: A Multi-Agent Frame- work for Autonomous Data Science Competitions. InDL4C@ICLR

  35. [35]

    Fan Liu, Zhe-Rui Yang, Cancheng Liu, Tianrui SONG, Xiaofeng Gao, and Hao Liu. 2025. MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem. InNeurIPS

  36. [36]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Trans. Assoc. Comput. Linguistics12 (2024), 157–173

  37. [37]

    Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Yanfeng Wang, Weinan E, and Siheng Chen. 2025. ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning.CoRRabs/2506.16499 (2025)

  38. [38]

    Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bolun Zhang, Lei Bai, and Siheng Chen. 2025. ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering.CoRRabs/2505.23723 (2025)

  39. [39]

    Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. 2025. A Survey of Context Engineering for Large Language Models.CoRRabs/2507.13334 (2025)

  40. [40]

    Zhanfeng Mo, Xingxuan Li, Yuntao Chen, and Lidong Bing. 2025. Multi-Agent Tool-Integrated Policy Optimization.CoRRabs/2510.04678 (2025)

  41. [41]

    Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Philip Torr, Ivan Laptev, Fabio Pizzati, Ronald Clark, and Christian Schroeder de Witt. 2025. MALT: Improving Reasoning with Multi-Agent LLM Training. In COLM

  42. [42]

    Alhassan Mumuni and Fuseini Mumuni. 2025. Automated data processing and feature engineering for deep learning and big data applications: A survey.J. Inf. Intell.3, 2 (2025), 113–153

  43. [43]

    Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan O Arik, and Tomas Pfister. 2025. MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement. InNeurIPS

  44. [44]

    Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, and Shafiq Joty. 2025. SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents.CoRR abs/2509.06283 (2025)

  45. [45]

    OpenAI. 2023. Code Interpreter. https://platform.openai.com/docs/guides/tools- code-interpreter

  46. [46]

    OpenAI. 2023. Hello GPT-4. https://openai.com/zh-Hans-CN/index/hello-gpt- 4o/

  47. [47]

    OpenAI. 2025. Introducing OpenAI o3 and o4-mini. https://openai.com/zh- Hans-CN/index/introducing-o3-and-o4-mini/

  48. [48]

    Ozdaglar, Kaiqing Zhang, and Joo-Kyung Kim

    Chanwoo Park, Seungju Han, Xingzhi Guo, Asuman E. Ozdaglar, Kaiqing Zhang, and Joo-Kyung Kim. 2025. MAPoRL: Multi-Agent Post-Co-Training for Collabora- tive Large Language Models with Reinforcement Learning. InACL. 30215–30248

  49. [49]

    Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, ChangHao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, and Bo Dai. 2025. MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering. InNeurIPS

  50. [50]

    Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Zhao Bin, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. 2026. Scaling Generalist Data-Analytic Agents. InICLR. EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea

  51. [51]

    Chuan Qin, Xin Chen, Chengrui Wang, Pengmin Wu, Xi Chen, Yihang Cheng, Jingyi Zhao, Meng Xiao, Xiangchao Dong, Qingqing Long, Boya Pan, Han Wu, Chengzan Li, Yuanchun Zhou, Hui Xiong, and Hengshu Zhu. 2025. SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models. InKDD (2). 5754–5765

  52. [52]

    Tahmid Rah- man Laskar, Ridwan Mahbub, Ahmed Masry, Shafiq Joty, and Enamul Hoque

    Mizanur Rahman, Amran Bhuiyan, Mohammed Saidul Islam, Md. Tahmid Rah- man Laskar, Ridwan Mahbub, Ahmed Masry, Shafiq Joty, and Enamul Hoque

  53. [53]

    LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions.CoRRabs/2510.04023 (2025)

  54. [54]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

  55. [55]

    Proximal Policy Optimization Algorithms.CoRRabs/1707.06347 (2017)

  56. [56]

    Jiaqi Shao, Yufeng Miao, Wei Zhang, and Bing Luo. 2025. FoldAct: Efficient and Stable Context Folding for Long-Horizon Search Agents.CoRRabs/2512.22733 (2025)

  57. [57]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.CoRRabs/2402.03300 (2024)

  58. [58]

    Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, Yancheng Yuan, and Jian Huang. 2025. LAMBDA: A Large Model Based Data Agent.J. Am. Stat. Assoc.0, 0 (2025), 1–13

  59. [59]

    Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, Yancheng Yuan, and Jian Huang. 2025. A survey on large language model-based agents for statistics and data science.Am. Stat.0, 0 (2025), 1–14

  60. [60]

    Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. 2025. Scaling Long-Horizon LLM Agent via Context-Folding.CoRR abs/2510.11967 (2025)

  61. [61]

    Zirui Tang, Weizheng Wang, Zihang Zhou, Yang Jiao, Bangrui Xu, Boyu Niu, Xuanhe Zhou, Guoliang Li, Yeye He, Wei Zhou, Yitong Song, Cheng Tan, Bin Wang, Conghui He, Xiaoyang Wang, and Fan Wu. 2025. LLM/Agent-as-Data- Analyst: A Survey.CoRRabs/2509.23988 (2025)

  62. [62]

    Patara Trirat, Wonyong Jeong, and Sung Ju Hwang. 2025. AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML. InICML. 60099–60146

  63. [63]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. 2024. A survey on large language model based autonomous agents.Frontiers Comput. Sci.18, 6 (2024), 186345

  64. [64]

    Peiran Wang, Yaoning Yu, Ke Chen, Xianyang Zhan, and Haohan Wang. 2025. Large Language Model-based Data Science Agent: A Survey.CoRRabs/2508.02744 (2025)

  65. [65]

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2025. Agent Workflow Memory. InICML. 63897–63911

  66. [66]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations. InCOLM

  67. [67]

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang

  68. [68]

    InNeurIPS

    A-Mem: Agentic Memory for LLM Agents. InNeurIPS

  69. [69]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

  70. [70]

    Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. 2026. AutoSkill: Experience- Driven Lifelong Learning via Skill Self-Evolution.CoRRabs/2603.01145 (2026)

  71. [71]

    Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, Zhiyuan Liu, Xiaodong Shi, and Maosong Sun. 2024. MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization. InACL (Findings). 11789–11804

  72. [72]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InICLR

  73. [73]

    Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi Fung, Hao Peng, and Heng Ji. 2024. CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets. InICLR

  74. [74]

    Differentiation

    Mert Yüksekgönül, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. 2024. TextGrad: Automatic "Differentiation" via Text. CoRRabs/2406.07496 (2024)

  75. [75]

    Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. 2025. The Landscape of Agentic Reinforcemen...

  76. [76]

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. 2025. AFlow: Automating Agentic Workflow Generation. InICLR

  77. [77]

    Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, and Xiaoyong Du. 2025. Deep- Analyze: Agentic Large Language Models for Autonomous Data Science.CoRR abs/2510.16872 (2025)

  78. [78]

    Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. 2024. Data- Copilot: Bridging Billions of Data and Humans with Autonomous Workflow. In LLMAgents@ICLR

  79. [79]

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024. Mem- oryBank: Enhancing Large Language Models with Long-Term Memory. InAAAI. 19724–19731

  80. [80]

    Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Yuzhi Zhang, Linfeng Zhang, Weinan E, Di Jin, Siheng Chen, and Yanfeng Wang. 2026. Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering. CoRRabs/2601.10402 (2026)

Showing first 80 references.