EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management
Pith reviewed 2026-06-28 09:30 UTC · model grok-4.3
The pith
EvoDS lets data science agents acquire reusable skills and learn context compression through reinforcement learning, raising benchmark performance by 28.9 percent while removing token-limit failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvoDS introduces an Autonomous Skill Acquisition mechanism that lets the agent synthesize, validate, and reuse executable skills together with an Adaptive Context Compression strategy that frames context management as a learned control problem. These components run inside a two-stage multi-agent training scheme that enables autonomous improvement over time. The authors prove the hierarchical design reduces tool-selection error and that the optimization objective aligns with the information bottleneck principle. Experiments show a 28.9 percent average gain over state-of-the-art open-source agents across four benchmarks and complete removal of out-of-token failures.
What carries the argument
Autonomous Skill Acquisition (ASA) and Adaptive Context Compression (ACC) inside a two-stage multi-agent reinforcement learning scheme that produces and reuses executable skills while learning to compress history.
If this is right
- Agents accumulate executable experience across separate tasks instead of restarting from scratch each time.
- Multi-stage iterative pipelines become feasible without repeated out-of-token crashes.
- Tool-selection errors drop because the hierarchy separates high-level planning from low-level execution.
- Context use becomes efficient by design rather than by manual truncation rules.
Where Pith is reading between the lines
- If the skill-validation step can be made fully automatic and bias-free, the same loop could be tried in non-data-science domains such as code refactoring or scientific experiment design.
- Treating context compression as a trainable policy may transfer to other long-horizon LLM settings where simple truncation currently loses critical details.
- The information-bottleneck alignment suggests the method could be extended to measure exactly how much task-relevant information survives each compression step.
Load-bearing premise
The two-stage training and Autonomous Skill Acquisition will reliably produce stable, reusable skills whose automatic validation introduces no hidden biases and requires no extra human oversight.
What would settle it
Run EvoDS on a new multi-stage data pipeline benchmark where the generated skills either fail validation repeatedly or the learned compressor drops information that later steps need, producing lower accuracy than a static baseline.
Figures
read the original abstract
Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EvoDS, a self-evolving LLM-based data science agent using Autonomous Skill Acquisition (ASA) to synthesize/validate/reuse executable skills and Adaptive Context Compression (ACC) for learned context management. These are orchestrated in a two-stage multi-agent training scheme. The manuscript claims a theoretical proof that the hierarchical design reduces tool-selection error and that the optimization objective aligns with an information-bottleneck principle. Empirically, it reports a 28.9% average outperformance over state-of-the-art open-source data science agents across four benchmarks, with zero out-of-token failures, and releases code and data.
Significance. If the empirical gains and theoretical alignment can be substantiated with full experimental protocols, derivations, and validation details, the work would represent a meaningful advance in autonomous LLM agents for iterative data science pipelines by addressing static action sets and context management. The open release of code/data strengthens potential impact and reproducibility.
major comments (3)
- [Abstract / theoretical section] Abstract and § on theoretical contributions: the claimed proof that the hierarchical design reduces tool-selection error and that the optimization aligns with an information-bottleneck principle is asserted without any equations, derivation steps, or formal statements, preventing verification of whether the alignment is independent or circular with the training objective.
- [Abstract / ASA mechanism] Abstract and ASA description: the Autonomous Skill Acquisition mechanism is described as enabling agents to 'synthesize, validate, and reuse executable skills,' but supplies no concrete validation criteria, success thresholds, failure-mode handling, or quantification of human oversight, which is load-bearing for both the 28.9% benchmark gains and the self-evolving property.
- [Abstract / experimental results] Empirical claims: the 28.9% average improvement, elimination of out-of-token failures, and cross-benchmark superiority are stated without dataset details, experimental protocol, error bars, statistical tests, or baseline implementations, rendering the central empirical result unverifiable.
minor comments (1)
- [Abstract] The abstract mentions 'four diverse benchmarks' without naming them or providing links to the released code/data repository contents.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will incorporate the requested clarifications, derivations, and experimental details into a revised manuscript.
read point-by-point responses
-
Referee: [Abstract / theoretical section] Abstract and § on theoretical contributions: the claimed proof that the hierarchical design reduces tool-selection error and that the optimization aligns with an information-bottleneck principle is asserted without any equations, derivation steps, or formal statements, preventing verification of whether the alignment is independent or circular with the training objective.
Authors: We acknowledge that the current manuscript states the theoretical claims at a high level without providing the supporting equations or derivations. In the revision we will add a dedicated subsection containing the formal statements, the full derivation showing how the hierarchical design reduces tool-selection error, and the step-by-step alignment of the optimization objective with the information-bottleneck principle, explicitly demonstrating that the alignment is not circular with the training loss. revision: yes
-
Referee: [Abstract / ASA mechanism] Abstract and ASA description: the Autonomous Skill Acquisition mechanism is described as enabling agents to 'synthesize, validate, and reuse executable skills,' but supplies no concrete validation criteria, success thresholds, failure-mode handling, or quantification of human oversight, which is load-bearing for both the 28.9% benchmark gains and the self-evolving property.
Authors: The manuscript currently presents ASA at a conceptual level. We will expand the ASA section with explicit validation criteria (including execution success thresholds and consistency checks), failure-mode handling procedures, and a clear statement of the (minimal) human oversight involved in the validation loop, thereby making the self-evolving claims fully verifiable. revision: yes
-
Referee: [Abstract / experimental results] Empirical claims: the 28.9% average improvement, elimination of out-of-token failures, and cross-benchmark superiority are stated without dataset details, experimental protocol, error bars, statistical tests, or baseline implementations, rendering the central empirical result unverifiable.
Authors: We agree that the empirical section requires substantially more detail. The revised manuscript will include complete dataset descriptions, the full experimental protocol, per-benchmark results with error bars, statistical significance tests, and explicit descriptions of how each baseline was implemented and evaluated. revision: yes
Circularity Check
No significant circularity; theoretical claims lack equations for inspection
full rationale
The provided abstract asserts a proof that the optimization objective aligns with an information bottleneck principle and that the hierarchical design reduces tool-selection error, but supplies no equations, definitions, or derivation steps. No load-bearing step can be quoted that reduces a claimed result to its own inputs by construction, nor is any fitted parameter renamed as a prediction. The empirical performance numbers are presented as benchmark outcomes rather than derived predictions. The derivation chain is therefore self-contained on the basis of the given text; the absence of mathematical detail prevents any circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. In ACL. 12248–12267
2024
-
[2]
Alemi, Ian Fischer, Joshua V
Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. 2017. Deep Variational Information Bottleneck. InICLR
2017
-
[3]
Hoos, Padhraic Smyth, and Christopher K
Tijl De Bie, Luc De Raedt, José Hernández-Orallo, Holger H. Hoos, Padhraic Smyth, and Christopher K. I. Williams. 2022. Automating data science.Commun. ACM65, 3 (2022), 76–87
2022
-
[4]
Xiaohe Bo, Zeyu Zhang, Quanyu Dai, Xueyang Feng, Lei Wang, Rui Li, Xu Chen, and Ji-Rong Wen. 2024. Reflective Multi-Agent Collaboration based on Large Language Models. InNeurIPS. 138595–138631
2024
-
[5]
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. 2024. Large Language Models as Tool Makers. InICLR
2024
-
[6]
Yibin Chen, Yifu Yuan, Zeyu Zhang, Yan Zheng, Jinyi Liu, Fei Ni, Jianye Hao, Hangyu Mao, and Fuzheng Zhang. 2025. SheetAgent: Towards a Generalist Agent for Spreadsheet Reasoning and Manipulation via Large Language Models. In WWW. 158–177
2025
-
[7]
Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. 2025. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discover...
2025
-
[8]
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav
-
[9]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.CoRRabs/2504.19413 (2025)
Pith/arXiv arXiv 2025
-
[10]
Yaswanth Chittepu, Raghavendra Addanki, Tung Mai, Anup B. Rao, and Branislav Kveton. 2025. ML-Tool-Bench: Tool-Augmented Planning for ML Tasks.CoRR abs/2512.00672 (2025)
arXiv 2025
-
[11]
DeepSeek. 2025. DeepSeek-V3.1 Release. https://api-docs.deepseek.com/news/ news250821
2025
-
[12]
Shangheng Du, Xiangchao Yan, Dengyang Jiang, Jiakang Yuan, Yusong Hu, Xin Li, Liang He, Bo Zhang, and Lei Bai. 2025. AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents.CoRRabs/2510.08511 (2025)
arXiv 2025
-
[13]
Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Da- gar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, and George Karypis. 2025. MLZero: A Multi-Agent System for End-to-end Machine Learning Automation. InNeurIPS
2025
-
[14]
Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. 2025. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems.CoRRabs/2508.07407 (2025)
Pith/arXiv arXiv 2025
-
[15]
Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, and Wei Han
-
[16]
InACL (Findings)
Extending Context Window of Large Language Models via Semantic Compression. InACL (Findings). 5169–5181
-
[17]
Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. 2024. Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution. InICML. 13481–13544
2024
-
[18]
Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. 2025. A Survey...
Pith/arXiv arXiv 2025
-
[19]
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. 2024. Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers. InICLR
2024
-
[20]
Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang
-
[21]
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning. InICML. 16813–16848
-
[22]
Chawla, Olaf Wiest, and Xiangliang Zhang
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large Language Model Based Multi-agents: A Survey of Progress and Challenges. InIJCAI. 8048–8057
2024
-
[23]
Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A survey of the state- of-the-art.Knowl-based Syst212 (2021), 106622
2021
-
[24]
Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, and Jinjie Gu. 2025. Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO.CoRRabs/2511.13288 (2025)
arXiv 2025
-
[25]
Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chenglin Wu. 2025. Da...
2025
-
[26]
Jian Hu. 2025. REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models.CoRRabs/2501.03262 (2025)
Pith/arXiv arXiv 2025
-
[27]
Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu. 2024. InfiAgent-DABench: Evalu- ating Agents on Data Analysis Tasks. InICML. 19544–19572
2024
-
[28]
Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, and Kang Liu. 2024. DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models. InEMNLP. 13487–13521
2024
-
[29]
Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, and Wenhu Chen. 2025. Verl- Tool: Towards Holistic Agentic Reinforcement Learning with Tool Use.CoRR abs/2509.01055 (2025)
arXiv 2025
-
[30]
Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. 2025. DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?. InICLR
2025
-
[31]
Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan
Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. 2025. ACON: Optimizing Context Compression for Long-horizon LLM Agents.CoRRabs/2510.00615 (2025)
Pith/arXiv arXiv 2025
-
[32]
Canny, and Ian Fischer
Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John F. Canny, and Ian Fischer
-
[33]
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts. InICML. 26396–26415
-
[34]
Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tianyu Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Stephen Huang, and Ge Zhang. 2025. AutoKaggle: A Multi-Agent Frame- work for Autonomous Data Science Competitions. InDL4C@ICLR
2025
-
[35]
Fan Liu, Zhe-Rui Yang, Cancheng Liu, Tianrui SONG, Xiaofeng Gao, and Hao Liu. 2025. MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem. InNeurIPS
2025
-
[36]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Trans. Assoc. Comput. Linguistics12 (2024), 157–173
2024
-
[37]
Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Yanfeng Wang, Weinan E, and Siheng Chen. 2025. ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning.CoRRabs/2506.16499 (2025)
arXiv 2025
-
[38]
Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bolun Zhang, Lei Bai, and Siheng Chen. 2025. ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering.CoRRabs/2505.23723 (2025)
Pith/arXiv arXiv 2025
-
[39]
Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. 2025. A Survey of Context Engineering for Large Language Models.CoRRabs/2507.13334 (2025)
Pith/arXiv arXiv 2025
-
[40]
Zhanfeng Mo, Xingxuan Li, Yuntao Chen, and Lidong Bing. 2025. Multi-Agent Tool-Integrated Policy Optimization.CoRRabs/2510.04678 (2025)
arXiv 2025
-
[41]
Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Philip Torr, Ivan Laptev, Fabio Pizzati, Ronald Clark, and Christian Schroeder de Witt. 2025. MALT: Improving Reasoning with Multi-Agent LLM Training. In COLM
2025
-
[42]
Alhassan Mumuni and Fuseini Mumuni. 2025. Automated data processing and feature engineering for deep learning and big data applications: A survey.J. Inf. Intell.3, 2 (2025), 113–153
2025
-
[43]
Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan O Arik, and Tomas Pfister. 2025. MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement. InNeurIPS
2025
-
[44]
Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, and Shafiq Joty. 2025. SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents.CoRR abs/2509.06283 (2025)
arXiv 2025
-
[45]
OpenAI. 2023. Code Interpreter. https://platform.openai.com/docs/guides/tools- code-interpreter
2023
-
[46]
OpenAI. 2023. Hello GPT-4. https://openai.com/zh-Hans-CN/index/hello-gpt- 4o/
2023
-
[47]
OpenAI. 2025. Introducing OpenAI o3 and o4-mini. https://openai.com/zh- Hans-CN/index/introducing-o3-and-o4-mini/
2025
-
[48]
Ozdaglar, Kaiqing Zhang, and Joo-Kyung Kim
Chanwoo Park, Seungju Han, Xingzhi Guo, Asuman E. Ozdaglar, Kaiqing Zhang, and Joo-Kyung Kim. 2025. MAPoRL: Multi-Agent Post-Co-Training for Collabora- tive Large Language Models with Reinforcement Learning. InACL. 30215–30248
2025
-
[49]
Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, ChangHao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, and Bo Dai. 2025. MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering. InNeurIPS
2025
-
[50]
Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Zhao Bin, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. 2026. Scaling Generalist Data-Analytic Agents. InICLR. EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea
2026
-
[51]
Chuan Qin, Xin Chen, Chengrui Wang, Pengmin Wu, Xi Chen, Yihang Cheng, Jingyi Zhao, Meng Xiao, Xiangchao Dong, Qingqing Long, Boya Pan, Han Wu, Chengzan Li, Yuanchun Zhou, Hui Xiong, and Hengshu Zhu. 2025. SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models. InKDD (2). 5754–5765
2025
-
[52]
Tahmid Rah- man Laskar, Ridwan Mahbub, Ahmed Masry, Shafiq Joty, and Enamul Hoque
Mizanur Rahman, Amran Bhuiyan, Mohammed Saidul Islam, Md. Tahmid Rah- man Laskar, Ridwan Mahbub, Ahmed Masry, Shafiq Joty, and Enamul Hoque
-
[53]
LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions.CoRRabs/2510.04023 (2025)
arXiv 2025
-
[54]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
-
[55]
Proximal Policy Optimization Algorithms.CoRRabs/1707.06347 (2017)
Pith/arXiv arXiv 2017
-
[56]
Jiaqi Shao, Yufeng Miao, Wei Zhang, and Bing Luo. 2025. FoldAct: Efficient and Stable Context Folding for Long-Horizon Search Agents.CoRRabs/2512.22733 (2025)
arXiv 2025
-
[57]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.CoRRabs/2402.03300 (2024)
Pith/arXiv arXiv 2024
-
[58]
Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, Yancheng Yuan, and Jian Huang. 2025. LAMBDA: A Large Model Based Data Agent.J. Am. Stat. Assoc.0, 0 (2025), 1–13
2025
-
[59]
Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, Yancheng Yuan, and Jian Huang. 2025. A survey on large language model-based agents for statistics and data science.Am. Stat.0, 0 (2025), 1–14
2025
-
[60]
Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. 2025. Scaling Long-Horizon LLM Agent via Context-Folding.CoRR abs/2510.11967 (2025)
arXiv 2025
-
[61]
Zirui Tang, Weizheng Wang, Zihang Zhou, Yang Jiao, Bangrui Xu, Boyu Niu, Xuanhe Zhou, Guoliang Li, Yeye He, Wei Zhou, Yitong Song, Cheng Tan, Bin Wang, Conghui He, Xiaoyang Wang, and Fan Wu. 2025. LLM/Agent-as-Data- Analyst: A Survey.CoRRabs/2509.23988 (2025)
arXiv 2025
-
[62]
Patara Trirat, Wonyong Jeong, and Sung Ju Hwang. 2025. AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML. InICML. 60099–60146
2025
-
[63]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. 2024. A survey on large language model based autonomous agents.Frontiers Comput. Sci.18, 6 (2024), 186345
2024
-
[64]
Peiran Wang, Yaoning Yu, Ke Chen, Xianyang Zhan, and Haohan Wang. 2025. Large Language Model-based Data Science Agent: A Survey.CoRRabs/2508.02744 (2025)
arXiv 2025
-
[65]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2025. Agent Workflow Memory. InICML. 63897–63911
2025
-
[66]
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations. InCOLM
2024
-
[67]
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang
-
[68]
InNeurIPS
A-Mem: Agentic Memory for LLM Agents. InNeurIPS
-
[69]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...
Pith/arXiv arXiv 2025
-
[70]
Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. 2026. AutoSkill: Experience- Driven Lifelong Learning via Skill Self-Evolution.CoRRabs/2603.01145 (2026)
arXiv 2026
-
[71]
Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, Zhiyuan Liu, Xiaodong Shi, and Maosong Sun. 2024. MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization. InACL (Findings). 11789–11804
2024
-
[72]
Narasimhan, and Yuan Cao
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InICLR
2023
-
[73]
Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi Fung, Hao Peng, and Heng Ji. 2024. CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets. InICLR
2024
-
[74]
Mert Yüksekgönül, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. 2024. TextGrad: Automatic "Differentiation" via Text. CoRRabs/2406.07496 (2024)
Pith/arXiv arXiv 2024
-
[75]
Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. 2025. The Landscape of Agentic Reinforcemen...
Pith/arXiv arXiv 2025
-
[76]
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. 2025. AFlow: Automating Agentic Workflow Generation. InICLR
2025
-
[77]
Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, and Xiaoyong Du. 2025. Deep- Analyze: Agentic Large Language Models for Autonomous Data Science.CoRR abs/2510.16872 (2025)
arXiv 2025
-
[78]
Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. 2024. Data- Copilot: Bridging Billions of Data and Humans with Autonomous Workflow. In LLMAgents@ICLR
2024
-
[79]
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024. Mem- oryBank: Enhancing Large Language Models with Long-Term Memory. InAAAI. 19724–19731
2024
-
[80]
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Yuzhi Zhang, Linfeng Zhang, Weinan E, Di Jin, Siheng Chen, and Yanfeng Wang. 2026. Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering. CoRRabs/2601.10402 (2026)
arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.