pith. sign in

arxiv: 2607.01120 · v1 · pith:SB575SE2new · submitted 2026-07-01 · 💻 cs.DC

Next-Generation Agentic Reinforcement Learning Systems Enable Self-Evolving Agents

Pith reviewed 2026-07-02 05:56 UTC · model grok-4.3

classification 💻 cs.DC
keywords self-evolving agentsagentic reinforcement learningLLM agentsonline RL systemstrajectory data protocoldata proxyevolution control planeenterprise deployment
0
0 comments X

The pith

Self-evolving LLM agents require new agentic RL systems built on three specific pillars rather than better algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current LLM agents stay frozen after deployment because the systems supporting reinforcement learning from their experiences are missing key pieces. It points to the lack of a common way to log agent steps with learning signals, a secure way to use real work for training data, and an automatic system to decide when to evolve the agent. Addressing these would allow agents to improve continuously from their own runs at enterprise scale without constant human oversight. This shift matters for production uses like coding assistants and research tools that currently need manual retraining loops.

Core claim

The paper establishes that agentic online RL systems are the bottleneck for self-evolving agents in large-scale enterprise settings, due to missing standardized agent trajectory data protocols for step-granularity RL signals, enterprise-grade data proxies for governed learning substrates, and unified evolution control planes for automatic updates; co-designing systems around these three pillars, as shown in the AReaL2.0 instantiation, enables the vision of agents that learn from deployed workloads.

What carries the argument

The three pillars of standardized trajectory data protocol, comprehensive data proxy, and unified evolution control plane that reorganize RL infrastructure for online policy updates from agent experiences.

If this is right

  • Policy weights can be updated directly from production agent trajectories.
  • Data from heterogeneous agent paradigms can be unified for RL learning.
  • Workload data becomes usable for training without manual curation through the proxy.
  • Automatic decisions on when to evolve agents reduce reliance on human loops.
  • Reorganized RL systems like AReaL2.0 demonstrate practical architectures for this.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pillars are adopted, agent maintenance could shift from periodic manual updates to continuous online evolution.
  • Integration with existing observability tools might be needed to make the data proxy enterprise-ready.
  • Testing the protocol across different agent types could reveal if it truly supports cross-paradigm learning.
  • Success might encourage similar system designs in other adaptive AI domains beyond agents.

Load-bearing premise

That the three inadequacies in current agentic RL systems are the main barriers and that implementing the three pillars will be enough to achieve self-evolving deployment at enterprise scale.

What would settle it

A production deployment using systems built on the three pillars that still requires manual human intervention for any agent improvements, or shows no measurable policy evolution from trajectory data.

read the original abstract

LLM agents are rapidly being deployed in production, including coding assistants, customer-support chatbots, and scientific research assistants, yet they remain fundamentally static in enterprise deployment. The LLM weights, system prompts, tool repertoires, and in-context harnesses are frozen at deployment time, and any improvement requires a manual loop of human-curated data collection, offline fine-tuning, modification of the agentic paradigm, and re-deployment. Recent work on self-evolving agents, such as OpenClaw for individual users, indicates that the next leap in agent capability will come from agents that continually learn from their own experience. In this paper, we argue that this vision for self-evolving agent deployment is being held back for enterprise-level large-scale agentic service not by reinforcement learning (RL) algorithms but by agentic online RL systems. Specifically, current agentic RL systems and the surrounding observability software stack are inadequate along three essential aspects: (i) there is no standardized agent trajectory data protocol capable of carrying RL learning signals at step granularity across heterogeneous agent paradigms; (ii) there is no enterprise-grade comprehensive data proxy that converts real workloads into governed learning substrates; and (iii) there is no unified agent evolution control plane that automatically decides, based on trajectory statistics, when to update policy weights or evolve the in-context harness. The next generation of agentic RL systems must be co-designed around these three pillars, and we sketch concrete architectures, case studies, and counter-arguments. We instantiate one branch through AReaL2.0, reorganizing existing RL infrastructure into an agent-oriented online RL loop for policy weight updates from deployed workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript argues that self-evolving LLM agents for enterprise use are limited not by RL algorithms but by the surrounding agentic online RL systems. It identifies three key inadequacies: (i) no standardized agent trajectory data protocol for RL signals at step granularity, (ii) no enterprise-grade data proxy for governed learning, and (iii) no unified agent evolution control plane for automatic policy updates. The authors advocate co-designing systems around these pillars, provide sketches of architectures, case studies, and counter-arguments, and describe an instantiation via AReaL2.0 for reorganizing RL infrastructure into an agent-oriented online loop.

Significance. If the central thesis holds, the paper could usefully redirect community attention toward systems-level infrastructure for agentic RL, potentially enabling scalable self-evolution. The sketches of architectures and counter-arguments provide a constructive starting point for discussion in the field. However, the lack of any empirical support or references validating the sufficiency of existing RL methods under the proposed setups substantially reduces the immediate significance.

major comments (2)
  1. [Abstract] Abstract: The assertion that the vision 'is being held back ... not by reinforcement learning (RL) algorithms but by agentic online RL systems' is load-bearing for the entire argument but is presented without any supporting experiment, ablation study, or citation demonstrating that standard RL algorithms can produce stable updates from the proposed step-granularity trajectories in production agent workloads.
  2. [Abstract] Abstract / Proposed Pillars: The claim that addressing the three pillars is sufficient for self-evolving agent deployment at enterprise scale rests on the untested assumption that current RL methods require no additional machinery for credit assignment, exploration, or safety when operating under the sketched control plane and data formats.
minor comments (1)
  1. The manuscript would benefit from explicit section headings or numbered sections to facilitate reference to specific arguments about the architectures and case studies.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on our position paper. We address each major comment below, noting the conceptual nature of the work and where revisions can clarify assumptions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that the vision 'is being held back ... not by reinforcement learning (RL) algorithms but by agentic online RL systems' is load-bearing for the entire argument but is presented without any supporting experiment, ablation study, or citation demonstrating that standard RL algorithms can produce stable updates from the proposed step-granularity trajectories in production agent workloads.

    Authors: The manuscript is a systems position paper that identifies infrastructure bottlenecks based on analysis of existing production deployments and literature on agent limitations. No new experiments are included because the contribution centers on co-design of data protocols, proxies, and control planes rather than RL algorithm validation. We will revise the abstract and add a related-work subsection with citations to trajectory-based RL methods that have demonstrated stability in offline and online settings with fine-grained data. revision: partial

  2. Referee: [Abstract] Abstract / Proposed Pillars: The claim that addressing the three pillars is sufficient for self-evolving agent deployment at enterprise scale rests on the untested assumption that current RL methods require no additional machinery for credit assignment, exploration, or safety when operating under the sketched control plane and data formats.

    Authors: The paper sketches architectures and includes counter-arguments addressing potential RL challenges under the proposed setups. We acknowledge that sufficiency is not empirically demonstrated and that the manuscript assumes standard RL techniques can leverage the new data formats without major extensions. We will revise the discussion of the pillars to explicitly list remaining open RL questions (e.g., safety under automatic updates) and frame the pillars as necessary but not necessarily complete infrastructure. revision: partial

standing simulated objections not resolved
  • Empirical validation via experiments or ablations showing that standard RL algorithms suffice under the proposed trajectory protocols and control plane without additional machinery for credit assignment, exploration, or safety.

Circularity Check

0 steps flagged

No circularity: position paper with no derivations or fitted quantities

full rationale

The manuscript advances a position argument that self-evolving agent deployment at enterprise scale is limited by three system-level inadequacies rather than by RL algorithms themselves. It describes current shortcomings, proposes three pillars (standardized trajectory protocol, data proxy, evolution control plane), and sketches architectures including AReaL2.0, but contains no equations, fitted parameters, derivations, or load-bearing self-citations that reduce any claim to its own inputs by construction. The central claim rests on stated premises about observability stacks and deployment realities, which are externally falsifiable and not internally self-referential. This is the expected non-finding for a non-derivational position paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no technical derivations, parameters, or entities are described.

pith-pipeline@v0.9.1-grok · 5907 in / 1127 out tokens · 26041 ms · 2026-07-02T05:56:21.729604+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 20 canonical work pages · 10 internal anchors

  1. [1]

    Openclaw: The ai that actually does things, 2026

    OpenClaw. Openclaw: The ai that actually does things, 2026

  2. [2]

    OpenClaw-RL: Train Any Agent Simply by Talking

    Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXiv preprint arXiv:2603.10165, 2026

  3. [3]

    Metaclaw: Just talk–an agent that meta-learns and evolves in the wild

    Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, et al. Metaclaw: Just talk–an agent that meta-learns and evolves in the wild. arXiv preprint arXiv:2603.17187, 2026

  4. [4]

    SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

    Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver. arXiv preprint arXiv:2604.08377, 2026

  5. [5]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634–8652, 2023

  6. [6]

    arXiv preprint arXiv:2603.18743 , year=

    Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026

  7. [7]

    Agentic context engineering: Evolving contexts for self-improving language models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models. 2026

  8. [8]

    Areal: A large-scale asynchronous reinforcement learning system for language reasoning

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, WANG JIASHU, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  9. [9]

    A Survey of Reinforcement Learning for Large Reasoning Models

    Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827, 2025

  10. [10]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  11. [11]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  12. [12]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  13. [13]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  15. [15]

    The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547, 2025. 11

  16. [16]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

  17. [17]

    TextGrad: Automatic "Differentiation" via Text

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024

  18. [18]

    Unlocking long-horizon agentic search with large-scale end-to-end rl

    Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Unlocking long-horizon agentic search with large-scale end-to-end rl. In The Fourteenth International Conference on Learning Representations, 2026

  19. [19]

    Real: Efficient rlhf training of large language models with parameter reallocation

    Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. arXiv preprint arXiv:2406.14088, 2024

  20. [20]

    Optimizing {RLHF} training for large language models with stage fusion

    Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 489–503, 2025

  21. [21]

    G-core: A simple, scalable and balanced rlhf trainer

    Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Haoqiang Hong, Boqi Liu, Hongtao Tian, Tao Yang, Yunsheng Shi, Feng Lin, et al. G-core: A simple, scalable and balanced rlhf trainer. arXiv preprint arXiv:2507.22789, 2025

  22. [22]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  23. [23]

    arXiv preprint arXiv:2504.15930 , year=

    Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025

  24. [24]

    arXiv preprint arXiv:2507.01663 , year=

    Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training. arXiv preprint arXiv:2507.01663, 2025

  25. [25]

    Introducing the Model Context Protocol

    Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024

  26. [26]

    Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

    Google. Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

  27. [27]

    A survey of ai agent protocols

    Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, et al. A survey of ai agent protocols. arXiv preprint arXiv:2504.16736, 2025

  28. [28]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

  29. [29]

    Rlds: an ecosystem to generate, share and use datasets in reinforcement learning

    Sabela Ramos, Sertan Girgin, Léonard Hussenot, Damien Vincent, Hanna Yakubovich, Daniel Toyama, Anita Gergely , Piotr Stanczyk, Raphael Marinier, Jeremiah Harmsen, et al. Rlds: an ecosystem to generate, share and use datasets in reinforcement learning. arXiv preprint arXiv:2111.02767, 2021

  30. [30]

    Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents

    Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, et al. Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents. arXiv preprint arXiv:2510.24702, 2025

  31. [31]

    LangChain: The agent engineering platform

    LangChain, Inc. LangChain: The agent engineering platform. https://github.com/langchain-ai/langchain, 2025

  32. [32]

    LangGraph: Build resilient language agents as graphs

    LangChain, Inc. LangGraph: Build resilient language agents as graphs. https://github.com/langchain-ai/ langgraph, 2025

  33. [33]

    CrewAI: Framework for orchestrating role-playing, autonomous AI agents

    crewAI, Inc. CrewAI: Framework for orchestrating role-playing, autonomous AI agents. https://github.com/ crewAIInc/crewAI, 2025

  34. [34]

    OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows

    OpenAI. OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows. https://github.com/ openai/openai-agents-python, 2025

  35. [35]

    Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025

    Anthropic. Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025. 12

  36. [36]

    Agentprm: Process reward models for llm agents via step-wise promise and progress

    Zhiheng Xi, Chenyang Liao, Guanyu Li, Zhihao Zhang, Wenxiang Chen, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. In Proceedings of the ACM Web Conference 2026, pages 4184–4195, 2026

  37. [37]

    Rlanything: Forge environment, policy , and reward model in completely dynamic rl system

    Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy , and reward model in completely dynamic rl system. arXiv preprint arXiv:2602.02488, 2026

  38. [38]

    Hermes agent: The self-improving ai agent built by nous research

    Nous Research. Hermes agent: The self-improving ai agent built by nous research. https://github.com/ NousResearch/hermes-agent, 2026. Accessed: 2026-06-30. 13