Next-Generation Agentic Reinforcement Learning Systems Enable Self-Evolving Agents

Binhang Yuan; Chuyi He; Haitao Wang; Hao Dai; Huaijie Wang; Jiale Li; Jiarui Zhang; Jiawei Zhang; Jiaxuan Gao; Jun Mei

arxiv: 2607.01120 · v1 · pith:SB575SE2new · submitted 2026-07-01 · 💻 cs.DC

Next-Generation Agentic Reinforcement Learning Systems Enable Self-Evolving Agents

Ran Yan , Wei Fu , Jiale Li , Shusheng Xu , Zhiyu Mei , Jiaxuan Gao , Jiarui Zhang , Xujie Shen

show 14 more authors

Hao Dai Chuyi He Zhen Pu Jun Mei Zhiyao Lin Haitao Wang Zhiqiang Ding Jiawei Zhang Huaijie Wang Ruida Xu Youhe Jiang Yi Wu Tongkai Yang Binhang Yuan

This is my paper

Pith reviewed 2026-07-02 05:56 UTC · model grok-4.3

classification 💻 cs.DC

keywords self-evolving agentsagentic reinforcement learningLLM agentsonline RL systemstrajectory data protocoldata proxyevolution control planeenterprise deployment

0 comments

The pith

Self-evolving LLM agents require new agentic RL systems built on three specific pillars rather than better algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current LLM agents stay frozen after deployment because the systems supporting reinforcement learning from their experiences are missing key pieces. It points to the lack of a common way to log agent steps with learning signals, a secure way to use real work for training data, and an automatic system to decide when to evolve the agent. Addressing these would allow agents to improve continuously from their own runs at enterprise scale without constant human oversight. This shift matters for production uses like coding assistants and research tools that currently need manual retraining loops.

Core claim

The paper establishes that agentic online RL systems are the bottleneck for self-evolving agents in large-scale enterprise settings, due to missing standardized agent trajectory data protocols for step-granularity RL signals, enterprise-grade data proxies for governed learning substrates, and unified evolution control planes for automatic updates; co-designing systems around these three pillars, as shown in the AReaL2.0 instantiation, enables the vision of agents that learn from deployed workloads.

What carries the argument

The three pillars of standardized trajectory data protocol, comprehensive data proxy, and unified evolution control plane that reorganize RL infrastructure for online policy updates from agent experiences.

If this is right

Policy weights can be updated directly from production agent trajectories.
Data from heterogeneous agent paradigms can be unified for RL learning.
Workload data becomes usable for training without manual curation through the proxy.
Automatic decisions on when to evolve agents reduce reliance on human loops.
Reorganized RL systems like AReaL2.0 demonstrate practical architectures for this.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pillars are adopted, agent maintenance could shift from periodic manual updates to continuous online evolution.
Integration with existing observability tools might be needed to make the data proxy enterprise-ready.
Testing the protocol across different agent types could reveal if it truly supports cross-paradigm learning.
Success might encourage similar system designs in other adaptive AI domains beyond agents.

Load-bearing premise

That the three inadequacies in current agentic RL systems are the main barriers and that implementing the three pillars will be enough to achieve self-evolving deployment at enterprise scale.

What would settle it

A production deployment using systems built on the three pillars that still requires manual human intervention for any agent improvements, or shows no measurable policy evolution from trajectory data.

read the original abstract

LLM agents are rapidly being deployed in production, including coding assistants, customer-support chatbots, and scientific research assistants, yet they remain fundamentally static in enterprise deployment. The LLM weights, system prompts, tool repertoires, and in-context harnesses are frozen at deployment time, and any improvement requires a manual loop of human-curated data collection, offline fine-tuning, modification of the agentic paradigm, and re-deployment. Recent work on self-evolving agents, such as OpenClaw for individual users, indicates that the next leap in agent capability will come from agents that continually learn from their own experience. In this paper, we argue that this vision for self-evolving agent deployment is being held back for enterprise-level large-scale agentic service not by reinforcement learning (RL) algorithms but by agentic online RL systems. Specifically, current agentic RL systems and the surrounding observability software stack are inadequate along three essential aspects: (i) there is no standardized agent trajectory data protocol capable of carrying RL learning signals at step granularity across heterogeneous agent paradigms; (ii) there is no enterprise-grade comprehensive data proxy that converts real workloads into governed learning substrates; and (iii) there is no unified agent evolution control plane that automatically decides, based on trajectory statistics, when to update policy weights or evolve the in-context harness. The next generation of agentic RL systems must be co-designed around these three pillars, and we sketch concrete architectures, case studies, and counter-arguments. We instantiate one branch through AReaL2.0, reorganizing existing RL infrastructure into an agent-oriented online RL loop for policy weight updates from deployed workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Position paper names three infrastructure gaps for self-evolving agents but supplies zero evidence that standard RL would actually work once those gaps close.

read the letter

The core takeaway is that this is a vision piece claiming current RL algorithms are sufficient for online agent self-evolution and that only three system pieces are missing: a standardized step-granularity trajectory protocol, an enterprise data proxy, and a unified evolution control plane. They sketch architectures around those pillars and mention reorganizing existing infrastructure into AReaL2.0.

What the paper does reasonably is lay out practical deployment barriers in plain terms. The three pillars map to real pain points when moving from research prototypes like OpenClaw to production services where agents must improve without constant human retraining. Framing the problem as co-design of data formats, proxies, and control logic is a useful organizational move even if the individual ideas draw from prior systems work.

The soft spot is exactly the one the stress-test flags. The argument that RL algorithms themselves are not the blocker rests on assertion alone. There are no experiments, ablations, or even references showing that off-the-shelf methods can ingest the proposed trajectories, assign credit stably, or handle exploration and safety under the sketched control plane. Without that, the claim that fixing the three pillars is sufficient remains untested. The manuscript also gives no measurements of current observability stacks or any concrete counter-argument handling.

This is aimed at applied researchers and engineers working on production LLM agents who need a checklist of missing pieces. It has little to offer readers looking for new algorithms, validated architectures, or reproducible results. The thinking is coherent on its own terms and engages the literature, but the evidential bar for a systems paper is not met.

I would not send this to peer review. It reads like an extended position statement that would benefit from at least a small-scale demonstration before taking up referee time.

Referee Report

2 major / 1 minor

Summary. The manuscript argues that self-evolving LLM agents for enterprise use are limited not by RL algorithms but by the surrounding agentic online RL systems. It identifies three key inadequacies: (i) no standardized agent trajectory data protocol for RL signals at step granularity, (ii) no enterprise-grade data proxy for governed learning, and (iii) no unified agent evolution control plane for automatic policy updates. The authors advocate co-designing systems around these pillars, provide sketches of architectures, case studies, and counter-arguments, and describe an instantiation via AReaL2.0 for reorganizing RL infrastructure into an agent-oriented online loop.

Significance. If the central thesis holds, the paper could usefully redirect community attention toward systems-level infrastructure for agentic RL, potentially enabling scalable self-evolution. The sketches of architectures and counter-arguments provide a constructive starting point for discussion in the field. However, the lack of any empirical support or references validating the sufficiency of existing RL methods under the proposed setups substantially reduces the immediate significance.

major comments (2)

[Abstract] Abstract: The assertion that the vision 'is being held back ... not by reinforcement learning (RL) algorithms but by agentic online RL systems' is load-bearing for the entire argument but is presented without any supporting experiment, ablation study, or citation demonstrating that standard RL algorithms can produce stable updates from the proposed step-granularity trajectories in production agent workloads.
[Abstract] Abstract / Proposed Pillars: The claim that addressing the three pillars is sufficient for self-evolving agent deployment at enterprise scale rests on the untested assumption that current RL methods require no additional machinery for credit assignment, exploration, or safety when operating under the sketched control plane and data formats.

minor comments (1)

The manuscript would benefit from explicit section headings or numbered sections to facilitate reference to specific arguments about the architectures and case studies.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on our position paper. We address each major comment below, noting the conceptual nature of the work and where revisions can clarify assumptions.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the vision 'is being held back ... not by reinforcement learning (RL) algorithms but by agentic online RL systems' is load-bearing for the entire argument but is presented without any supporting experiment, ablation study, or citation demonstrating that standard RL algorithms can produce stable updates from the proposed step-granularity trajectories in production agent workloads.

Authors: The manuscript is a systems position paper that identifies infrastructure bottlenecks based on analysis of existing production deployments and literature on agent limitations. No new experiments are included because the contribution centers on co-design of data protocols, proxies, and control planes rather than RL algorithm validation. We will revise the abstract and add a related-work subsection with citations to trajectory-based RL methods that have demonstrated stability in offline and online settings with fine-grained data. revision: partial
Referee: [Abstract] Abstract / Proposed Pillars: The claim that addressing the three pillars is sufficient for self-evolving agent deployment at enterprise scale rests on the untested assumption that current RL methods require no additional machinery for credit assignment, exploration, or safety when operating under the sketched control plane and data formats.

Authors: The paper sketches architectures and includes counter-arguments addressing potential RL challenges under the proposed setups. We acknowledge that sufficiency is not empirically demonstrated and that the manuscript assumes standard RL techniques can leverage the new data formats without major extensions. We will revise the discussion of the pillars to explicitly list remaining open RL questions (e.g., safety under automatic updates) and frame the pillars as necessary but not necessarily complete infrastructure. revision: partial

standing simulated objections not resolved

Empirical validation via experiments or ablations showing that standard RL algorithms suffice under the proposed trajectory protocols and control plane without additional machinery for credit assignment, exploration, or safety.

Circularity Check

0 steps flagged

No circularity: position paper with no derivations or fitted quantities

full rationale

The manuscript advances a position argument that self-evolving agent deployment at enterprise scale is limited by three system-level inadequacies rather than by RL algorithms themselves. It describes current shortcomings, proposes three pillars (standardized trajectory protocol, data proxy, evolution control plane), and sketches architectures including AReaL2.0, but contains no equations, fitted parameters, derivations, or load-bearing self-citations that reduce any claim to its own inputs by construction. The central claim rests on stated premises about observability stacks and deployment realities, which are externally falsifiable and not internally self-referential. This is the expected non-finding for a non-derivational position paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no technical derivations, parameters, or entities are described.

pith-pipeline@v0.9.1-grok · 5907 in / 1127 out tokens · 26041 ms · 2026-07-02T05:56:21.729604+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 20 canonical work pages · 10 internal anchors

[1]

Openclaw: The ai that actually does things, 2026

OpenClaw. Openclaw: The ai that actually does things, 2026

2026
[2]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXiv preprint arXiv:2603.10165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Metaclaw: Just talk–an agent that meta-learns and evolves in the wild

Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, et al. Metaclaw: Just talk–an agent that meta-learns and evolves in the wild. arXiv preprint arXiv:2603.17187, 2026

work page arXiv 2026
[4]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver. arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634–8652, 2023

2023
[6]

arXiv preprint arXiv:2603.18743 , year=

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026
[7]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models. 2026

2026
[8]

Areal: A large-scale asynchronous reinforcement learning system for language reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, WANG JIASHU, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[9]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

2022
[11]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Unlocking long-horizon agentic search with large-scale end-to-end rl

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Unlocking long-horizon agentic search with large-scale end-to-end rl. In The Fourteenth International Conference on Learning Representations, 2026

2026
[19]

Real: Efficient rlhf training of large language models with parameter reallocation

Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. arXiv preprint arXiv:2406.14088, 2024

work page arXiv 2024
[20]

Optimizing {RLHF} training for large language models with stage fusion

Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 489–503, 2025

2025
[21]

G-core: A simple, scalable and balanced rlhf trainer

Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Haoqiang Hong, Boqi Liu, Hongtao Tian, Tao Yang, Yunsheng Shi, Feng Lin, et al. G-core: A simple, scalable and balanced rlhf trainer. arXiv preprint arXiv:2507.22789, 2025

work page arXiv 2025
[22]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

2025
[23]

arXiv preprint arXiv:2504.15930 , year=

Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025

work page arXiv 2025
[24]

arXiv preprint arXiv:2507.01663 , year=

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training. arXiv preprint arXiv:2507.01663, 2025

work page arXiv 2025
[25]

Introducing the Model Context Protocol

Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024

2024
[26]

Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

Google. Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

2025
[27]

A survey of ai agent protocols

Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, et al. A survey of ai agent protocols. arXiv preprint arXiv:2504.16736, 2025

work page arXiv 2025
[28]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[29]

Rlds: an ecosystem to generate, share and use datasets in reinforcement learning

Sabela Ramos, Sertan Girgin, Léonard Hussenot, Damien Vincent, Hanna Yakubovich, Daniel Toyama, Anita Gergely , Piotr Stanczyk, Raphael Marinier, Jeremiah Harmsen, et al. Rlds: an ecosystem to generate, share and use datasets in reinforcement learning. arXiv preprint arXiv:2111.02767, 2021

work page arXiv 2021
[30]

Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents

Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, et al. Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents. arXiv preprint arXiv:2510.24702, 2025

work page arXiv 2025
[31]

LangChain: The agent engineering platform

LangChain, Inc. LangChain: The agent engineering platform. https://github.com/langchain-ai/langchain, 2025

2025
[32]

LangGraph: Build resilient language agents as graphs

LangChain, Inc. LangGraph: Build resilient language agents as graphs. https://github.com/langchain-ai/ langgraph, 2025

2025
[33]

CrewAI: Framework for orchestrating role-playing, autonomous AI agents

crewAI, Inc. CrewAI: Framework for orchestrating role-playing, autonomous AI agents. https://github.com/ crewAIInc/crewAI, 2025

2025
[34]

OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows

OpenAI. OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows. https://github.com/ openai/openai-agents-python, 2025

2025
[35]

Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025

Anthropic. Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025. 12

2025
[36]

Agentprm: Process reward models for llm agents via step-wise promise and progress

Zhiheng Xi, Chenyang Liao, Guanyu Li, Zhihao Zhang, Wenxiang Chen, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. In Proceedings of the ACM Web Conference 2026, pages 4184–4195, 2026

2026
[37]

Rlanything: Forge environment, policy , and reward model in completely dynamic rl system

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy , and reward model in completely dynamic rl system. arXiv preprint arXiv:2602.02488, 2026

work page arXiv 2026
[38]

Hermes agent: The self-improving ai agent built by nous research

Nous Research. Hermes agent: The self-improving ai agent built by nous research. https://github.com/ NousResearch/hermes-agent, 2026. Accessed: 2026-06-30. 13

2026

[1] [1]

Openclaw: The ai that actually does things, 2026

OpenClaw. Openclaw: The ai that actually does things, 2026

2026

[2] [2]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXiv preprint arXiv:2603.10165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Metaclaw: Just talk–an agent that meta-learns and evolves in the wild

Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, et al. Metaclaw: Just talk–an agent that meta-learns and evolves in the wild. arXiv preprint arXiv:2603.17187, 2026

work page arXiv 2026

[4] [4]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver. arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634–8652, 2023

2023

[6] [6]

arXiv preprint arXiv:2603.18743 , year=

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026

[7] [7]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models. 2026

2026

[8] [8]

Areal: A large-scale asynchronous reinforcement learning system for language reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, WANG JIASHU, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[9] [9]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

2022

[11] [11]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023

[14] [14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Unlocking long-horizon agentic search with large-scale end-to-end rl

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Unlocking long-horizon agentic search with large-scale end-to-end rl. In The Fourteenth International Conference on Learning Representations, 2026

2026

[19] [19]

Real: Efficient rlhf training of large language models with parameter reallocation

Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. arXiv preprint arXiv:2406.14088, 2024

work page arXiv 2024

[20] [20]

Optimizing {RLHF} training for large language models with stage fusion

Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 489–503, 2025

2025

[21] [21]

G-core: A simple, scalable and balanced rlhf trainer

Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Haoqiang Hong, Boqi Liu, Hongtao Tian, Tao Yang, Yunsheng Shi, Feng Lin, et al. G-core: A simple, scalable and balanced rlhf trainer. arXiv preprint arXiv:2507.22789, 2025

work page arXiv 2025

[22] [22]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

2025

[23] [23]

arXiv preprint arXiv:2504.15930 , year=

Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025

work page arXiv 2025

[24] [24]

arXiv preprint arXiv:2507.01663 , year=

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training. arXiv preprint arXiv:2507.01663, 2025

work page arXiv 2025

[25] [25]

Introducing the Model Context Protocol

Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024

2024

[26] [26]

Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

Google. Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

2025

[27] [27]

A survey of ai agent protocols

Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, et al. A survey of ai agent protocols. arXiv preprint arXiv:2504.16736, 2025

work page arXiv 2025

[28] [28]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[29] [29]

Rlds: an ecosystem to generate, share and use datasets in reinforcement learning

Sabela Ramos, Sertan Girgin, Léonard Hussenot, Damien Vincent, Hanna Yakubovich, Daniel Toyama, Anita Gergely , Piotr Stanczyk, Raphael Marinier, Jeremiah Harmsen, et al. Rlds: an ecosystem to generate, share and use datasets in reinforcement learning. arXiv preprint arXiv:2111.02767, 2021

work page arXiv 2021

[30] [30]

Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents

Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, et al. Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents. arXiv preprint arXiv:2510.24702, 2025

work page arXiv 2025

[31] [31]

LangChain: The agent engineering platform

LangChain, Inc. LangChain: The agent engineering platform. https://github.com/langchain-ai/langchain, 2025

2025

[32] [32]

LangGraph: Build resilient language agents as graphs

LangChain, Inc. LangGraph: Build resilient language agents as graphs. https://github.com/langchain-ai/ langgraph, 2025

2025

[33] [33]

CrewAI: Framework for orchestrating role-playing, autonomous AI agents

crewAI, Inc. CrewAI: Framework for orchestrating role-playing, autonomous AI agents. https://github.com/ crewAIInc/crewAI, 2025

2025

[34] [34]

OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows

OpenAI. OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows. https://github.com/ openai/openai-agents-python, 2025

2025

[35] [35]

Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025

Anthropic. Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025. 12

2025

[36] [36]

Agentprm: Process reward models for llm agents via step-wise promise and progress

Zhiheng Xi, Chenyang Liao, Guanyu Li, Zhihao Zhang, Wenxiang Chen, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. In Proceedings of the ACM Web Conference 2026, pages 4184–4195, 2026

2026

[37] [37]

Rlanything: Forge environment, policy , and reward model in completely dynamic rl system

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy , and reward model in completely dynamic rl system. arXiv preprint arXiv:2602.02488, 2026

work page arXiv 2026

[38] [38]

Hermes agent: The self-improving ai agent built by nous research

Nous Research. Hermes agent: The self-improving ai agent built by nous research. https://github.com/ NousResearch/hermes-agent, 2026. Accessed: 2026-06-30. 13

2026