pith. machine review for the scientific record. sign in

arxiv: 2605.08013 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

Learning CLI Agents with Structured Action Credit under Selective Observation

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords CLI agentsreinforcement learningcredit assignmentselective observationabstract syntax treeShellOps datasetagentic RL
0
0 comments X

The pith

CLI agents can assign credit to actions using A³, which derives turn-level advantages from episode-level feedback, AST sub-chain residuals, and trajectory margins without increasing RL complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses challenges in training CLI agents with reinforcement learning, specifically handling partial observations of large codebases and assigning credit across long action sequences with sparse rewards. It introduces σ-Reveal, an inference-time method for selecting relevant context within token limits, and Action Advantage Assignment (A³) to create finer-grained advantage signals. A³ leverages the structured attributes of CLI commands, including abstract syntax trees for sub-actions and margins in trajectories, to build advantages from relative episode feedback. This approach keeps the same computational cost as basic agent RL methods. The authors also provide the ShellOps dataset for evaluating such agents on information extraction and file editing tasks in shell environments.

Core claim

The central discovery is that turn-level advantages for CLI agent actions can be constructed from episode-level relative feedback combined with abstract syntax tree based action sub-chain residuals and tree-level trajectory margins, all while preserving the algorithmic complexity of standard agentic reinforcement learning.

What carries the argument

Action Advantage Assignment (A³), which turns episode feedback, AST-based sub-chain residuals, and trajectory margins into turn-level advantages for credit assignment in agent RL.

If this is right

  • Turn-level advantages improve credit assignment for long-horizon CLI tasks.
  • Selective observation with σ-Reveal allows effective operation under partial views of codebases.
  • The method applies directly to verifiable tasks like file editing and information extraction.
  • Preserving standard RL complexity means it can be integrated into existing agent training pipelines without overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar structured credit methods could apply to other agent domains with hierarchical actions, such as web navigation or robotics.
  • Reliance on ASTs suggests that parsing action syntax is key to better learning signals in code-related agents.
  • Testing on more diverse CLI environments could reveal if the structural advantages generalize beyond the ShellOps suite.

Load-bearing premise

The structured properties of CLI actions, including their syntax trees and trajectory relations, yield unbiased and useful signals for improving reinforcement learning credit assignment beyond what standard methods achieve.

What would settle it

A direct comparison on the ShellOps tasks where standard RL achieves equal or higher success rates and efficiency than A³-augmented agents would falsify the benefit of the structured credit construction.

Figures

Figures reproduced from arXiv: 2605.08013 by Haoyang Su, Ying Wen.

Figure 1
Figure 1. Figure 1: Overview of the verifiable CLI task workflow. (a) ShellOps task instance with a natural [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic overview of the algorithm. An AST comparison for a pair of shell commands [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training diagnostics on Qwen3-14B under matched data and rollout settings. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation and efficiency diagnostics for A3 . (a) Advantage-component magnitudes and training-mean contribution shares after fusion weighting. (b) ShellOps accuracy-cost Pareto view, using macro exact match and per-step advantage cost, with dashed non-A3 RL SOTA and annotated A3 with σ-Reveal gain [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ShellOps standard corpus task workspace profile across 1624 tasks. Panel A ranks file [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ShellOps-Pro task workspace profile across 150 tasks. Panel A ranks file extensions by file [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals. Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents. First, the agent must identify task-relevant evidence in a large codebase from partial observations. Second, sparse terminal rewards must be assigned to the actions that shape a long multi-turn trajectory. We study these bottlenecks through shell-driven information extraction and file editing tasks. For selective observation, we introduce $\sigma$-Reveal, an inference-time mechanism that selects token-budgeted context for the same CLI. For credit assignment, we propose Action Advantage Assignment ($\mathrm{A}^3$), a native agentic RL method that preserves the algorithmic complexity of standard agentic RL. $\mathrm{A}^3$ constructs turn-level advantages from episode-level relative feedback, abstract syntax tree (AST) based action sub-chain residuals, and tree-level trajectory margins. To further evaluate this problem setting, we construct ShellOps, a verifiable dataset suite covering CLI tasks in repository environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces σ-Reveal, an inference-time token-budgeted selective observation mechanism for CLI agents, and Action Advantage Assignment (A³), which constructs turn-level advantages from episode-level relative feedback, AST-based action sub-chain residuals, and tree-level trajectory margins while preserving standard agentic RL complexity. It evaluates these on shell-driven information extraction and file editing tasks using the new ShellOps dataset suite.

Significance. If the empirical results support the claims, A³ offers a structured credit-assignment approach that leverages native CLI action attributes (syntax trees, sub-chains, margins) to address sparse rewards in long trajectories without added algorithmic complexity, potentially improving sample efficiency for agentic RL in partially observable repository environments.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (A³ description): the construction of AST-based sub-chain residuals and tree-level trajectory margins is presented as robust, yet σ-Reveal's token-budgeted partial observations can produce truncated or syntactically incomplete command strings; no error-tolerant parsing, fallback to string-level residuals, or handling for undefined ASTs is specified, which risks systematic bias in advantage estimates favoring well-formed commands.
  2. [§4] §4 (experimental setup): the claim that A³ improves over standard RL baselines relies on reliable margin computation and AST residuals providing better signals than vanilla credit assignment, but without ablations isolating the contribution of each component (episode feedback vs. AST residuals vs. margins) under selective observation, it is unclear whether the reported gains are load-bearing or artifactual.
minor comments (3)
  1. [Abstract] Abstract: the phrase 'preserves the algorithmic complexity of standard agentic RL' should be quantified (e.g., big-O or wall-clock overhead) since AST parsing per turn adds non-trivial cost in long trajectories.
  2. [§2] §2 (related work): add explicit comparison to prior structured credit assignment methods (e.g., those using program ASTs in code generation RL) to clarify novelty of the sub-chain residual formulation.
  3. [Figure 1 or §3.2] Figure 1 or §3.2: include a small worked example showing how a truncated CLI command under σ-Reveal yields an AST residual and margin value.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our methods. We address each major point below, providing clarifications based on the manuscript design and committing to revisions that strengthen the work without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (A³ description): the construction of AST-based sub-chain residuals and tree-level trajectory margins is presented as robust, yet σ-Reveal's token-budgeted partial observations can produce truncated or syntactically incomplete command strings; no error-tolerant parsing, fallback to string-level residuals, or handling for undefined ASTs is specified, which risks systematic bias in advantage estimates favoring well-formed commands.

    Authors: We appreciate this observation on potential edge cases. In the manuscript, σ-Reveal performs token-budgeted selection exclusively over environment observations (e.g., command outputs, directory listings), while agent actions remain complete, syntactically valid command strings produced by the policy network. AST construction and sub-chain residuals are therefore applied only to these full post-execution actions. When AST parsing fails (an infrequent occurrence given our training distribution), the implementation already falls back to string-level edit-distance residuals for advantage computation. To make this explicit and address any risk of bias, we will expand §3 with a dedicated paragraph on error-tolerant parsing and the fallback rule. revision: yes

  2. Referee: [§4] §4 (experimental setup): the claim that A³ improves over standard RL baselines relies on reliable margin computation and AST residuals providing better signals than vanilla credit assignment, but without ablations isolating the contribution of each component (episode feedback vs. AST residuals vs. margins) under selective observation, it is unclear whether the reported gains are load-bearing or artifactual.

    Authors: We agree that isolating the three components of A³ would provide stronger evidence for their individual contributions, especially under selective observation. The current §4 reports end-to-end results comparing A³ against standard RL baselines in the presence of σ-Reveal. In the revised manuscript we will add targeted ablations in §4 (and the appendix) that systematically disable episode-level relative feedback, AST sub-chain residuals, and trajectory margins in turn, while keeping σ-Reveal active. These experiments will quantify the marginal benefit of each element and confirm that the observed improvements are not artifacts of the combined setting. revision: yes

Circularity Check

0 steps flagged

No circularity: A³ is an additive construction over standard agentic RL

full rationale

The paper defines A³ explicitly as a method that builds turn-level advantages from three external signals (episode relative feedback, AST sub-chain residuals, tree-level margins) while preserving standard RL algorithmic complexity. No equations are shown that equate the output advantages to the input signals by construction, no parameters are fitted on a subset and then renamed as predictions, and no load-bearing self-citations or uniqueness theorems from the same authors are invoked. The derivation chain is therefore additive rather than self-referential; the method's value rests on empirical improvement over baselines, not on internal redefinition of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly relies on standard RL assumptions (MDP structure, advantage estimation) and the claim that CLI syntax trees yield useful residuals, but these are not enumerated.

pith-pipeline@v0.9.0 · 5512 in / 1187 out tokens · 41012 ms · 2026-05-11T02:55:50.674818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 12 internal anchors

  1. [1]

    LMRL gym: Benchmarks for multi-turn reinforcement learning with language models

    Marwa Abdulhai, Isadora White, Charlie Victor Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. LMRL gym: Benchmarks for multi-turn reinforcement learning with language models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 126–153. PMLR, 2025

  2. [2]

    Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R

    Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R. Narasimhan, Ramesh Karri, and Ofir Press. EnIGMA: Interactive tools substantially assist LM agents in finding security vulnera...

  3. [3]

    Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267,...

  4. [4]

    Introducing claude opus 4.7

    Anthropic. Introducing claude opus 4.7. Anthropic research announcement, 2026

  5. [5]

    tree-sitter/tree-sitter: v0.25.3

    Max Brunsfeld and tree-sitter contributors. tree-sitter/tree-sitter: v0.25.3. Zenodo software release, 2025. Software, version 0.25.3

  6. [6]

    arXiv preprint arXiv:2310.05915 , year=

    Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning, 2023. arXiv:2310.05915 [cs.CL]

  7. [7]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

    DeepSeek-AI. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645:633–638, 2025

  8. [8]

    Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization

    Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin, Varun Kumar, Zijian Wang, and Anoop Deoras. Empowering multi-turn tool-integrated agentic reasoning with group turn policy optimization, 2025. arXiv:2511.14846 [cs.LG]

  9. [9]

    Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? InInternational Conference on Machine Learning, volume 235, pages 11642–11662, 2024

  10. [10]

    Group-in-group policy optimization for LLM agent training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. InAdvances in Neural Information Processing Systems, 2025

  11. [11]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5 Team. GLM-5: from vibe coding to agentic engineering, 2026. arXiv:2602.15763 [cs.CL]

  12. [12]

    Goldberger, Luis A

    Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000

  13. [13]

    Question answering over tabular data with DataBench: A large-scale empirical evaluation of LLMs

    Jorge Osés Grijalba, Luis Alfonso Ureña-López, Eugenio Martínez Cámara, and Jose Camacho- Collados. Question answering over tabular data with DataBench: A large-scale empirical evaluation of LLMs. InProceedings of LREC-COLING 2024, Turin, Italy, 2024

  14. [14]

    What do agents learn from trajectory-sft: Semantics or interfaces?,

    Weizheng Gu, Chengze Li, Zhuohao Yu, Mengyuan Sun, Zhibang Yang, Wei Wang, Hongrui Jia, Shikun Zhang, and Wei Ye. What do agents learn from trajectory-sft: Semantics or interfaces?,

  15. [15]

    arXiv:2602.01611 [cs.LG]

  16. [16]

    Scaling agentic capabilities, not context: Efficient reinforcement finetuning for large toolspaces, 2026

    Karan Gupta, Pranav Vajreshwari, Yash Pandya, Raghav Magazine, Akshay Nambi, and Ahmed Awadallah. Scaling agentic capabilities, not context: Efficient reinforcement finetuning for large toolspaces, 2026. arXiv:2603.06713 [cs.LG]. 10

  17. [17]

    Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817,

    Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, and Bo An. Hierarchy-of-groups policy optimization for long-horizon agentic tasks, 2026. arXiv:2602.22817 [cs.LG]

  18. [18]

    MLAgentBench: Evaluating language agents on machine learning experimentation

    Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. MLAgentBench: Evaluating language agents on machine learning experimentation. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProc...

  19. [19]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

  20. [20]

    Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-Wei H. Lehman, Mengling Feng, Mo- hammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. MIMIC-III, a freely accessible critical care database.Scientific Data, 3:160035, 2016

  21. [21]

    No Starch Press, 2010

    Michael Kerrisk.The Linux Programming Interface: A Linux and UNIX System Programming Handbook. No Starch Press, 2010

  22. [22]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team. Kimi K2.5: Visual agentic intelligence, 2026. arXiv:2602.02276 [cs.CL]

  23. [23]

    VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Li...

  24. [24]

    Ehrcon: Dataset for checking consistency between unstructured notes and structured tables in electronic health records

    Yeonsu Kwon, Jiho Kim, Gyubok Lee, Seongsu Bae, Daeun Kyung, Wonchul Cha, Tom Pollard, Alistair Johnson, and Edward Choi. Ehrcon: Dataset for checking consistency between unstructured notes and structured tables in electronic health records. InAdvances in Neural Information Processing Systems, volume 37, pages 89334–89345. Curran Associates, Inc., 2024

  25. [25]

    EHRCon: Dataset for checking consistency between unstructured notes and structured tables in electronic health records.PhysioNet, mar 2025

    Yeonsu Kwon, Jiho Kim, Gyubok Lee, Seongsu Bae, Daeun Kyung, Wonchul Cha, Tom Pollard, Alistair Johnson, and Edward Choi. EHRCon: Dataset for checking consistency between unstructured notes and structured tables in electronic health records.PhysioNet, mar 2025. Version 1.0.1

  26. [26]

    Levenshtein

    Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8):707–710, 1966

  27. [27]

    ModelScope-agent: Building your customizable agent system with open-source large language models

    Chenliang Li, Hehong Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, Hongzhu Shi, Ji Zhang, Fei Huang, and Jingren Zhou. ModelScope-agent: Building your customizable agent system with open-source large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language ...

  28. [28]

    GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning

    Xiaoya Li, Xiaofei Sun, Guoyin Wang, Songqiao Su, Chris Shum, Jiwei Li, and DeepReinforce Team. Grandcode: Achieving grandmaster level in competitive programming via agentic reinforcement learning, 2026. arXiv:2604.02721 [cs.AI]

  29. [29]

    ToolRLA: Multiplicative reward decomposition for tool-integrated agents.arXiv preprint arXiv:2603.01620, 2026

    Pengbo Liu. Toolrla: Multiplicative reward decomposition for tool-integrated agents, 2026. arXiv:2603.01620 [cs.AI]

  30. [30]

    Agentbench: Evaluating LLMs as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...

  31. [31]

    Agentic reinforcement learning with implicit step rewards

    Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Jianbin Jiao, and Junge Zhang. Agentic reinforcement learning with implicit step rewards. InThe Fourteenth International Conference on Learning Representations, 2026

  32. [32]

    ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

    Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational L...

  33. [33]

    Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026

    Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. SKILL0: In-context agentic reinforcement learning for skill internalization, 2026. arXiv:2604.02268 [cs.LG]

  34. [34]

    AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents

    Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents. InNeural Information Processing Systems, 2024

  35. [35]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

  36. [36]

    Revisiting group relative policy op- timization: Insights into on-policy and off-policy training

    Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan Greenewald, Jiri Navratil, Jarret Ross, and Jesus Rios. Revisiting group relative policy op- timization: Insights into on-policy and off-policy training. InThe Fourteenth International Conference on Learning Representations, 2026

  37. [37]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  38. [38]

    Training software engineering agents and verifiers with SWE-gym

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with SWE-gym. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research. PMLR, 2025

  39. [39]

    Grpo-λ: Credit assignment improves llm reasoning.arXiv preprint arXiv:2510.00194, 2025

    Prasanna Parthasarathi, Mathieu Reymond, Boxing Chen, Yufei Cui, and Sarath Chandar. Grpo-λ: Credit assignment improves llm reasoning, 2025. arXiv:2510.00194 [cs.LG]

  40. [40]

    WebRL: Training LLM web agents via self- evolving online curriculum reinforcement learning

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Jiadai Sun, Xinyue Yang, Yu Yang, Shuntian Yao, Wei Xu, Jie Tang, and Yuxiao Dong. WebRL: Training LLM web agents via self- evolving online curriculum reinforcement learning. InThe Thirteenth International Conference on Learning Representations, 2025. 12

  41. [41]

    Mutual reasoning makes smaller LLMs stronger problem-solvers

    Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reasoning makes smaller LLMs stronger problem-solvers. InThe Thirteenth International Conference on Learning Representations, 2025

  42. [42]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. arXiv:2505.09388 [cs.CL]

  43. [43]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, 2023

  44. [44]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. arXiv:1707.06347 [cs.LG]

  45. [45]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. arXiv:2402.03300 [cs.CL]

  46. [46]

    HybridFlow: A flexible and efficient RLHF framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys ’25), 2025

  47. [47]

    ALFWorld: Aligning text and embodied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021

  48. [48]

    Lammi-pathology: A tool-centric bottom- up lvlm-agent framework for molecularly informed medical intelligence in pathology, 2026

    Haoyang Su, Shaoting Zhang, and Xiaosong Wang. Lammi-pathology: A tool-centric bottom- up lvlm-agent framework for molecularly informed medical intelligence in pathology, 2026. arXiv:2602.18773 [cs.AI]

  49. [49]

    Zerosearch: Incentivize the search capability of llms without searching, 2025

    Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. ZeroSearch: Incentivize the search capability of LLMs without searching, 2025. arXiv:2505.04588 [cs.CL]

  50. [50]

    Landlock LSM: Kernel documentation

    The Linux Kernel Developers. Landlock LSM: Kernel documentation. Linux kernel documen- tation, 2026

  51. [51]

    AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Associa- tio...

  52. [52]

    Information gain-based policy optimization: A simple and effective ap- proach for multi-turn search agents

    Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, and Zhenzhe Ying. Information gain-based policy optimization: A simple and effective ap- proach for multi-turn search agents. InThe Fourteenth International Conference on Learning Representations, 2026

  53. [53]

    arXiv preprint arXiv:2505.20732 , year=

    Hanlin Wang, Chak Tou Leong, Jiashuo Wang, Jian Wang, and Wenjie Li. Spa-rl: Reinforcing llm agents via stepwise progress attribution, 2025. arXiv:2505.20732 [cs.CL]

  54. [54]

    SLEA-RL: Step-level experience augmented reinforce- ment learning for multi-turn agentic training, 2026

    Prince Zizhuang Wang and Shuli Jiang. SLEA-RL: Step-level experience augmented reinforce- ment learning for multi-turn agentic training, 2026. arXiv:2603.18079 [cs.LG]

  55. [55]

    Executable code actions elicit better LLM agents

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 50208–50232. PMLR, 2024

  56. [56]

    MINT: Evaluating LLMs in Multi-Turn Interaction with Tools and Language Feedback

    Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in Multi-Turn Interaction with Tools and Language Feedback. In International Conference on Learning Representations, 2024. 13

  57. [57]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for ai soft...

  58. [58]

    Reinforcing multi-turn reasoning in llm agents via turn-level reward design.arXiv preprint arXiv:2505.11821, 2025

    Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, and Mingyi Hong. Reinforcing multi-turn reasoning in llm agents via turn-level reward design, 2025. arXiv:2505.11821 [cs.LG]

  59. [59]

    EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

    Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle, 2025. arXiv:2510.16079 [cs.CL]

  60. [60]

    Tablebench: A comprehensive and complex benchmark for table question answering

    Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xeron Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, et al. Tablebench: A comprehensive and complex benchmark for table question answering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25497–25506, 2025

  61. [61]

    AgentGym: Evaluating and training large language model-based agents across diverse environments

    Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Xin Guo, Dingwen Yang, Chenyang Liao, Wei He, Songyang Gao, Lu Chen, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. AgentGym: Evaluating and training large language model-based agents across diverse environments. InPro- ceedi...

  62. [62]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning, 2026. arXiv:2602.08234 [cs.LG]

  63. [63]

    OSWorld: Benchmark- ing Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmark- ing Multimodal Agents for Open-Ended Tasks in Real Computer Environments. InNeural Information Processing ...

  64. [64]

    Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Zhiruo Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Melroy Maben, Raj Mehta, Wayne Chi, Lawrence Keunho Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real ...

  65. [65]

    Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials

    Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials. InThe Thirteenth International Conference on Learning Representations, 2025

  66. [66]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, volume 37, 2024

  67. [67]

    WebShop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, volume 35, 2022

  68. [68]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. 14

  69. [69]

    Narasimhan

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan. τ-bench: A bench- mark for tool-agent-user interaction in real-world domains. InThe Thirteenth International Conference on Learning Representations, 2025

  70. [70]

    DAPO: An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  71. [71]

    From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

    Chenchen Zhang. From reasoning to agentic: Credit assignment in reinforcement learning for large language models, 2026. arXiv:2604.09459 [cs.CL]

  72. [72]

    Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026

    Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, Zhiding Yu, Jan Kautz, and Yi Dong. Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents, 2026. arXiv:2603.18815 [cs.AI]

  73. [73]

    Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026

    Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, and Muning Wen. MemRL: Self-evolving agents via runtime reinforcement learning on episodic memory, 2026. arXiv:2601.03192 [cs.CL]

  74. [74]

    arXiv preprint arXiv:2603.08561 , year=

    Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, and Wenqi Shao. Retroagent: From solving to evolving via retrospective dual intrinsic feedback.arXiv preprint arXiv:2603.08561, 2026

  75. [75]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. arXiv:2507.18071 [cs.LG]

  76. [76]

    arXiv preprint arXiv:2505.11942 , year=

    Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, ZhongZhi Li, Yingying Zhang, Le Song, and Qianli Ma. Lifelongagentbench: Evaluating llm agents as lifelong learners, 2025. arXiv:2505.11942 [cs.AI]

  77. [77]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InNeural Information Processing Systems, 2024

  78. [78]

    StepSearch: Igniting LLMs search ability via step-wise proximal policy optimization

    Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, and Yichao Wu. StepSearch: Igniting LLMs search ability via step-wise proximal policy optimization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  79. [79]

    Language agent tree search unifies reasoning, acting, and planning in language models

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

  80. [80]

    archive/by-date/

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents. InInternational Conference on Learning Representations, 2024. 15 A Dataset Details Each ShellOps instance belongs to one of four ...