pith. machine review for the scientific record. sign in

arxiv: 2604.20987 · v1 · submitted 2026-04-22 · 💻 cs.AI

Recognition: unknown

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:09 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsskill discoverylong-horizon tasksco-evolutiongame benchmarksdecision makingskill bank
0
0 comments X

The pith

A co-evolving skill bank and decision agent framework enables LLMs to better handle long-horizon tasks in games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that LLMs can overcome their limitations in long-horizon decision making by co-evolving two components: a decision agent that retrieves and uses skills from a bank to guide actions, and a skill pipeline that extracts reusable skills from the agent's own rollouts to update the bank. This mutual improvement allows the system to discover, retain, and reuse structured skills across episodes without supervision. If successful, it would mean smaller LLMs can achieve higher performance in complex interactive environments like games compared to larger frontier models that lack such mechanisms.

Core claim

COSPLAY is a co-evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent-managed skill pipeline discovers reusable skills from the agent's unlabeled rollouts to form and update the skill bank. This setup improves the decision agent's skill retrieval and action generation while the skill bank continually extracts, refines, and updates skills with their contracts, leading to better performance in long-horizon game environments.

What carries the argument

The co-evolution between the LLM decision agent and the skill bank agent, where skills are retrieved for decision making and extracted from rollouts for bank updates.

If this is right

  • The decision agent learns better skill retrieval and action generation through interaction with the skill bank.
  • The skill bank agent extracts, refines, and contracts skills from unlabeled rollouts, enabling reuse across episodes.
  • Experiments across six game environments demonstrate over 25.1 percent average reward improvement with an 8B base model against frontier LLM baselines on single-player benchmarks.
  • Competitive performance is maintained on multi-player social reasoning games.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This mutual bootstrapping could reduce the need for extensive human-labeled data or supervision in agent training.
  • The approach might extend to other long-horizon domains such as robotics or planning tasks if the skill extraction generalizes.
  • Smaller models enhanced this way could become more efficient alternatives to scaling up model size for interactive tasks.

Load-bearing premise

That the skill pipeline can reliably extract, refine, and contract genuinely reusable skills from unlabeled rollouts without supervision, and that this produces transferable improvements rather than environment-specific overfitting.

What would settle it

If the skills extracted by the pipeline fail to improve the decision agent's performance when used in new episodes or environments, or if removing the co-evolution loop eliminates the observed reward gains.

Figures

Figures reproduced from arXiv: 2604.20987 by Alexander Duffy, Dinesh Manocha, Guangyao Shi, Matthew Lyle Olson, Tianyi Zhou, Tyler Marques, Xiyang Wu, Zongxia Li.

Figure 1
Figure 1. Figure 1: Overview of COS-PLAY. COS-PLAY is a multi-agent co-evolution framework that couples gameplay with skill learning. It consists of a decision agent (Orange Box), a skill bank agent (Red Box), and a skill bank (Purple Box). The decision agent interacts with the game by retrieving skills, updating intentions, and selecting actions. After each episode, the skill bank agent segments trajectories, learns skill co… view at source ↗
Figure 2
Figure 2. Figure 2: Skill bank agent pipeline on one Diplomacy episode (Austria). (a) Raw Tra￾jectory. A decision-agent rollout; shaded rows mark skill transitions. (b) Boundary Pro￾posal. We score each timestep for transition signals and discard low scorers. (c) Infer Segmentation. We select true boundaries and label each segment with a bank skill or new skill. (d) Contract Learning. We aggregate state deltas across all inst… view at source ↗
Figure 3
Figure 3. Figure 3: Skill bank evolution over Diplomacy training. (a) Development of Strategic Function Categories from the first to the last training step. Compared with the initial skill bank, the final bank shows notable increases in phase transition and territory loss skills, indicating a broader tactical repertoire. (b) Changes in Intention Composition between the first and last training steps, suggesting increasingly go… view at source ↗
Figure 4
Figure 4. Figure 4: Co-evolution reward curves for all games. Single-player games show steady gains, indicating improved strategies from joint decision-agent and skill-bank training. Multiplayer self-play remains flat because all players improve symmetrically, pushing rewards toward equilibrium. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Step-level comparison between GPT-5.4 and our method in Candy Crush (1/2: Setting & Early Game). Continued in [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Step-level comparison between GPT-5.4 and our method in Candy Crush (2/2: Mid & Late Game). See [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Step-level comparison between GPT-5.4 and our method in Diplomacy as Austria (1/2: Setting & Early Game). Continued in [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Step-level comparison between GPT-5.4 and our method in Diplomacy as Austria (2/2: Mid & Late Game). See [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure analysis for both methods in Diplomacy. COS-PLAY fails by stagnation (5/28 episodes plateau at 3 SC); GPT-5.4 fails by collapse (16/60 decline to 1–2 SC). Stagna￾tion is the safer failure mode. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Skill retrieval patterns and causal mechanism in Diplomacy. Skills function as a curriculum schedule for action exploration, as they impose temporal structure that broadens the action distribution and establishes a safety floor. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
read the original abstract

Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces COSPLAY, a co-evolution framework for LLM agents in long-horizon interactive environments such as games. A decision agent retrieves skills from a learnable skill bank to guide multi-step actions under partial observability and delayed rewards, while a separate skill pipeline agent extracts, refines, and contracts reusable skills from the decision agent's unlabeled rollouts to populate and update the bank. Experiments across six game environments claim that COSPLAY instantiated with an 8B base model yields over 25.1% average reward improvement versus four frontier LLM baselines on single-player benchmarks while remaining competitive on multi-player social-reasoning games.

Significance. If the empirical gains prove robust, the framework would represent a meaningful advance for LLM agents by enabling unsupervised, iterative skill discovery and reuse without human supervision or hand-crafted skill libraries. The co-evolution loop between decision and skill agents directly targets the long-horizon consistency problem that current prompting and retrieval methods struggle with. The fact that an 8B model reportedly outperforms larger frontier baselines on single-player tasks would be noteworthy if supported by proper controls and ablations.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The headline claim of a 25.1% average reward improvement is presented without any description of the six environments, the four frontier baselines, number of evaluation episodes, variance across runs, or statistical tests. This information is load-bearing for the central empirical claim and its absence prevents verification that the gains arise from the skill bank rather than prompting artifacts or environment-specific overfitting.
  2. [§3] §3 (Method, Skill Pipeline): The unsupervised extraction, refinement, and contraction of skills from unlabeled rollouts is described at a high level with no concrete criteria, similarity metric, or validation step for determining reusability or transferability. Without such mechanisms or accompanying ablations that isolate the skill bank's contribution from the base 8B model's retrieval, it is impossible to rule out that reported gains reflect environment-specific correlations rather than genuinely reusable skills.
  3. [§4] §4 (Experiments): No cross-environment transfer tests or ablation studies (e.g., skill bank disabled, random skills, or fixed bank) are reported. Such controls are necessary to substantiate that the co-evolution produces transferable improvements rather than per-environment overfitting, which directly bears on the weakest assumption identified in the manuscript.
minor comments (2)
  1. [Abstract] The abstract and introduction use both 'co evolution' and 'co-evolution' inconsistently; standardize the hyphenated form.
  2. [Figures and Tables] Figure captions and table headers should explicitly state the base model size (8B) and the exact reward metric used for the 25.1% figure to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed feedback on our manuscript. We believe the suggested clarifications and additional analyses will strengthen the presentation of our co-evolution framework. Below we respond to each major comment and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline claim of a 25.1% average reward improvement is presented without any description of the six environments, the four frontier baselines, number of evaluation episodes, variance across runs, or statistical tests. This information is load-bearing for the central empirical claim and its absence prevents verification that the gains arise from the skill bank rather than prompting artifacts or environment-specific overfitting.

    Authors: We agree that the abstract and experimental summary should be more self-contained to allow readers to assess the claims immediately. In the revised manuscript, we will expand the abstract to briefly name the six game environments and the four frontier LLM baselines. In §4, we will add explicit details on the number of evaluation episodes per environment, report standard deviations or variances across multiple runs, and include results from statistical significance tests comparing COSPLAY to baselines. These additions will help confirm that the reported gains are attributable to the co-evolution mechanism rather than other factors. revision: yes

  2. Referee: [§3] §3 (Method, Skill Pipeline): The unsupervised extraction, refinement, and contraction of skills from unlabeled rollouts is described at a high level with no concrete criteria, similarity metric, or validation step for determining reusability or transferability. Without such mechanisms or accompanying ablations that isolate the skill bank's contribution from the base 8B model's retrieval, it is impossible to rule out that reported gains reflect environment-specific correlations rather than genuinely reusable skills.

    Authors: The description in §3 was intentionally high-level to focus on the overall co-evolution loop, but we recognize the need for concreteness. We will revise §3 to provide the concrete criteria for skill extraction, the similarity metric used for refinement and contraction, and the validation steps for determining reusability and transferability. We will also incorporate ablations that isolate the skill bank's contribution from the base 8B model's retrieval to rule out environment-specific correlations. revision: yes

  3. Referee: [§4] §4 (Experiments): No cross-environment transfer tests or ablation studies (e.g., skill bank disabled, random skills, or fixed bank) are reported. Such controls are necessary to substantiate that the co-evolution produces transferable improvements rather than per-environment overfitting, which directly bears on the weakest assumption identified in the manuscript.

    Authors: We acknowledge that the current experiments primarily demonstrate in-environment performance improvements. To address concerns about overfitting versus transferable skills, we will include in the revised §4 additional ablation experiments such as COSPLAY with the skill bank disabled, using randomly generated skills, and a fixed skill bank without updates. We will also report cross-environment transfer tests, where skills discovered in one game environment are applied to another, to demonstrate reusability across tasks. These will be added as new tables or figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical claims rest on external benchmarks

full rationale

The paper describes an empirical co-evolution framework (COSPLAY) for LLM agents and skill banks, with central claims consisting of reported reward improvements (e.g., 25.1% average on single-player benchmarks) across six game environments. No derivation chain, equations, fitted parameters, or self-referential definitions exist; the method is presented as a proposed architecture whose performance is evaluated via external benchmarks rather than reducing to its own inputs by construction. Self-citations, if any, are not load-bearing for the core results, which are falsifiable against frontier LLM baselines. This is a standard experimental paper with no circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on domain assumptions about automatic skill extraction from rollouts and the utility of skill contracts; no free parameters or invented entities with independent evidence are quantified in the abstract.

axioms (2)
  • domain assumption LLMs can improve decision making by retrieving and applying structured skills from an external bank
    Core premise of the decision agent component.
  • domain assumption Reusable skills with usage contracts can be discovered from unlabeled agent rollouts
    Foundation for the skill pipeline agent.
invented entities (2)
  • Skill bank no independent evidence
    purpose: Store and supply reusable skills to the decision agent
    Central new component of the framework
  • Skill pipeline agent no independent evidence
    purpose: Discover, refine, and update skills and contracts from rollouts
    The co-evolving counterpart to the decision agent

pith-pipeline@v0.9.0 · 5526 in / 1443 out tokens · 88299 ms · 2026-05-10T00:09:49.625126+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...

Reference graph

Works this paper leans on

37 extracted references · 35 canonical work pages · cited by 1 Pith paper · 18 internal anchors

  1. [1]

    Dota 2 with Large Scale Deep Reinforcement Learning

    Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680,

  2. [2]

    OpenAI Gym

    10 Preprint. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540,

  3. [3]

    Can vlms play action role-playing games? take black myth wukong as a study case, 2024

    Peng Chen, Pi Bu, Jun Song, Yuan Gao, and Bo Zheng. Can vlms play action role-playing games? take black myth wukong as a study case.arXiv preprint arXiv:2409.12889,

  4. [4]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978,

  5. [5]

    Visplay: Self-evolving vision-language models from images,

    Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, and Yonghui Yang. Visplay: Self-evolving vision-language models from images.arXiv preprint arXiv:2511.15661,

  6. [6]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  7. [7]

    lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146,

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146,

  8. [8]

    A survey on large language model-based game agents,

    Sihao Hu, Tiansheng Huang, Gaowen Liu, Ramana Rao Kompella, Fatih Ilhan, Selim Furkan Tekin, Yichang Xu, Zachary Yahn, and Ling Liu. A survey on large language model-based game agents.arXiv preprint arXiv:2404.02039,

  9. [9]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025a. Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. Cascade: Cumulative agentic skill creation through autonomou...

  10. [10]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026a. Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus- 1: Hybrid multimodal m...

  11. [11]

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    11 Preprint. Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652,

  12. [12]

    Mm-zero: Self-evolving multi-model vision language models from zero data,

    Zongxia Li, Hongyang Du, Chengsong Huang, Xiyang Wu, Lantao Yu, Yicheng He, Jing Xie, Xiaomin Wu, Zhichao Liu, Jiarui Zhang, et al. Mm-zero: Self-evolving multi-model vision language models from zero data.arXiv preprint arXiv:2603.09206, 2026b. Yi Liao, Yu Gu, Yuan Sui, Zining Zhu, Yifan Lu, Guohua Tang, Zhongqian Sun, and Wei Yang. Think in games: Learni...

  13. [13]

    From text to tactic: Evaluating llms playing the game of avalon

    Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. Avalonbench: Evaluating llms playing the game of avalon.arXiv preprint arXiv:2310.05036,

  14. [14]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

  15. [15]

    Agentic reinforcement learning with implicit step rewards, 2025

    Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199,

  16. [16]

    AVA: Attentive VLM Agent for Mastering StarCraft II

    Weiyu Ma, Yuqian Fu, Zecheng Zhang, Guohao Li, and Bernard Ghanem. Vlms play starcraft ii: A benchmark and multimodal decision method.arXiv preprint arXiv:2503.05383,

  17. [17]

    Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

    Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Procmem: Learning reusable procedural memory from experience via non- parametric ppo for llm agents.arXiv preprint arXiv:2602.01869,

  18. [18]

    Playing Atari with Deep Reinforcement Learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,

  19. [19]

    arXiv preprint arXiv:2411.13543 , year=

    OpenAI. Introducing gpt-5.4, 2026a. OpenAI. GPT-5 mini Model (gpt-5-mini), 2026b. Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´ nski, Lerrel Pinto, Rob Fergus, et al. Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543,

  20. [20]

    Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

    Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, et al. Orak: A foundational benchmark for training and evaluating llm agents on diverse video games.arXiv preprint arXiv:2506.03610,

  21. [21]

    Team et al.Scaling Instructable Agents Across Many Simulated Worlds

    Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, Sarah Chakera, et al. Scaling in- structable agents across many simulated worlds.arXiv preprint arXiv:2404.10179,

  22. [22]

    Bayesian Social Deduction with Graph-Informed Language Models

    Shahab Rahimirad, Guven Gergerli, Lucia Romero, Angela Qian, Matthew Lyle Olson, Simon Stepputtis, and Joseph Campbell. Bayesian social deduction with graph-informed language models.arXiv preprint arXiv:2506.17788,

  23. [23]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    12 Preprint. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  24. [24]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815,

  25. [25]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,

  26. [26]

    Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

    Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self- improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025a. Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, and Wentian Zhao. Vision-...

  27. [27]

    Chan, Sewon Min, and Joseph E

    Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, XuDong Wang, Ion Stoica, David M Chan, et al. Visgym: Diverse, customizable, scalable environments for multimodal agents.arXiv preprint arXiv:2601.16973,

  28. [28]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234,

  29. [29]

    Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents

    Han Xiao, Guozhi Wang, Hao Wang, Shilong Liu, Yuxiang Chai, Yue Pan, Yufeng Zhou, Xiaoxin Chen, Yafei Wen, and Hongsheng Li. Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents.arXiv preprint arXiv:2602.05832,

  30. [30]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430,

  31. [31]

    Vs-bench: Evaluating vlms for strategic reasoning and decision-making in multi-agent environments

    Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, and Yu Wang. Vs-bench: Evaluating vlms for strategic reasoning and decision-making in multi-agent environments. 2025a. Zhongwen Xu, Xianliang Wang, Siyi Li, Tao Yu, Liang Wang, Qiang Fu, and Wei Yang. Agents play thousands of 3d video games.arXiv preprint arXiv:2503.13356, 2025b...

  32. [32]

    Memweaver: A hierarchical memory from textual interactive behaviors for personalized generation.arXiv preprint arXiv:2510.07713, 2025a

    Shuo Yu, Mingyue Cheng, Daoyu Wang, Qi Liu, Zirui Liu, Ze Guo, and Xiaoyu Tao. Memweaver: A hierarchical memory from textual interactive behaviors for personalized generation.arXiv preprint arXiv:2510.07713, 2025a. Simon Yu, Gang Li, Weiyan Shi, and Peng Qi. Polyskill: Learning generalizable skills through polymorphic abstraction.arXiv preprint arXiv:2510...

  33. [33]

    Agentevolver: Towards efficient self-evolving agent system.arXiv, 2025

    Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, et al. Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395,

  34. [34]

    Zhang, Thomas L

    Alex L Zhang, Thomas L Griffiths, Karthik R Narasimhan, and Ofir Press. Videogamebench: Can vision-language models complete popular video games?arXiv preprint arXiv:2505.18134,

  35. [35]

    MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

    Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents. arXiv preprint arXiv:2602.02474, 2026a. Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, et al. Memrl: Self-evolvi...

  36. [36]

    Ckpt Int

    C Key Hyperparameters We summarize the main hyperparameters used in co-evolution training for the six game environments in the main paper. Table 3 lists the game-specific settings used in our main experiments. All training runs are conducted on an 8 ×A100 GPU cluster. For games without explicit GRPO overrides, we report the default values directly: GRPO c...

  37. [37]

    Avalon is a team-based competitive game in which only one side can win. It is structurally harder for the Good side, since Good players must infer hidden roles from sparse signals such as proposals, votes, and quest outcomes, while Evil players begin with full coordination and can strategically hide or sabotage (Light et al., 2023). As shown in Table 1, o...