pith. machine review for the scientific record. sign in

arxiv: 2602.20867 · v1 · submitted 2026-02-24 · 💻 cs.CR · cs.AI· cs.CE· cs.ET

Recognition: 3 theorem links

· Lean Theorem

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CEcs.ET
keywords agentic skillsLLM agentssystematization of knowledgesupply chain securityprompt injectionagent evaluationmarketplace risks
0
0 comments X

The pith

Agentic skills function as reusable procedural modules that let LLM agents handle long-horizon tasks reliably across domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic skills package procedural knowledge into callable modules that include applicability conditions, execution policies, termination criteria, and interfaces, so agents can apply the same capability to many tasks instead of building one-off plans each time. The paper maps the complete skill lifecycle from discovery through update and supplies two taxonomies that organize how skills are built and used in practice. Security analysis shows that the same reusability creates supply-chain exposure, with prompt-injection payloads able to travel through skills, as demonstrated when nearly 1,200 malicious skills entered a major marketplace and stole credentials at scale. Evaluation data indicate that hand-curated skills raise agent success rates while skills generated by the agents themselves tend to lower them. The work therefore frames the move toward robust, verifiable, and certifiable skills as a necessary step for trustworthy autonomous agents.

Core claim

Agentic skills are distinct from atomic tool calls because they carry explicit conditions, policies, and reusable interfaces that let them operate reliably across tasks; systematizing their design patterns and representations reveals both performance gains from curated skills and concrete supply-chain and injection risks that must be addressed for safe deployment.

What carries the argument

The skill layer, consisting of reusable modules that combine procedural knowledge with applicability conditions, execution policies, termination criteria, and standardized interfaces.

Load-bearing premise

The two proposed taxonomies capture the essential structure of agentic skills and that risks observed in one marketplace case will appear in other agent platforms.

What would settle it

A subsequent marketplace audit or controlled deployment that finds zero successful skill-based exfiltrations despite widespread adoption, or that shows a simpler taxonomy organizes existing skills more cleanly than the seven-pattern plus representation-by-scope scheme.

read the original abstract

Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably. These capabilities are callable modules that package procedural knowledge with explicit applicability conditions, execution policies, termination criteria, and reusable interfaces. Unlike one-off plans or atomic tool calls, skills operate (and often do well) across tasks. This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update) and introduces two complementary taxonomies. The first is a system-level set of \textbf{seven design patterns} capturing how skills are packaged and executed in practice, from metadata-driven progressive disclosure and executable code skills to self-evolving libraries and marketplace distribution. The second is an orthogonal \textbf{representation $\times$ scope} taxonomy describing what skills \emph{are} (natural language, code, policy, hybrid) and what environments they operate over (web, OS, software engineering, robotics). We analyze the security and governance implications of skill-based agents, covering supply-chain risks, prompt injection via skill payloads, and trust-tiered execution, grounded by a case study of the ClawHavoc campaign in which nearly 1{,}200 malicious skills infiltrated a major agent marketplace, exfiltrating API keys, cryptocurrency wallets, and browser credentials at scale. We further survey deterministic evaluation approaches, anchored by recent benchmark evidence that curated skills can substantially improve agent success rates while self-generated skills may degrade them. We conclude with open challenges toward robust, verifiable, and certifiable skills for real-world autonomous agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 4 minor

Summary. The paper is a Systematization of Knowledge (SoK) on agentic skills in LLM agents, defining them as reusable procedural modules with explicit applicability conditions, execution policies, termination criteria, and interfaces that operate reliably across tasks. It maps the full skill lifecycle (discovery through update), proposes two complementary taxonomies (seven design patterns for packaging/execution and an orthogonal representation-by-scope taxonomy covering natural language/code/policy/hybrid forms across web/OS/SE/robotics environments), analyzes security/governance risks (supply-chain attacks and prompt injection) grounded in the ClawHavoc campaign of nearly 1,200 malicious skills, surveys deterministic evaluation methods with benchmark evidence on curated vs. self-generated skills, and outlines open challenges for verifiable skills.

Significance. If the proposed taxonomies hold as stable organizing frameworks and the security analysis generalizes beyond the single case study, this SoK would provide a timely, practical reference for designing reliable agentic systems while highlighting concrete risks, potentially influencing both research taxonomies and marketplace governance standards in the LLM agent space.

major comments (1)
  1. [Security and Governance Implications] Security and Governance section: The security implications (supply-chain infiltration and prompt injection) are anchored exclusively to the ClawHavoc marketplace campaign; the manuscript does not provide evidence or discussion of whether comparable risks manifest in non-marketplace or closed agent platforms, which limits the load-bearing claim that these risks are inherent to skill-based agents broadly.
minor comments (4)
  1. [Abstract] Abstract: The notation '1{,}200' is a LaTeX artifact; standardize to '1,200' or 'nearly twelve hundred' for readability in the final version.
  2. [Taxonomies] Taxonomy presentation: The seven design patterns and representation-by-scope taxonomy are described in text; a single summary table or diagram mapping literature examples to each category would substantially improve clarity and allow readers to assess coverage.
  3. [Evaluation] Evaluation section: Benchmark evidence is cited for curated skills improving success rates while self-generated skills may degrade them, but no specific quantitative results (e.g., success-rate deltas or benchmark names) are tabulated; adding a small results table would make the survey more concrete.
  4. [Skill Lifecycle] Lifecycle mapping: The seven-stage lifecycle (discovery, practice, distillation, etc.) is introduced without an accompanying figure showing dependencies or feedback loops between stages; a diagram would aid comprehension.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the security analysis. We agree that the discussion would benefit from explicit treatment of generalizability beyond the marketplace case study and will revise accordingly.

read point-by-point responses
  1. Referee: Security and Governance section: The security implications (supply-chain infiltration and prompt injection) are anchored exclusively to the ClawHavoc marketplace campaign; the manuscript does not provide evidence or discussion of whether comparable risks manifest in non-marketplace or closed agent platforms, which limits the load-bearing claim that these risks are inherent to skill-based agents broadly.

    Authors: We thank the referee for this observation. The ClawHavoc campaign is used as the sole concrete, large-scale public example because it is the only documented incident with detailed data on nearly 1,200 malicious skills. The manuscript does not present evidence from closed or non-marketplace platforms, as such data is not publicly available. At the same time, the risks of supply-chain infiltration and prompt injection arise from the core properties of skills as modular, distributable units with explicit interfaces and payloads, which are captured in both the seven design patterns and the representation-by-scope taxonomy. These properties exist independently of the distribution channel. In the revised manuscript we will expand the Security and Governance section to (1) state that the mechanisms generalize to any skill-sharing setting (internal libraries, direct imports, self-evolving libraries), (2) illustrate the point with references to the design patterns that apply outside marketplaces, and (3) note the lack of public empirical data from closed systems as a limitation. This change will make the broader claim explicit while remaining faithful to the available evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This SoK paper surveys the agentic skills literature, proposes two complementary taxonomies (seven design patterns and representation-by-scope) as organizing frameworks, and grounds its security analysis in the external ClawHavoc campaign. No load-bearing claims reduce to self-defined quantities, fitted parameters, or self-citation chains by construction. The taxonomies are explicitly presented as non-exhaustive complementary views rather than derived results, and all empirical references (benchmarks, case study) are anchored to external sources without internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no mathematical free parameters, formal axioms, or newly postulated physical entities; it relies entirely on literature synthesis and one external case study.

pith-pipeline@v0.9.0 · 5613 in / 1186 out tokens · 42543 ms · 2026-05-14T23:14:01.776364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Five Attacks on x402 Agentic Payment Protocol

    cs.CR 2026-05 conditional novelty 7.0

    Five practical attacks on the x402 agentic payment protocol are demonstrated across authorization, binding, replay protection, and web handling, validated on local chains, Base Sepolia, live endpoints, and three open-...

  2. OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on f...

  3. Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

    cs.SE 2026-05 conditional novelty 7.0

    SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...

  4. Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck

    cs.LG 2026-05 unverdicted novelty 7.0

    CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.

  5. SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.

  6. Sealing the Audit-Runtime Gap for LLM Skills

    cs.CR 2026-05 unverdicted novelty 7.0

    SIGIL cryptographically seals the audit-runtime gap for LLM skills via an on-chain registry with four publication types, DAO vetting, and a runtime verification loader that enforces integrity and permissions.

  7. Uncertainty Propagation in LLM-Based Systems

    cs.SE 2026-04 unverdicted novelty 7.0

    This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insight...

  8. Knows: Agent-Native Structured Research Representations

    cs.AI 2026-04 conditional novelty 7.0

    Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via ...

  9. SoK: Blockchain Agent-to-Agent Payments

    q-fin.GN 2026-04 unverdicted novelty 7.0

    The first systematization of blockchain-based agent-to-agent payments organizes designs into discovery, authorization, execution, and accounting stages while identifying trust and security gaps.

  10. SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

    cs.CR 2026-05 unverdicted novelty 6.0

    SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.

  11. SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

  12. Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries

    cs.CL 2026-05 unverdicted novelty 6.0

    GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.

  13. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...

  14. SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology

    cs.AI 2026-04 unverdicted novelty 6.0

    SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.

  15. Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

    cs.AI 2026-04 conditional novelty 6.0

    The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.

  16. Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses

    cs.CR 2026-03 unverdicted novelty 6.0

    The survey organizes over 400 papers on embodied AI safety into a multi-level taxonomy and flags overlooked issues such as fragile multimodal fusion and unstable planning under jailbreaks.

  17. Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution

    cs.AI 2026-05 unverdicted novelty 5.0

    Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to ...

  18. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...

  19. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...

  20. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

    cs.SE 2026-04 accept novelty 5.0

    LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

  21. A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    cs.IR 2026-05 unverdicted novelty 4.0

    The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

  22. SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

    cs.AI 2026-04 unverdicted novelty 4.0

    SciFi is a safe, lightweight agentic AI framework that automates structured scientific tasks with minimal human intervention via isolated environments and layered self-assessing agents.

  23. ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation

    cs.AI 2026-05 unverdicted novelty 3.0

    Expanded orchestration in ChromaFlow lowered accuracy on GAIA tasks from 29/53 to 27/53 while increasing timeouts, tool failures, and costs.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 21 Pith papers · 28 internal anchors

  1. [1]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, “WebArena: A realistic web environment for building autonomous agents,” in International Conference on Learning Representations (ICLR), 2024, arXiv:2307.13854

  2. [2]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable au- tomated software engineering,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, arXiv:2405.15793

  3. [3]

    Measuring and augmenting large language models for solving capture-the-flag challenges,

    Z. Ji, D. Wu, W. Jiang, P. Ma, Z. Li, and S. Wang, “Measuring and augmenting large language models for solving capture-the-flag challenges,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2025, pp. 603–617

  4. [4]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang, “Hug- gingGPT: Solving AI tasks with ChatGPT and its friends in Hug- ging Face,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2303.17580

  5. [5]

    Can large language model agents simulate human trust behavior?

    C. Xie, C. Chen, F. Jia, Z. Ye, S. Lai, K. Shu, J. Gu, A. Bibi, Z. Hu, D. Jurgenset al., “Can large language model agents simulate human trust behavior?”Advances in Neural Information Processing Systems (NeurIPS), vol. 37, pp. 15 674–15 729, 2024

  6. [6]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “MetaGPT: Meta programming for a multi-agent collaborative framework,” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2308.00352

  7. [7]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “AutoGen: Enabling next-gen LLM applications via multi- agent conversation,” inConference on Language Modeling (COLM), 2024, arXiv:2308.08155

  8. [8]

    An integrated theory of the mind

    J. R. Anderson, D. Bothell, M. D. Byrne, S. Douglass, C. Lebiere, and Y . Qin, “An integrated theory of the mind.”Psychological Review, vol. 111, no. 4, pp. 1036–1060, 2004

  9. [9]

    J. E. Laird,The Soar Cognitive Architecture. MIT Press, 2012

  10. [10]

    Between MDPs and semi- MDPs: A framework for temporal abstraction in reinforcement learn- ing,

    R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi- MDPs: A framework for temporal abstraction in reinforcement learn- ing,”Artificial Intelligence, vol. 112, no. 1–2, pp. 181–211, 1999

  11. [11]

    The landscape of agentic reinforcement learning for llms: A survey,

    G. Zhang, H. Geng, X. Yu, Z. Yin, Z. Zhang, Z. Tan, H. Zhou, Z.-Z. Li, X. Xue, Y . Liet al., “The landscape of agentic reinforcement learning for llms: A survey,”Transactions on Machine Learning Research (TMLR)

  12. [12]

    ToolRL: Reward is all tool learning needs,

    C. Qian, E. C. Acikgoz, Q. He, H. W ANG, X. Chen, D. Hakkani- Tür, G. Tur, and H. Ji, “ToolRL: Reward is all tool learning needs,” inAnnual Conference on Neural Information Processing Systems (NeurIPS)

  13. [13]

    A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence,

    H.-a. Gao, J. Geng, W. Hua, M. Hu, X. Juan, H. Liu, S. Liu, J. Qiu, X. Qi, Q. Renet al., “A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence,” Transactions on Machine Learning Research (TMLR)

  14. [14]

    Safety at scale: A comprehensive survey of large model and agent safety,

    X. Ma, Y . Gao, Y . Wang, R. Wang, X. Wang, Y . Sun, Y . Ding, H. Xu, Y . Chen, Y . Zhaoet al., “Safety at scale: A comprehensive survey of large model and agent safety,”Foundations and Trends in Privacy and Security, vol. 8, no. 3-4, pp. 1–240, 2026

  15. [15]

    A Survey on Large Language Model based Autonomous Agents

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. X. Zhao, Z. Wei, and J.-R. Wen, “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024, extended from arXiv:2308.11432

  16. [16]

    A survey on agentic security: Applications, threats and defenses,

    A. Shahriar, M. N. Rahman, S. Ahmed, F. Sadeque, and M. R. Parvez, “A survey on agentic security: Applications, threats and defenses,” arXiv preprint arXiv:2510.06445, 2025

  17. [17]

    Understanding the planning of LLM agents: A survey

    X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y . Wang, R. Tang, and E. Chen, “Understanding the planning of LLM agents: A survey,”arXiv preprint arXiv:2402.02716, 2024

  18. [18]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,”arXiv preprint arXiv:2402.01680, 2024

  19. [19]

    Survey on Evaluation of LLM-based Agents

    A. Yehudai, L. Eden, A. Li, G. Uziel, Y . Zhao, R. Bar-Haim, A. Co- han, and M. Shmueli-Scheuer, “Survey on evaluation of LLM-based agents,”arXiv preprint arXiv:2503.16416, 2025

  20. [20]

    Information fidelity in tool-using llm agents: A martingale analysis of the model context protocol,

    F. X. Fan, C. Tan, R. Wattenhofer, and Y .-S. Ong, “Information fidelity in tool-using llm agents: A martingale analysis of the model context protocol,”arXiv preprint arXiv:2602.13320, 2026

  21. [21]

    Tool learning with foundation models,

    Y . Qin, S. Hu, Y . Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, X. Zhou, Y . Huang, C. Xiao, C. Han, Y . R. Fung, Y . Su, H. Wang, C. Qian, R. Tian, K. Zhu, S. Liang, X. Shen, B. Xu, Z. Zhang, Y . Ye, B. Li et al., “Tool learning with foundation models,”ACM Computing Surveys (CSUR), vol. 57, no. 4, pp. 101:1–101:40, 2025

  22. [22]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Tool- former: Language models can teach themselves to use tools,” in Advances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2302.04761

  23. [23]

    Large language model based multi-agents: A survey of progress and challenges,

    T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,” inProceedings of the Thirty-Third In- ternational Joint Conference on Artificial Intelligence (IJCAI), 2024, pp. 8048–8057, survey track

  24. [24]

    MemGPT: Towards LLMs as Operating Systems

    C. Packer, V . Fang, S. G. Patil, K. Lin, S. Wooders, and J. E. Gonzalez, “MemGPT: Towards LLMs as operating systems,”arXiv preprint arXiv:2310.08560, 2023

  25. [25]

    Memory matters: The need to improve long-term memory in llm-agents,

    K. Hatalis, D. Christou, J. Myers, S. Jones, K. Lambert, A. Amos- Binks, Z. Dannenhauer, and D. Dannenhauer, “Memory matters: The need to improve long-term memory in llm-agents,” inProceedings of the AAAI Symposium Series (AAAI), vol. 2, no. 1, 2023, pp. 277–280

  26. [26]

    Sok: Semantic privacy in large language models,

    B. Ma, Y . Jiang, X. Wang, G. Yu, Q. Wang, C. Sun, C. Li, X. Qi, Y . He, W. Niet al., “Sok: Semantic privacy in large language models,” arXiv preprint arXiv:2506.23603, 2025

  27. [27]

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

    J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. El- nashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,”arXiv preprint arXiv:2302.11382, 2023

  28. [28]

    SHOP2: An HTN planning system,

    D. S. Nau, T.-C. Au, O. Ilghami, U. Kuter, J. W. Murdock, D. Wu, and F. Yaman, “SHOP2: An HTN planning system,”Journal of Artificial Intelligence Research, vol. 20, pp. 379–404, 2003

  29. [29]

    BDI agents: From theory to practice,

    A. S. Rao and M. P. Georgeff, “BDI agents: From theory to practice,” inProceedings of the First International Conference on Multi-Agent Systems (ICMAS), 1995, pp. 312–319

  30. [30]

    STRIPS: A new approach to the appli- cation of theorem proving to problem solving,

    R. E. Fikes and N. J. Nilsson, “STRIPS: A new approach to the appli- cation of theorem proving to problem solving,”Artificial Intelligence, vol. 2, no. 3–4, pp. 189–208, 1971

  31. [31]

    Perception in chess,

    W. G. Chase and H. A. Simon, “Perception in chess,”Cognitive Psychology, vol. 4, no. 1, pp. 55–81, 1973

  32. [32]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    X. Li, W. Chen, Y . Liu, S. Zheng, X. Chen, Y . He, Y . Li, B. You, H. Shen, J. Sunet al., “SkillsBench: Benchmarking how well agent skills work across diverse tasks,”arXiv preprint arXiv:2602.12670, 2026

  33. [33]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”Transactions on Machine Learning Research (TMLR), 2024, arXiv:2305.16291

  34. [34]

    ReAct: Synergizing Reasoning and Acting in Language Models

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023, arXiv:2210.03629

  35. [35]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2303.11366

  36. [36]

    Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

    Z. Wang, S. Cai, A. Liu, Y . Jin, J. Hou, B. Zhang, H. Lin, Z. He, Z. Zheng, Y . Yang, X. Ma, and Y . Liang, “JARVIS-1: Open- world multi-task agents with memory-augmented multimodal lan- guage models,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 47, no. 3, pp. 1894–1907, 2025, extended from arXiv:2311.05997

  37. [37]

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao

    Z. Wang, S. Cai, A. Liu, X. Ma, and Y . Liang, “Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2302.01560

  38. [38]

    Skill-it! a data-driven skills framework for understanding and training language models,

    M. F. Chen, N. Roberts, K. Bhatia, J. Wang, C. Zhang, F. Sala, and C. Ré, “Skill-it! a data-driven skills framework for understanding and training language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2307.14430

  39. [39]

    Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, and Yong Yu

    C. Zhang, Z. Yang, J. Liu, Y . Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “AppAgent: Multimodal agents as smartphone users,” inPro- ceedings of the CHI Conference on Human Factors in Computing Sys- tems (CHI), 2025, pp. 70:1–70:20, extended from arXiv:2312.13771

  40. [40]

    Cradle: Empowering foundation agents towards general computer control,

    W. Tan, W. Zhang, X. Xu, H. Xia, Z. Ding, B. Li, B. Zhouet al., “Cradle: Empowering foundation agents towards general computer control,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 267, 2025, pp. 58 658–58 725, extended from arXiv:2403.03186

  41. [41]

    Agenttuning: Enabling generalized agent abilities for llms,

    A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y . Dong, and J. Tang, “AgentTuning: Enabling generalized agent abilities for LLMs,” in Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 3053–3077, arXiv:2310.12823

  42. [42]

    Executable code actions elicit better LLM agents

    X. Wang, Y . Chen, L. Yuan, Y . Zhang, Y . Li, H. Peng, and H. Ji, “Executable code actions elicit better LLM agents,” inInternational Conference on Machine Learning (ICML), 2024, arXiv:2402.01030

  43. [43]

    arXiv preprint arXiv:2310.05915 , year=

    B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao, “FireAct: Toward language agent fine-tuning,”arXiv preprint arXiv:2310.05915, 2023

  44. [44]

    TaskWeaver: A code-first agent framework,

    B. Qiao, L. Li, X. Zhang, S. He, Y . Kang, C. Zhang, F. Yang, H. Dong, J. Zhang, L. Wang, M. Ma, P. Zhao, S. Qin, X. Qin, C. Du, Y . Xu, Q. Lin, S. Rajmohan, and D. Zhang, “TaskWeaver: A code-first agent framework,”arXiv preprint arXiv:2311.17541, 2023

  45. [45]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gober, K. Hausmanet al., “Do as i can, not as i say: Grounding language in robotic affordances,” inConference on Robot Learning (CoRL), 2022, arXiv:2204.01691

  46. [46]

    Generative Agents: Interactive Simulacra of Human Behavior

    J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behav- ior,” inACM Symposium on User Interface Software and Technology (UIST), 2023, arXiv:2304.03442

  47. [47]

    Eureka: Human-Level Reward Design via Coding Large Language Models

    Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” inInternational Confer- ence on Learning Representations (ICLR), 2024, arXiv:2310.12931

  48. [48]

    Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,

    K. Nottingham, P. Ammanabrolu, A. Suhr, Y . Choi, H. Hajishirzi, S. Singh, and R. Fox, “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,” inInternational Conference on Machine Learning (ICML), 2023, arXiv:2301.12050

  49. [49]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotaret al., “Inner monologue: Embodied reasoning through planning with language models,” in Conference on Robot Learning (CoRL), 2022, arXiv:2207.05608

  50. [50]

    Gamma, R

    E. Gamma, R. Helm, R. Johnson, and J. Vlissides,Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional, 1994

  51. [51]

    Semantic Kernel: A lightweight SDK for AI agent de- velopment,

    Microsoft, “Semantic Kernel: A lightweight SDK for AI agent de- velopment,” https://github.com/microsoft/semantic-kernel, 2023, accessed: 2026-02-21

  52. [52]

    Code as Policies: Language Model Programs for Embodied Control

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” inIEEE International Conference on Robotics and Automation (ICRA), 2023, arXiv:2209.07753

  53. [53]

    Progprompt: Generating situated robot task plans using large language models

    I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Trem- blay, D. Fox, J. Thomason, and A. Garg, “ProgPrompt: Generating situated robot task plans using large language models,” inIEEE International Conference on Robotics and Automation (ICRA), 2023, arXiv:2209.11302

  54. [54]

    Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

    A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y .- X. Wang, “Language agent tree search unifies reasoning, acting, and planning in language models,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235, 2024, pp. 62 138–62 160, arXiv:2310.04406

  55. [55]

    Self-instruct: Aligning language models with self- generated instructions,

    Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language models with self- generated instructions,” inAnnual Meeting of the Association for Computational Linguistics (ACL), 2023, pp. 13 484–13 508

  56. [56]

    CREATOR: Tool creation for disentangling abstract and concrete reasoning of large language models,

    C. Qian, C. Han, Y . R. Fung, Y . Qin, Z. Liu, and H. Ji, “CREATOR: Tool creation for disentangling abstract and concrete reasoning of large language models,” inFindings of the Association for Computa- tional Linguistics (EMNLP), 2023, arXiv:2305.14318

  57. [57]

    Introducing the model context protocol,

    Anthropic, “Introducing the model context protocol,” https://www.an thropic.com/news/model-context-protocol, 2024, accessed: 2026-02- 21

  58. [58]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qianet al., “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2307.16789

  59. [59]

    OpenClaw: Personal ai assistant,

    OpenClaw Project, “OpenClaw: Personal ai assistant,” https://gith ub.com/openclaw/openclaw, 2026, official repository (216k stars at access time). Accessed: 2026-02-22

  60. [60]

    ClawHavoc: 341 malicious clawed skills found by the bot they were targeting,

    Alex and Oren Yomtov, “ClawHavoc: 341 malicious clawed skills found by the bot they were targeting,” https://www.koi.ai/blog/claw havoc-341-malicious-clawedbot-skills-found-by-the-bot-they-wer e-targeting, 2026, koi Research blog post; update dated Feb 16, 2026 reports 824 malicious skills. Accessed: 2026-02-22

  61. [61]

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su

    X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2Web: Towards a generalist agent for the web,” in Advances in Neural Information Processing Systems (NeurIPS), 2023, spotlight. arXiv:2306.06070

  62. [62]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, datasets and Benchmarks track. arX...

  63. [63]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real- world GitHub issues?” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2310.06770

  64. [64]

    Re- cent advances in robot learning from demonstration,

    H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard, “Re- cent advances in robot learning from demonstration,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 297–330, 2020

  65. [65]

    AutoGPT: An autonomous GPT-4 experiment,

    Significant Gravitas, “AutoGPT: An autonomous GPT-4 experiment,” https://github.com/Significant-Gravitas/AutoGPT, 2023, accessed: 2026-02-21

  66. [66]

    A comprehensive survey of continual learning: Theory, method and application,

    L. Wang, X. Zhang, H. Su, and J. Zhu, “A comprehensive survey of continual learning: Theory, method and application,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 46, no. 8, pp. 5362–5383, 2024

  67. [67]

    Proagent: Building proactive cooperative ai with large language models

    C. Zhang, K. Yang, S. Hu, Z. Wang, G. Li, Y . Sun, C. Zhang, Z. Zhang, A. Liu, S.-C. Zhu, X. Chang, J. Zhang, F. Yin, Y . Liang, and Y . Yang, “ProAgent: Building proactive cooperative agents with large language models,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 16, 2024, pp. 17 591– 17 599, arXiv:2308.11339

  68. [68]

    SoK: Taxonomy of attacks on open-source software supply chains,

    P. Ladisa, H. Plate, M. Martinez, and O. Barais, “SoK: Taxonomy of attacks on open-source software supply chains,” inIEEE Symposium on Security and Privacy (SP), 2023, pp. 1509–1526

  69. [69]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real- world LLM-integrated applications with indirect prompt injection,” inACM Workshop on Artificial Intelligence and Security (AISec), 2023, arXiv:2302.12173

  70. [70]

    From automation to infection: How OpenClaw AI agent skills are being weaponized,

    B. Quintero, “From automation to infection: How OpenClaw AI agent skills are being weaponized,” https://blog.virustotal.com/2026/0 2/from-automation-to-infection-how.html, 2026, virusTotal Blog, February 2, 2026. Accessed: 2026-02-22

  71. [71]

    Agent skills guard,

    B. Van, “Agent skills guard,” https://github.com/brucevanfdm/agent -skills-guard, 2026, desktop scanner/manager; README reports 8 risk categories and 22 hard-trigger rules. Accessed: 2026-02-22

  72. [72]

    SkillGuard: AI agent security scanner,

    G. Singh, “SkillGuard: AI agent security scanner,” https://skillgaurd .up.railway.app/, 2026, website and linked source repo describe AST analysis for JS/TS, 9-language coverage, and 20+ attack patterns. Accessed: 2026-02-22

  73. [73]

    Jailbreaking chat- gpt via prompt engineering: An empirical study

    Y . Liu, G. Deng, Z. Xu, Y . Li, Y . Zheng, Y . Zhang, L. Zhao, T. Zhang, K. Wang, and Y . Liu, “Jailbreaking chatgpt via prompt engineering: An empirical study,”arXiv preprint arXiv:2305.13860, 2023

  74. [74]

    GAIA: a benchmark for General AI Assistants

    G. Mialon, C. Fourrier, T. Wolf, Y . LeCun, and T. Scialom, “GAIA: A benchmark for general AI assistants,” inInternational Conference on Learning Representations (ICLR), 2024, poster. arXiv:2311.12983

  75. [75]

    AgentBench: Evaluating LLMs as Agents

    X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang, “AgentBench: Evaluating LLMs as agents,” inInternational Confer- ence on Learning Representations (ICLR), 2024, arXiv:2308.03688

  76. [76]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    C. Rawles, S. Clinckemaillie, Y . Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. E. Bishop, W. Li, F. Campbell-Ajala, D. K. Toyama, R. J. Berry, D. Tyamagundlu, T. P. Lillicrap, and O. Riva, “AndroidWorld: A dynamic benchmarking environment for autonomous agents,” in International Conference on Learning Representations (ICLR), 2025, arXiv:2405.14573