arxiv: 2602.20867 · v1 · submitted 2026-02-24 · 💻 cs.CR · cs.AI· cs.CE· cs.ET

Recognition: 3 theorem links

· Lean Theorem

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang , Delong Li , Haiyu Deng , Baihe Ma , Xu Wang , Qin Wang , Guangsheng Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CEcs.ET

keywords agentic skillsLLM agentssystematization of knowledgesupply chain securityprompt injectionagent evaluationmarketplace risks

0 comments

The pith

Agentic skills function as reusable procedural modules that let LLM agents handle long-horizon tasks reliably across domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic skills package procedural knowledge into callable modules that include applicability conditions, execution policies, termination criteria, and interfaces, so agents can apply the same capability to many tasks instead of building one-off plans each time. The paper maps the complete skill lifecycle from discovery through update and supplies two taxonomies that organize how skills are built and used in practice. Security analysis shows that the same reusability creates supply-chain exposure, with prompt-injection payloads able to travel through skills, as demonstrated when nearly 1,200 malicious skills entered a major marketplace and stole credentials at scale. Evaluation data indicate that hand-curated skills raise agent success rates while skills generated by the agents themselves tend to lower them. The work therefore frames the move toward robust, verifiable, and certifiable skills as a necessary step for trustworthy autonomous agents.

Core claim

Agentic skills are distinct from atomic tool calls because they carry explicit conditions, policies, and reusable interfaces that let them operate reliably across tasks; systematizing their design patterns and representations reveals both performance gains from curated skills and concrete supply-chain and injection risks that must be addressed for safe deployment.

What carries the argument

The skill layer, consisting of reusable modules that combine procedural knowledge with applicability conditions, execution policies, termination criteria, and standardized interfaces.

Load-bearing premise

The two proposed taxonomies capture the essential structure of agentic skills and that risks observed in one marketplace case will appear in other agent platforms.

What would settle it

A subsequent marketplace audit or controlled deployment that finds zero successful skill-based exfiltrations despite widespread adoption, or that shows a simpler taxonomy organizes existing skills more cleanly than the seven-pattern plus representation-by-scope scheme.

read the original abstract

Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably. These capabilities are callable modules that package procedural knowledge with explicit applicability conditions, execution policies, termination criteria, and reusable interfaces. Unlike one-off plans or atomic tool calls, skills operate (and often do well) across tasks. This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update) and introduces two complementary taxonomies. The first is a system-level set of \textbf{seven design patterns} capturing how skills are packaged and executed in practice, from metadata-driven progressive disclosure and executable code skills to self-evolving libraries and marketplace distribution. The second is an orthogonal \textbf{representation $\times$ scope} taxonomy describing what skills \emph{are} (natural language, code, policy, hybrid) and what environments they operate over (web, OS, software engineering, robotics). We analyze the security and governance implications of skill-based agents, covering supply-chain risks, prompt injection via skill payloads, and trust-tiered execution, grounded by a case study of the ClawHavoc campaign in which nearly 1{,}200 malicious skills infiltrated a major agent marketplace, exfiltrating API keys, cryptocurrency wallets, and browser credentials at scale. We further survey deterministic evaluation approaches, anchored by recent benchmark evidence that curated skills can substantially improve agent success rates while self-generated skills may degrade them. We conclude with open challenges toward robust, verifiable, and certifiable skills for real-world autonomous agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This SoK gives a clear definition of agentic skills as reusable modules and backs security risks with the ClawHavoc case, but the new taxonomies lack strong validation.

read the letter

The main thing to know is that this paper defines agentic skills as callable modules with explicit conditions, policies, termination rules, and interfaces that work across tasks, then organizes them with two taxonomies and illustrates supply-chain risks through the ClawHavoc marketplace attack that hit nearly 1,200 malicious skills. It maps the full lifecycle from discovery to update and distinguishes skills from one-off plans or atomic tools. The seven design patterns cover real packaging approaches like progressive disclosure, executable code, self-evolving libraries, and marketplace distribution. The representation-by-scope taxonomy adds a clean split on what skills are made of and where they run. The security section uses the concrete campaign to show prompt injection and credential theft, which is more useful than generic warnings. Evaluation discussion ties to existing benchmarks where curated skills boost success rates. The soft spots are straightforward. The taxonomies are presented as new organizing tools but receive no systematic test against a wide range of current agents, so their stability as the field changes is unproven. Security claims rest on a single case study, which demonstrates the issue effectively but leaves generalization to other platforms as an assumption rather than a checked result. No new experiments or derivations appear, only synthesis. This paper is for researchers and engineers working on LLM agent design, skill libraries, or security who need a shared structure. A reader who wants to reference how skills differ from tools or to cite a documented attack will get value. It deserves peer review because the framing is timely, the case study is substantive, and the organization helps the area even if the taxonomies need later refinement.

Referee Report

1 major / 4 minor

Summary. The paper is a Systematization of Knowledge (SoK) on agentic skills in LLM agents, defining them as reusable procedural modules with explicit applicability conditions, execution policies, termination criteria, and interfaces that operate reliably across tasks. It maps the full skill lifecycle (discovery through update), proposes two complementary taxonomies (seven design patterns for packaging/execution and an orthogonal representation-by-scope taxonomy covering natural language/code/policy/hybrid forms across web/OS/SE/robotics environments), analyzes security/governance risks (supply-chain attacks and prompt injection) grounded in the ClawHavoc campaign of nearly 1,200 malicious skills, surveys deterministic evaluation methods with benchmark evidence on curated vs. self-generated skills, and outlines open challenges for verifiable skills.

Significance. If the proposed taxonomies hold as stable organizing frameworks and the security analysis generalizes beyond the single case study, this SoK would provide a timely, practical reference for designing reliable agentic systems while highlighting concrete risks, potentially influencing both research taxonomies and marketplace governance standards in the LLM agent space.

major comments (1)

[Security and Governance Implications] Security and Governance section: The security implications (supply-chain infiltration and prompt injection) are anchored exclusively to the ClawHavoc marketplace campaign; the manuscript does not provide evidence or discussion of whether comparable risks manifest in non-marketplace or closed agent platforms, which limits the load-bearing claim that these risks are inherent to skill-based agents broadly.

minor comments (4)

[Abstract] Abstract: The notation '1{,}200' is a LaTeX artifact; standardize to '1,200' or 'nearly twelve hundred' for readability in the final version.
[Taxonomies] Taxonomy presentation: The seven design patterns and representation-by-scope taxonomy are described in text; a single summary table or diagram mapping literature examples to each category would substantially improve clarity and allow readers to assess coverage.
[Evaluation] Evaluation section: Benchmark evidence is cited for curated skills improving success rates while self-generated skills may degrade them, but no specific quantitative results (e.g., success-rate deltas or benchmark names) are tabulated; adding a small results table would make the survey more concrete.
[Skill Lifecycle] Lifecycle mapping: The seven-stage lifecycle (discovery, practice, distillation, etc.) is introduced without an accompanying figure showing dependencies or feedback loops between stages; a diagram would aid comprehension.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the security analysis. We agree that the discussion would benefit from explicit treatment of generalizability beyond the marketplace case study and will revise accordingly.

read point-by-point responses

Referee: Security and Governance section: The security implications (supply-chain infiltration and prompt injection) are anchored exclusively to the ClawHavoc marketplace campaign; the manuscript does not provide evidence or discussion of whether comparable risks manifest in non-marketplace or closed agent platforms, which limits the load-bearing claim that these risks are inherent to skill-based agents broadly.

Authors: We thank the referee for this observation. The ClawHavoc campaign is used as the sole concrete, large-scale public example because it is the only documented incident with detailed data on nearly 1,200 malicious skills. The manuscript does not present evidence from closed or non-marketplace platforms, as such data is not publicly available. At the same time, the risks of supply-chain infiltration and prompt injection arise from the core properties of skills as modular, distributable units with explicit interfaces and payloads, which are captured in both the seven design patterns and the representation-by-scope taxonomy. These properties exist independently of the distribution channel. In the revised manuscript we will expand the Security and Governance section to (1) state that the mechanisms generalize to any skill-sharing setting (internal libraries, direct imports, self-evolving libraries), (2) illustrate the point with references to the design patterns that apply outside marketplaces, and (3) note the lack of public empirical data from closed systems as a limitation. This change will make the broader claim explicit while remaining faithful to the available evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This SoK paper surveys the agentic skills literature, proposes two complementary taxonomies (seven design patterns and representation-by-scope) as organizing frameworks, and grounds its security analysis in the external ClawHavoc campaign. No load-bearing claims reduce to self-defined quantities, fitted parameters, or self-citation chains by construction. The taxonomies are explicitly presented as non-exhaustive complementary views rather than derived results, and all empirical references (benchmarks, case study) are anchored to external sources without internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no mathematical free parameters, formal axioms, or newly postulated physical entities; it relies entirely on literature synthesis and one external case study.

pith-pipeline@v0.9.0 · 5613 in / 1186 out tokens · 42543 ms · 2026-05-14T23:14:01.776364+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Five Attacks on x402 Agentic Payment Protocol
cs.CR 2026-05 conditional novelty 7.0

Five practical attacks on the x402 agentic payment protocol are demonstrated across authorization, binding, replay protection, and web handling, validated on local chains, Base Sepolia, live endpoints, and three open-...
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents
cs.AI 2026-05 unverdicted novelty 7.0

OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on f...
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
cs.SE 2026-05 conditional novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
cs.LG 2026-05 unverdicted novelty 7.0

CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
Sealing the Audit-Runtime Gap for LLM Skills
cs.CR 2026-05 unverdicted novelty 7.0

SIGIL cryptographically seals the audit-runtime gap for LLM skills via an on-chain registry with four publication types, DAO vetting, and a runtime verification loader that enforces integrity and permissions.
Uncertainty Propagation in LLM-Based Systems
cs.SE 2026-04 unverdicted novelty 7.0

This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insight...
Knows: Agent-Native Structured Research Representations
cs.AI 2026-04 conditional novelty 7.0

Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via ...
SoK: Blockchain Agent-to-Agent Payments
q-fin.GN 2026-04 unverdicted novelty 7.0

The first systematization of blockchain-based agent-to-agent payments organizes designs into discovery, authorization, execution, and accounting stages while identifying trust and security gaps.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries
cs.CL 2026-05 unverdicted novelty 6.0

GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
cs.AI 2026-04 unverdicted novelty 6.0

SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
cs.AI 2026-04 conditional novelty 6.0

The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses
cs.CR 2026-03 unverdicted novelty 6.0

The survey organizes over 400 papers on embodied AI safety into a multi-level taxonomy and flags overlooked issues such as fragile multimodal fusion and unstable planning under jailbreaks.
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
cs.AI 2026-05 unverdicted novelty 5.0

Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to ...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications
cs.AI 2026-04 unverdicted novelty 4.0

SciFi is a safe, lightweight agentic AI framework that automates structured scientific tasks with minimal human intervention via isolated environments and layered self-assessing agents.
ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation
cs.AI 2026-05 unverdicted novelty 3.0

Expanded orchestration in ChromaFlow lowered accuracy on GAIA tasks from 29/53 to 27/53 while increasing timeouts, tool failures, and costs.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 21 Pith papers · 28 internal anchors

[1]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, “WebArena: A realistic web environment for building autonomous agents,” in International Conference on Learning Representations (ICLR), 2024, arXiv:2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable au- tomated software engineering,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, arXiv:2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Measuring and augmenting large language models for solving capture-the-flag challenges,

Z. Ji, D. Wu, W. Jiang, P. Ma, Z. Li, and S. Wang, “Measuring and augmenting large language models for solving capture-the-flag challenges,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2025, pp. 603–617

work page 2025
[4]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang, “Hug- gingGPT: Solving AI tasks with ChatGPT and its friends in Hug- ging Face,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2303.17580

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Can large language model agents simulate human trust behavior?

C. Xie, C. Chen, F. Jia, Z. Ye, S. Lai, K. Shu, J. Gu, A. Bibi, Z. Hu, D. Jurgenset al., “Can large language model agents simulate human trust behavior?”Advances in Neural Information Processing Systems (NeurIPS), vol. 37, pp. 15 674–15 729, 2024

work page 2024
[6]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “MetaGPT: Meta programming for a multi-agent collaborative framework,” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2308.00352

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “AutoGen: Enabling next-gen LLM applications via multi- agent conversation,” inConference on Language Modeling (COLM), 2024, arXiv:2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

An integrated theory of the mind

J. R. Anderson, D. Bothell, M. D. Byrne, S. Douglass, C. Lebiere, and Y . Qin, “An integrated theory of the mind.”Psychological Review, vol. 111, no. 4, pp. 1036–1060, 2004

work page 2004
[9]

J. E. Laird,The Soar Cognitive Architecture. MIT Press, 2012

work page 2012
[10]

Between MDPs and semi- MDPs: A framework for temporal abstraction in reinforcement learn- ing,

R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi- MDPs: A framework for temporal abstraction in reinforcement learn- ing,”Artificial Intelligence, vol. 112, no. 1–2, pp. 181–211, 1999

work page 1999
[11]

The landscape of agentic reinforcement learning for llms: A survey,

G. Zhang, H. Geng, X. Yu, Z. Yin, Z. Zhang, Z. Tan, H. Zhou, Z.-Z. Li, X. Xue, Y . Liet al., “The landscape of agentic reinforcement learning for llms: A survey,”Transactions on Machine Learning Research (TMLR)

work page
[12]

ToolRL: Reward is all tool learning needs,

C. Qian, E. C. Acikgoz, Q. He, H. W ANG, X. Chen, D. Hakkani- Tür, G. Tur, and H. Ji, “ToolRL: Reward is all tool learning needs,” inAnnual Conference on Neural Information Processing Systems (NeurIPS)

work page
[13]

A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence,

H.-a. Gao, J. Geng, W. Hua, M. Hu, X. Juan, H. Liu, S. Liu, J. Qiu, X. Qi, Q. Renet al., “A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence,” Transactions on Machine Learning Research (TMLR)

work page
[14]

Safety at scale: A comprehensive survey of large model and agent safety,

X. Ma, Y . Gao, Y . Wang, R. Wang, X. Wang, Y . Sun, Y . Ding, H. Xu, Y . Chen, Y . Zhaoet al., “Safety at scale: A comprehensive survey of large model and agent safety,”Foundations and Trends in Privacy and Security, vol. 8, no. 3-4, pp. 1–240, 2026

work page 2026
[15]

A Survey on Large Language Model based Autonomous Agents

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. X. Zhao, Z. Wei, and J.-R. Wen, “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024, extended from arXiv:2308.11432

work page internal anchor Pith review arXiv 2024
[16]

A survey on agentic security: Applications, threats and defenses,

A. Shahriar, M. N. Rahman, S. Ahmed, F. Sadeque, and M. R. Parvez, “A survey on agentic security: Applications, threats and defenses,” arXiv preprint arXiv:2510.06445, 2025

work page arXiv 2025
[17]

Understanding the planning of LLM agents: A survey

X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y . Wang, R. Tang, and E. Chen, “Understanding the planning of LLM agents: A survey,”arXiv preprint arXiv:2402.02716, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,”arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Survey on Evaluation of LLM-based Agents

A. Yehudai, L. Eden, A. Li, G. Uziel, Y . Zhao, R. Bar-Haim, A. Co- han, and M. Shmueli-Scheuer, “Survey on evaluation of LLM-based agents,”arXiv preprint arXiv:2503.16416, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Information fidelity in tool-using llm agents: A martingale analysis of the model context protocol,

F. X. Fan, C. Tan, R. Wattenhofer, and Y .-S. Ong, “Information fidelity in tool-using llm agents: A martingale analysis of the model context protocol,”arXiv preprint arXiv:2602.13320, 2026

work page arXiv 2026
[21]

Tool learning with foundation models,

Y . Qin, S. Hu, Y . Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, X. Zhou, Y . Huang, C. Xiao, C. Han, Y . R. Fung, Y . Su, H. Wang, C. Qian, R. Tian, K. Zhu, S. Liang, X. Shen, B. Xu, Z. Zhang, Y . Ye, B. Li et al., “Tool learning with foundation models,”ACM Computing Surveys (CSUR), vol. 57, no. 4, pp. 101:1–101:40, 2025

work page 2025
[22]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Tool- former: Language models can teach themselves to use tools,” in Advances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Large language model based multi-agents: A survey of progress and challenges,

T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,” inProceedings of the Thirty-Third In- ternational Joint Conference on Artificial Intelligence (IJCAI), 2024, pp. 8048–8057, survey track

work page 2024
[24]

MemGPT: Towards LLMs as Operating Systems

C. Packer, V . Fang, S. G. Patil, K. Lin, S. Wooders, and J. E. Gonzalez, “MemGPT: Towards LLMs as operating systems,”arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Memory matters: The need to improve long-term memory in llm-agents,

K. Hatalis, D. Christou, J. Myers, S. Jones, K. Lambert, A. Amos- Binks, Z. Dannenhauer, and D. Dannenhauer, “Memory matters: The need to improve long-term memory in llm-agents,” inProceedings of the AAAI Symposium Series (AAAI), vol. 2, no. 1, 2023, pp. 277–280

work page 2023
[26]

Sok: Semantic privacy in large language models,

B. Ma, Y . Jiang, X. Wang, G. Yu, Q. Wang, C. Sun, C. Li, X. Qi, Y . He, W. Niet al., “Sok: Semantic privacy in large language models,” arXiv preprint arXiv:2506.23603, 2025

work page arXiv 2025
[27]

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. El- nashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,”arXiv preprint arXiv:2302.11382, 2023

work page internal anchor Pith review arXiv 2023
[28]

SHOP2: An HTN planning system,

D. S. Nau, T.-C. Au, O. Ilghami, U. Kuter, J. W. Murdock, D. Wu, and F. Yaman, “SHOP2: An HTN planning system,”Journal of Artificial Intelligence Research, vol. 20, pp. 379–404, 2003

work page 2003
[29]

BDI agents: From theory to practice,

A. S. Rao and M. P. Georgeff, “BDI agents: From theory to practice,” inProceedings of the First International Conference on Multi-Agent Systems (ICMAS), 1995, pp. 312–319

work page 1995
[30]

STRIPS: A new approach to the appli- cation of theorem proving to problem solving,

R. E. Fikes and N. J. Nilsson, “STRIPS: A new approach to the appli- cation of theorem proving to problem solving,”Artificial Intelligence, vol. 2, no. 3–4, pp. 189–208, 1971

work page 1971
[31]

Perception in chess,

W. G. Chase and H. A. Simon, “Perception in chess,”Cognitive Psychology, vol. 4, no. 1, pp. 55–81, 1973

work page 1973
[32]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

X. Li, W. Chen, Y . Liu, S. Zheng, X. Chen, Y . He, Y . Li, B. You, H. Shen, J. Sunet al., “SkillsBench: Benchmarking how well agent skills work across diverse tasks,”arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Voyager: An Open-Ended Embodied Agent with Large Language Models

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”Transactions on Machine Learning Research (TMLR), 2024, arXiv:2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

ReAct: Synergizing Reasoning and Acting in Language Models

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023, arXiv:2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

Z. Wang, S. Cai, A. Liu, Y . Jin, J. Hou, B. Zhang, H. Lin, Z. He, Z. Zheng, Y . Yang, X. Ma, and Y . Liang, “JARVIS-1: Open- world multi-task agents with memory-augmented multimodal lan- guage models,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 47, no. 3, pp. 1894–1907, 2025, extended from arXiv:2311.05997

work page arXiv 1907
[37]

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao

Z. Wang, S. Cai, A. Liu, X. Ma, and Y . Liang, “Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2302.01560

work page arXiv 2023
[38]

Skill-it! a data-driven skills framework for understanding and training language models,

M. F. Chen, N. Roberts, K. Bhatia, J. Wang, C. Zhang, F. Sala, and C. Ré, “Skill-it! a data-driven skills framework for understanding and training language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2307.14430

work page arXiv 2023
[39]

Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, and Yong Yu

C. Zhang, Z. Yang, J. Liu, Y . Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “AppAgent: Multimodal agents as smartphone users,” inPro- ceedings of the CHI Conference on Human Factors in Computing Sys- tems (CHI), 2025, pp. 70:1–70:20, extended from arXiv:2312.13771

work page arXiv 2025
[40]

Cradle: Empowering foundation agents towards general computer control,

W. Tan, W. Zhang, X. Xu, H. Xia, Z. Ding, B. Li, B. Zhouet al., “Cradle: Empowering foundation agents towards general computer control,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 267, 2025, pp. 58 658–58 725, extended from arXiv:2403.03186

work page arXiv 2025
[41]

Agenttuning: Enabling generalized agent abilities for llms,

A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y . Dong, and J. Tang, “AgentTuning: Enabling generalized agent abilities for LLMs,” in Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 3053–3077, arXiv:2310.12823

work page arXiv 2024
[42]

Executable code actions elicit better LLM agents

X. Wang, Y . Chen, L. Yuan, Y . Zhang, Y . Li, H. Peng, and H. Ji, “Executable code actions elicit better LLM agents,” inInternational Conference on Machine Learning (ICML), 2024, arXiv:2402.01030

work page arXiv 2024
[43]

arXiv preprint arXiv:2310.05915 , year=

B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao, “FireAct: Toward language agent fine-tuning,”arXiv preprint arXiv:2310.05915, 2023

work page arXiv 2023
[44]

TaskWeaver: A code-first agent framework,

B. Qiao, L. Li, X. Zhang, S. He, Y . Kang, C. Zhang, F. Yang, H. Dong, J. Zhang, L. Wang, M. Ma, P. Zhao, S. Qin, X. Qin, C. Du, Y . Xu, Q. Lin, S. Rajmohan, and D. Zhang, “TaskWeaver: A code-first agent framework,”arXiv preprint arXiv:2311.17541, 2023

work page arXiv 2023
[45]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gober, K. Hausmanet al., “Do as i can, not as i say: Grounding language in robotic affordances,” inConference on Robot Learning (CoRL), 2022, arXiv:2204.01691

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

Generative Agents: Interactive Simulacra of Human Behavior

J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behav- ior,” inACM Symposium on User Interface Software and Technology (UIST), 2023, arXiv:2304.03442

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Eureka: Human-Level Reward Design via Coding Large Language Models

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” inInternational Confer- ence on Learning Representations (ICLR), 2024, arXiv:2310.12931

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,

K. Nottingham, P. Ammanabrolu, A. Suhr, Y . Choi, H. Hajishirzi, S. Singh, and R. Fox, “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,” inInternational Conference on Machine Learning (ICML), 2023, arXiv:2301.12050

work page arXiv 2023
[49]

Inner Monologue: Embodied Reasoning through Planning with Language Models

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotaret al., “Inner monologue: Embodied reasoning through planning with language models,” in Conference on Robot Learning (CoRL), 2022, arXiv:2207.05608

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Gamma, R

E. Gamma, R. Helm, R. Johnson, and J. Vlissides,Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional, 1994

work page 1994
[51]

Semantic Kernel: A lightweight SDK for AI agent de- velopment,

Microsoft, “Semantic Kernel: A lightweight SDK for AI agent de- velopment,” https://github.com/microsoft/semantic-kernel, 2023, accessed: 2026-02-21

work page 2023
[52]

Code as Policies: Language Model Programs for Embodied Control

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” inIEEE International Conference on Robotics and Automation (ICRA), 2023, arXiv:2209.07753

work page internal anchor Pith review arXiv 2023
[53]

Progprompt: Generating situated robot task plans using large language models

I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Trem- blay, D. Fox, J. Thomason, and A. Garg, “ProgPrompt: Generating situated robot task plans using large language models,” inIEEE International Conference on Robotics and Automation (ICRA), 2023, arXiv:2209.11302

work page arXiv 2023
[54]

Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y .- X. Wang, “Language agent tree search unifies reasoning, acting, and planning in language models,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235, 2024, pp. 62 138–62 160, arXiv:2310.04406

work page arXiv 2024
[55]

Self-instruct: Aligning language models with self- generated instructions,

Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language models with self- generated instructions,” inAnnual Meeting of the Association for Computational Linguistics (ACL), 2023, pp. 13 484–13 508

work page 2023
[56]

CREATOR: Tool creation for disentangling abstract and concrete reasoning of large language models,

C. Qian, C. Han, Y . R. Fung, Y . Qin, Z. Liu, and H. Ji, “CREATOR: Tool creation for disentangling abstract and concrete reasoning of large language models,” inFindings of the Association for Computa- tional Linguistics (EMNLP), 2023, arXiv:2305.14318

work page arXiv 2023
[57]

Introducing the model context protocol,

Anthropic, “Introducing the model context protocol,” https://www.an thropic.com/news/model-context-protocol, 2024, accessed: 2026-02- 21

work page 2024
[58]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qianet al., “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

OpenClaw: Personal ai assistant,

OpenClaw Project, “OpenClaw: Personal ai assistant,” https://gith ub.com/openclaw/openclaw, 2026, official repository (216k stars at access time). Accessed: 2026-02-22

work page 2026
[60]

ClawHavoc: 341 malicious clawed skills found by the bot they were targeting,

Alex and Oren Yomtov, “ClawHavoc: 341 malicious clawed skills found by the bot they were targeting,” https://www.koi.ai/blog/claw havoc-341-malicious-clawedbot-skills-found-by-the-bot-they-wer e-targeting, 2026, koi Research blog post; update dated Feb 16, 2026 reports 824 malicious skills. Accessed: 2026-02-22

work page 2026
[61]

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2Web: Towards a generalist agent for the web,” in Advances in Neural Information Processing Systems (NeurIPS), 2023, spotlight. arXiv:2306.06070

work page arXiv 2023
[62]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, datasets and Benchmarks track. arX...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real- world GitHub issues?” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Re- cent advances in robot learning from demonstration,

H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard, “Re- cent advances in robot learning from demonstration,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 297–330, 2020

work page 2020
[65]

AutoGPT: An autonomous GPT-4 experiment,

Significant Gravitas, “AutoGPT: An autonomous GPT-4 experiment,” https://github.com/Significant-Gravitas/AutoGPT, 2023, accessed: 2026-02-21

work page 2023
[66]

A comprehensive survey of continual learning: Theory, method and application,

L. Wang, X. Zhang, H. Su, and J. Zhu, “A comprehensive survey of continual learning: Theory, method and application,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 46, no. 8, pp. 5362–5383, 2024

work page 2024
[67]

Proagent: Building proactive cooperative ai with large language models

C. Zhang, K. Yang, S. Hu, Z. Wang, G. Li, Y . Sun, C. Zhang, Z. Zhang, A. Liu, S.-C. Zhu, X. Chang, J. Zhang, F. Yin, Y . Liang, and Y . Yang, “ProAgent: Building proactive cooperative agents with large language models,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 16, 2024, pp. 17 591– 17 599, arXiv:2308.11339

work page arXiv 2024
[68]

SoK: Taxonomy of attacks on open-source software supply chains,

P. Ladisa, H. Plate, M. Martinez, and O. Barais, “SoK: Taxonomy of attacks on open-source software supply chains,” inIEEE Symposium on Security and Privacy (SP), 2023, pp. 1509–1526

work page 2023
[69]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real- world LLM-integrated applications with indirect prompt injection,” inACM Workshop on Artificial Intelligence and Security (AISec), 2023, arXiv:2302.12173

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

From automation to infection: How OpenClaw AI agent skills are being weaponized,

B. Quintero, “From automation to infection: How OpenClaw AI agent skills are being weaponized,” https://blog.virustotal.com/2026/0 2/from-automation-to-infection-how.html, 2026, virusTotal Blog, February 2, 2026. Accessed: 2026-02-22

work page 2026
[71]

Agent skills guard,

B. Van, “Agent skills guard,” https://github.com/brucevanfdm/agent -skills-guard, 2026, desktop scanner/manager; README reports 8 risk categories and 22 hard-trigger rules. Accessed: 2026-02-22

work page 2026
[72]

SkillGuard: AI agent security scanner,

G. Singh, “SkillGuard: AI agent security scanner,” https://skillgaurd .up.railway.app/, 2026, website and linked source repo describe AST analysis for JS/TS, 9-language coverage, and 20+ attack patterns. Accessed: 2026-02-22

work page 2026
[73]

Jailbreaking chat- gpt via prompt engineering: An empirical study

Y . Liu, G. Deng, Z. Xu, Y . Li, Y . Zheng, Y . Zhang, L. Zhao, T. Zhang, K. Wang, and Y . Liu, “Jailbreaking chatgpt via prompt engineering: An empirical study,”arXiv preprint arXiv:2305.13860, 2023

work page arXiv 2023
[74]

GAIA: a benchmark for General AI Assistants

G. Mialon, C. Fourrier, T. Wolf, Y . LeCun, and T. Scialom, “GAIA: A benchmark for general AI assistants,” inInternational Conference on Learning Representations (ICLR), 2024, poster. arXiv:2311.12983

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

AgentBench: Evaluating LLMs as Agents

X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang, “AgentBench: Evaluating LLMs as agents,” inInternational Confer- ence on Learning Representations (ICLR), 2024, arXiv:2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

C. Rawles, S. Clinckemaillie, Y . Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. E. Bishop, W. Li, F. Campbell-Ajala, D. K. Toyama, R. J. Berry, D. Tyamagundlu, T. P. Lillicrap, and O. Riva, “AndroidWorld: A dynamic benchmarking environment for autonomous agents,” in International Conference on Learning Representations (ICLR), 2025, arXiv:2405.14573

work page internal anchor Pith review Pith/arXiv arXiv 2025