Recognition: 3 theorem links
· Lean TheoremSoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Pith reviewed 2026-05-14 23:14 UTC · model grok-4.3
The pith
Agentic skills function as reusable procedural modules that let LLM agents handle long-horizon tasks reliably across domains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agentic skills are distinct from atomic tool calls because they carry explicit conditions, policies, and reusable interfaces that let them operate reliably across tasks; systematizing their design patterns and representations reveals both performance gains from curated skills and concrete supply-chain and injection risks that must be addressed for safe deployment.
What carries the argument
The skill layer, consisting of reusable modules that combine procedural knowledge with applicability conditions, execution policies, termination criteria, and standardized interfaces.
Load-bearing premise
The two proposed taxonomies capture the essential structure of agentic skills and that risks observed in one marketplace case will appear in other agent platforms.
What would settle it
A subsequent marketplace audit or controlled deployment that finds zero successful skill-based exfiltrations despite widespread adoption, or that shows a simpler taxonomy organizes existing skills more cleanly than the seven-pattern plus representation-by-scope scheme.
read the original abstract
Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably. These capabilities are callable modules that package procedural knowledge with explicit applicability conditions, execution policies, termination criteria, and reusable interfaces. Unlike one-off plans or atomic tool calls, skills operate (and often do well) across tasks. This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update) and introduces two complementary taxonomies. The first is a system-level set of \textbf{seven design patterns} capturing how skills are packaged and executed in practice, from metadata-driven progressive disclosure and executable code skills to self-evolving libraries and marketplace distribution. The second is an orthogonal \textbf{representation $\times$ scope} taxonomy describing what skills \emph{are} (natural language, code, policy, hybrid) and what environments they operate over (web, OS, software engineering, robotics). We analyze the security and governance implications of skill-based agents, covering supply-chain risks, prompt injection via skill payloads, and trust-tiered execution, grounded by a case study of the ClawHavoc campaign in which nearly 1{,}200 malicious skills infiltrated a major agent marketplace, exfiltrating API keys, cryptocurrency wallets, and browser credentials at scale. We further survey deterministic evaluation approaches, anchored by recent benchmark evidence that curated skills can substantially improve agent success rates while self-generated skills may degrade them. We conclude with open challenges toward robust, verifiable, and certifiable skills for real-world autonomous agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a Systematization of Knowledge (SoK) on agentic skills in LLM agents, defining them as reusable procedural modules with explicit applicability conditions, execution policies, termination criteria, and interfaces that operate reliably across tasks. It maps the full skill lifecycle (discovery through update), proposes two complementary taxonomies (seven design patterns for packaging/execution and an orthogonal representation-by-scope taxonomy covering natural language/code/policy/hybrid forms across web/OS/SE/robotics environments), analyzes security/governance risks (supply-chain attacks and prompt injection) grounded in the ClawHavoc campaign of nearly 1,200 malicious skills, surveys deterministic evaluation methods with benchmark evidence on curated vs. self-generated skills, and outlines open challenges for verifiable skills.
Significance. If the proposed taxonomies hold as stable organizing frameworks and the security analysis generalizes beyond the single case study, this SoK would provide a timely, practical reference for designing reliable agentic systems while highlighting concrete risks, potentially influencing both research taxonomies and marketplace governance standards in the LLM agent space.
major comments (1)
- [Security and Governance Implications] Security and Governance section: The security implications (supply-chain infiltration and prompt injection) are anchored exclusively to the ClawHavoc marketplace campaign; the manuscript does not provide evidence or discussion of whether comparable risks manifest in non-marketplace or closed agent platforms, which limits the load-bearing claim that these risks are inherent to skill-based agents broadly.
minor comments (4)
- [Abstract] Abstract: The notation '1{,}200' is a LaTeX artifact; standardize to '1,200' or 'nearly twelve hundred' for readability in the final version.
- [Taxonomies] Taxonomy presentation: The seven design patterns and representation-by-scope taxonomy are described in text; a single summary table or diagram mapping literature examples to each category would substantially improve clarity and allow readers to assess coverage.
- [Evaluation] Evaluation section: Benchmark evidence is cited for curated skills improving success rates while self-generated skills may degrade them, but no specific quantitative results (e.g., success-rate deltas or benchmark names) are tabulated; adding a small results table would make the survey more concrete.
- [Skill Lifecycle] Lifecycle mapping: The seven-stage lifecycle (discovery, practice, distillation, etc.) is introduced without an accompanying figure showing dependencies or feedback loops between stages; a diagram would aid comprehension.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the security analysis. We agree that the discussion would benefit from explicit treatment of generalizability beyond the marketplace case study and will revise accordingly.
read point-by-point responses
-
Referee: Security and Governance section: The security implications (supply-chain infiltration and prompt injection) are anchored exclusively to the ClawHavoc marketplace campaign; the manuscript does not provide evidence or discussion of whether comparable risks manifest in non-marketplace or closed agent platforms, which limits the load-bearing claim that these risks are inherent to skill-based agents broadly.
Authors: We thank the referee for this observation. The ClawHavoc campaign is used as the sole concrete, large-scale public example because it is the only documented incident with detailed data on nearly 1,200 malicious skills. The manuscript does not present evidence from closed or non-marketplace platforms, as such data is not publicly available. At the same time, the risks of supply-chain infiltration and prompt injection arise from the core properties of skills as modular, distributable units with explicit interfaces and payloads, which are captured in both the seven design patterns and the representation-by-scope taxonomy. These properties exist independently of the distribution channel. In the revised manuscript we will expand the Security and Governance section to (1) state that the mechanisms generalize to any skill-sharing setting (internal libraries, direct imports, self-evolving libraries), (2) illustrate the point with references to the design patterns that apply outside marketplaces, and (3) note the lack of public empirical data from closed systems as a limitation. This change will make the broader claim explicit while remaining faithful to the available evidence. revision: yes
Circularity Check
No significant circularity
full rationale
This SoK paper surveys the agentic skills literature, proposes two complementary taxonomies (seven design patterns and representation-by-scope) as organizing frameworks, and grounds its security analysis in the external ClawHavoc campaign. No load-bearing claims reduce to self-defined quantities, fitted parameters, or self-citation chains by construction. The taxonomies are explicitly presented as non-exhaustive complementary views rather than derived results, and all empirical references (benchmarks, case study) are anchored to external sources without internal reduction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 23 Pith papers
-
Five Attacks on x402 Agentic Payment Protocol
Five practical attacks on the x402 agentic payment protocol are demonstrated across authorization, binding, replay protection, and web handling, validated on local chains, Base Sepolia, live endpoints, and three open-...
-
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on f...
-
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...
-
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
-
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
-
Sealing the Audit-Runtime Gap for LLM Skills
SIGIL cryptographically seals the audit-runtime gap for LLM skills via an on-chain registry with four publication types, DAO vetting, and a runtime verification loader that enforces integrity and permissions.
-
Uncertainty Propagation in LLM-Based Systems
This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insight...
-
Knows: Agent-Native Structured Research Representations
Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via ...
-
SoK: Blockchain Agent-to-Agent Payments
The first systematization of blockchain-based agent-to-agent payments organizes designs into discovery, authorization, execution, and accounting stages while identifying trust and security gaps.
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
-
Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries
GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
-
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
-
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
-
Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses
The survey organizes over 400 papers on embodied AI safety into a multi-level taxonomy and flags overlooked issues such as fragile multimodal fusion and unstable planning under jailbreaks.
-
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to ...
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications
SciFi is a safe, lightweight agentic AI framework that automates structured scientific tasks with minimal human intervention via isolated environments and layered self-assessing agents.
-
ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation
Expanded orchestration in ChromaFlow lowered accuracy on GAIA tasks from 29/53 to 27/53 while increasing timeouts, tool failures, and costs.
Reference graph
Works this paper leans on
-
[1]
WebArena: A Realistic Web Environment for Building Autonomous Agents
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, “WebArena: A realistic web environment for building autonomous agents,” in International Conference on Learning Representations (ICLR), 2024, arXiv:2307.13854
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable au- tomated software engineering,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, arXiv:2405.15793
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Measuring and augmenting large language models for solving capture-the-flag challenges,
Z. Ji, D. Wu, W. Jiang, P. Ma, Z. Li, and S. Wang, “Measuring and augmenting large language models for solving capture-the-flag challenges,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2025, pp. 603–617
work page 2025
-
[4]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang, “Hug- gingGPT: Solving AI tasks with ChatGPT and its friends in Hug- ging Face,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2303.17580
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Can large language model agents simulate human trust behavior?
C. Xie, C. Chen, F. Jia, Z. Ye, S. Lai, K. Shu, J. Gu, A. Bibi, Z. Hu, D. Jurgenset al., “Can large language model agents simulate human trust behavior?”Advances in Neural Information Processing Systems (NeurIPS), vol. 37, pp. 15 674–15 729, 2024
work page 2024
-
[6]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “MetaGPT: Meta programming for a multi-agent collaborative framework,” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2308.00352
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “AutoGen: Enabling next-gen LLM applications via multi- agent conversation,” inConference on Language Modeling (COLM), 2024, arXiv:2308.08155
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
An integrated theory of the mind
J. R. Anderson, D. Bothell, M. D. Byrne, S. Douglass, C. Lebiere, and Y . Qin, “An integrated theory of the mind.”Psychological Review, vol. 111, no. 4, pp. 1036–1060, 2004
work page 2004
-
[9]
J. E. Laird,The Soar Cognitive Architecture. MIT Press, 2012
work page 2012
-
[10]
Between MDPs and semi- MDPs: A framework for temporal abstraction in reinforcement learn- ing,
R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi- MDPs: A framework for temporal abstraction in reinforcement learn- ing,”Artificial Intelligence, vol. 112, no. 1–2, pp. 181–211, 1999
work page 1999
-
[11]
The landscape of agentic reinforcement learning for llms: A survey,
G. Zhang, H. Geng, X. Yu, Z. Yin, Z. Zhang, Z. Tan, H. Zhou, Z.-Z. Li, X. Xue, Y . Liet al., “The landscape of agentic reinforcement learning for llms: A survey,”Transactions on Machine Learning Research (TMLR)
-
[12]
ToolRL: Reward is all tool learning needs,
C. Qian, E. C. Acikgoz, Q. He, H. W ANG, X. Chen, D. Hakkani- Tür, G. Tur, and H. Ji, “ToolRL: Reward is all tool learning needs,” inAnnual Conference on Neural Information Processing Systems (NeurIPS)
-
[13]
H.-a. Gao, J. Geng, W. Hua, M. Hu, X. Juan, H. Liu, S. Liu, J. Qiu, X. Qi, Q. Renet al., “A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence,” Transactions on Machine Learning Research (TMLR)
-
[14]
Safety at scale: A comprehensive survey of large model and agent safety,
X. Ma, Y . Gao, Y . Wang, R. Wang, X. Wang, Y . Sun, Y . Ding, H. Xu, Y . Chen, Y . Zhaoet al., “Safety at scale: A comprehensive survey of large model and agent safety,”Foundations and Trends in Privacy and Security, vol. 8, no. 3-4, pp. 1–240, 2026
work page 2026
-
[15]
A Survey on Large Language Model based Autonomous Agents
L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. X. Zhao, Z. Wei, and J.-R. Wen, “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024, extended from arXiv:2308.11432
work page internal anchor Pith review arXiv 2024
-
[16]
A survey on agentic security: Applications, threats and defenses,
A. Shahriar, M. N. Rahman, S. Ahmed, F. Sadeque, and M. R. Parvez, “A survey on agentic security: Applications, threats and defenses,” arXiv preprint arXiv:2510.06445, 2025
-
[17]
Understanding the planning of LLM agents: A survey
X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y . Wang, R. Tang, and E. Chen, “Understanding the planning of LLM agents: A survey,”arXiv preprint arXiv:2402.02716, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,”arXiv preprint arXiv:2402.01680, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Survey on Evaluation of LLM-based Agents
A. Yehudai, L. Eden, A. Li, G. Uziel, Y . Zhao, R. Bar-Haim, A. Co- han, and M. Shmueli-Scheuer, “Survey on evaluation of LLM-based agents,”arXiv preprint arXiv:2503.16416, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Information fidelity in tool-using llm agents: A martingale analysis of the model context protocol,
F. X. Fan, C. Tan, R. Wattenhofer, and Y .-S. Ong, “Information fidelity in tool-using llm agents: A martingale analysis of the model context protocol,”arXiv preprint arXiv:2602.13320, 2026
-
[21]
Tool learning with foundation models,
Y . Qin, S. Hu, Y . Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, X. Zhou, Y . Huang, C. Xiao, C. Han, Y . R. Fung, Y . Su, H. Wang, C. Qian, R. Tian, K. Zhu, S. Liang, X. Shen, B. Xu, Z. Zhang, Y . Ye, B. Li et al., “Tool learning with foundation models,”ACM Computing Surveys (CSUR), vol. 57, no. 4, pp. 101:1–101:40, 2025
work page 2025
-
[22]
Toolformer: Language Models Can Teach Themselves to Use Tools
T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Tool- former: Language models can teach themselves to use tools,” in Advances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2302.04761
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Large language model based multi-agents: A survey of progress and challenges,
T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,” inProceedings of the Thirty-Third In- ternational Joint Conference on Artificial Intelligence (IJCAI), 2024, pp. 8048–8057, survey track
work page 2024
-
[24]
MemGPT: Towards LLMs as Operating Systems
C. Packer, V . Fang, S. G. Patil, K. Lin, S. Wooders, and J. E. Gonzalez, “MemGPT: Towards LLMs as operating systems,”arXiv preprint arXiv:2310.08560, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Memory matters: The need to improve long-term memory in llm-agents,
K. Hatalis, D. Christou, J. Myers, S. Jones, K. Lambert, A. Amos- Binks, Z. Dannenhauer, and D. Dannenhauer, “Memory matters: The need to improve long-term memory in llm-agents,” inProceedings of the AAAI Symposium Series (AAAI), vol. 2, no. 1, 2023, pp. 277–280
work page 2023
-
[26]
Sok: Semantic privacy in large language models,
B. Ma, Y . Jiang, X. Wang, G. Yu, Q. Wang, C. Sun, C. Li, X. Qi, Y . He, W. Niet al., “Sok: Semantic privacy in large language models,” arXiv preprint arXiv:2506.23603, 2025
-
[27]
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. El- nashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,”arXiv preprint arXiv:2302.11382, 2023
work page internal anchor Pith review arXiv 2023
-
[28]
SHOP2: An HTN planning system,
D. S. Nau, T.-C. Au, O. Ilghami, U. Kuter, J. W. Murdock, D. Wu, and F. Yaman, “SHOP2: An HTN planning system,”Journal of Artificial Intelligence Research, vol. 20, pp. 379–404, 2003
work page 2003
-
[29]
BDI agents: From theory to practice,
A. S. Rao and M. P. Georgeff, “BDI agents: From theory to practice,” inProceedings of the First International Conference on Multi-Agent Systems (ICMAS), 1995, pp. 312–319
work page 1995
-
[30]
STRIPS: A new approach to the appli- cation of theorem proving to problem solving,
R. E. Fikes and N. J. Nilsson, “STRIPS: A new approach to the appli- cation of theorem proving to problem solving,”Artificial Intelligence, vol. 2, no. 3–4, pp. 189–208, 1971
work page 1971
-
[31]
W. G. Chase and H. A. Simon, “Perception in chess,”Cognitive Psychology, vol. 4, no. 1, pp. 55–81, 1973
work page 1973
-
[32]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
X. Li, W. Chen, Y . Liu, S. Zheng, X. Chen, Y . He, Y . Li, B. You, H. Shen, J. Sunet al., “SkillsBench: Benchmarking how well agent skills work across diverse tasks,”arXiv preprint arXiv:2602.12670, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
Voyager: An Open-Ended Embodied Agent with Large Language Models
G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”Transactions on Machine Learning Research (TMLR), 2024, arXiv:2305.16291
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
ReAct: Synergizing Reasoning and Acting in Language Models
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023, arXiv:2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Reflexion: Language Agents with Verbal Reinforcement Learning
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2303.11366
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models
Z. Wang, S. Cai, A. Liu, Y . Jin, J. Hou, B. Zhang, H. Lin, Z. He, Z. Zheng, Y . Yang, X. Ma, and Y . Liang, “JARVIS-1: Open- world multi-task agents with memory-augmented multimodal lan- guage models,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 47, no. 3, pp. 1894–1907, 2025, extended from arXiv:2311.05997
-
[37]
Z. Wang, S. Cai, A. Liu, X. Ma, and Y . Liang, “Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2302.01560
-
[38]
Skill-it! a data-driven skills framework for understanding and training language models,
M. F. Chen, N. Roberts, K. Bhatia, J. Wang, C. Zhang, F. Sala, and C. Ré, “Skill-it! a data-driven skills framework for understanding and training language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2307.14430
-
[39]
Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, and Yong Yu
C. Zhang, Z. Yang, J. Liu, Y . Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “AppAgent: Multimodal agents as smartphone users,” inPro- ceedings of the CHI Conference on Human Factors in Computing Sys- tems (CHI), 2025, pp. 70:1–70:20, extended from arXiv:2312.13771
-
[40]
Cradle: Empowering foundation agents towards general computer control,
W. Tan, W. Zhang, X. Xu, H. Xia, Z. Ding, B. Li, B. Zhouet al., “Cradle: Empowering foundation agents towards general computer control,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 267, 2025, pp. 58 658–58 725, extended from arXiv:2403.03186
-
[41]
Agenttuning: Enabling generalized agent abilities for llms,
A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y . Dong, and J. Tang, “AgentTuning: Enabling generalized agent abilities for LLMs,” in Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 3053–3077, arXiv:2310.12823
-
[42]
Executable code actions elicit better LLM agents
X. Wang, Y . Chen, L. Yuan, Y . Zhang, Y . Li, H. Peng, and H. Ji, “Executable code actions elicit better LLM agents,” inInternational Conference on Machine Learning (ICML), 2024, arXiv:2402.01030
-
[43]
arXiv preprint arXiv:2310.05915 , year=
B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao, “FireAct: Toward language agent fine-tuning,”arXiv preprint arXiv:2310.05915, 2023
-
[44]
TaskWeaver: A code-first agent framework,
B. Qiao, L. Li, X. Zhang, S. He, Y . Kang, C. Zhang, F. Yang, H. Dong, J. Zhang, L. Wang, M. Ma, P. Zhao, S. Qin, X. Qin, C. Du, Y . Xu, Q. Lin, S. Rajmohan, and D. Zhang, “TaskWeaver: A code-first agent framework,”arXiv preprint arXiv:2311.17541, 2023
-
[45]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gober, K. Hausmanet al., “Do as i can, not as i say: Grounding language in robotic affordances,” inConference on Robot Learning (CoRL), 2022, arXiv:2204.01691
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[46]
Generative Agents: Interactive Simulacra of Human Behavior
J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behav- ior,” inACM Symposium on User Interface Software and Technology (UIST), 2023, arXiv:2304.03442
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Eureka: Human-Level Reward Design via Coding Large Language Models
Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” inInternational Confer- ence on Learning Representations (ICLR), 2024, arXiv:2310.12931
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
K. Nottingham, P. Ammanabrolu, A. Suhr, Y . Choi, H. Hajishirzi, S. Singh, and R. Fox, “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,” inInternational Conference on Machine Learning (ICML), 2023, arXiv:2301.12050
-
[49]
Inner Monologue: Embodied Reasoning through Planning with Language Models
W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotaret al., “Inner monologue: Embodied reasoning through planning with language models,” in Conference on Robot Learning (CoRL), 2022, arXiv:2207.05608
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [50]
-
[51]
Semantic Kernel: A lightweight SDK for AI agent de- velopment,
Microsoft, “Semantic Kernel: A lightweight SDK for AI agent de- velopment,” https://github.com/microsoft/semantic-kernel, 2023, accessed: 2026-02-21
work page 2023
-
[52]
Code as Policies: Language Model Programs for Embodied Control
J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” inIEEE International Conference on Robotics and Automation (ICRA), 2023, arXiv:2209.07753
work page internal anchor Pith review arXiv 2023
-
[53]
Progprompt: Generating situated robot task plans using large language models
I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Trem- blay, D. Fox, J. Thomason, and A. Garg, “ProgPrompt: Generating situated robot task plans using large language models,” inIEEE International Conference on Robotics and Automation (ICRA), 2023, arXiv:2209.11302
-
[54]
A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y .- X. Wang, “Language agent tree search unifies reasoning, acting, and planning in language models,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235, 2024, pp. 62 138–62 160, arXiv:2310.04406
-
[55]
Self-instruct: Aligning language models with self- generated instructions,
Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language models with self- generated instructions,” inAnnual Meeting of the Association for Computational Linguistics (ACL), 2023, pp. 13 484–13 508
work page 2023
-
[56]
CREATOR: Tool creation for disentangling abstract and concrete reasoning of large language models,
C. Qian, C. Han, Y . R. Fung, Y . Qin, Z. Liu, and H. Ji, “CREATOR: Tool creation for disentangling abstract and concrete reasoning of large language models,” inFindings of the Association for Computa- tional Linguistics (EMNLP), 2023, arXiv:2305.14318
-
[57]
Introducing the model context protocol,
Anthropic, “Introducing the model context protocol,” https://www.an thropic.com/news/model-context-protocol, 2024, accessed: 2026-02- 21
work page 2024
-
[58]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qianet al., “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2307.16789
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
OpenClaw: Personal ai assistant,
OpenClaw Project, “OpenClaw: Personal ai assistant,” https://gith ub.com/openclaw/openclaw, 2026, official repository (216k stars at access time). Accessed: 2026-02-22
work page 2026
-
[60]
ClawHavoc: 341 malicious clawed skills found by the bot they were targeting,
Alex and Oren Yomtov, “ClawHavoc: 341 malicious clawed skills found by the bot they were targeting,” https://www.koi.ai/blog/claw havoc-341-malicious-clawedbot-skills-found-by-the-bot-they-wer e-targeting, 2026, koi Research blog post; update dated Feb 16, 2026 reports 824 malicious skills. Accessed: 2026-02-22
work page 2026
-
[61]
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su
X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2Web: Towards a generalist agent for the web,” in Advances in Neural Information Processing Systems (NeurIPS), 2023, spotlight. arXiv:2306.06070
-
[62]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, datasets and Benchmarks track. arX...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real- world GitHub issues?” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2310.06770
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Re- cent advances in robot learning from demonstration,
H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard, “Re- cent advances in robot learning from demonstration,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 297–330, 2020
work page 2020
-
[65]
AutoGPT: An autonomous GPT-4 experiment,
Significant Gravitas, “AutoGPT: An autonomous GPT-4 experiment,” https://github.com/Significant-Gravitas/AutoGPT, 2023, accessed: 2026-02-21
work page 2023
-
[66]
A comprehensive survey of continual learning: Theory, method and application,
L. Wang, X. Zhang, H. Su, and J. Zhu, “A comprehensive survey of continual learning: Theory, method and application,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 46, no. 8, pp. 5362–5383, 2024
work page 2024
-
[67]
Proagent: Building proactive cooperative ai with large language models
C. Zhang, K. Yang, S. Hu, Z. Wang, G. Li, Y . Sun, C. Zhang, Z. Zhang, A. Liu, S.-C. Zhu, X. Chang, J. Zhang, F. Yin, Y . Liang, and Y . Yang, “ProAgent: Building proactive cooperative agents with large language models,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 16, 2024, pp. 17 591– 17 599, arXiv:2308.11339
-
[68]
SoK: Taxonomy of attacks on open-source software supply chains,
P. Ladisa, H. Plate, M. Martinez, and O. Barais, “SoK: Taxonomy of attacks on open-source software supply chains,” inIEEE Symposium on Security and Privacy (SP), 2023, pp. 1509–1526
work page 2023
-
[69]
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real- world LLM-integrated applications with indirect prompt injection,” inACM Workshop on Artificial Intelligence and Security (AISec), 2023, arXiv:2302.12173
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
From automation to infection: How OpenClaw AI agent skills are being weaponized,
B. Quintero, “From automation to infection: How OpenClaw AI agent skills are being weaponized,” https://blog.virustotal.com/2026/0 2/from-automation-to-infection-how.html, 2026, virusTotal Blog, February 2, 2026. Accessed: 2026-02-22
work page 2026
-
[71]
B. Van, “Agent skills guard,” https://github.com/brucevanfdm/agent -skills-guard, 2026, desktop scanner/manager; README reports 8 risk categories and 22 hard-trigger rules. Accessed: 2026-02-22
work page 2026
-
[72]
SkillGuard: AI agent security scanner,
G. Singh, “SkillGuard: AI agent security scanner,” https://skillgaurd .up.railway.app/, 2026, website and linked source repo describe AST analysis for JS/TS, 9-language coverage, and 20+ attack patterns. Accessed: 2026-02-22
work page 2026
-
[73]
Jailbreaking chat- gpt via prompt engineering: An empirical study
Y . Liu, G. Deng, Z. Xu, Y . Li, Y . Zheng, Y . Zhang, L. Zhao, T. Zhang, K. Wang, and Y . Liu, “Jailbreaking chatgpt via prompt engineering: An empirical study,”arXiv preprint arXiv:2305.13860, 2023
-
[74]
GAIA: a benchmark for General AI Assistants
G. Mialon, C. Fourrier, T. Wolf, Y . LeCun, and T. Scialom, “GAIA: A benchmark for general AI assistants,” inInternational Conference on Learning Representations (ICLR), 2024, poster. arXiv:2311.12983
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
AgentBench: Evaluating LLMs as Agents
X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang, “AgentBench: Evaluating LLMs as agents,” inInternational Confer- ence on Learning Representations (ICLR), 2024, arXiv:2308.03688
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
C. Rawles, S. Clinckemaillie, Y . Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. E. Bishop, W. Li, F. Campbell-Ajala, D. K. Toyama, R. J. Berry, D. Tyamagundlu, T. P. Lillicrap, and O. Riva, “AndroidWorld: A dynamic benchmarking environment for autonomous agents,” in International Conference on Learning Representations (ICLR), 2025, arXiv:2405.14573
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.