pith. machine review for the scientific record. sign in

arxiv: 2602.12430 · v3 · submitted 2026-02-12 · 💻 cs.MA · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu, Yang Yan

Pith reviewed 2026-05-13 07:35 UTC · model grok-4.3

classification 💻 cs.MA cs.AI
keywords agent skillsLLM agentsskill acquisitionsecurity vulnerabilitiesgovernance frameworkModel Context Protocolskill provenance
0
0 comments X

The pith

Agent skills shift LLMs from monolithic models to modular systems where composable packages of instructions and code load on demand for dynamic capability extension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey organizes the rapid development of agent skills into four areas: architectural foundations such as the SKILL.md format and Model Context Protocol integration, acquisition methods including reinforcement learning and autonomous discovery, deployment through computer-use stacks and benchmarks like OSWorld, and security risks where 26.1 percent of community skills show vulnerabilities. The central proposal is a four-tier Skill Trust and Lifecycle Governance Framework that ties skill provenance to graduated deployment permissions via gate-based controls. The work identifies seven open challenges and sets a research agenda for trustworthy self-improving skill ecosystems, distinguishing this layer from broader agent or tool surveys.

Core claim

Agent skills are formalized as portable, composable packages enabling progressive disclosure without model retraining, with security analyses showing substantial vulnerabilities in community contributions that motivate a four-tier gate-based permission model mapping provenance to deployment capabilities.

What carries the argument

The Skill Trust and Lifecycle Governance Framework, a four-tier gate-based permission model that assigns graduated deployment rights according to skill provenance and verification status.

If this is right

  • Skills acquired via reinforcement learning or compositional synthesis allow capability extension without retraining the base model.
  • Deployment at scale proceeds through computer-use agent stacks, GUI grounding, and benchmarks such as OSWorld and SWE-bench.
  • Seven open challenges remain, including cross-platform portability and capability-based rather than provenance-based permissions.
  • Proper governance enables self-improving skill ecosystems that maintain trustworthiness over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same provenance-to-permission mapping could extend to other modular components such as tools or memory stores in agent systems.
  • Empirical validation of the framework would require longitudinal tracking of skill usage and incident rates rather than snapshot sampling.
  • Integration points with existing agent platforms would determine how quickly the proposed tiers can be adopted in practice.

Load-bearing premise

The 26.1 percent vulnerability rate found in sampled community skills is representative enough to make a new governance framework the main path forward.

What would settle it

A larger multi-platform sample of skills that finds vulnerability rates below 10 percent and no clear link between provenance and risk would remove the empirical basis for prioritizing the four-tier framework.

read the original abstract

The transition from monolithic language models to modular, skill-equipped agents marks a defining shift in how large language models (LLMs) are deployed in practice. Rather than encoding all procedural knowledge within model weights, agent skills -- composable packages of instructions, code, and resources that agents load on demand -- enable dynamic capability extension without retraining. It is formalized in a paradigm of progressive disclosure, portable skill definitions, and integration with the Model Context Protocol (MCP). This survey provides a comprehensive treatment of the agent skills landscape, as it has rapidly evolved during the last few months. We organize the field along four axes: (i) architectural foundations, examining the SKILL$.$md specification, progressive context loading, and the complementary roles of skills and MCP; (ii) skill acquisition, covering reinforcement learning with skill libraries, autonomous skill discovery (SEAgent), and compositional skill synthesis; (iii) deployment at scale, including the computer-use agent (CUA) stack, GUI grounding advances, and benchmark progress on OSWorld and SWE-bench; and (iv) security, where recent empirical analyses reveal that 26.1% of community-contributed skills contain vulnerabilities, motivating our proposed Skill Trust and Lifecycle Governance Framework -- a four-tier, gate-based permission model that maps skill provenance to graduated deployment capabilities. We identify seven open challenges -- from cross-platform skill portability to capability-based permission models -- and propose a research agenda for realizing trustworthy, self-improving skill ecosystems. Unlike prior surveys that broadly cover LLM agents or tool use, this work focuses specifically on the emerging skill abstraction layer and its implications for the next generation of agentic systems. Project repo: https://github.com/scienceaix/agentskills

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper surveys the agent skills landscape for LLMs, organizing recent work along four axes: architectural foundations (SKILL.md specification, progressive context loading, and integration with the Model Context Protocol), skill acquisition (reinforcement learning with skill libraries, autonomous discovery via SEAgent, and compositional synthesis), deployment at scale (computer-use agent stacks, GUI grounding, and benchmarks such as OSWorld and SWE-bench), and security. It reports that 26.1% of community-contributed skills contain vulnerabilities according to recent empirical analyses, which motivates the proposal of a four-tier Skill Trust and Lifecycle Governance Framework that maps skill provenance to graduated deployment permissions. The manuscript identifies seven open challenges and outlines a research agenda for trustworthy, self-improving skill ecosystems, distinguishing itself from broader LLM-agent surveys by focusing on the skill abstraction layer.

Significance. If the central claims hold, the survey would provide a timely, focused synthesis of a rapidly evolving subfield and introduce a concrete governance proposal that could guide standardization of skill provenance and permissions. The work is strengthened by its grounding in very recent literature and by naming a specific empirical statistic on vulnerabilities, which supplies a falsifiable anchor for the security discussion and the proposed framework.

major comments (1)
  1. [Security section] Security section (abstract and corresponding §4): The 26.1% vulnerability rate is presented as direct motivation for the Skill Trust and Lifecycle Governance Framework, yet the manuscript supplies no sample size, selection method, verification procedure, platform scope, or citation to the underlying empirical study. Because this statistic is load-bearing for the claim that a four-tier gate-based permission model is the indicated path forward, its representativeness must be demonstrated or the motivation for the framework must be qualified.
minor comments (1)
  1. [Abstract] Abstract: The seven open challenges are referenced but not enumerated; listing them explicitly would improve readability and allow readers to connect them directly to the proposed research agenda.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the security discussion. We address the concern about the 26.1% statistic below and will revise the manuscript accordingly to strengthen the motivation for the proposed framework.

read point-by-point responses
  1. Referee: [Security section] Security section (abstract and corresponding §4): The 26.1% vulnerability rate is presented as direct motivation for the Skill Trust and Lifecycle Governance Framework, yet the manuscript supplies no sample size, selection method, verification procedure, platform scope, or citation to the underlying empirical study. Because this statistic is load-bearing for the claim that a four-tier gate-based permission model is the indicated path forward, its representativeness must be demonstrated or the motivation for the framework must be qualified.

    Authors: We agree that additional context is required for the 26.1% figure. In the revised manuscript we will insert the full citation to the underlying empirical study together with its sample size, selection method (e.g., sampling from public skill repositories), verification procedure (e.g., the static-analysis pipeline employed), and platform scope. This will allow readers to evaluate representativeness directly. If the study’s coverage is narrower than the full ecosystem, we will qualify the motivation by framing the four-tier model as a response to the broader class of provenance and permission vulnerabilities documented across recent analyses, while retaining the statistic as a concrete, falsifiable illustration. revision: yes

Circularity Check

0 steps flagged

Low circularity: survey motivates governance framework from external empirical statistic without self-referential reduction

full rationale

The paper is a survey organizing external literature on agent skills and proposing the Skill Trust and Lifecycle Governance Framework as a response to a cited 26.1% vulnerability rate from 'recent empirical analyses.' No equations, fitted parameters, or derivations appear in the provided text. The central forward claim does not reduce to a self-definition, a renamed known result, or a load-bearing self-citation chain; the statistic is treated as independent motivation rather than an author-derived input. This yields only minor self-citation risk at most, consistent with normal survey practice.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper rests on standard assumptions from the LLM agent literature (e.g., composability of instructions and code) and introduces one new proposed entity without independent validation here.

invented entities (1)
  • Skill Trust and Lifecycle Governance Framework no independent evidence
    purpose: Four-tier permission model to gate deployment based on skill provenance and address observed vulnerabilities
    Proposed in response to the 26.1% vulnerability finding; no independent falsifiable test provided in the work

pith-pipeline@v0.9.0 · 5613 in / 1131 out tokens · 32002 ms · 2026-05-13T07:35:20.366373+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.LawOfExistence defect_zero_iff_one unclear

    recent empirical analyses reveal that 26.1% of community-contributed skills contain vulnerabilities, motivating our proposed Skill Trust and Lifecycle Governance Framework—a four-tier, gate-based permission model

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry

    cs.AI 2026-05 unverdicted novelty 8.0

    Semantic manipulations of SKILL.md descriptions enable effective supply-chain attacks that bias AI agent skill registries toward adversarial skills in discovery, selection, and governance.

  2. HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

    cs.CR 2026-04 unverdicted novelty 8.0

    Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

  3. OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on f...

  4. Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

    cs.SE 2026-05 conditional novelty 7.0

    SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...

  5. Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck

    cs.LG 2026-05 unverdicted novelty 7.0

    CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.

  6. SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.

  7. Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  8. SAGER: Self-Evolving User Policy Skills for Recommendation Agent

    cs.IR 2026-04 unverdicted novelty 7.0

    SAGER equips LLM recommendation agents with per-user evolving policy skills via two-representation architecture, contrastive CoT diagnosis, and skill-augmented listwise reasoning, yielding SOTA gains orthogonal to mem...

  9. Skill-Conditioned Visual Geolocation for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...

  10. SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

    cs.AI 2026-04 unverdicted novelty 7.0

    SkillFoundry mines heterogeneous scientific resources into a self-evolving library of validated agent skills, with 71.1% novelty versus prior libraries and measurable gains on coding benchmarks plus two genomics tasks.

  11. AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models

    cs.AI 2026-04 unverdicted novelty 7.0

    AutoVerifier decomposes technical claims into triples and uses layered LLM verification to assess validity, demonstrated on a quantum computing paper by finding overclaims and conflicts.

  12. MMSkills: Towards Multimodal Skills for General Visual Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.

  13. SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

    cs.CL 2026-05 unverdicted novelty 6.0

    SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.

  14. SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

    cs.CR 2026-05 unverdicted novelty 6.0

    SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.

  15. SkillEvolver: Skill Learning as a Meta-Skill

    cs.AI 2026-05 unverdicted novelty 6.0

    A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.

  16. SkillGen: Verified Inference-Time Agent Skill Synthesis

    cs.LG 2026-05 unverdicted novelty 6.0

    SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.

  17. SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...

  18. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...

  19. From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

    cs.CL 2026-04 unverdicted novelty 6.0

    SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.

  20. ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

    cs.AI 2026-04 unverdicted novelty 6.0

    ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...

  21. Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.

  22. Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

    cs.AI 2026-04 conditional novelty 6.0

    The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.

  23. TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection

    cs.AI 2026-04 unverdicted novelty 6.0

    TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts wh...

  24. SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

    cs.CR 2026-04 unverdicted novelty 6.0

    SkillSieve is a hierarchical triage framework combining regex/AST/XGBoost filtering, parallel LLM subtasks, and multi-LLM jury voting to detect malicious AI agent skills, reaching 0.800 F1 on a 400-skill benchmark at ...

  25. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...

  26. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...

  27. EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation

    cs.AI 2026-04 unverdicted novelty 5.0

    EvoAgent is an evolvable LLM agent framework using structured skill learning, user-feedback loops, and hierarchical delegation that boosts GPT5.2 performance by about 28% in real-world trade scenarios under LLM-as-Jud...

  28. From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

    cs.SE 2026-04 unverdicted novelty 5.0

    Compact Gene representations of experience outperform documentation-oriented Skill packages for test-time control and iterative evolution in code-solving tasks, with measured gains on CritPt from 9.1% to 18.57% and 17...

  29. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

    cs.SE 2026-04 accept novelty 5.0

    LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 27 Pith papers · 5 internal anchors

  1. [1]

    Equipping agents for the real world with agent skills.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills, 2025

    Barry Zhang, Keith Lazuka, and Mahesh Murag. Equipping agents for the real world with agent skills.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills, 2025. Anthropic Engineering Blog, Oct 2025

  2. [2]

    Introducing agent skills.https://www.anthropic.com/news/skills, 2025

    Anthropic. Introducing agent skills.https://www.anthropic.com/news/skills, 2025. Product Announce- ment, Oct 2025

  3. [3]

    Agent skills open standard.https://agentskills.io, 2025

    Anthropic. Agent skills open standard.https://agentskills.io, 2025. Open standard specification

  4. [4]

    Donating the model context protocol and establishing the agentic AI foundation.https://www

    Anthropic. Donating the model context protocol and establishing the agentic AI foundation.https://www. anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation,

  5. [5]

    Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    Junyu Luo et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

  6. [6]

    Agentic large language models: A survey.arXiv preprint arXiv:2503.23037, 2025

    Aske Plaat et al. Agentic large language models: A survey.arXiv preprint arXiv:2503.23037, 2025

  7. [7]

    Tool learning with large language models: A survey.Frontiers of Computer Science, 19(8), 2025

    Changle Qu et al. Tool learning with large language models: A survey.Frontiers of Computer Science, 19(8), 2025

  8. [8]

    Zhang, S

    Chaoyun Zhang et al. Large language model-brained GUI agents: A survey.arXiv preprint arXiv:2411.18279, 2024. Updated May 2025

  9. [9]

    OS agents: A survey on MLLM-based agents for general computing devices use

    Xueyu Hu et al. OS agents: A survey on MLLM-based agents for general computing devices use. InProceedings of the Association for Computational Linguistics (ACL), 2025

  10. [10]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang et al. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

  11. [11]

    CREATOR: Tool creation for disentangling abstract and concrete reasonings of large language models

    Cheng Qian et al. CREATOR: Tool creation for disentangling abstract and concrete reasonings of large language models. InFindings of EMNLP, 2024

  12. [12]

    arXiv preprint arXiv:2305.17126 , year=

    Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers.arXiv preprint arXiv:2305.17126, 2023

  13. [13]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick et al. Toolformer: Language models can teach themselves to use tools. 2024

  14. [14]

    Introducing the model context protocol.https://www.anthropic.com/news/ model-context-protocol, 2024

    Anthropic. Introducing the model context protocol.https://www.anthropic.com/news/ model-context-protocol, 2024. Nov 2024

  15. [15]

    Model context protocol specification (2025-11-25).https:// modelcontextprotocol.io/specification/2025-11-25, 2025

    Model Context Protocol. Model context protocol specification (2025-11-25).https:// modelcontextprotocol.io/specification/2025-11-25, 2025

  16. [16]

    Introducing advanced tool use on the claude developer platform.https://www.anthropic.com/ engineering/advanced-tool-use, 2025

    Anthropic. Introducing advanced tool use on the claude developer platform.https://www.anthropic.com/ engineering/advanced-tool-use, 2025. Nov 24, 2025

  17. [17]

    Reinforcement learning for self-improving agent with skill library, 2025

    Jiongxiao Wang et al. Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

  18. [18]

    SEAgent: Self- evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

    Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. SEAgent: Self- evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025. 11

  19. [19]

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu

    Tianyi Chen et al. CUA-Skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123, 2026

  20. [20]

    Agentic proposing: Enhancing large language model reasoning via compositional skill synthesis.arXiv preprint arXiv:2602.03279,

    Zhengbo Jiao et al. Agentic proposing: Enhancing large language model reasoning via compositional skill synthesis. arXiv preprint arXiv:2602.03279, 2026

  21. [21]

    When single-agent with skills replace multi-agent systems and when they fail,

    Xiaoxiao Li. When single-agent with skills replace multi-agent systems and when they fail.arXiv preprint arXiv:2601.04748, 2026

  22. [22]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin et al. UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  23. [23]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang et al. UI-TARS-2 technical report: Advancing GUI agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

  24. [24]

    Agent S2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

    Saaket Agashe et al. Agent S2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

  25. [25]

    OpenCUA: Open foundations for computer-use agents

    Xinyuan Wang et al. OpenCUA: Open foundations for computer-use agents. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2025. Spotlight

  26. [26]

    UGround: Universal visual grounding for GUI agents

    Boyu Gou et al. UGround: Universal visual grounding for GUI agents. InInternational Conference on Learning Representations (ICLR), 2025. Oral

  27. [27]

    Scaling computer-use grounding via user interface decomposition and synthesis

    Tianbao Xie et al. Scaling computer-use grounding via user interface decomposition and synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. Spotlight

  28. [28]

    Enhancing visual grounding for GUI agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

    Xinbin Yuan et al. Enhancing visual grounding for GUI agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

  29. [29]

    arXiv preprint arXiv:2506.03143 , year=

    Qianhui Wu et al. GUI-Actor: Coordinate-free visual grounding for GUI agents.arXiv preprint arXiv:2506.03143, 2025

  30. [30]

    OS-Marathon: Benchmarking computer-use agents on long-horizon repetitive tasks.arXiv preprint arXiv:2601.20650, 2026

    Jing Wu et al. OS-Marathon: Benchmarking computer-use agents on long-horizon repetitive tasks.arXiv preprint arXiv:2601.20650, 2026

  31. [31]

    CoAct-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025

    Linxin Song et al. CoAct-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025

  32. [32]

    Agent skills enable a new class of realistic and trivially simple prompt injections,

    David Schmotz, Sahar Abdelnabi, and Maksym Andriushchenko. Agent skills enable a new class of realistic and trivially simple prompt injections.arXiv preprint arXiv:2510.26328, 2025

  33. [33]

    Agent skills in the wild: An empirical study of security vulnerabilities at scale,

    Yi Liu et al. Agent skills in the wild: An empirical study of security vulnerabilities at scale.arXiv preprint arXiv:2601.10338, 2026

  34. [34]

    Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026

    Yi Liu et al. Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026

  35. [35]

    Demystifying evals for AI agents.https://www.anthropic.com/engineering/ demystifying-evals-for-ai-agents, 2025

    Anthropic. Demystifying evals for AI agents.https://www.anthropic.com/engineering/ demystifying-evals-for-ai-agents, 2025

  36. [36]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, et al. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026. 12