{"total":28,"items":[{"citing_arxiv_id":"2607.02357","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware","primary_cat":"cs.CR","submitted_at":"2026-07-02T16:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillCloak evades existing static scanners for agent skill malware at high rates, while SkillDetonate detects 97% of attacks at 2% false-positive rate using sandboxed runtime behavior analysis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00911","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Registry to Repository: How AI Agent Skills Are Written, Adapted, and Maintained","primary_cat":"cs.SE","submitted_at":"2026-07-01T13:14:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Empirical study of 41k+ AI agent skills finds reuse is mostly one-time verbatim copying with 53% never modified afterward and maintenance focused on additive local adaptations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11671","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security","primary_cat":"cs.CR","submitted_at":"2026-06-10T05:29:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Runtime Skill Audit introduces targeted runtime probing to detect malicious LLM agent skills, reporting 90% accuracy and resilience to self-evolving attacks on 100 skills versus static baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20659","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill Coverage: A Test Adequacy Metric for Agent Skills","primary_cat":"cs.AI","submitted_at":"2026-06-09T10:16:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Skill coverage is a binary test adequacy metric that extracts observable behavior constraints from skill documents and judges whether trajectories provide sufficient evidence to cover each constraint, revealing 39.90-43.98% coverage on SkillsBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08671","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History","primary_cat":"cs.LG","submitted_at":"2026-06-07T15:21:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SkillHone introduces a harness that maintains persistent decision histories to support continual evolution of language-model agent skills, reporting 15.8-point gains on GAIA over a commercial deep-research agent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07131","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills","primary_cat":"cs.CR","submitted_at":"2026-06-05T10:43:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MalSkillBench supplies the first sandbox-verified dataset of malicious agent skills and shows that existing detectors achieve high recall on code injection but collapse on prompt injection and agent-control attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06893","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition","primary_cat":"cs.AI","submitted_at":"2026-06-05T04:19:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"W2S framework with RWSA decomposition converts heterogeneous traces into Skills and improves behavioral replay consistency by 10.5% over summarization baselines on 70 Skills.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06416","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unsupervised Skill Discovery for Agentic Data Analysis","primary_cat":"cs.AI","submitted_at":"2026-06-04T17:20:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DataCOPE uses verifier-guided contrastive distillation from agent trajectories to discover skills, yielding average gains of 9.71% on report-style and 32.30% on reasoning-style data analysis tasks across four model settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06079","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillComposer: Learning to Evolve Agent Skills for Specification and Generalization","primary_cat":"cs.CL","submitted_at":"2026-06-04T12:16:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SkillComposer decomposes skill construction into create/improve/merge operations trained by rejection sampling, enabling self-evolving skills that improve agent and code task performance while generalizing to unseen domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05525","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization","primary_cat":"cs.AI","submitted_at":"2026-06-04T00:14:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SciVisAgentSkills provides reusable agent skills that raise mean task scores on a 108-task SciVis benchmark when paired with Codex and Claude Code agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03980","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill","primary_cat":"cs.LG","submitted_at":"2026-06-02T17:56:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03143","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FederatedSkill: Federated Learning for Agentic Skill Evolution","primary_cat":"cs.LG","submitted_at":"2026-06-02T04:38:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FederatedSkill aggregates client semantic skill diffs via a server evolution agent to enable strictly personalized skill evolution, reporting up to 44.4% higher success rates and 37.5% lower compute cost than self-evolving baselines across 20 task families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03024","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillGuard: A Permission Framework for Agent Skills","primary_cat":"cs.CR","submitted_at":"2026-06-02T02:01:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkillGuard presents a dual-plane permission framework for agent skills that achieves 99.76% taxonomy coverage and reduces attack success rates in evaluations on 315 skills.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00510","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning","primary_cat":"cs.CL","submitted_at":"2026-05-30T04:00:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28424","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-05-27T12:54:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Skill0.5 is an agentic RL framework that internalizes general skills for hard tasks and utilizes task-specific skills for easy tasks via a dynamic difficulty-aware router to improve out-of-distribution generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27466","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-05-26T08:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgensFlow learns coordination policies from task trajectories and outperforms fixed pipelines on distributed-systems incident and security-advisory tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25430","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CODESKILL: Learning Self-Evolving Skills for Coding Agents","primary_cat":"cs.AI","submitted_at":"2026-05-25T05:12:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CODESKILL trains an LLM policy via RL on hybrid rewards to extract and maintain multi-granularity skills from agent trajectories, raising pass rates 9.69 points over no-skill baselines on three coding benchmarks while keeping the skill bank compact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22634","ref_index":5,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents","primary_cat":"cs.SE","submitted_at":"2026-05-21T15:40:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Contractual skills framework structures SKILL.md files as readable task contracts; A/B tests on synthetic tasks show mean quality rising from 4.692 to 4.914 and critical-error rate falling from 0.083 to 0.013 across models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18401","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution","primary_cat":"cs.CL","submitted_at":"2026-05-18T13:44:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"experience more compact than a full trajectory while preserving more executable context than an isolated natural-language summary [19]. At ecosystem scale, the problem is no longer only how to author an individual skill, but how to control a continuously expanding library. Public skill ecosystems already exhibit scale, redundancy, uneven quality, and safety risks [29]. Skill benchmarks further show that the benefit of skills depends on task, domain, and retrieval setting; weakly related or low-quality skills can degrade agent performance [23, 32]. Treating skills as ecosystem artifacts also changes the failure mode: larger libraries increase coverage, but they also enlarge the search space and amplify library pollution when weakly supported lessons are incorporated"},{"citing_arxiv_id":"2605.12015","ref_index":20,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces","primary_cat":"cs.CR","submitted_at":"2026-05-12T12:03:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09038","ref_index":18,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks","primary_cat":"cs.AI","submitted_at":"2026-05-09T16:23:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SearchSkill improves LLM query planning on knowledge QA by using explicit skill selection from an evolving SkillBank and a two-stage SFT process that aligns training with inference-time skill-grounded execution.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recent work on agent skills offers a complementary angle. Anthropic's Agent Skills1 made skills a first-class interface for packaging instructions, code, and resources that an agent can load on demand. Follow-up analyses show that skills rapidly became a practical mechanism for extending model functionality, while also raising ecosystem-level questions about organization and safe reuse [ 18, 19]. AgentSkillOS studies selection and benchmarking over large skill ecosystems [ 16], while Reinforcement Learning for Self-Improving Agent with Skill Library, MemSkill, and SkillRL explore how agents can maintain and evolve skill libraries or skill banks over training [30, 36, 33]. However, applying this intuition to search tool use is still non-trivial."},{"citing_arxiv_id":"2605.08526","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck","primary_cat":"cs.LG","submitted_at":"2026-05-08T22:17:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"where Σθ and Σϕ are diagonal, σθ(M, c∗) is the elementwise standard-deviation vector, the posterior conditions multimodal rollout features on the fixed text card, and the prior represents the default multimodal expectation induced by c∗ alone. During training, we use the standard reparameterization z=µ θ(M,c ∗) +σ θ(M,c ∗)⊙ϵ, ϵ∼ N(0,I),Σ θ(M,c ∗) =diag σθ(M,c ∗)⊙σ θ(M,c ∗) \u0001 (13) Finally, the realized latent z is fused with the text card through the control map already introduced in Equation (5). Concretely, the projected latent gω(z) is prepended together with the fixed card c∗ and the rollout bundle B to the frozen task model πtsk, so that the prediction term in Equation (10) measures how much additional task-relevant information"},{"citing_arxiv_id":"2605.02709","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance","primary_cat":"cs.AI","submitted_at":"2026-05-04T15:16:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Public healthcare agent skills emphasize workflow automation over clinical diagnostics and treatments, with uneven lifecycle coverage and weak alignment between technical and clinical risk.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24594","ref_index":16,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill Retrieval Augmentation for Agentic AI","primary_cat":"cs.CL","submitted_at":"2026-04-27T15:19:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces SRA paradigm and SRA-Bench benchmark (5,400 tasks, 26,262 skills) showing retrieval improves performance but LLMs fail to selectively incorporate retrieved skills.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15415","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?","primary_cat":"cs.CR","submitted_at":"2026-04-16T17:31:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Our measurements in Section 5 reveal that 4,858 skills (4.93% in the evaluated registries) are openly accessible and ready for immediate deployment. Despite their preva- lence, existing agent-safety [2, 14, 77] and jailbreak bench- marks [63, 9, 12, 44] focus on distinct threat models, fail- ing to treat the skill itself as a primary harm vector. Sim- ilarly, while concurrent ecosystem studies [38, 40, 35] ex- plore prevalence, covert attacks, or functionality, they over- look whether agents comply with overtly harmful skills. To bridge this gap, we propose HARMFULSKILLBENCH(Fig- ure 10), a benchmark designed to answerhow harmful skills affect the safety behavior of LLM-based agents (RQ3). 6.1 Benchmark Construction To construct HARMFULSKILLBENCH, we select 200 harm-"},{"citing_arxiv_id":"2604.08224","ref_index":86,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering","primary_cat":"cs.SE","submitted_at":"2026-04-09T13:19:41+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03460","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations","primary_cat":"physics.chem-ph","submitted_at":"2026-04-03T21:09:19+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[32] A. Mandal, M. A. Taylor, B. M. Weight, E. R. Koessler, X. Li, and P. Huo, Theoretical Advances in Polariton Chemistry and Molecular Cavity Quantum Electrody- namics, Chem. Rev.123, 9786 (2023). [33] M. Ruggenthaler, D. Sidler, and A. Rubio, Understand- ing Polaritonic Chemistry from Ab Initio Quantum Elec- trodynamics, Chem. Rev.123, 11191 (2023). [34] G. Ling, S. Zhong, and R. Huang, Agent Skills: A Data- Driven Analysis of Claude Skills for Extending Large LanguageModelFunctionality,arXiv:2602.08004 (2026). [35] L. Hatton, The t experiments: errors in scientific soft- ware, IEEE Comput. Sci. Eng.4, 27 (1997). [36] K. T. Williams, Y. Yao, J. Li, L. Chen, H. Shi, M. Motta, C. Niu, U. Ray, S. Guo, R."},{"citing_arxiv_id":"2604.13064","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub","primary_cat":"cs.CL","submitted_at":"2026-03-19T14:31:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}