arxiv: 2605.12015 · v1 · submitted 2026-05-12 · 💻 cs.CR · cs.AI· cs.CL· cs.LG· cs.MA

Recognition: 1 theorem link

· Lean Theorem

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

An Wang, Biaojie Zeng, Chang Jin, Chao Yang, Jingjing Qu, Kai Wang, Qiaosheng Zhang, Xia Hu, Xingcheng Xu, Zeming Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:04 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.LGcs.MA

keywords LLM agentsskill safetyadversarial evaluationsafety benchmarkreusable skillsagent attacksrisk domains

0 comments

The pith

SkillSafetyBench shows that attacks on reusable skills can induce unsafe actions in LLM agents even from benign user requests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SkillSafetyBench as a way to test how modular skills in LLM agents create new safety problems that standard evaluations overlook. Reusable skills give agents access to tools and contexts that can be poisoned with adversarial material, leading agents to perform harmful actions without the user asking for them. Through 155 test cases in various risk areas, the authors demonstrate that these attacks work reliably on different agents and models, revealing unique failure types for each combination. This matters because as agents rely more on shared skills, their safety becomes tied to how they process and trust those skills in real execution environments, beyond just the underlying model's training.

Core claim

SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that localized non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. The findings indicate that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.

What carries the argument

SkillSafetyBench, a runnable benchmark for evaluating skill-mediated safety failures using adversarial cases and rule-based verifiers.

If this is right

Agent safety evaluations need to include tests for skill-facing attacks in addition to direct user prompts.
Distinct failure patterns suggest that safety improvements must be tailored to specific agent scaffolds and model backends.
Trust in workflow context from skills can be exploited to bypass safety measures in executable environments.
Reusable skills should be designed with safeguards against local adversarial artifacts to maintain agent safety.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the benchmark to include more diverse agent types beyond CLI could reveal additional vulnerabilities in deployed systems.
Skill providers might need to incorporate validation mechanisms for skill content to reduce attack surfaces.
The results imply that future agent designs could benefit from isolated execution environments for skills to limit the impact of compromised context.

Load-bearing premise

The constructed adversarial cases and rule-based verifiers in SkillSafetyBench correctly identify and measure real-world skill-mediated safety failures without missing important cases or introducing errors in verification.

What would settle it

Re-running the experiments on the 155 cases with new agent-model combinations and observing that no or very few unsafe behaviors are triggered according to the verifiers would challenge the claim of consistent induction of unsafe behavior.

Figures

Figures reproduced from arXiv: 2605.12015 by An Wang, Biaojie Zeng, Chang Jin, Chao Yang, Jingjing Qu, Kai Wang, Qiaosheng Zhang, Xia Hu, Xingcheng Xu, Zeming Wei.

**Figure 2.** Figure 2: The construction pipeline of a specific case under the taxonomy of SkillSafetyBench. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: An example case in RD3 from SkillSafetyBench. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Attack success versus task success across evaluated agent systems. Each point represents one CLI agent system–model backend pairing. The x-axis reports the task success rate, while the y-axis reports the overall attack success rate (ASR) on SkillSafetyBench. Dashed lines show the median task success rate (37.4%) and median ASR (41.8%) across evaluated systems. 5.4 Main Results Agent CLI and Model Compariso… view at source ↗

**Figure 6.** Figure 6: Average ASR by risk domain. Bars show the mean attack success rate (ASR) across completed agentmodel runs for each risk domain, and error bars indicate standard deviation across systems. Risk domains are sorted by mean ASR in descending order. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, task-relevant skill materials or local artifacts can steer an agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that localized non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillSafetyBench flags a genuine gap in how modular agent skills create attack surfaces, but the rule-based verifiers make the reported failure rates hard to trust without more validation details.

read the letter

The paper's core point is that reusable skills in LLM agents open up attack vectors that standard safety checks miss, because even benign user prompts can be steered by local artifacts or skill materials. It builds SkillSafetyBench with 155 cases spanning 47 tasks, 6 domains, and 30 categories, then runs them on several CLI agents and backends to show consistent unsafe outputs with varying patterns by domain and setup. That framing of skill-facing surfaces is new enough to matter for anyone building tool-using agents, and the experiments do demonstrate that model alignment alone is not enough when execution context and workflows are involved. The work is straightforward about its scope and avoids overclaiming theoretical fixes. The soft spot is the evaluation method. Each case uses a custom rule-based verifier, yet the abstract and available details give no information on how those rules were tested for false positives, whether they account for intent or downstream effects, or how the adversarial cases were generated without knowledge of the target agents' weaknesses. If the rules lean on surface keywords or tool-call patterns, the distinct failure patterns could partly reflect benchmark construction rather than real-world risk. The paper would be useful for agent safety researchers who need concrete test cases to probe scaffolding and skill packaging. Readers working on red-teaming or deployment standards would get practical value from the domain breakdowns. It deserves a serious referee because the underlying concern is real and the empirical setup is reproducible in principle, even if the current verifiers require scrutiny. I would send it for review with the expectation that the authors add verifier validation steps and clearer comparisons to existing agent red-teaming benchmarks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SkillSafetyBench, a runnable benchmark for evaluating safety failures in LLM agents induced by reusable skills that grant access to files, tools, memory, and execution environments. It comprises 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each paired with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends demonstrate that localized non-user attacks can consistently induce unsafe behavior, with distinct failure patterns varying by domain, attack method, and scaffold-model pairing. The authors argue that agent safety requires attention to skill interpretation, workflow context, and executable environments beyond model-level alignment.

Significance. If the benchmark's cases and verifiers hold up under validation, the work is significant for identifying an overlooked attack surface in modular LLM agents. It supplies empirical evidence of how benign user requests combined with adversarial skill materials can steer agents toward unsafe actions, highlighting the need for skill-aware safety mechanisms. The runnable design and multi-domain coverage are strengths that could aid reproducibility and future extensions.

major comments (2)

[Benchmark Design] Benchmark Design section (around the description of the 155 cases and verifiers): The central claim of consistent unsafe behavior induction depends on the case-specific rule-based verifiers correctly identifying safety failures. However, no details are provided on rule development, validation against human judgments, inter-rater agreement, or checks that rules capture intent/context rather than surface keywords (e.g., file writes or tool calls). This is load-bearing, as overfitting or misclassification could artifactually generate the reported distinct failure patterns across domains and scaffolds.
[Experimental Results] Experimental Results section (around the experiments with CLI agents and model backends): The abstract reports consistent induction of unsafe behavior but omits information on case construction (e.g., independence from tested agents' failure modes), statistical significance, controls for prompt sensitivity, or confounding factors. Without these, the generalizability of the distinct failure patterns across domains, attack methods, and pairings cannot be assessed reliably.

minor comments (2)

[Abstract] The abstract would be clearer if it specified the exact number and identities of CLI agents and model backends tested.
[Benchmark Design] Consider adding a summary table or figure showing the distribution of the 155 cases across the 6 risk domains and 30 safety categories to aid reader comprehension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript introducing SkillSafetyBench. The comments identify areas where additional methodological transparency will strengthen the presentation of the benchmark and results. We address each major comment below and will incorporate the suggested clarifications in a revised version.

read point-by-point responses

Referee: [Benchmark Design] Benchmark Design section (around the description of the 155 cases and verifiers): The central claim of consistent unsafe behavior induction depends on the case-specific rule-based verifiers correctly identifying safety failures. However, no details are provided on rule development, validation against human judgments, inter-rater agreement, or checks that rules capture intent/context rather than surface keywords (e.g., file writes or tool calls). This is load-bearing, as overfitting or misclassification could artifactually generate the reported distinct failure patterns across domains and scaffolds.

Authors: We agree that greater detail on verifier construction is warranted to support the central claims. The case-specific rules were authored to detect observable violations of the safety categories within each task's defined context, rather than relying on isolated keywords; for example, a rule for unauthorized file access checks both the target path and the absence of required permissions given the workflow state. In the revision we will add a dedicated subsection describing the rule development process, including how rules were derived from the 30 safety categories and 47 tasks. We will also report results from a human validation study on a representative subset of cases, including inter-annotator agreement metrics and alignment between automated verdicts and expert judgments. These additions will directly address concerns about potential misclassification and allow readers to assess the reliability of the observed failure patterns. revision: yes
Referee: [Experimental Results] Experimental Results section (around the experiments with CLI agents and model backends): The abstract reports consistent induction of unsafe behavior but omits information on case construction (e.g., independence from tested agents' failure modes), statistical significance, controls for prompt sensitivity, or confounding factors. Without these, the generalizability of the distinct failure patterns across domains, attack methods, and pairings cannot be assessed reliably.

Authors: We acknowledge the value of these additional details for evaluating generalizability. The 155 cases were constructed from domain-specific risk scenarios and common agent workflow patterns prior to selecting the evaluation scaffolds, ensuring independence from any particular agent's failure modes. In the revised manuscript we will expand the experimental section to include: (1) a description of the case construction methodology and its separation from the tested CLI agents and model backends; (2) statistical significance testing and confidence intervals for the reported unsafe behavior rates; and (3) discussion of controls for prompt sensitivity (e.g., template variations) and other potential confounders such as environment initialization and temperature settings. These changes will provide a clearer basis for interpreting the distinct failure patterns across domains, attack methods, and scaffold-model pairings. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents SkillSafetyBench as an empirical benchmark consisting of 155 adversarial cases across tasks, domains, and categories, each paired with a case-specific rule-based verifier. It reports experimental outcomes from running multiple CLI agents and model backends under localized non-user attacks. No mathematical derivations, equations, fitted parameters, predictions, or self-citations appear in the abstract or described structure. The central claim—that such attacks induce unsafe behavior with distinct patterns—is a direct reporting of benchmark results rather than any reduction to inputs by construction, self-definition, or load-bearing self-citation. The evaluation is self-contained as an observational study of agent behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim depends on the assumption that the constructed adversarial cases and rule-based verifiers faithfully represent skill-facing attack surfaces; no free parameters, axioms, or invented entities are invoked in the abstract.

pith-pipeline@v0.9.0 · 5499 in / 1155 out tokens · 39925 ms · 2026-05-13T05:04:22.667508+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 24 internal anchors

[9]

Advances in Neural Information Processing Systems , volume=

Taskbench: Benchmarking large language models for task automation , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

Advances in Neural Information Processing Systems , volume=

GTA: a benchmark for general tool agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[19]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

work page
[21]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Creator: Tool creation for disentangling abstract and concrete reasoning of large language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[22]

Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

work page
[23]

33rd USENIX Security Symposium (USENIX Security 24) , pages=

Formalizing and benchmarking prompt injection attacks and defenses , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=

work page
[24]

Advances in Neural Information Processing Systems , volume=

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[27]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

work page 2024
[28]

Advances in Neural Information Processing Systems , volume=

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases , author=. Advances in Neural Information Processing Systems , volume=

work page
[35]

Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal=

work page
[36]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

work page 2025
[37]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Appworld: A controllable world of apps and people for benchmarking interactive coding agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[43]

CREATOR: Tool creation for disentangling abstract and concrete reasoning of large language models,

CREATOR: disentangling abstract and concrete reasonings of large language models through tool creation. CoRR, abs/2305.14318, 2023b. doi: 10.48550 , author=

work page arXiv
[44]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Benchmarking and defending against indirect prompt injection attacks on large language models , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1 , pages=

work page
[46]

34th USENIX Security Symposium (USENIX Security 25) , pages=

\ StruQ \ : Defending against prompt injection with structured queries , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=

work page
[47]

34th USENIX Security Symposium (USENIX Security 25) , pages=

\ PoisonedRAG \ : Knowledge corruption attacks to \ Retrieval-Augmented \ generation of large language models , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=

work page
[49]

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, and 1 others. 2024. Agentharm: A benchmark for measuring harmfulness of llm agents. arXiv preprint arXiv:2410.09024

work page internal anchor Pith review arXiv 2024
[50]

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, and 1 others. 2024. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095

work page Pith review arXiv 2024
[51]

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. \ StruQ \ : Defending against prompt injection with structured queries. In 34th USENIX Security Symposium (USENIX Security 25), pages 2383--2400

work page 2025
[52]

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. Advances in Neural Information Processing Systems, 37:130185--130213

work page 2024
[53]

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tram \`e r. 2024. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. Advances in Neural Information Processing Systems, 37:82895--82920

work page 2024
[54]

Zenghao Duan, Yuxin Tian, Zhiyi Yin, Liang Pang, Jingcheng Deng, Zihao Wei, Shicheng Xu, Yuyao Ge, and Xueqi Cheng. 2026. Skillattack: Automated red teaming of agent skills through attack path refinement. arXiv preprint arXiv:2604.04989

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kamalika Chaudhuri. 2025. Wasp: Benchmarking web agent security against prompt injection attacks. arXiv preprint arXiv:2504.18575

work page arXiv 2025
[56]

Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, and Wenke Huang. 2026. Skilltrojan: Backdoor attacks on skill-based agent systems. arXiv preprint arXiv:2604.06811

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM workshop on artificial intelligence and security, pages 79--90

work page 2023
[58]

Yinghan Hou and Zongyou Yang. 2026. Skillsieve: A hierarchical triage framework for detecting malicious ai agent skills. arXiv preprint arXiv:2604.06550

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. 2025. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering. arXiv preprint arXiv:2506.09050

work page arXiv 2025
[60]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. 2026. Sok: Agentic skills--beyond tool use in llm agents. arXiv preprint arXiv:2602.20867

work page internal anchor Pith review arXiv 2026
[62]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. 2025. Sec-bench: Automated benchmarking of llm agents on real-world software security tasks. arXiv preprint arXiv:2506.11791

work page arXiv 2025
[64]

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, and 1 others. 2026 a . Skillsbench: Benchmarking how well agent skills work across diverse tasks. arXiv preprint arXiv:2602.12670

work page internal anchor Pith review Pith/arXiv arXiv 2026
[65]

Zhiyuan Li, Jingzheng Wu, Xiang Ling, Xing Cui, and Tianyue Luo. 2026 b . Towards secure agent skills: Architecture, threat taxonomy, and security analysis. arXiv preprint arXiv:2604.02837

work page internal anchor Pith review Pith/arXiv arXiv 2026
[66]

George Ling, Shanshan Zhong, and Richard Huang. 2026. Agent skills: A data-driven analysis of claude skills for extending large language model functionality. arXiv preprint arXiv:2602.08004

work page arXiv 2026
[67]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, and 1 others. 2023 a . Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Ying Zhang, and Leo Yu Zhang. 2026 a . Malicious agent skills in the wild: A large-scale security empirical study. arXiv preprint arXiv:2602.06547

work page arXiv 2026
[69]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and 1 others. 2023 b . Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, and Leo Zhang. 2026 b . Agent skills in the wild: An empirical study of security vulnerabilities at scale. arXiv preprint arXiv:2601.10338

work page internal anchor Pith review arXiv 2026
[71]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1831--1847

work page 2024
[72]

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, and 1 others. 2025. Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1160--1183

work page 2025
[73]

Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. 2023. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6922--6939

work page 2023
[74]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Ayush RoyChowdhury, Mulong Luo, Prateek Sahu, Sarbartha Banerjee, and Mohit Tiwari. 2024. Confusedpilot: Confused deputy risks in rag-based llms. arXiv preprint arXiv:2408.04870

work page arXiv 2024
[76]

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. 2023. Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, and Maksym Andriushchenko. 2026. Skill-inject: Measuring agent vulnerability to skill file attacks. arXiv preprint arXiv:2602.20156

work page arXiv 2026
[78]

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Yueting Zhuang. 2024. Taskbench: Benchmarking large language models for task automation. Advances in Neural Information Processing Systems, 37:4540--4574

work page 2024
[79]

Guiyao Tie, Jiawen Shi, Pan Zhou, and Lichao Sun. 2026. Badskill: Backdoor attacks on agent skills via model-in-skill poisoning. arXiv preprint arXiv:2604.09378

work page internal anchor Pith review Pith/arXiv arXiv 2026
[80]

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. 2024. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pap...

work page 2024
[81]

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, and 1 others. 2026. Skillx: Automatically constructing skill knowledge bases for agents. arXiv preprint arXiv:2604.04804

work page internal anchor Pith review Pith/arXiv arXiv 2026
[82]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023
[83]

Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. 2024. Gta: a benchmark for general tool agents. Advances in Neural Information Processing Systems, 37:75749--75790

work page 2024
[84]

Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. 2025. Inducing programmatic skills for agentic tasks. arXiv preprint arXiv:2504.06821

work page arXiv 2025
[85]

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, and 1 others. 2024. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040--52094

work page 2024
[86]

Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, and 1 others. 2024. Theagentcompany: benchmarking llm agents on consequential real world tasks. arXiv preprint arXiv:2412.14161

work page internal anchor Pith review arXiv 2024
[87]

Renjun Xu and Yang Yan. 2026. Agent skills for large language models: Architecture, acquisition, security, and the path forward. arXiv preprint arXiv:2602.12430

work page internal anchor Pith review arXiv 2026
[88]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. -bench : A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045

work page internal anchor Pith review Pith/arXiv arXiv 2024
[89]

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. 2025. Survey on evaluation of llm-based agents. arXiv preprint arXiv:2503.16416

work page internal anchor Pith review Pith/arXiv arXiv 2025
[90]

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2025. Benchmarking and defending against indirect prompt injection attacks on large language models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 1809--1820

work page 2025
[91]

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10471--10506

work page 2024
[92]

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. 2024. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. arXiv preprint arXiv:2410.02644

work page internal anchor Pith review Pith/arXiv arXiv 2024
[93]

Boyuan Zheng, Michael Y Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and 1 others. 2025. Skillweaver: Web agents can self-improve by discovering and honing skills. arXiv preprint arXiv:2504.07079

work page internal anchor Pith review arXiv 2025
[94]

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, and 1 others. 2023. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2023
[95]

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2025. \ PoisonedRAG \ : Knowledge corruption attacks to \ Retrieval-Augmented \ generation of large language models. In 34th USENIX Security Symposium (USENIX Security 25), pages 3827--3844

work page 2025