pith. sign in

arxiv: 2606.02302 · v1 · pith:5XUXGKY6new · submitted 2026-06-01 · 💻 cs.CR · cs.AI

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

Pith reviewed 2026-06-28 13:49 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords autonomous LLM agentssecurity task synthesisrisk specificationsexecution-based evaluationtrajectory assessmentagent security benchmarks
0
0 comments X

The pith

SeClaw generates security evaluation tasks for LLM agents from structured risk specifications and assesses their execution trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SeClaw, a framework for creating security tasks for autonomous LLM agents using specification-driven synthesis rather than manual curation. It builds tasks from structured descriptions of risks involving resources, user tasks, environments, and agent behaviors. The approach includes a docker-based testbed that evaluates not only final outcomes but the sequence of actions leading to potential security issues. This matters because current benchmarks are limited in coverage and focus, making it hard to test against emerging threats in stateful environments. If successful, it provides a reproducible way to measure and compare how agents handle safety risks during operation.

Core claim

SeClaw combines specification-driven security task synthesis with execution-based security evaluation for autonomous agents, allowing scalable construction of tasks from structured risk specifications and trajectory-aware assessment of unsafe actions beyond final responses.

What carries the argument

Specification-driven security task synthesis, which turns structured risk specifications into controllable tasks, paired with a standardized docker testbed for trajectory evaluation.

If this is right

  • Security tasks can be generated scalably from risk specifications covering resources, tasks, environments, and behaviors.
  • Evaluation focuses on the execution trajectory to identify unsafe actions.
  • The docker testbed standardizes testing across diverse safety-risk scenarios.
  • This setup allows reproducible comparisons of different agents' security performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If risk specifications are kept up to date, the framework could adapt to new agent capabilities and threats automatically.
  • Trajectory-aware evaluation might inform better agent designs by highlighting specific failure points in the process.
  • The approach could be adapted to evaluate other types of autonomous systems beyond LLMs.

Load-bearing premise

Structured risk specifications can be written in advance to capture the full range of emerging threats without requiring manual task curation or missing important real-world scenarios.

What would settle it

Demonstration of a security vulnerability in an LLM agent that none of the synthesized tasks from the framework would detect or reproduce.

Figures

Figures reproduced from arXiv: 2606.02302 by Bo Zhang, Changtao Miao, Chao Shen, Chenhao Lin, Erjia Xiao, Hao Cheng, He Liu, Jindong Gu, Jing Yang, Jinhao Duan, Jize Zhang, Junchi Chen, Kaidi Xu, Mengshu Sun, Peiyan Dong, Philip Torr, Renjing Xu, Taowen Wang, Tianle Song, Xiaoyu Shi, Xuan Shen, Yang Cao, Yichi Wang, Yin Wu.

Figure 1
Figure 1. Figure 1: Safety Risk Taxonomy for agent: Sources and Potential Consequences. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SeClaw Framework. promise LLM agents. Zou et al. (2026) proposed PoisonedRAG, demonstrating manipulation of retrieval-augmented systems through poisoned web knowledge. Evtimov et al. (2026) introduced WASP for evaluating prompt injections in web content against autonomous web agents. Ersoy et al. (2026) further showed that deceptive web UI dark patterns can mislead LLM-based web agents. Int… view at source ↗
Figure 3
Figure 3. Figure 3: The key idea is to use explicit task specifications as the interface between human task [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Spec-Driven Security Task Synthesis Pipeline. The pipeline converts risk taxonomies [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SeClaw-Docker: Secure Task Execution and Trajectory Logging. SeClaw-Docker auto [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluations. Current agent security benchmarks often rely on manually curated tasks, provide limited coverage of emerging threats, and focus primarily on final outcomes rather than the execution processes that lead to unsafe behavior. We introduce SeClaw, a framework that combines specification-driven security task synthesis with execution-based security evaluation for Autonomous agents. Spec-driven security task synthesis enables scalable and controllable construction of security tasks from structured risk specifications, while SeClaw docker provides a standardized testbed for evaluating agent behavior under diverse safety-risk scenarios. The benchmark covers risks arising from resources, user tasks, environments, and intrinsic agent behaviors, and supports trajectory-aware assessment of unsafe actions beyond final responses. By bridging systematic task synthesis and reproducible security evaluation, SeClaw provides a practical foundation for measuring, diagnosing, and comparing security failures in autonomous LLM agents. The code is available at https://github.com/seclaw-eval/seclaw-eval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces SeClaw, a framework combining specification-driven security task synthesis with execution-based evaluation in a Docker testbed. It claims this enables scalable, controllable generation of security tasks from structured risk specifications (covering resources, user tasks, environments, and intrinsic agent behaviors) and supports trajectory-aware assessment of unsafe actions in autonomous LLM agents, addressing limitations of manual curation and outcome-only benchmarks.

Significance. If validated, the framework could provide a more systematic and reproducible foundation for security evaluation of LLM agents than existing manual benchmarks, with potential to improve coverage of emerging threats and diagnosis of process-level failures. The open-source code release is a positive factor for reproducibility.

major comments (3)
  1. [abstract / framework description] The central claim of 'scalable and controllable construction of security tasks from structured risk specifications' (abstract) rests on the unvalidated assumption that pre-authored specs can comprehensively encode novel threats without gaps; the manuscript supplies no coverage metric, expressiveness proof, or examples of specs-to-tasks mapping to substantiate this.
  2. [abstract] No concrete task examples, generated trajectories, or evaluation results are provided to demonstrate that the synthesis produces diverse tasks or that the Docker testbed accurately reflects real agent behavior in stateful environments (abstract).
  3. [abstract] The trajectory-aware assessment is described at a high level but lacks any formal definition of unsafe actions, scoring mechanism, or comparison to prior benchmarks, leaving the 'practical foundation for measuring security failures' claim unsupported by evidence.
minor comments (1)
  1. [abstract] The abstract mentions 'The code is available at https://github.com/seclaw-eval/seclaw-eval' but the manuscript does not include any usage instructions, API details, or reproduction steps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the presentation of the claims.

read point-by-point responses
  1. Referee: [abstract / framework description] The central claim of 'scalable and controllable construction of security tasks from structured risk specifications' (abstract) rests on the unvalidated assumption that pre-authored specs can comprehensively encode novel threats without gaps; the manuscript supplies no coverage metric, expressiveness proof, or examples of specs-to-tasks mapping to substantiate this.

    Authors: The manuscript details the four risk categories in the specification structure and includes illustrative specs-to-tasks mappings within the framework description. We agree that no quantitative coverage metric or formal expressiveness proof is supplied, as the contribution is an extensible engineering framework rather than a completeness proof. We will add further concrete mapping examples and a limitations subsection discussing potential gaps for novel threats. revision: partial

  2. Referee: [abstract] No concrete task examples, generated trajectories, or evaluation results are provided to demonstrate that the synthesis produces diverse tasks or that the Docker testbed accurately reflects real agent behavior in stateful environments (abstract).

    Authors: The body of the manuscript provides concrete synthesized task examples, sample trajectories, and Docker testbed implementation details in the evaluation sections. To better align the abstract with this content, we will revise the abstract to reference a representative example and expand the testbed description with additional notes on stateful environment fidelity. revision: yes

  3. Referee: [abstract] The trajectory-aware assessment is described at a high level but lacks any formal definition of unsafe actions, scoring mechanism, or comparison to prior benchmarks, leaving the 'practical foundation for measuring security failures' claim unsupported by evidence.

    Authors: The evaluation framework section categorizes unsafe actions according to the risk specifications and outlines the trajectory analysis approach. We will revise to include an explicit formal definition of unsafe actions, a precise description of the scoring mechanism, and a comparison table against prior outcome-focused benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity; framework introduction is self-contained

full rationale

The paper presents SeClaw as a new framework combining specification-driven task synthesis with execution-based evaluation. No equations, fitted parameters, predictions, or derivations are described that could reduce to inputs by construction. No self-citations or uniqueness theorems are invoked in the provided text to justify core claims. The contribution rests on the design of structured risk specifications and the testbed, which are independent of any prior fitted results or self-referential definitions from the authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Because only the abstract is available, the ledger is limited to elements explicitly stated there. The framework itself is the main new contribution.

axioms (1)
  • domain assumption Autonomous LLM agents operate in stateful environments where they access tools, files, memory, and external services.
    Stated directly in the first sentence of the abstract as the setting that creates security risks.
invented entities (1)
  • SeClaw framework no independent evidence
    purpose: Combines spec-driven task synthesis with execution-based evaluation and provides a docker testbed.
    Newly introduced in the paper; no independent evidence outside this work is mentioned.

pith-pipeline@v0.9.1-grok · 5802 in / 1274 out tokens · 29105 ms · 2026-06-28T13:49:36.086375+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 10 linked inside Pith

  1. [1]

    Simple prompt injection attacks can leak personal data observed by llm agents during task execution.arXiv preprint arXiv:2506.01055,

    Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, and Fabrizio Gilardi. Simple prompt injection attacks can leak personal data observed by llm agents during task execution.arXiv preprint arXiv:2506.01055,

  2. [2]

    Agentharm: A benchmark for measuring harmfulness of llm agents

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. InInternational Conference on Learning Representa- tions, volume 2025, pp. 79185–79220,

  3. [3]

    Credential leakage in llm agent skills: A large-scale empirical study.arXiv preprint arXiv:2604.03070,

    Zhihao Chen, Ying Zhang, Yi Liu, Gelei Deng, Yuekang Li, Yanjun Zhang, Jianting Ning, Leo Yu Zhang, Lei Ma, and Zhiqiang Li. Credential leakage in llm agent skills: A large-scale empirical study.arXiv preprint arXiv:2604.03070,

  4. [4]

    Taming openclaw: Security analysis and mitigation of autonomous llm agent threats.arXiv preprint arXiv:2603.11619,

    Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, et al. Taming openclaw: Security analysis and mitigation of autonomous llm agent threats.arXiv preprint arXiv:2603.11619,

  5. [5]

    Agentsentinel: An end-to-end and real-time security defense framework for computer-use agents

    Haitao Hu, Peng Chen, Yanpeng Zhao, and Yuqi Chen. Agentsentinel: An end-to-end and real-time security defense framework for computer-use agents. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pp. 3535–3549,

  6. [6]

    Maltool: Malicious tool attacks on llm agents.arXiv preprint arXiv:2602.12194,

    Yuepeng Hu, Yuqi Jia, Mengyuan Li, Dawn Song, and Neil Gong. Maltool: Malicious tool attacks on llm agents.arXiv preprint arXiv:2602.12194,

  7. [7]

    Prompt flow integrity to prevent privilege escalation in llm agents.arXiv preprint arXiv:2503.15547,

    Juhee Kim, Woohyuk Choi, and Byoungyoung Lee. Prompt flow integrity to prevent privilege escalation in llm agents.arXiv preprint arXiv:2503.15547,

  8. [8]

    Agentdyn: A dynamic open- ended benchmark for evaluating prompt injection attacks of real-world agent security system

    Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, and Chaowei Xiao. Agentdyn: A dynamic open- ended benchmark for evaluating prompt injection attacks of real-world agent security system. arXiv preprint arXiv:2602.03117,

  9. [9]

    Clawkeeper: Comprehensive safety protection for openclaw agents through skills, plugins, and watchers.arXiv preprint arXiv:2603.24414,

    Songyang Liu, Chaozhuo Li, Chenxu Wang, Jinyu Hou, Zejian Chen, Litian Zhang, Zheng Liu, Qi- wei Ye, Yiming Hei, Xi Zhang, et al. Clawkeeper: Comprehensive safety protection for openclaw agents through skills, plugins, and watchers.arXiv preprint arXiv:2603.24414,

  10. [10]

    Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460,

    Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460,

  11. [11]

    David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, and Maksym Andriushchenko

    GitHub repository. David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, and Maksym Andriushchenko. Skill-inject: Measuring agent vulnerability to skill file attacks.arXiv preprint arXiv:2602.20156,

  12. [12]

    Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXiv preprint arXiv:2603.10387,

    Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu. Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXiv preprint arXiv:2603.10387,

  13. [13]

    Optimization-based prompt injection attack to llm-as-a-judge

    Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. Optimization-based prompt injection attack to llm-as-a-judge. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 660–674,

  14. [14]

    Saga: A security architecture for governing ai agentic systems.arXiv preprint arXiv:2504.21034,

    Georgios Syros, Anshuman Suri, Jacob Ginesin, Cristina Nita-Rotaru, and Alina Oprea. Saga: A security architecture for governing ai agentic systems.arXiv preprint arXiv:2504.21034,

  15. [15]

    Hintbench: Horizon-agent intrinsic non-attack trajectory benchmark.arXiv preprint arXiv:2604.13954, 2026a

    Jiacheng Wang, Jinchang Hou, Fabian Wang, Ping Jian, Chenfu Bao, and Zhonghou Lv. Hintbench: Horizon-agent intrinsic non-attack trajectory benchmark.arXiv preprint arXiv:2604.13954, 2026a. Liwen Wang, Wenxuan Wang, Shuai Wang, Zongjie Li, Zhenlan Ji, Zongyi Lyu, Daoyuan Wu, and Shing-Chi Cheung. Masleak: Investigating and exposing intellectual property le...

  16. [16]

    Agent skills for large language models: Architecture, acquisition, secu- rity, and the path forward.arXiv preprint arXiv:2602.12430,

    Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, secu- rity, and the path forward.arXiv preprint arXiv:2602.12430,

  17. [17]

    Benchmarks for trajectory safety evaluation and diagnosis in openclaw and codex: Atbench-claw and atbench-codex.arXiv preprint arXiv:2604.14858,

    Zhonghao Yang, Yu Li, Yanxu Zhu, Tianyi Zhou, Yuejin Xie, Haoyu Luo, Jing Shao, Xia Hu, and Dongrui Liu. Benchmarks for trajectory safety evaluation and diagnosis in openclaw and codex: Atbench-claw and atbench-codex.arXiv preprint arXiv:2604.14858,

  18. [18]

    Claw-eval: Toward trustworthy evaluation of autonomous agents

    Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents. arXiv preprint arXiv:2604.06132,

  19. [19]

    Clawtrap: A mitm-based red-teaming framework for real-world openclaw security evaluation.arXiv preprint arXiv:2603.18762,

    Haochen Zhao and Shaoyang Cui. Clawtrap: A mitm-based red-teaming framework for real-world openclaw security evaluation.arXiv preprint arXiv:2603.18762,

  20. [20]

    A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124,

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124,