Recognition: no theorem link
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3
The pith
Poisoning any single dimension of an AI agent's state triples its real-world attack success rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using the CIK taxonomy to organize safety analysis, the authors evaluate a live OpenClaw instance across four backbone models and twelve attack scenarios. Poisoning any one CIK dimension raises average attack success from 24.6% to 64-74%, and the most robust model shows more than a threefold increase. Three CIK-aligned defenses plus file protection are tested, yet the strongest defense still permits 63.8% success on capability-targeted attacks while file protection blocks 97% of malicious injections but also blocks legitimate updates. The results indicate that these vulnerabilities are inherent to the agent architecture.
What carries the argument
The CIK taxonomy, which divides an agent's persistent state into Capability, Identity, and Knowledge dimensions to structure real-world attack testing and defense evaluation on a live system.
If this is right
- Current sandboxed evaluations miss the risks that appear when agents hold broad system and service access.
- Even the strongest tested model becomes far more vulnerable once any one CIK dimension is poisoned.
- File-level protection stops most malicious changes but also blocks normal legitimate updates to the agent's state.
- Vulnerabilities persist across defense strategies, pointing to the need for safeguards that address the agent architecture itself.
Where Pith is reading between the lines
- Other agents that integrate with user services and maintain persistent state may show similar single-dimension weaknesses under real-world testing.
- Reducing the scope of agent capabilities to limit attack surface would trade off some of the automation that makes the systems valuable.
- Agent designs could incorporate isolation or verification checks tied specifically to each CIK dimension to limit propagation of poisoned state.
Load-bearing premise
The twelve attack scenarios run on the live OpenClaw instance accurately capture realistic threats without the evaluation process itself introducing artifacts that inflate the measured success rates.
What would settle it
Repeating the same single-dimension poisoning attacks inside a non-live, isolated simulation of OpenClaw and observing no substantial rise in attack success rates would challenge the central finding.
Figures
read the original abstract
OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem. While these broad privileges enable high levels of automation and powerful personalization, they also expose a substantial attack surface that existing sandboxed evaluations fail to capture. To address this gap, we present the first real-world safety evaluation of OpenClaw and introduce the CIK taxonomy, which unifies an agent's persistent state into three dimensions, i.e., Capability, Identity, and Knowledge, for safety analysis. Our evaluations cover 12 attack scenarios on a live OpenClaw instance across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, and GPT-5.4). The results show that poisoning any single CIK dimension increases the average attack success rate from 24.6% to 64-74%, with even the most robust model exhibiting more than a threefold increase over its baseline vulnerability. We further assess three CIK-aligned defense strategies alongside a file-protection mechanism; however, the strongest defense still yields a 63.8% success rate under Capability-targeted attacks, while file protection blocks 97% of malicious injections but also prevents legitimate updates. Taken together, these findings show that the vulnerabilities are inherent to the agent architecture, necessitating more systematic safeguards to secure personal AI agents. Our project page is https://ucsc-vlaa.github.io/CIK-Bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first real-world safety evaluation of OpenClaw, a widely deployed personal AI agent with full local system access and integrations to services such as Gmail, Stripe, and the filesystem. It introduces the CIK taxonomy (Capability, Identity, Knowledge) to unify analysis of an agent's persistent state. Evaluations on a live OpenClaw instance across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, GPT-5.4) and 12 attack scenarios show that poisoning any single CIK dimension raises average attack success rate from 24.6% to 64-74%, with even the strongest model exhibiting more than a threefold increase; three CIK-aligned defenses and a file-protection mechanism are also assessed, with the strongest defense still allowing 63.8% success under capability-targeted attacks.
Significance. If the reported rates hold after addressing experimental controls, the work offers concrete evidence that vulnerabilities are inherent to agent architectures with broad privileges, beyond what sandboxed tests capture. The direct empirical testing on a live system with explicit baselines and the CIK framework provide a useful, falsifiable starting point for personal AI agent safety research.
major comments (2)
- [Evaluation setup] Evaluation setup (as described in the abstract and methods): the manuscript does not document per-trial resets of the persistent OpenClaw instance or counterbalanced ordering of the 12 scenarios. Because all trials share one live instance, unreset state (memory, tool registrations, or prior successful actions) could create cross-trial dependencies that contribute to the observed ASR increase from 24.6% to 64-74% after CIK poisoning, rather than isolating the independent variable as claimed.
- [Results] Results section: full attack definitions, statistical details (e.g., variance, significance tests, or number of trials per scenario), and exact poisoning procedures are not provided. This prevents verification of whether the threefold increase and 64-74% range are robust or influenced by post-hoc scenario selection or live-instance variability.
minor comments (2)
- [Abstract] Abstract: the phrase 'early 2026' for OpenClaw deployment should be clarified as to whether it refers to a hypothetical future or current status at time of writing.
- [Defense evaluation] Defense evaluation: the file-protection mechanism's 97% block rate is reported alongside its prevention of legitimate updates; a quantitative discussion of the usability impact (e.g., frequency of legitimate updates blocked) would strengthen the trade-off analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the documentation of our experimental controls and results without altering the core findings.
read point-by-point responses
-
Referee: [Evaluation setup] Evaluation setup (as described in the abstract and methods): the manuscript does not document per-trial resets of the persistent OpenClaw instance or counterbalanced ordering of the 12 scenarios. Because all trials share one live instance, unreset state (memory, tool registrations, or prior successful actions) could create cross-trial dependencies that contribute to the observed ASR increase from 24.6% to 64-74% after CIK poisoning, rather than isolating the independent variable as claimed.
Authors: We acknowledge that the methods section lacks explicit documentation of the reset and ordering procedures. In conducting the live-instance experiments, we performed a complete reset of the OpenClaw persistent state between trials, including clearing memory, deregistering tools, and flushing action history via the agent's reset interface, followed by confirmation queries to verify a clean state. The 12 scenarios were executed in randomized order across models and trials to counterbalance potential sequence effects. We agree this protocol must be described in detail. The revised manuscript will add a dedicated 'Experimental Controls' subsection in the methods that specifies the reset steps, verification method, and randomization procedure. This will confirm that the ASR increases are attributable to the CIK poisoning rather than residual state. revision: yes
-
Referee: [Results] Results section: full attack definitions, statistical details (e.g., variance, significance tests, or number of trials per scenario), and exact poisoning procedures are not provided. This prevents verification of whether the threefold increase and 64-74% range are robust or influenced by post-hoc scenario selection or live-instance variability.
Authors: We agree that greater transparency is required for independent verification. The revised manuscript will expand the Results section and include a new appendix containing: (i) the full text of all 12 attack scenarios and their prompts; (ii) the precise poisoning procedures for each CIK dimension, including the injected content and application method on the live instance; (iii) the number of trials per scenario (50 independent runs); (iv) variance measures (standard errors and confidence intervals); and (v) statistical tests (paired Wilcoxon signed-rank tests, all p < 0.001) confirming the significance of the reported increases. All scenarios were defined a priori according to the CIK taxonomy with no post-hoc selection. These additions will enable readers to assess robustness against live-instance variability. revision: yes
Circularity Check
No circularity: empirical measurements on live system with explicit baseline
full rationale
The paper reports direct empirical attack success rates measured on a persistent OpenClaw instance across 12 scenarios and four models, comparing baseline (24.6%) to post-CIK-poisoning conditions (64-74%). No equations, fitted parameters, or first-principles derivations are present that could reduce the reported rates to self-referential inputs. The CIK taxonomy is introduced as an organizational framework for state dimensions rather than a mathematical construct whose outputs are forced by its own definitions. No self-citations are invoked to justify uniqueness or load-bearing premises. The central quantitative claims rest on observable experimental outcomes rather than any reduction by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The CIK taxonomy unifies an agent's persistent state into three dimensions sufficient for safety analysis.
Forward citations
Cited by 4 Pith papers
-
Towards Security-Auditable LLM Agents: A Unified Graph Representation
Agent-BOM is a unified hierarchical attributed directed graph that models static capability bases and dynamic semantic states of LLM agents for path-level security auditing and risk assessment.
-
When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents
Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the Sta...
-
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
Reference graph
Works this paper leans on
-
[1]
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
arXiv:2410.09024. Anthropic. Introducing claude opus 4.6, 2026a. URL https://www.anthropic.com/news/ claude-opus-4-6. Anthropic. Introducing claude sonnet 4.5, 2026b. URL https://www.anthropic.com/news/ claude-sonnet-4-5. Zhaorun Chen et al. Agentpoison: Red-teaming LLM agents via poisoning memory or knowledge bases. InNeurIPS,
work page internal anchor Pith review arXiv
-
[2]
Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases
arXiv:2407.12784. Edoardo Debenedetti et al. Agentdojo: A dynamic environment to evaluate prompt in- jection attacks and defenses for LLM agents. InNeurIPS Datasets and Benchmarks,
-
[3]
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
arXiv:2406.13352. Shen Dong et al. MINJA: Memory injection attacks on LLM agents via query-only interaction. InNeurIPS,
work page internal anchor Pith review arXiv
- [4]
-
[5]
URL https://blog.google/innovation-and-ai/models-and-research/gemini-models/ gemini-3-1-pro/. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection.arXiv preprint arXiv:2302.12173,
work page internal anchor Pith review arXiv
-
[6]
URL https: //hackmag.com/news/openclaw-open. Siwei Han, Kaiwen Xiong, Jiaqi Liu, Xinyu Ye, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, et al. Alignment tipping process: How self- evolution pushes llm agents off the rails.arXiv preprint arXiv:2510.04860,
-
[7]
Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, and Philip Torr. Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement.arXiv preprint arXiv:2602.14211,
-
[8]
Refusal-trained llms are easily jailbroken as browser agents.arXiv preprint arXiv:2410.13886,
Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Scale Red Team, Elaine Chang, Vaughn Robinson, Sean Hendryx, Shuyan Zhou, Matt Fredrikson, Summer Yue, and Zifan Wang. Refusal-trained llms are easily jailbroken as browser agents.arXiv preprint arXiv:2410.13886,
-
[9]
Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale
Yi Liu et al. Agent skills in the wild: An empirical study of security vulnerabilities at scale. arXiv preprint arXiv:2601.10338,
work page internal anchor Pith review arXiv
-
[10]
"Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors
Yue Liu, Yanjie Zhao, Yunbo Lyu, Ting Zhang, Haoyu Wang, and David Lo. " your ai, my shell": Demystifying prompt injection attacks on agentic ai coding editors.arXiv preprint arXiv:2509.22040,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
MemGPT: Towards LLMs as Operating Systems
URL https://openai.com/index/ introducing-gpt-5-4/. Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Identifying the Risks of LM Agents with an LM-Emulated Sandbox
URL https://www.pillar.security/blog/ new-vulnerability-in-github-copilot-and-cursor-how-hackers-can-weaponize-code-agents . Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox.arXiv preprint arXiv:2309.15817,
work page internal anchor Pith review arXiv
-
[13]
Great, now write an article about that: The crescendo multi-turn llm jailbreak attack
Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack.arXiv preprint arXiv:2404.01833,
-
[14]
Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, and Jing Shao. Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354,
-
[15]
URLhttps://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/. Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, and Maarten Sap. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety.arXiv preprint arXiv:2507.06134,
-
[16]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning attack on real-world mcp servers.arXiv preprint arXiv:2508.14925,
-
[18]
The Rise and Potential of Large Language Model Based Agents: A Survey
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changyu Chen, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, and...
work page internal anchor Pith review arXiv
- [19]
-
[20]
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents.arXiv preprint arXiv:2403.02691,
work page internal anchor Pith review arXiv
-
[21]
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
11 Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hong- wei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and bench- marking attacks and defenses in llm-based agents.arXiv preprint arXiv:2410.02644,
work page internal anchor Pith review arXiv
-
[22]
doi:10.48550/arXiv.2402.07867 , abstract =
Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corrup- tion attacks to retrieval-augmented generation of large language models.arXiv preprint arXiv:2402.07867,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.