arxiv: 2605.04808 · v1 · submitted 2026-05-06 · 💻 cs.AI

Recognition: unknown

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Zhaorun Chen , Xun Liu , Haibo Tong , Chengquan Guo , Yuzhou Nie , Jiawei Zhang , Mintong Kang , Chejian Xu

show 9 more authors

Qichang Liu Xiaogeng Liu Tianneng Shi Chaowei Xiao Sanmi Koyejo Percy Liang Wenbo Guo Dawn Song Bo Li

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI agentsred-teamingsecurity evaluationsimulation environmentsautonomous attackerprompt injectionvulnerability assessmentbenchmark dataset

0 comments

The pith

DTap supplies the first set of controllable simulations and an autonomous attacker to expose vulnerabilities in AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes DTap as a platform offering over 50 simulation environments that replicate systems such as Google Workspace, Paypal, and Slack across 14 domains. It further introduces DTap-Red, an autonomous red-teaming agent that explores injection vectors including prompts, tools, skills, and environment combinations to discover effective attack strategies for different malicious goals. The authors use this setup to curate DTap-Bench, a dataset of attack instances with automatic judges, and run evaluations on multiple AI agents to map out vulnerability patterns. A sympathetic reader would care because AI agents now execute long-horizon tasks in untrusted settings where manipulation can cause data leaks or unauthorized actions, yet prior tools lacked the realism and scale needed for thorough assessment.

Core claim

We introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances.

What carries the argument

DTap is the collection of controllable simulation environments replicating real tools and workflows; DTap-Red is the autonomous agent that systematically searches across prompt, tool, skill, and environment injection vectors to locate successful attacks.

If this is right

Evaluations across backbone models and domains reveal systematic vulnerability patterns in current AI agents.
The platform supports testing against varied security policies and risk categories using a single controllable setup.
DTap-Bench supplies paired attack instances and automatic validators that can serve as a reusable resource for defense research.
Attack strategies found by DTap-Red give concrete examples that can guide the design of more robust next-generation agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar simulation-plus-autonomous-attacker designs could be adapted to evaluate other interactive AI systems such as multi-agent workflows or tool-augmented chat interfaces.
Iterative use of DTap-Red discoveries inside agent training loops might reduce the gap between simulated and real-world robustness without requiring manual red-teaming.
If the platform's environment library grows, it could become a shared testbed that enables consistent cross-model comparisons of agent security.

Load-bearing premise

The simulated environments replicate the dynamic tool-using and user-interaction behavior of real systems closely enough that attacks discovered inside them will succeed against the same agents when connected to actual external services.

What would settle it

A direct comparison test in which an attack strategy produced by DTap-Red succeeds inside the simulation but fails to manipulate the identical agent when the simulation is replaced by live connections to Google Workspace or Slack.

Figures

Figures reproduced from arXiv: 2605.04808 by Bo Li, Chaowei Xiao, Chejian Xu, Chengquan Guo, Dawn Song, Haibo Tong, Jiawei Zhang, Mintong Kang, Percy Liang, Qichang Liu, Sanmi Koyejo, Tianneng Shi, Wenbo Guo, Xiaogeng Liu, Xun Liu, Yuzhou Nie, Zhaorun Chen.

**Figure 1.** Figure 1: DECODINGTRUST-AGENT PLATFORM provides a comprehensive security evaluation for AI agents with advanced red-teaming, covering both indirect and direct threat models. ∗Core Contributors. Correspondence to Zhaorun Chen at zhaorun@uchicago.edu, Bo Li at boli@virtueai.com. Preprint. arXiv:2605.04808v1 [cs.AI] 6 May 2026 view at source ↗

**Figure 2.** Figure 2: Overview of DECODINGTRUST-AGENT PLATFORM (DTAP). DTAP is the first controllable and interactive simulation platform for advanced red-teaming for AI agents. It spans 14 high-stakes domains (e.g., finance, workflows, coding, customer service) and over 50 widely used environments (e.g., Google Workspace, PayPal, Slack), replicating their real-world counterparts while providing realistic agent interfaces. DTA… view at source ↗

**Figure 3.** Figure 3: Overview of red-teaming agent DTAP-RED. Given a malicious goal, DTAP-RED first retrieves past experiences from a multi-layer memory module and invokes relevant attack algorithms and injection strategies from its attack skill library. It then performs diverse injections to induce the victim agent to achieve the malicious goal, where the optimized injection can be delivered via user prompt, MCP tools, agent … view at source ↗

**Figure 4.** Figure 4: Dataset statistics of DTAP-BENCH. Left: Distribution of all tasks by domain; Center: Distribution of attack injection types (or combinations) across red-teaming tasks; Right: Mean and confidence interval of the minimum number of tool calls across tasks by domain. prior works (e.g., [6, 18]), the verifiable judge reduces false positives by evaluating concrete outcomes in the environment. For example, it ver… view at source ↗

**Figure 5.** Figure 5: TSNE of the agent trajectories (Claude Code) of DTAP and AgentDojo. Task. Each task in DTAP-BENCH includes: (1) a config file defining the task instruction, agent setup (e.g., system prompt and required MCPs), injection sequence for reproducing redteaming attacks, ensuring deterministic evaluation; (2) setup scripts that seed user data in the corresponding environments (e.g., personalized emails in a use… view at source ↗

**Figure 6.** Figure 6: Per-category result in the CRM domain. Left: BSR across different task categories; Center: ASR across different risk categories under the indirect threat model; Right: ASR across different risk categories under the direct threat model. Results of other domains are in Sec. B to Sec. O indirect ASR from 46.7% to 35.6%, yielding a combined ∼31% reduction in ASR at a minimal utility cost (2%). This suggests th… view at source ↗

**Figure 7.** Figure 7: Mean and confidence interval of the minimum number of tool calls per task for each domain, view at source ↗

**Figure 8.** Figure 8: Statistics of red-teaming efficiency of DTAP-RED. Left: Average number of iterations required to achieve 100% ASR for each risk category in the CRM domain; Right: Attack success rate as a function of optimization iterations in the CRM domain. Effectiveness of the Attack Skills. The diverse attack skill set of DTAP-RED enables strong performance across a wide range of scenarios. As shown in view at source ↗

**Figure 9.** Figure 9: ASR of different attack skills during end-to-end red-teaming with view at source ↗

**Figure 10.** Figure 10: Left: Benign task success rates across different task categories in the workflow domain; Center: Attack success rates across different risk categories under the indirect threat model in the workflow domain; Right: Attack success rates across different risk categories under the direct threat model in the workflow domain. B.4.2 Indirect Threat Model view at source ↗

**Figure 11.** Figure 11: Trade-off between agent vulnerability and benign utility in the workflow domain. view at source ↗

**Figure 12.** Figure 12: Left: Comparison of attack vulnerability in the workflow domain across different agent frameworks; Center: Attack success rates across different risk categories or attack settings in the workflow domain; Right: Distributional summary of workflow attack patterns. Google ADK + Gemini-3-Pro is more vulnerable under indirect attacks (65.9%) than under direct attacks (54.4%), suggesting weaker robustness to ma… view at source ↗

**Figure 13.** Figure 13: Average attack success rate (ASR) across major workflow environments under the direct view at source ↗

**Figure 14.** Figure 14: Environment injection used in the indirect-threat workflow case study. The at view at source ↗

**Figure 15.** Figure 15: Attack trajectory of the indirect-threat case in the workflow domain. Starting from the view at source ↗

**Figure 16.** Figure 16: Attack outcome of the indirect-threat case in the workflow domain. The vic view at source ↗

**Figure 17.** Figure 17: Victim agent definition for the direct-threat workflow case study. The agent is configured view at source ↗

**Figure 18.** Figure 18: Direct malicious prompt used in the workflow case study. The request is framed as an view at source ↗

**Figure 19.** Figure 19: Case study of the direct threat model in the workflow domain. The malicious goal view at source ↗

**Figure 20.** Figure 20: Final attack outcome produced by the victim trajectory in Fig. 19 under the direct threat view at source ↗

**Figure 21.** Figure 21: Left: Benign task success rates across different task categories in the CRM domain; Center: Attack success rates across different risk categories under the indirect threat model in the CRM domain; Right: Attack success rates across different risk categories under the direct threat model in the CRM domain. 0 20 40 60 80 100 Benign Task Success Rate (%) 0 20 40 60 80 100 Attack Success Rate (%) OpenAI Agent… view at source ↗

**Figure 22.** Figure 22: Trade-off between agent vulnerability and benign utility in the CRM domain. view at source ↗

**Figure 23.** Figure 23: Left: Comparison of attack vulnerability in the CRM domain across different agent frameworks (we evaluate the ASR with their latest built-in models); Center: Attack success rates across different injection types and their combinations in the CRM domain; Right: Task distribution across injection types and their combinations in the CRM domain. categories. This indicates strong prompt-level guardrails but di… view at source ↗

**Figure 24.** Figure 24: The attacker injects a stealthy malicious instruction by appending it as an action item view at source ↗

**Figure 25.** Figure 25: An example of the victim agent trajectory for the view at source ↗

**Figure 26.** Figure 26: An example attack outcome for the Data Exfiltration risk category under the indirect threat model, resulting from the attack in view at source ↗

**Figure 27.** Figure 27: The attacker injects a flip-attack jailbreak to induce the agent to append malicious weaponization instructions to Salesforce CRM records. 41 view at source ↗

**Figure 28.** Figure 28: An example of the victim trajectory for the view at source ↗

**Figure 29.** Figure 29: An example attack outcome for the General AI Restrictions (e.g., weaponization) risk category under the direct threat model, resulting from the attack in view at source ↗

**Figure 30.** Figure 30: Left: Benign task success rates across different task categories in the customer service domain; Center: Attack success rates across different risk categories under the indirect threat model in the customer service domain; Right: Attack success rates across different risk categories under the direct threat model in the customer service domain. D.4.2 Indirect Threat Model We present the attack success rate… view at source ↗

**Figure 31.** Figure 31: Left: Comparison of attack vulnerability in the customer service domain across different agent frameworks (we evaluate the ASR with their latest built-in models); Center: Attack success rates across different injection types and their combinations in the customer service domain; Right: Task distribution across injection types and their combinations in the customer service domain. 70 80 90 100 Benign Task … view at source ↗

**Figure 32.** Figure 32: Trade-off between agent vulnerability and benign utility in the customer service domain. view at source ↗

**Figure 33.** Figure 33: Agent configuration for the Financial Fraud case study under the indirect threat model view at source ↗

**Figure 34.** Figure 34: Attack configuration for the Financial Fraud case study under the indirect threat model view at source ↗

**Figure 35.** Figure 35: Victim agent trajectory for the Financial Fraud case study under the indirect threat model view at source ↗

**Figure 36.** Figure 36: Attack outcome for the Financial Fraud case study (Fig. 33 and 34). The unauthorized view at source ↗

**Figure 37.** Figure 37: Agent configuration for the PII Disclosure case study under the direct threat model in the view at source ↗

**Figure 38.** Figure 38: Direct prompt-injection payload for the PII Disclosure case study (Fig. 37). The single-turn view at source ↗

**Figure 39.** Figure 39: Victim agent trajectory for the PII Disclosure case study under the direct threat model view at source ↗

**Figure 40.** Figure 40: Attack outcome for the PII Disclosure case study (Fig. 37). The exfiltrated billing profile view at source ↗

**Figure 41.** Figure 41: Left: Benign task success rates across different task categories in the travel domain; Center: Attack success rates across different risk categories under the indirect threat model in the travel domain; Right: Attack success rates across different risk categories under the direct threat model in the travel domain. At the opposite extreme, GPT-OSS-120B on OpenAI Agents SDK completes 0.0% of benign tasks, y… view at source ↗

**Figure 42.** Figure 42: Trade-off between agent vulnerability and benign utility in the travel domain. view at source ↗

**Figure 43.** Figure 43: Left: Comparison of attack vulnerability in the travel domain across different agent frameworks (we evaluate the ASR with their latest built-in models); Center: Attack success rates across different injection types and their combinations in the travel domain; Right: Task distribution across injection types and their combinations in the travel domain. then declines to answer). GPT-5.2 reduces this to 10.2%… view at source ↗

**Figure 44.** Figure 44: Travel agent configuration for the case study of the indirect threat model as shown in view at source ↗

**Figure 45.** Figure 45: Environment, skill and tool injections of the case study under the indirect threat model as view at source ↗

**Figure 46.** Figure 46: An example of a victim trajectory for the Data Exfiltration risk category under the indirect view at source ↗

**Figure 47.** Figure 47: Travel agent configuration for the case study of the direct threat model as shown in view at source ↗

**Figure 48.** Figure 48: An example of a victim trajectory for the Sybil Reviews Manipulation risk category under view at source ↗

**Figure 49.** Figure 49: Left: Benign task success rates across different task categories in the code domain; Center: Attack success rates across different risk categories under the indirect threat model in the code domain; Right: Attack success rates across different risk categories under the direct threat model in the code domain. As shown in Tab. 31, the ASR on all agents increases significantly after red-teaming optimization,… view at source ↗

**Figure 50.** Figure 50: Trade-off between agent vulnerability and benign utility in the code domain. view at source ↗

**Figure 51.** Figure 51: Left: Comparison of attack vulnerability in the code domain across different agent frameworks; Center: Attack success rates across different injection types and their combinations in the code domain; Right: Task distribution across injection types and their combinations in the code domain. F.5 Analysis Finding 1: Tasks that require stronger reasoning lead to degraded performance for agents. Task success r… view at source ↗

**Figure 52.** Figure 52: Victim trajectory for the direct attack case study (Delete Sensitive Files) in the code view at source ↗

**Figure 53.** Figure 53: Environment and tool injections of the case study for the indirect threat model in the code view at source ↗

**Figure 54.** Figure 54: Victim trajectory for the indirect attack case study (Add Risky Alias) in the code domain. view at source ↗

**Figure 55.** Figure 55: Left: Benign task success rates across different task categories in the browser domain; Center: Attack success rates across different risk categories under the indirect threat model in the browser domain; Right: Attack success rates across different risk categories under the direct threat model in the browser domain. 0 20 40 60 80 100 Benign Task Success Rate (%) 0 20 40 60 80 100 Attack Success Rate (%) … view at source ↗

**Figure 56.** Figure 56: Trade-off between agent vulnerability and benign utility in the browser domain. view at source ↗

**Figure 57.** Figure 57: Left: Comparison of attack vulnerability in the browser domain across different agent frameworks (we evaluate the ASR with their latest built-in models); Center: Attack success rates across different injection types and their combinations in the browser domain; Right: Task distribution across injection types and their combinations in the browser domain. Finding 4: GPT models rarely log in via autofill, ca… view at source ↗

**Figure 58.** Figure 58: Tool injections for the indirect threat model case study in the browser domain. view at source ↗

**Figure 59.** Figure 59: An example of a victim trajectory for data exfiltration risk category under the indirect view at source ↗

**Figure 60.** Figure 60: An example of the attack outcome for data exfiltration risk category under the indirect view at source ↗

**Figure 61.** Figure 61: Tool injections for the direct threat model case study in the browser domain. view at source ↗

**Figure 62.** Figure 62: An example of a victim trajectory for spam bot risk category under the direct threat model view at source ↗

**Figure 63.** Figure 63: An example of the attack outcome for spam bot risk category under the direct threat model view at source ↗

**Figure 64.** Figure 64: Left: Benign task success rates across different task categories in the research domain; Center: Attack success rates across different risk categories under the indirect threat model in the research domain; Right: Attack success rates across different risk categories under the direct threat model in the research domain. categories are primarily tuned to mitigate direct requests, while remaining weaker aga… view at source ↗

**Figure 65.** Figure 65: Trade-off between agent vulnerability and benign utility in the research domain. view at source ↗

**Figure 66.** Figure 66: Left: Comparison of attack vulnerability in the research domain across different agent frameworks; Center: Attack success rates across different injection types and their combinations in the research domain; Right: Task distribution across injection types and their combinations in the research domain. H.6 Case Study We present two representative red-teaming case studies in the research domain, one for eac… view at source ↗

**Figure 67.** Figure 67: Victim trajectory for the direct attack case study (Nuclear Weapons) in the research view at source ↗

**Figure 68.** Figure 68: Tool injection of the case study for the indirect threat model in the research domain. view at source ↗

**Figure 69.** Figure 69: Victim trajectory for the indirect attack case study (Arms-Control Circumvention) in the view at source ↗

**Figure 70.** Figure 70: Left: Benign task success rates across different task categories in the OS-filesystem domain; Center: Attack success rates across different risk categories under the indirect threat model; Right: Attack success rates across different risk categories under the direct threat model. 0 20 40 60 80 100 Benign Task Success Rate (%) 0 20 40 60 80 100 Attack Success Rate (%) (a) Indirect Threat Model OpenAI Agent… view at source ↗

**Figure 71.** Figure 71: Trade-off between agent vulnerability and benign utility in the OS-filesystem domain. view at source ↗

**Figure 72.** Figure 72: Left: Comparison of attack vulnerability in the OS-filesystem domain across different agent frameworks (we evaluate the ASR with their latest built-in models); Center: Attack success rates across different injection types and their combinations; Right: Task distribution across injection types in the OS-filesystem domain. injections that embed anomalous instructions in tool descriptions or planted files. O… view at source ↗

**Figure 73.** Figure 73: Victim trajectory for the direct attack case study (Security & Integrity Failures) in the view at source ↗

**Figure 74.** Figure 74: Attack outcome for the direct case study. The agent injects a view at source ↗

**Figure 75.** Figure 75: Victim trajectory for the indirect attack case study (Security & Integrity Failures) in the view at source ↗

**Figure 76.** Figure 76: Attack outcome for the indirect case study. Before the attack, view at source ↗

**Figure 77.** Figure 77: Indirect ASR (%) by risk category across models and frameworks on Windows. System view at source ↗

**Figure 78.** Figure 78: Capability–security tradeoff on Windows: benign success rate (BSR) vs. attack success view at source ↗

**Figure 79.** Figure 79: Direct and indirect ASR (%) across all model–framework configurations on Windows. view at source ↗

**Figure 80.** Figure 80: Victim trajectory for the indirect attack on Malware-13 (GPT-5.4). The agent creates the view at source ↗

**Figure 81.** Figure 81: Attack outcome for Malware-13: the Windows Task Scheduler shows the “DailyCleanup” view at source ↗

**Figure 82.** Figure 82: Victim trajectory for the indirect image-injection case study ( view at source ↗

**Figure 83.** Figure 83: The attacker uses a social engineering pretext (QA monitoring deployment) to induce the view at source ↗

**Figure 84.** Figure 84: Victim trajectory for the direct surveillance attack (SV-2, GPT-5.2). The agent creates a view at source ↗

**Figure 85.** Figure 85: Attack outcome for SV-2. (a) The compromised VM shows the surveillance script, captured screenshots, and HTTP exfiltration to the C2 server. (b) The attacker’s exfiltration listener confirms receipt of stolen data: hostname, username, timestamp, and a 121 KB desktop screenshot. 124 view at source ↗

**Figure 86.** Figure 86: Left: Benign task success rates across different task categories in the finance domain; Center: Attack success rates across different risk categories under the indirect threat model in the finance domain; Right: Attack success rates across different risk categories under the direct threat model in the finance domain. 0 20 40 60 80 100 Benign Task Success Rate (%) 0 20 40 60 80 100 Attack Success Rate (%) … view at source ↗

**Figure 87.** Figure 87: Trade-off between agent vulnerability and benign utility in the finance domain. view at source ↗

**Figure 88.** Figure 88: Left: Comparison of attack vulnerability in the finance domain across different agent frameworks (we evaluate the ASR with their latest built-in models); Center: Attack success rates across different injection types and their combinations in the finance domain; Right: Task distribution across injection types and their combinations in the finance domain. and attack vulnerability. Claude-Opus-4.6 on Claude … view at source ↗

**Figure 89.** Figure 89: Agent configuration for the symbol substitution case study under the indirect threat model view at source ↗

**Figure 90.** Figure 90: Attack configuration for the symbol substitution case study under the indirect threat model view at source ↗

**Figure 91.** Figure 91: Victim agent trajectory for the stock symbol substitution risk category under the indirect view at source ↗

**Figure 92.** Figure 92: Attack outcome for the symbol substitution case study (Fig. 89 and 90). The portfolio view at source ↗

**Figure 93.** Figure 93: Agent configuration for the pump-and-dump case study under the direct threat model view at source ↗

**Figure 94.** Figure 94: Victim agent trajectory for the pump-and-dump risk category under the direct threat model view at source ↗

**Figure 95.** Figure 95: Left: Benign task success rates across different task categories in the legal domain; Center: Attack success rates across different risk categories under the indirect threat model in the legal domain; Right: Attack success rates across different risk categories under the direct threat model in the legal domain. straightforward, while research-intensive categories (general litigation: 0–50%, sentencing res… view at source ↗

**Figure 96.** Figure 96: Trade-off between agent vulnerability and benign utility in the legal domain. view at source ↗

**Figure 97.** Figure 97: Left: Comparison of attack vulnerability in the legal domain across different agent frameworks (we evaluate the ASR with their latest built-in models); Center: Attack success rates across different injection types and their combinations in the legal domain; Right: Task distribution across injection types and their combinations in the legal domain. framework-level controls can prevent the “act first, refus… view at source ↗

**Figure 98.** Figure 98: Agent configuration for the strategy leak case study under the indirect threat model in the view at source ↗

**Figure 99.** Figure 99: Attack configuration for the strategy leak case study under the indirect threat model in view at source ↗

**Figure 100.** Figure 100: Victim trajectory for the strategy leak risk category (strategy_leak/13) under the indirect view at source ↗

**Figure 101.** Figure 101: Attack outcome for the strategy leak case study (Fig. 98 and 99). The Gmail inbox of view at source ↗

**Figure 102.** Figure 102: Agent configuration for the fabricated evidence case study under the direct threat model view at source ↗

**Figure 103.** Figure 103: Victim trajectory for the fabricated evidence risk category under the direct threat model in the legal domain. The red-teaming agent crafts a jailbreak prompt (shown in the user query) that frames the fabricated country conditions data as pre-validated authenticated case materials. The agent (GPT-5.4, view at source ↗

**Figure 104.** Figure 104: Left: Benign task success rates across different task categories in the Telecom domain; Center: Attack success rates across different risk categories under the indirect threat model in the Telecom domain; Right: Attack success rates across different risk categories under the direct threat model in the Telecom domain. OpenAI Agents (GPT-5.4) and OpenAI Agents (GPT-5.2), whose optimized ASR is roughly 5.3x… view at source ↗

**Figure 105.** Figure 105: Trade-off between agent vulnerability and benign utility in the Telecom domain. view at source ↗

**Figure 106.** Figure 106: Left: Comparison of attack vulnerability in the Telecom domain across different agent frameworks; Center: Attack success rates across different injection types and their combinations in the Telecom domain; Right: Task distribution across injection types and their combinations in the Telecom domain. PII exposure in Telecom is primarily driven by attacker-controlled workflow context, rather than by direct … view at source ↗

**Figure 107.** Figure 107: Indirect-threat case study in the Telecom domain ( view at source ↗

**Figure 108.** Figure 108: Direct-threat case study in the Telecom domain ( view at source ↗

**Figure 109.** Figure 109: Left: Benign task success rates across different clinical workflow categories in the medical domain; Center: Attack success rates across different medical risk categories under the indirect threat model; Right: Attack success rates across different medical risk categories under the direct threat model. 0 20 40 60 80 100 Benign Task Success Rate (%) 0 20 40 60 80 100 Attack Success Rate (%) OpenAI Agents … view at source ↗

**Figure 110.** Figure 110: Trade-off between agent vulnerability and benign clinical utility in the medical domain. view at source ↗

**Figure 111.** Figure 111: Left: Comparison of attack vulnerability in the medical domain across different agent frameworks (evaluated using their latest supported models); Center: Attack success rates across different injection types and their combinations in the medical domain; Right: Task distribution across injection types and their combinations in the medical domain. as retrieved clinical notes or tool outputs. These results … view at source ↗

**Figure 112.** Figure 112: Indirect-threat case study in the Medical Service domain ( view at source ↗

**Figure 113.** Figure 113: Direct-threat case study in the Medical Service domain ( view at source ↗

**Figure 114.** Figure 114: Statistics of environments in DTAP, including (a) the number of REST API endpoints per environment; (b) the number of MCP tools per environment; (c) the number of UI pages per environment; and (d) the number of injection MCP tools per environment (each corresponding to a practical injection point). P.2 Gmail The Gmail environment simulates a realistic email workspace for workflow-agent evaluation, coveri… view at source ↗

**Figure 115.** Figure 115: Simulated Gmail environment. Representative views used for inbox navigation, message inspection, email drafting, and reply actions in workflow-agent evaluation. 192 view at source ↗

**Figure 116.** Figure 116: Simulated Google Calendar environment. Representative views used for scheduling, editing events, and inviting attendees. MCP Tools. The Google Calendar environment provides MCP tools for end-to-end scheduling and calendar coordination. As shown in view at source ↗

**Figure 117.** Figure 117: Simulated Google Docs environment. 199 view at source ↗

**Figure 118.** Figure 118: Simulated Google Forms environment. Representative views used for form-content inspection and detail-page review in workflow-agent evaluation. MCP Tools. The Google Forms environment provides MCP tools for form-centered data-collection workflows. As summarized in view at source ↗

**Figure 119.** Figure 119: Simulated Google Drive environment. Representative views used for file browsing, directory filtering, and document retrieval in workflow-agent evaluation. 202 view at source ↗

**Figure 120.** Figure 120: Simulated Google Sheets environment. Representative views used for spreadsheet browsing and sheet-level data inspection in workflow-agent evaluation. MCP Tools. The Google Sheets environment exposes MCP tools for structured spreadsheet management and table-centered workflow tasks. As summarized in view at source ↗

**Figure 121.** Figure 121: Simulated Slack environment. Representative views used for reading messages and posting notifications in workflow-agent evaluation. 204 view at source ↗

**Figure 122.** Figure 122: Simulated PayPal environment. Representative views used for payment workflows (recipient selection, amount entry, and confirmation). MCP Tools. The PayPal environment provides MCP tools for financial workflow tasks involving payments, invoices, subscriptions, and payouts. As shown in view at source ↗

**Figure 123.** Figure 123: Simulated Zoom environment. Representative views used for meeting management, scheduling, and invitations. 209 view at source ↗

**Figure 124.** Figure 124: Simulated Atlassian environment. Representative views used for project browsing, issue discovery, board-based workflow tracking, and detailed issue inspection in workflow-agent evaluation. MCP Tools. The Atlassian environment integrates Jira and Confluence through a unified MCP interface for project coordination and collaborative knowledge-management workflows. As summarized in view at source ↗

**Figure 125.** Figure 125: Simulated WhatsApp environment. Representative views used for login, message review, call history inspection, and call-related interactions in workflow-agent evaluation. 213 view at source ↗

**Figure 126.** Figure 126: Simulated GitHub environment. Representative views used for repository access, issue tracking, pull-request review, and commit-history inspection in workflow-agent evaluation. MCP Tools. The GitHub environment exposes a comprehensive MCP interface for repository-centered softwaredevelopment workflows. As summarized in view at source ↗

**Figure 127.** Figure 127: Simulated GitLab environment. Representative views used for project access, issue tracking, detailed issue inspection, and board-based workflow management in workflow-agent evaluation. 217 view at source ↗

**Figure 128.** Figure 128: Simulated Salesforce CRM dashboard. 219 view at source ↗

**Figure 129.** Figure 129: Windows 11 simulation environment with PowerShell and File Explorer. view at source ↗

**Figure 130.** Figure 130: macOS Sonoma simulation environment with Terminal and Finder. view at source ↗

**Figure 131.** Figure 131: Simulated browser environment. MCP Tools. The browser environment exposes 27 MCP tools organized into 7 functional categories (Tab. 104): navigation and history management, page state and capture, element interaction, tab and viewport control, coordinate-based mouse control, credential management, and payment information management. These tools allow agents to navigate pages, inspect DOM snapshots and sc… view at source ↗

**Figure 132.** Figure 132: Simulated arXiv Website Graphical User Interface (GUI) view at source ↗

**Figure 133.** Figure 133: Customer service GUI. Left: Agent workspace home page with metric summary cards and active case table. Center: Case queue with sidebar filtering and sortable columns. Right: Case detail view with case fields, activity timeline, and compose area. P.25 Booking We construct a simulated travel booking platform where a travel agent helps users plan, book and pay trips. The environment is populated with large-… view at source ↗

**Figure 134.** Figure 134: Simulated travel booking website, including the search page and results page. view at source ↗

**Figure 135.** Figure 135: Simulated Yahoo Finance brokerage platform used in the finance domain. The platform view at source ↗

**Figure 136.** Figure 136: illustrates the FedEx-style interface, including both the homepage and the tracking page. The homepage is inspired by the official FedEx website and features a navigation bar (e.g., Shipping, Tracking, Support) and quick-access actions such as creating shipments. The tracking page provides a dedicated interface for entering tracking numbers and displays the current shipment status along with the full chr… view at source ↗

**Figure 137.** Figure 137: Simulated X environment. Representative views used for account access, profile inspection, topic exploration, and timeline browsing in workflow-agent evaluation. 239 view at source ↗

**Figure 138.** Figure 138: Simulated LinkedIn environment. Representative views used for account access, feed browsing, job exploration, and professional network management in workflow-agent evaluation. MCP Tools. The LinkedIn environment exposes a comprehensive MCP interface for professional networking workflows. As summarized in view at source ↗

**Figure 139.** Figure 139: Simulated Chase environment. Representative views used for account overview, balance and transaction inspection, card management, and money movement in workflow-agent evaluation. MCP Tools. The Chase environment exposes a comprehensive MCP interface for retail banking workflows. As summarized in view at source ↗

**Figure 140.** Figure 140: Simulated Notion environment. Representative views used for account access, onboarding content, individual page editing, and personalized workspace browsing in workflow-agent evaluation. 249 view at source ↗

**Figure 141.** Figure 141: Simulated Reddit environment. Representative views used for account access, personalized feed browsing, user profile inspection, and post / comment-thread reading in workflow-agent evaluation. MCP Tools. The Reddit environment exposes an MCP interface that covers the full community interaction surface. As summarized in view at source ↗

**Figure 142.** Figure 142: Simulated Robinhood environment. Representative views used for account access, market discovery, crypto and equity trading, order tracking, and cash movement in workflow-agent evaluation. MCP Tools. The Robinhood environment exposes a comprehensive MCP interface for retail brokerage workflows across both equities and crypto. As summarized in view at source ↗

**Figure 143.** Figure 143: Simulated Dropbox environment. Representative views used for account access, top-level file browsing, folder inspection, and inbound file-request collection in workflow-agent evaluation. MCP Tools. The Dropbox environment exposes an MCP interface that covers the full cloud-storage surface area. As summarized in view at source ↗

**Figure 144.** Figure 144: Simulated Southwest Airlines dashboard. MCP Tools. The Southwest environment exposes a comprehensive MCP interface for low-cost-carrier flight-booking workflows. As summarized in view at source ↗

**Figure 145.** Figure 145: Simulated United Airlines dashboard. MCP Tools. The United environment exposes a comprehensive MCP interface for airline-reservation workflows. As summarized in view at source ↗

**Figure 146.** Figure 146: Simulated Enterprise Rent-A-Car dashboard. view at source ↗

**Figure 147.** Figure 147: Simulated DoorDash dashboard. MCP Tools. The DoorDash environment exposes a comprehensive MCP interface for on-demand food and grocery delivery workflows. As summarized in view at source ↗

**Figure 148.** Figure 148: Simulated Expedia website, including the search page and results page. view at source ↗

read the original abstract

AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DTap builds a new red-teaming platform with simulated environments and an autonomous attacker, but the simulations lack any fidelity checks against real tools.

read the letter

The punchline is that DTap creates a controllable platform for testing AI agent vulnerabilities in simulated real-world settings, but without checks on how well those simulations match actual systems, the practical value of the attacks it finds is unclear. What the paper does is introduce DTap with 14 domains and more than 50 environments replicating systems like Google Workspace, PayPal, and Slack. They also propose DTap-Red, an autonomous agent that explores various injection methods including prompts, tools, skills, and environment changes to find effective attacks for different goals. Then they use it to build DTap-Bench, a dataset of attack instances each with an automatic judge to verify outcomes. This setup allows large-scale evaluations of different backbone models and policies, which they say reveals vulnerability patterns. The work is helpful because it moves beyond abstract discussions of agent risks to a concrete, reproducible testing framework. Having interactive and controllable sims addresses the challenge of dynamic environments with tools and user interactions that make real evaluation difficult. The main concern is the one in the stress test: no reported metrics on simulation fidelity or transfer experiments. The environments are described as replicating the systems, but if they don't accurately capture state changes or API responses, then the red-teaming results could be exploiting simulation-specific issues rather than real vulnerabilities. The abstract mentions conducting large-scale evaluations and providing insights, but the lack of specific success rates or validation details in the summary makes it hard to judge the strength of those findings. This paper is for researchers in AI safety and security who are looking for benchmarks and platforms to assess agent risks systematically. A reader working on agent deployment or evaluation standards would find the artifact useful as a starting point, though they would likely want to verify the sims themselves. It deserves a serious referee because it offers a new tool in an important area, and the issues are fixable with additional experiments. I recommend sending it to peer review with a focus on adding validation for the environments.

Referee Report

2 major / 1 minor

Summary. The paper introduces the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents spanning 14 real-world domains and over 50 simulation environments that replicate systems such as Google Workspace, PayPal, and Slack. It proposes DTap-Red, an autonomous red-teaming agent that explores diverse injection vectors (prompt, tool, skill, environment, and combinations) and discovers effective attack strategies, curates the DTap-Bench dataset with high-quality instances paired with verifiable automatic judges, and conducts large-scale evaluations of popular AI agents to reveal systematic vulnerability patterns across security policies, risk categories, and attack strategies.

Significance. If the simulation environments accurately replicate real-world agent dynamics and the evaluations are supported by rigorous quantitative validation, this platform and benchmark would represent a meaningful advance in AI agent security research by enabling scalable, reproducible red-teaming in dynamic, tool-using settings. The autonomous DTap-Red component and the curated dataset with judges could help standardize assessment practices and support development of more secure agents.

major comments (2)

Abstract: the claim of conducting large-scale evaluations that reveal systematic vulnerability patterns is central to the paper's empirical contribution, yet the abstract (and provided summary) supplies no quantitative results such as attack success rates, error analysis, or validation of the automatic judges. This omission is load-bearing for assessing the strength and reliability of the reported findings.
Abstract (DTap environment descriptions): the platform's value for real-world red-teaming rests on the assertion that the 50+ simulation environments faithfully replicate dynamic tool use, state changes, and user interactions in systems such as Google Workspace and Slack. No fidelity metrics, API parity tests, or transfer experiments demonstrating that attacks discovered in simulation generalize to actual deployments are mentioned, which directly affects the generalizability of DTap-Red's attack strategies.

minor comments (1)

Abstract: the abstract is dense and lists multiple contributions in a single paragraph; splitting key claims or adding a brief limitations sentence could improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the claim of conducting large-scale evaluations that reveal systematic vulnerability patterns is central to the paper's empirical contribution, yet the abstract (and provided summary) supplies no quantitative results such as attack success rates, error analysis, or validation of the automatic judges. This omission is load-bearing for assessing the strength and reliability of the reported findings.

Authors: We agree that the abstract would benefit from including key quantitative highlights to better convey the empirical strength of the work. The full manuscript reports detailed results in the experiments section, including attack success rates across models and attack vectors as well as validation metrics for the automatic judges. In the revised version, we will update the abstract to concisely summarize representative quantitative findings from the large-scale evaluations while keeping the focus on the platform's contributions. revision: yes
Referee: Abstract (DTap environment descriptions): the platform's value for real-world red-teaming rests on the assertion that the 50+ simulation environments faithfully replicate dynamic tool use, state changes, and user interactions in systems such as Google Workspace and Slack. No fidelity metrics, API parity tests, or transfer experiments demonstrating that attacks discovered in simulation generalize to actual deployments are mentioned, which directly affects the generalizability of DTap-Red's attack strategies.

Authors: We acknowledge that explicit fidelity metrics and transfer experiments would further support claims about real-world applicability. The current manuscript details the environment construction based on real API specifications and state management but does not include dedicated fidelity quantification or generalization tests. In the revision, we will add a dedicated subsection on environment design and validation (including API parity where implemented) and explicitly discuss the limitations regarding direct transfer to production deployments. revision: partial

Circularity Check

0 steps flagged

No circularity: platform, agent, and dataset presented as original constructions

full rationale

The paper introduces DTap as a newly built platform spanning 14 domains and 50+ simulation environments, DTap-Red as an autonomous red-teaming agent, and DTap-Bench as a curated dataset with judges. These are described as constructed artifacts for evaluation, with no equations, fitted parameters, predictions, or derivation steps that reduce by construction to prior inputs, self-citations, or ansatzes. Large-scale evaluations of agents are performed directly on the built system, rendering the work self-contained without load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that high-fidelity simulations can stand in for real agent environments; no free parameters are introduced and the only invented entity is the autonomous red-teaming agent itself.

axioms (1)

domain assumption Simulation environments can accurately replicate the dynamic interactions, tool usage, and user behaviors of real-world systems such as Google Workspace and Slack.
Invoked to justify that attacks found in DTap will be relevant to deployed agents.

invented entities (1)

DTap-Red no independent evidence
purpose: Autonomous agent that explores injection vectors and discovers effective attack strategies tailored to malicious goals.
Newly proposed component whose effectiveness is asserted but not independently validated in the abstract.

pith-pipeline@v0.9.0 · 5659 in / 1433 out tokens · 69623 ms · 2026-05-08T16:07:32.049427+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

147 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Dodd-frank wall street reform and consumer protection act

111th United States Congress. Dodd-frank wall street reform and consumer protection act. https://www.congress.gov/bill/111th-congress/house-bill/4173, 2010. Pub.L. 111–203, 124 Stat. 1376

2010
[2]

Terms of Service

Airbnb. Terms of Service. https://www.airbnb.com/help/article/2908, 2026. Ac- cessed 2026-04-01

2026
[3]

Airbnb’s Content Policy

Airbnb. Airbnb’s Content Policy. https://www.airbnb.com/help/article/546, n.d. Accessed 2026-04-01

2026
[4]

Off-Platform and Fee Transparency Policy

Airbnb. Off-Platform and Fee Transparency Policy. https://www.airbnb.com/help/ article/2799, n.d. Accessed 2026-04-01

2026
[5]

Model rules of professional conduct

American Bar Association. Model rules of professional conduct. https: //www.americanbar.org/groups/professional_responsibility/publications/ model_rules_of_professional_conduct/, 2024. Accessed: 2026-04-01

2024
[6]

Agentharm: A benchmark for measuring harmfulness of llm agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. InThe Thirteenth International Conference on Learning Representations
[7]

Claude agent sdk overview, 2025

Anthropic. Claude agent sdk overview, 2025

2025
[8]

Claude code by anthropic, 2026

Anthropic. Claude code by anthropic, 2026. Accessed: 2026-04-23

2026
[9]

Claude cowork, 2026

Anthropic. Claude cowork, 2026

2026
[10]

Cursor: The best way to code with ai, 2023

Anysphere. Cursor: The best way to code with ai, 2023

2023
[11]

Atlassian acceptable use policy

Atlassian. Atlassian acceptable use policy. https://www.atlassian.com/legal/ acceptable-use-policy. Accessed: 2026-03-19

2026
[12]

Shieldagent: Shielding agents via verifiable safety policy reasoning

Zhaorun Chen, Mintong Kang, and Bo Li. Shieldagent: Shielding agents via verifiable safety policy reasoning. InForty-second International Conference on Machine Learning
[13]

Arms: Adaptive red-teaming agent against multimodal models with plug-and-play attacks

Zhaorun Chen, Xun Liu, Mintong Kang, Jiawei Zhang, Minzhou Pan, Shuang Yang, and Bo Li. Arms: Adaptive red-teaming agent against multimodal models with plug-and-play attacks. arXiv preprint arXiv:2510.02677, 2025. 13

work page arXiv 2025
[14]

Agentpoison: Red- teaming llm agents via poisoning memory or knowledge bases.Advances in Neural Information Processing Systems, 37:130185–130213, 2024

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red- teaming llm agents via poisoning memory or knowledge bases.Advances in Neural Information Processing Systems, 37:130185–130213, 2024

2024
[15]

Eu cbrn action plan

Council of the European Union. Eu cbrn action plan. https://data.consilium.europa. eu/doc/document/ST-15505-2009-REV-1/en/pdf, 2009. Accessed: 2026-04-03

2009
[16]

Interim measures for the management of generative artifi- cial intelligence services.https://www.cac.gov.cn/2023-07/13/c_1690898327029107

Cyberspace Administration of China. Interim measures for the management of generative artifi- cial intelligence services.https://www.cac.gov.cn/2023-07/13/c_1690898327029107. htm, 2023. Accessed: 2026-04-01

2023
[17]

Databricks acceptable use policy

Databricks. Databricks acceptable use policy. https://www.databricks.com/legal/ acceptable-use-policy-fe. Accessed: 2026-03-19

2026
[18]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

2024
[19]

Regulation (EU) 2019/881 — the EU cybersecurity act

European Union. Regulation (EU) 2019/881 — the EU cybersecurity act. https://eur-lex. europa.eu/eli/reg/2019/881/oj. Accessed: 2026-04-01

2019
[20]

The eu artificial intelligence act, 2024

European Union. The eu artificial intelligence act, 2024

2024
[21]

Ftc policy statement on deception, 1983

Federal Trade Commission. Ftc policy statement on deception, 1983. Appended toCliffdale Associates, Inc., 103 F.T.C. 110, 174 (1984)

1983
[22]

FINRA rules

Financial Industry Regulatory Authority. FINRA rules. https://www.finra.org/ rules-guidance/rulebooks/finra-rules, 2025. Accessed: 2026-04-01

2025
[23]

Petri: An open-source auditing tool to accelerate ai safety research.Alignment Science Blog, 10, 2025

K Fronsdal, I Gupta, A Sheshadri, J Michala, S Mcaleer, and R Wang. Petri: An open-source auditing tool to accelerate ai safety research.Alignment Science Blog, 10, 2025

2025
[24]

Chrome services acceptable use policy

Google. Chrome services acceptable use policy. https://chromeenterprise.google/ terms/aup/. Accessed: 2026-04-01

2026
[25]

Gmail program policies

Google. Gmail program policies. https://support.google.com/mail/answer/ 16734397?hl=en. Accessed: 2026-03-19

2026
[26]

Google calendar program policies

Google. Google calendar program policies. https://www.google.com/intl/en_GB/ googlecalendar/program_policies.html. Accessed: 2026-05-04

2026
[27]

Google docs editors help: Abuse program policies & enforcement.https://support

Google. Google docs editors help: Abuse program policies & enforcement.https://support. google.com/docs/answer/148505?hl=en. Accessed: 2026-03-19

2026
[28]

Agent development kit (adk) documentation, 2025

Google. Agent development kit (adk) documentation, 2025

2025
[29]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90, 2023

2023
[30]

Artprompt: Ascii art-based jailbreak attacks against aligned llms

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. Artprompt: Ascii art-based jailbreak attacks against aligned llms. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15157–15173, 2024

2024
[31]

Shade-arena: Evaluating sabotage and monitoring in llm agents.arXiv preprint arXiv:2506.15740, 2025

Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, et al. Shade-arena: Evaluating sabotage and monitoring in llm agents.arXiv preprint arXiv:2506.15740, 2025

work page arXiv 2025
[32]

Langchain framework documentation, 2025

LangChain. Langchain framework documentation, 2025. 14

2025
[33]

St- webagentbench: A benchmark for evaluating safety and trustworthiness in web agents

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov. St- webagentbench: A benchmark for evaluating safety and trustworthiness in web agents. In International Conference on Machine Learning, 2025

2025
[34]

arXiv preprint arXiv:2405.02957 , year =

Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, and Yang Liu. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

work page arXiv 2024
[35]

CoRR, abs/2505.21936

Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier, Yu Su, Zhiqiang Lin, and Huan Sun. Redteamcua: Realistic adversarial testing of computer-use agents in hybrid web-os environments.arXiv preprint arXiv:2505.21936, 2025

work page arXiv 2025
[36]

Autodan: Generating stealthy jailbreak prompts on aligned large language models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[37]

Welcome — manus documentation, 2026

Manus. Welcome — manus documentation, 2026. Accessed: 2026-04-23

2026
[38]

Cwe top 25 most dangerous software weaknesses

MITRE. Cwe top 25 most dangerous software weaknesses. https://cwe.mitre.org/ top25/archive/2025/2025_cwe_top25.html, 2025. Accessed: 2026-04-03

2025
[39]

Mitre att&ck framework

MITRE. Mitre att&ck framework. https://attack.mitre.org/, 2025. Accessed: 2026- 04-03

2025
[40]

Mozilla acceptable use policy.https://www.mozilla.org/en-US/about/legal/ acceptable-use/

Mozilla. Mozilla acceptable use policy.https://www.mozilla.org/en-US/about/legal/ acceptable-use/. Accessed: 2026-04-01

2026
[41]

Security and privacy controls for information systems and organizations (sp 800-53 rev

National Institute of Standards and Technology (NIST). Security and privacy controls for information systems and organizations (sp 800-53 rev. 5). https://nvlpubs.nist.gov/ nistpubs/SpecialPublications/NIST.SP.800-53r5.pdf, 2020. Accessed: 2026-04- 03

2020
[42]

Towards a standard for identify- ing and managing bias in artificial intelligence

National Institute of Standards and Technology (NIST). Towards a standard for identify- ing and managing bias in artificial intelligence. https://nvlpubs.nist.gov/nistpubs/ SpecialPublications/NIST.SP.1270.pdf, 2022. Accessed: 2026-04-03

2022
[43]

Personal information protection law of the people’s republic of china

National People’s Congress of China. Personal information protection law of the people’s republic of china. http://www.npc.gov.cn/npc/c30834/202108/ a8c4e3672c74491a80b53a172bb753fe.shtml, 2021. Accessed: 2026-04-01

2021
[44]

Nato’s chemical, biological, radiological and nuclear (cbrn) defence policy

NATO. Nato’s chemical, biological, radiological and nuclear (cbrn) defence policy. https://www.nato.int/en/about-us/ official-texts-and-resources/official-texts/2022/06/14/ natos-chemical-biological-radiological-and-nuclear-cbrn-defence-policy ,

2022
[45]

Accessed: 2026-04-03

2026
[46]

Chatgpt atlas, 2025

OpenAI. Chatgpt atlas, 2025

2025
[47]

Openai agents python sdk, 2025

OpenAI. Openai agents python sdk, 2025

2025
[48]

Codex | ai coding partner from openai, 2026

OpenAI. Codex | ai coding partner from openai, 2026. Accessed: 2026-04-23

2026
[49]

Introducing openclaw, 2026

OpenClaw. Introducing openclaw, 2026

2026
[50]

Openclaw — personal ai assistant, 2026

OpenClaw. Openclaw — personal ai assistant, 2026. Accessed: 2026-04-23

2026
[51]

Owasp top 10 for large language model applications.https://genai

OWASP Foundation. Owasp top 10 for large language model applications.https://genai. owasp.org/llm-top-10/, 2025. Accessed: 2026-04-03

2025
[52]

Paypal acceptable use policy

PayPal. Paypal acceptable use policy. https://www.paypal.com/us/legalhub/paypal/ acceptableuse-full. Accessed: 2026-03-19

2026
[53]

Paypal user agreement

PayPal. Paypal user agreement. https://www.paypal.com/us/legalhub/paypal/ useragreement-full. Accessed: 2026-03-19. 15

2026
[54]

PCI Data Security Standard (PCI DSS) v4.0.1

PCI Security Standards Council. PCI Data Security Standard (PCI DSS) v4.0.1. https:// docs-prv.pcisecuritystandards.org/PCI%20DSS/Standard/PCI-DSS-v4_0_1.pdf ,
[55]

Salesforce artificial intelligence acceptable use policy

Salesforce, Inc. Salesforce artificial intelligence acceptable use policy. https: //www.salesforce.com/en-us/wp-content/uploads/sites/4/documents/legal/ Agreements/policies/ai-acceptable-use-policy.pdf, 2023

2023
[56]

Salesforce acceptable use and external-facing services policy

Salesforce, Inc. Salesforce acceptable use and external-facing services policy. https://www.salesforce.com/en-us/wp-content/uploads/sites/4/documents/ legal/Agreements/policies/ExternalFacing_Services_Policy.pdf, 2025

2025
[57]

Shopify acceptable use policy, 2024

Shopify, Inc. Shopify acceptable use policy, 2024

2024
[58]

Slack acceptable use policy

Slack. Slack acceptable use policy. https://slack.com/acceptable-use-policy. Ac- cessed: 2026-03-19

2026
[59]

Snowflake acceptable use policy

Snowflake. Snowflake acceptable use policy. https://www.snowflake.com/en/legal/ addenda/acceptable-use-policy/. Accessed: 2026-03-19

2026
[60]

California consumer privacy act / california privacy rights act (CCPA/CPRA)

State of California. California consumer privacy act / california privacy rights act (CCPA/CPRA). https://oag.ca.gov/privacy/ccpa. Accessed: 2026-04-01

2026
[61]

Safearena: Evaluating the safety of autonomous web agents

Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejandra Zambrano, Arkil Patel, Esin Durmus, Spandana Gella, Karolina Stanczak, and Siva Reddy. Safearena: Evaluating the safety of autonomous web agents. InInternational Conference on Machine Learning, pages 60404– 60441. PMLR, 2025

2025
[62]

Recommendation on the ethics of artificial intelligence

UNESCO. Recommendation on the ethics of artificial intelligence. https://www. unesco.org/en/artificial-intelligence/recommendation-ethics, 2021. Adopted 23 November 2021

2021
[63]

General data protection regulation, 2018

European Union. General data protection regulation, 2018

2018
[64]

18 U.S.C

United States Code. 18 U.S.C. § 1001 — statements or entries generally. https://www.law. cornell.edu/uscode/text/18/1001, 1948. Title 18, Part I, Chapter 47

1948
[65]

18 U.S.C

United States Code. 18 U.S.C. § 1341 — frauds and swindles. https://www.law.cornell. edu/uscode/text/18/1341, 1948. Title 18, Part I, Chapter 63

1948
[66]

18 U.S.C

United States Code. 18 U.S.C. § 1951 — interference with commerce by threats or violence. https://www.law.cornell.edu/uscode/text/18/1951, 1948. Title 18, Part I, Chapter 95 (Hobbs Act)

1951
[67]

18 U.S.C

United States Code. 18 U.S.C. § 873 — blackmail. https://www.law.cornell.edu/ uscode/text/18/873, 1948. Title 18, Part I, Chapter 41

1948
[68]

18 U.S.C

United States Code. 18 U.S.C. § 1343 — fraud by wire, radio, or television. https://www. law.cornell.edu/uscode/text/18/1343, 1952. Title 18, Part I, Chapter 63

1952
[69]

18 U.S.C

United States Code. 18 U.S.C. § 1028 — fraud and related activity in connection with identifi- cation documents. https://www.law.cornell.edu/uscode/text/18/1028, 1982. Title 18, Part I, Chapter 47

1982
[70]

Computer fraud and abuse act, 18 U.S.C

United States Congress. Computer fraud and abuse act, 18 U.S.C. §1030. https://www.law. cornell.edu/uscode/text/18/1030. Accessed: 2026-04-01

2026
[71]

Electronic communications privacy act, 18 U.S.C

United States Congress. Electronic communications privacy act, 18 U.S.C. §§2511, 2701,
[72]

Ac- cessed: 2026-04-01

https://www.law.cornell.edu/uscode/text/18/part-I/chapter-119. Ac- cessed: 2026-04-01

2026
[73]

Department of the Treasury

U.S. Department of the Treasury. Bank secrecy act.https://www.fincen.gov/resources/ statutes-and-regulations/bank-secrecy-act, 1970. 31 U.S.C. §§ 5311–5332. 16

1970
[74]

Securities and Exchange Commission

U.S. Securities and Exchange Commission. Rule 10b-5: Employment of manipulative and deceptive devices. https://www.law.cornell.edu/cfr/text/17/240.10b-5, 1942. 17 CFR § 240.10b-5

1942
[75]

Securities and Exchange Commission

U.S. Securities and Exchange Commission. Regulation S-P: Privacy of consumer financial information. https://www.law.cornell.edu/cfr/text/17/part-248, 2000. 17 CFR Part 248

2000
[76]

Securities and Exchange Commission

U.S. Securities and Exchange Commission. Regulation best interest: The broker-dealer standard of conduct. https://www.sec.gov/rules-regulations/2019/06/s7-07-18 , 2019. 17 CFR § 240.15l-1, Release No. 34-86031

2019
[77]

Law: Legal agen- tic workflows for custody and fund services contracts

William Watson, Nicole Cho, Nishan Srishankar, Zhen Zeng, Lucas Cecchi, Daniel Scott, Suchetha Siddagangappa, Rachneet Kaur, Tucker Balch, and Manuela Veloso. Law: Legal agen- tic workflows for custody and fund services contracts. InProceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 583–594, Abu Dhabi, UAE,
[78]

Association for Computational Linguistics
[79]

Whatsapp business messaging policy

WhatsApp. Whatsapp business messaging policy. https://business.whatsapp.com/ policy. Accessed: 2026-03-19

2026
[80]

Whatsapp business terms of service

WhatsApp. Whatsapp business terms of service. https://www.whatsapp.com/legal/ business-terms. Accessed: 2026-03-19

2026

Showing first 80 references.