VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents
Pith reviewed 2026-06-27 19:00 UTC · model grok-4.3
The pith
VESTA automates creation of 1,072 safety scenarios and shows LLM agents carry an average 47.1 percent attack success rate during tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VESTA instantiates abstract safety risks from five dimensions into 1,072 measurable scenarios and runs an automated pipeline that evaluates 12 LLM agents under two authority contexts. The results establish that current agents face substantial behavioral safety risks during task execution, with an average attack success rate of 47.1 percent and several models exceeding 70 percent.
What carries the argument
VESTA, the fully automated framework that converts five risk dimensions into executable scenarios and applies a pipeline for process-level safety measurement instead of final-output judgment.
If this is right
- Process-level checks during execution surface risks that final-output checks miss.
- Agents require safeguards that operate across the full task trajectory rather than only at termination.
- Automated generation scales evaluation beyond what hand-written prompts can achieve.
- Authority context changes the measured risk exposure for the same underlying agent.
Where Pith is reading between the lines
- Developers could integrate the scenario generator into training loops to reduce failures before deployment.
- The same dimension-based approach might apply to non-LLM autonomous systems that execute multi-step plans.
- High failure rates indicate that alignment techniques focused solely on text outputs leave tool-use and memory behaviors under-protected.
- Repeated runs of the pipeline could track whether new agent releases reduce the measured rates over time.
Load-bearing premise
The five risk dimensions and the automated instantiation process produce scenarios that are both diverse and representative of real-world safety failures that agents will encounter in deployment.
What would settle it
A direct comparison that maps a large collection of documented real-world agent incidents onto the five risk dimensions and finds that most incidents fall outside the generated scenario set.
Figures
read the original abstract
Large language models (LLMs) are increasingly evolving from simple text-based interaction systems into LLM agents that can maintain memory, use tools, access external environments, and execute tasks. As their capabilities and autonomy expand, the safety risks they face also become more diverse. Existing evaluations often rely on manually written scenarios, static prompts, or final-output judgments, making it difficult to capture the diverse risks that agents may face during task execution. We introduce VESTA, a fully automated scenario generation and safety evaluation framework for LLM agents. Based on five risk dimensions, VESTA instantiaes abstract and diverse safety risks in real-world task execution into 1,072 measurable evaluation scenarios. Using the automated evaluation pipeline, 12 LLM agents are evaluated under two authority contexts. The results show that current agents still face substantial behavioral safety risks during task execution, with an average ASR of 47.1% and several models exceeding 70%. These findings demonstrate the importance of executable, process-level evaluation for understanding and improving LLM agent safety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents VESTA, a fully automated scenario generation and safety evaluation framework for LLM agents. Based on five risk dimensions, it instantiates abstract risks into 1,072 measurable evaluation scenarios. Using an automated pipeline, it evaluates 12 LLM agents under two authority contexts and reports an average attack success rate (ASR) of 47.1%, with several models exceeding 70%. The work concludes that current agents face substantial behavioral safety risks during task execution and argues for the importance of executable, process-level evaluation.
Significance. If the generated scenarios prove representative, the scale of the evaluation (1,072 scenarios across 12 agents) supplies concrete empirical evidence of behavioral safety vulnerabilities in LLM agents. The fully automated pipeline is a clear strength, enabling reproducible, large-scale assessment without manual scenario curation and supporting process-level rather than final-output judgments.
major comments (1)
- [Abstract] Abstract: the generalization that 'current agents still face substantial behavioral safety risks during task execution' with an average ASR of 47.1% rests on the claim that the five risk dimensions plus automated instantiation yield scenarios that are both diverse and representative of real-world failures. No external validation (mapping to documented incidents, expert coverage checks, or comparison against manually curated benchmarks) is supplied to anchor this representativeness.
minor comments (1)
- [Abstract] Abstract: concrete numbers (1,072 scenarios, 47.1% ASR) are reported without accompanying details on validation of generated scenarios, error bars, baseline comparisons, or operationalization of the two authority contexts.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the abstract's generalization. We address the point directly below and will make the requested changes.
read point-by-point responses
-
Referee: [Abstract] Abstract: the generalization that 'current agents still face substantial behavioral safety risks during task execution' with an average ASR of 47.1% rests on the claim that the five risk dimensions plus automated instantiation yield scenarios that are both diverse and representative of real-world failures. No external validation (mapping to documented incidents, expert coverage checks, or comparison against manually curated benchmarks) is supplied to anchor this representativeness.
Authors: We agree that the current abstract overstates the representativeness claim without supporting validation. The five risk dimensions are synthesized from prior AI safety literature, and the automated pipeline is intended to ensure systematic coverage and diversity within those dimensions, but no external anchoring (real-incident mapping, expert review, or benchmark comparison) is performed. We will revise the abstract to qualify the statement (e.g., replace the broad generalization with a scoped claim limited to the generated scenarios) and add an explicit limitations paragraph discussing the absence of such validation together with directions for future work. revision: yes
Circularity Check
Empirical measurement of ASR on external agents with no self-referential reduction
full rationale
The paper introduces VESTA as a scenario generation framework based on five author-defined risk dimensions and reports an empirical average ASR of 47.1% measured directly from the observed behaviors of 12 external LLM agents across the generated scenarios. No equations, fitted parameters, or derivations are present that would reduce the reported ASR to a quantity defined in terms of itself or to a self-citation chain. The result is a straightforward empirical count of failure instances under the generated conditions and remains independent of any internal construction that would render it tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Five risk dimensions capture the main safety concerns for LLM agents in real-world task execution.
Reference graph
Works this paper leans on
-
[1]
Kimi K2: Open Agentic Intelligence
Kimi K2: Open Agentic Intelligence , author =. arXiv preprint arXiv:2507.20534 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
2025 , urldate =
Introducing Claude Haiku 4.5 , url =. 2025 , urldate =
2025
-
[3]
Introducing Claude Opus 4.7 , year =
-
[4]
2026 , urldate =
Introducing Claude Sonnet 4.6 , url =. 2026 , urldate =
2026
-
[5]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author =. arXiv preprint arXiv:2512.02556 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
2026 , urldate =
DeepSeek V4 Preview Release , url =. 2026 , urldate =
2026
-
[7]
2025 , urldate =
Doubao Large Model , url =. 2025 , urldate =
2025
-
[8]
2025 , urldate =
Gemini Flash , url =. 2025 , urldate =
2025
-
[9]
2025 , urldate =
Gemini 3 Flash: Frontier Intelligence Built for Speed , url =. 2025 , urldate =
2025
-
[10]
2026 , urldate =
GLM-5.1 Overview , url =. 2026 , urldate =
2026
-
[11]
Introducing GPT-5.4 , year =
-
[12]
The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , year =
-
[13]
GPT-4o System Card , author =. arXiv preprint arXiv:2410.21276 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
2024 , urldate =
GPT-4o mini: Advancing Cost-Efficient Intelligence , url =. 2024 , urldate =
2024
-
[15]
2024 , urldate =
Llama 3.3 Model Cards and Prompt Formats , url =. 2024 , urldate =
2024
-
[16]
2026 , urldate =
Qwen3.6-Plus: Towards Real World Agents , url =. 2026 , urldate =
2026
-
[17]
International Conference on Learning Representations , volume=
Identifying the risks of lm agents with an lm-emulated sandbox , author=. International Conference on Learning Representations , volume=
-
[18]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Toolsword: Unveiling safety issues of large language models in tool learning across three stages , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[19]
ReAct: Synergizing Reasoning and Acting in Language Models
React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Advances in neural information processing systems , volume=
Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=
-
[21]
Frontiers of Computer Science , volume=
A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=
2024
-
[22]
Advances in Neural Information Processing Systems , volume=
Gorilla: Large language model connected with massive apis , author=. Advances in Neural Information Processing Systems , volume=
-
[23]
International Conference on Learning Representations , volume=
Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. International Conference on Learning Representations , volume=
-
[24]
International Conference on Learning Representations , volume=
Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=
-
[25]
Advances in Neural Information Processing Systems , volume=
Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
-
[26]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[27]
arXiv preprint arXiv:2509.07315 , year=
SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs , author=. arXiv preprint arXiv:2509.07315 , year=
-
[28]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Learning to ask: When llm agents meet unclear instruction , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[29]
International Conference on Learning Representations , volume=
Agentharm: A benchmark for measuring harmfulness of llm agents , author=. International Conference on Learning Representations , volume=
-
[30]
Advances in Neural Information Processing Systems , volume=
Os-harm: A benchmark for measuring safety of computer use agents , author=. Advances in Neural Information Processing Systems , volume=
-
[31]
arXiv preprint arXiv:2601.08012 , year=
Towards Verifiably Safe Tool Use for LLM Agents , author=. arXiv preprint arXiv:2601.08012 , year=
-
[32]
Mind the gap: Text safety does not transfer to tool-call safety in llm agents
Mind the GAP: Text safety does not transfer to tool-call safety in LLM agents , author=. arXiv preprint arXiv:2602.16943 , year=
-
[33]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Safetybench: Evaluating the safety of large language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[35]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
R-judge: Benchmarking safety risk awareness for llm agents , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
2024
-
[37]
WebGPT: Browser-assisted question-answering with human feedback
Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
International Conference on Learning Representations , volume=
Agentbench: Evaluating llms as agents , author=. International Conference on Learning Representations , volume=
-
[39]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=
Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=
-
[41]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
ACM Transactions on Embedded Computing Systems , volume=
Adatest: Reinforcement learning and adaptive sampling for on-chip hardware trojan detection , author=. ACM Transactions on Embedded Computing Systems , volume=. 2023 , publisher=
2023
-
[43]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
Red teaming language models with language models , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
2022
-
[44]
Advances in Neural Information Processing Systems , year =
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents , author =. Advances in Neural Information Processing Systems , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.