Recognition: unknown
Toward a Principled Framework for Agent Safety Measurement
Pith reviewed 2026-05-09 13:56 UTC · model grok-4.3
The pith
Agent safety should be measured by searching the space of possible trajectories under a likelihood budget rather than sampling a few rollouts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agent safety is a property of an entire deployment configuration (model, decoder, prompt, environment, judger, and likelihood budget). Rather than sampling trajectories, the configuration should be evaluated by searching the in-budget trajectory space and reporting the probability that every possible trajectory the agent can generate stays safe. The search operates both inside single LLM rounds and across multi-step interaction trees; practical optimizations such as batched decoding, prefix caching, and chunked tree expansion keep the cost manageable on current hardware.
What carries the argument
BOA, a budgeted search procedure that explores the tree of agent trajectories whose cumulative likelihood stays inside a preset budget and returns the fraction of those trajectories that the judger labels safe.
If this is right
- Unsafe trajectories missed by greedy or small-sample evaluations become detectable and quantifiable.
- Different models, defenses, and attack methods can be compared directly on the same numerical safety score.
- Evaluation cost stays within GPU budgets that current research and deployment pipelines already tolerate.
- Safety scores become a property of the full configuration rather than of any single sampled run.
Where Pith is reading between the lines
- Safety reporting could shift from binary pass-fail verdicts to calibrated probabilities that reflect how much of the possible behavior space has been examined.
- The same search machinery might be reused at deployment time to monitor live agents and raise alerts when the remaining safe probability drops below a threshold.
- Budget allocation strategies could be studied to decide how much likelihood mass to spend on exploration versus exploitation for a given safety-critical task.
Load-bearing premise
A fixed likelihood budget is large enough to include the unsafe trajectories that matter in practice without leaving important risks out or becoming computationally infeasible.
What would settle it
Apply both BOA and standard greedy-plus-sampling evaluation to the same set of agent configurations and find that BOA never surfaces additional unsafe trajectories or that its runtime exceeds practical limits on the workloads the authors test.
Figures
read the original abstract
LLM agents emit actions, not just text, and once taken, those actions often cannot be undone. Yet today's agent-safety evaluations run greedy or a few sampled rollouts and report a single safe/unsafe rate -- blind to the long-tail trajectories where unsafe behavior may arise from low-probability but non-negligible actions. We argue agent safety should be measured by search, not sampling. We apply BOA, a framework that, given a deployment configuration (model, decoder, prompt, environment, judger, likelihood budget), searches the in-budget trajectory space and reports a safety score: the probability the agent stays safe under the configuration. BOA searches both within a single LLM round and across the agent-environment interaction tree under a given likelihood budget, and makes search practical via batched decoding/judging, prefix caching, and chunked tree expansion. On agent-safety workloads, BOA discovers unsafe trajectories that greedy and sampled evaluations miss. BOA can additionally be used for ranking models, defenses, and attacks, all on the same scale, with manageable GPU costs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that LLM agent safety evaluations relying on greedy decoding or limited sampling miss long-tail unsafe trajectories, and proposes the BOA framework. Given a deployment configuration and likelihood budget, BOA performs budgeted search over single-round and multi-step agent-environment trajectory trees (using batched decoding, prefix caching, and chunked expansion) to report a safety score defined as the probability that the agent stays safe under that configuration. The framework is positioned as enabling discovery of unsafe behaviors missed by sampling and as a unified scale for ranking models, defenses, and attacks.
Significance. If the BOA safety score can be shown to provide a well-characterized approximation to the true safety probability (with explicit error bounds or convergence guarantees), the work would offer a principled alternative to sampling-based evaluation for agent safety. This could strengthen detection of rare but high-impact unsafe behaviors in interactive LLM agents and support more reliable comparative assessments across configurations, addressing a gap in current evaluation practices.
major comments (2)
- [Abstract] Abstract: The central claim that BOA 'reports a safety score: the probability the agent stays safe under the configuration' is not justified by the described method. With a finite likelihood budget on an exponentially branching trajectory space, full enumeration is infeasible; the manuscript must specify the aggregation rule (e.g., renormalization over searched mass, treating unsearched branches as safe, or another heuristic) and prove or bound the resulting deviation from the true safety probability.
- [Abstract] Abstract (and the description of BOA): The distinction drawn between search and sampling is undermined by the lack of any error analysis or variance characterization for the reported score. Sampling at least yields unbiased estimators with standard concentration bounds; without analogous guarantees for the budgeted tree search, the safety score functions as a heuristic discovery tool rather than a principled probability estimate.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments correctly identify areas where the manuscript's claims about the safety score require clarification and additional analysis. We will revise the paper to address both points directly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that BOA 'reports a safety score: the probability the agent stays safe under the configuration' is not justified by the described method. With a finite likelihood budget on an exponentially branching trajectory space, full enumeration is infeasible; the manuscript must specify the aggregation rule (e.g., renormalization over searched mass, treating unsearched branches as safe, or another heuristic) and prove or bound the resulting deviation from the true safety probability.
Authors: We agree that the abstract phrasing risks overstating the result as the exact safety probability. BOA computes the safety score by exhaustive enumeration of trajectories within the allocated likelihood budget (enabled by batched decoding, prefix caching, and chunked expansion) and aggregates the probability mass of safe trajectories found. We will revise the abstract and add a dedicated subsection to specify the aggregation rule explicitly: the reported score is the total safe probability mass renormalized over the searched mass, equivalently a lower bound on true safety by conservatively treating all unsearched mass as unsafe. We will also state the deviation bound: the absolute error is at most the unsearched probability mass, which is strictly less than the residual budget after search termination. This makes the score a well-defined, conservative estimate under the given configuration and budget. revision: yes
-
Referee: [Abstract] Abstract (and the description of BOA): The distinction drawn between search and sampling is undermined by the lack of any error analysis or variance characterization for the reported score. Sampling at least yields unbiased estimators with standard concentration bounds; without analogous guarantees for the budgeted tree search, the safety score functions as a heuristic discovery tool rather than a principled probability estimate.
Authors: We acknowledge that the current text does not supply a formal error analysis, which weakens the claimed distinction. Sampling indeed supplies unbiased estimators with concentration inequalities, whereas budgeted search yields a deterministic result within the explored subspace. We will add a new section on approximation properties that (1) characterizes the bias as a function of residual budget, (2) shows monotonic convergence of the safety score to the true probability as budget tends to infinity, and (3) contrasts the zero-variance property of search (for fixed budget and search order) with the variance of Monte-Carlo sampling. This analysis will be placed alongside the existing empirical comparisons, thereby grounding the distinction in explicit guarantees rather than leaving the score as a pure heuristic. revision: yes
Circularity Check
No circularity: safety score defined directly via budgeted search without reduction to inputs
full rationale
The paper defines the BOA safety score explicitly as the probability that the agent stays safe under a given configuration, computed by searching the in-budget trajectory space. This is a direct definitional mapping from the search procedure to the reported score, with no fitted parameters renamed as predictions, no self-citations load-bearing the central claim, and no equations that equate the output to its own inputs by construction. The abstract and description introduce practical mechanisms (batched decoding, prefix caching, chunked expansion) as implementation details rather than deriving the probability from prior results or ansatzes. The framework is therefore self-contained against external benchmarks for the purpose of this circularity check.
Axiom & Free-Parameter Ledger
free parameters (1)
- likelihood budget
Reference graph
Works this paper leans on
-
[1]
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. 2025. AgentHarm: A Benchmark for Mea- suring Harmfulness of LLM Agents. arXiv:2410.09024 [cs.LG] https://arxiv.org/abs/2410.09024
work page internal anchor Pith review arXiv 2025
-
[2]
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li
-
[3]
https://openreview.net/forum?id= Y841BRW9rY
AgentPoison: Red-teaming LLM Agents via Poisoning Memory orKnowledgeBases.InTheThirty-eighthAnnualConferenceonNeural Information Processing Systems. https://openreview.net/forum?id= Y841BRW9rY
-
[4]
AI-poweredcodingtoolwipedoutasoftwarecompany’s database in ‘catastrophic failure’
Fortune.2025. AI-poweredcodingtoolwipedoutasoftwarecompany’s database in ‘catastrophic failure’. https://fortune.com/2025/07/23/ai- coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/. Accessed: 2025-11-15
2025
-
[5]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect promptinjection.InProceedingsofthe16thACMWorkshoponArtificial Intelligence and Security. 79–90
2023
-
[7]
Aonan Guan et al. 2026. Comment and Control: Prompt Injection to Credential Theft in Claude Code, Gemini CLI, and GitHub Copi- lot Agent. https://oddguan.com/blog/comment-and-control-prompt- injection-credential-theft-claude-code-gemini-cli-github-copilot/
2026
-
[8]
Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending against indirect prompt injection attacks with spotlighting.arXiv preprint arXiv:2403.14720 (2024)
work page internal anchor Pith review arXiv 2024
-
[9]
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi
-
[10]
The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751(2019)
work page internal anchor Pith review arXiv 1904
-
[11]
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem
ShuyiLin,AnshumanSuri,AlinaOprea,andChengTan.2026. Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem. InProceedings of Machine Learning and Systems (MLSys)
2026
-
[12]
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24). 1831–1847
2024
-
[13]
Maddison, and Tatsunori Hashimoto
Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the Risks of LM Agents with an LM- EmulatedSandbox.InTheTwelfthInternationalConferenceonLearning Representations. https://openreview.net/forum?id=GEcwtMk1uA
2024
-
[14]
Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, et al.2026. Agents of chaos.arXiv preprint arXiv:2602.20021 (2026)
-
[15]
MrinankSharma,MegTong,JesseMu,JerryWei,JorritKruthoff,Scott Goodfriend,EuanOng,AlwinPeng,RajAgarwal,CemAnil,etal .2025. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming.arXiv preprint arXiv:2501.18837 (2025)
-
[16]
Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, and Wai Lam. 2024. A thorough examination of decoding methods in the era of llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 8601–8629
2024
-
[17]
EricWallace,KaiXiao,ReimarLeike,LilianWeng,JohannesHeidecke, and Alex Beutel. 2024. The instruction hierarchy: Training llms to prioritize privileged instructions.arXiv preprint arXiv:2404.13208 (2024)
work page internal anchor Pith review arXiv 2024
-
[18]
Jarrod Watts. 2024. An autonomous AI agent tricked into releas- ing its $47,000 prize pool. https://x.com/jarrodWattsDev/status/ 1862299845710757980
2024
-
[19]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng,BowenYu,ChangGao,ChengenHuang,ChenxuLv,etal .2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Toolsword: Unveiling safety issues of large language models in tool learning across three stages
Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, YilongWu,QiZhang,TaoGui,andXuan-JingHuang.2024. Toolsword: Unveiling safety issues of large language models in tool learning across three stages. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2024
-
[21]
R-judge:Benchmarkingsafetyriskawarenessforllmagents
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao,TianXia,LizhenXu,BinglinZhou,FangqiLi,ZhuoshengZhang, etal.2024. R-judge:Benchmarkingsafetyriskawarenessforllmagents. InFindings of the Association for Computational Linguistics: EMNLP
2024
-
[22]
Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. 2024. Agent secu- rity bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents.arXiv preprint arXiv:2410.02644(2024)
work page internal anchor Pith review arXiv 2024
- [23]
-
[24]
Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. 2024. Agent-safetybench: Evaluat- ing the safety of llm agents.arXiv preprint arXiv:2412.14470(2024)
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.