arxiv: 2605.01644 · v1 · submitted 2026-05-02 · 💻 cs.CR

Recognition: unknown

Toward a Principled Framework for Agent Safety Measurement

Alina Oprea, Anshuman Suri, Cheng Tan, Shuyi Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-09 13:56 UTC · model grok-4.3

classification 💻 cs.CR

keywords agent safetyLLM agentstrajectory searchsafety evaluationbudgeted searchrisk measurementadversarial evaluation

0 comments

The pith

Agent safety should be measured by searching the space of possible trajectories under a likelihood budget rather than sampling a few rollouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current evaluations of LLM agents rely on greedy decoding or limited random samples and report a simple safe-or-unsafe rate, yet agents produce irreversible actions whose risks can hide in low-probability paths. The paper claims that safety is better captured by systematically searching the full set of trajectories that fit inside a given likelihood budget and computing the probability that the agent remains safe throughout. This search covers both individual LLM steps and full agent-environment interaction trees, made feasible by batching, prefix caching, and chunked expansion. If the claim holds, evaluators could rank models, defenses, and attacks on one consistent scale and surface failure modes that standard methods overlook.

Core claim

Agent safety is a property of an entire deployment configuration (model, decoder, prompt, environment, judger, and likelihood budget). Rather than sampling trajectories, the configuration should be evaluated by searching the in-budget trajectory space and reporting the probability that every possible trajectory the agent can generate stays safe. The search operates both inside single LLM rounds and across multi-step interaction trees; practical optimizations such as batched decoding, prefix caching, and chunked tree expansion keep the cost manageable on current hardware.

What carries the argument

BOA, a budgeted search procedure that explores the tree of agent trajectories whose cumulative likelihood stays inside a preset budget and returns the fraction of those trajectories that the judger labels safe.

If this is right

Unsafe trajectories missed by greedy or small-sample evaluations become detectable and quantifiable.
Different models, defenses, and attack methods can be compared directly on the same numerical safety score.
Evaluation cost stays within GPU budgets that current research and deployment pipelines already tolerate.
Safety scores become a property of the full configuration rather than of any single sampled run.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety reporting could shift from binary pass-fail verdicts to calibrated probabilities that reflect how much of the possible behavior space has been examined.
The same search machinery might be reused at deployment time to monitor live agents and raise alerts when the remaining safe probability drops below a threshold.
Budget allocation strategies could be studied to decide how much likelihood mass to spend on exploration versus exploitation for a given safety-critical task.

Load-bearing premise

A fixed likelihood budget is large enough to include the unsafe trajectories that matter in practice without leaving important risks out or becoming computationally infeasible.

What would settle it

Apply both BOA and standard greedy-plus-sampling evaluation to the same set of agent configurations and find that BOA never surfaces additional unsafe trajectories or that its runtime exceeds practical limits on the workloads the authors test.

Figures

Figures reproduced from arXiv: 2605.01644 by Alina Oprea, Anshuman Suri, Cheng Tan, Shuyi Lin.

**Figure 1.** Figure 1: shows a representative trace on a multi-turn request— each turn rendered as a concentric tree linked to the next by environment interactions. We describe the search bottom-up: first the within-turn structure (chunk expansion and the chunk tree), then the across-turn block tree, and finally the safety score and judger that drive the priority queues. Within-Turn: Chunk-based Expansion. chunk tree chunk block… view at source ↗

**Figure 2.** Figure 2: BOA’s architecture. The shaded component (“Model Decoding”) represents existing serving frameworks such as HuggingFace or vLLM view at source ↗

**Figure 3.** Figure 3: Per-prompt safety scores for Qwen2.5 (32B). Bar height represents the safety score for each prompt, sorted view at source ↗

**Figure 4.** Figure 4: Safety score heatmap for four LLMs across 72 prompts. Rows display results for each evaluated model. Cell view at source ↗

**Figure 5.** Figure 5: Mean BOA safety score per model, grouped by greedy-agreement subset: “safe” denotes greedy-all-safe; “mixed” denotes greedy-mixed; “unsafe” denotes greedyall-unsafe; and “all” aggregates these three groups. safe subset (0.88), Qwen2.5 (32B) demonstrates superior robustness in the mixed category (0.55), nearly tripling the 0.18 score of Llama-3.1 (8B). The consistent underperformance of Llama-3.1 (8B) acro… view at source ↗

read the original abstract

LLM agents emit actions, not just text, and once taken, those actions often cannot be undone. Yet today's agent-safety evaluations run greedy or a few sampled rollouts and report a single safe/unsafe rate -- blind to the long-tail trajectories where unsafe behavior may arise from low-probability but non-negligible actions. We argue agent safety should be measured by search, not sampling. We apply BOA, a framework that, given a deployment configuration (model, decoder, prompt, environment, judger, likelihood budget), searches the in-budget trajectory space and reports a safety score: the probability the agent stays safe under the configuration. BOA searches both within a single LLM round and across the agent-environment interaction tree under a given likelihood budget, and makes search practical via batched decoding/judging, prefix caching, and chunked tree expansion. On agent-safety workloads, BOA discovers unsafe trajectories that greedy and sampled evaluations miss. BOA can additionally be used for ranking models, defenses, and attacks, all on the same scale, with manageable GPU costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BOA gives a budgeted search method that finds unsafe agent trajectories sampling misses, but the safety score is an incomplete approximation without error bounds.

read the letter

BOA is a framework that searches the space of agent trajectories under a likelihood budget instead of relying on greedy or sampled rollouts. It uses batched decoding, prefix caching, and chunked expansion to keep the search feasible across single rounds and multi-step interactions. In the workloads they ran, this turns up unsafe paths that standard evaluations overlook, and it lets you rank models, defenses, and attacks on one scale with reasonable GPU time.

Referee Report

2 major / 0 minor

Summary. The paper argues that LLM agent safety evaluations relying on greedy decoding or limited sampling miss long-tail unsafe trajectories, and proposes the BOA framework. Given a deployment configuration and likelihood budget, BOA performs budgeted search over single-round and multi-step agent-environment trajectory trees (using batched decoding, prefix caching, and chunked expansion) to report a safety score defined as the probability that the agent stays safe under that configuration. The framework is positioned as enabling discovery of unsafe behaviors missed by sampling and as a unified scale for ranking models, defenses, and attacks.

Significance. If the BOA safety score can be shown to provide a well-characterized approximation to the true safety probability (with explicit error bounds or convergence guarantees), the work would offer a principled alternative to sampling-based evaluation for agent safety. This could strengthen detection of rare but high-impact unsafe behaviors in interactive LLM agents and support more reliable comparative assessments across configurations, addressing a gap in current evaluation practices.

major comments (2)

[Abstract] Abstract: The central claim that BOA 'reports a safety score: the probability the agent stays safe under the configuration' is not justified by the described method. With a finite likelihood budget on an exponentially branching trajectory space, full enumeration is infeasible; the manuscript must specify the aggregation rule (e.g., renormalization over searched mass, treating unsearched branches as safe, or another heuristic) and prove or bound the resulting deviation from the true safety probability.
[Abstract] Abstract (and the description of BOA): The distinction drawn between search and sampling is undermined by the lack of any error analysis or variance characterization for the reported score. Sampling at least yields unbiased estimators with standard concentration bounds; without analogous guarantees for the budgeted tree search, the safety score functions as a heuristic discovery tool rather than a principled probability estimate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments correctly identify areas where the manuscript's claims about the safety score require clarification and additional analysis. We will revise the paper to address both points directly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that BOA 'reports a safety score: the probability the agent stays safe under the configuration' is not justified by the described method. With a finite likelihood budget on an exponentially branching trajectory space, full enumeration is infeasible; the manuscript must specify the aggregation rule (e.g., renormalization over searched mass, treating unsearched branches as safe, or another heuristic) and prove or bound the resulting deviation from the true safety probability.

Authors: We agree that the abstract phrasing risks overstating the result as the exact safety probability. BOA computes the safety score by exhaustive enumeration of trajectories within the allocated likelihood budget (enabled by batched decoding, prefix caching, and chunked expansion) and aggregates the probability mass of safe trajectories found. We will revise the abstract and add a dedicated subsection to specify the aggregation rule explicitly: the reported score is the total safe probability mass renormalized over the searched mass, equivalently a lower bound on true safety by conservatively treating all unsearched mass as unsafe. We will also state the deviation bound: the absolute error is at most the unsearched probability mass, which is strictly less than the residual budget after search termination. This makes the score a well-defined, conservative estimate under the given configuration and budget. revision: yes
Referee: [Abstract] Abstract (and the description of BOA): The distinction drawn between search and sampling is undermined by the lack of any error analysis or variance characterization for the reported score. Sampling at least yields unbiased estimators with standard concentration bounds; without analogous guarantees for the budgeted tree search, the safety score functions as a heuristic discovery tool rather than a principled probability estimate.

Authors: We acknowledge that the current text does not supply a formal error analysis, which weakens the claimed distinction. Sampling indeed supplies unbiased estimators with concentration inequalities, whereas budgeted search yields a deterministic result within the explored subspace. We will add a new section on approximation properties that (1) characterizes the bias as a function of residual budget, (2) shows monotonic convergence of the safety score to the true probability as budget tends to infinity, and (3) contrasts the zero-variance property of search (for fixed budget and search order) with the variance of Monte-Carlo sampling. This analysis will be placed alongside the existing empirical comparisons, thereby grounding the distinction in explicit guarantees rather than leaving the score as a pure heuristic. revision: yes

Circularity Check

0 steps flagged

No circularity: safety score defined directly via budgeted search without reduction to inputs

full rationale

The paper defines the BOA safety score explicitly as the probability that the agent stays safe under a given configuration, computed by searching the in-budget trajectory space. This is a direct definitional mapping from the search procedure to the reported score, with no fitted parameters renamed as predictions, no self-citations load-bearing the central claim, and no equations that equate the output to its own inputs by construction. The abstract and description introduce practical mechanisms (batched decoding, prefix caching, chunked expansion) as implementation details rather than deriving the probability from prior results or ansatzes. The framework is therefore self-contained against external benchmarks for the purpose of this circularity check.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Relies on likelihood budget as key parameter for search.

free parameters (1)

likelihood budget
Parameter controlling search extent.

pith-pipeline@v0.9.0 · 10022 in / 743 out tokens · 97420 ms · 2026-05-09T13:56:27.657074+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 11 canonical work pages · 8 internal anchors

[1]

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. 2025. AgentHarm: A Benchmark for Mea- suring Harmfulness of LLM Agents. arXiv:2410.09024 [cs.LG] https://arxiv.org/abs/2410.09024

work page internal anchor Pith review arXiv 2025
[2]

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li
[3]

https://openreview.net/forum?id= Y841BRW9rY

AgentPoison: Red-teaming LLM Agents via Poisoning Memory orKnowledgeBases.InTheThirty-eighthAnnualConferenceonNeural Information Processing Systems. https://openreview.net/forum?id= Y841BRW9rY
[4]

AI-poweredcodingtoolwipedoutasoftwarecompany’s database in ‘catastrophic failure’

Fortune.2025. AI-poweredcodingtoolwipedoutasoftwarecompany’s database in ‘catastrophic failure’. https://fortune.com/2025/07/23/ai- coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/. Accessed: 2025-11-15

2025
[5]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect promptinjection.InProceedingsofthe16thACMWorkshoponArtificial Intelligence and Security. 79–90

2023
[7]

Aonan Guan et al. 2026. Comment and Control: Prompt Injection to Credential Theft in Claude Code, Gemini CLI, and GitHub Copi- lot Agent. https://oddguan.com/blog/comment-and-control-prompt- injection-credential-theft-claude-code-gemini-cli-github-copilot/

2026
[8]

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending against indirect prompt injection attacks with spotlighting.arXiv preprint arXiv:2403.14720 (2024)

work page internal anchor Pith review arXiv 2024
[9]

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi
[10]

The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751(2019)

work page internal anchor Pith review arXiv 1904
[11]

Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem

ShuyiLin,AnshumanSuri,AlinaOprea,andChengTan.2026. Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem. InProceedings of Machine Learning and Systems (MLSys)

2026
[12]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24). 1831–1847

2024
[13]

Maddison, and Tatsunori Hashimoto

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the Risks of LM Agents with an LM- EmulatedSandbox.InTheTwelfthInternationalConferenceonLearning Representations. https://openreview.net/forum?id=GEcwtMk1uA

2024
[14]

Agents of chaos, 2026

Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, et al.2026. Agents of chaos.arXiv preprint arXiv:2602.20021 (2026)

work page arXiv 2026
[15]

Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

MrinankSharma,MegTong,JesseMu,JerryWei,JorritKruthoff,Scott Goodfriend,EuanOng,AlwinPeng,RajAgarwal,CemAnil,etal .2025. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming.arXiv preprint arXiv:2501.18837 (2025)

work page arXiv 2025
[16]

Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, and Wai Lam. 2024. A thorough examination of decoding methods in the era of llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 8601–8629

2024
[17]

EricWallace,KaiXiao,ReimarLeike,LilianWeng,JohannesHeidecke, and Alex Beutel. 2024. The instruction hierarchy: Training llms to prioritize privileged instructions.arXiv preprint arXiv:2404.13208 (2024)

work page internal anchor Pith review arXiv 2024
[18]

Jarrod Watts. 2024. An autonomous AI agent tricked into releas- ing its $47,000 prize pool. https://x.com/jarrodWattsDev/status/ 1862299845710757980

2024
[19]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng,BowenYu,ChangGao,ChengenHuang,ChenxuLv,etal .2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Toolsword: Unveiling safety issues of large language models in tool learning across three stages

Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, YilongWu,QiZhang,TaoGui,andXuan-JingHuang.2024. Toolsword: Unveiling safety issues of large language models in tool learning across three stages. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2024
[21]

R-judge:Benchmarkingsafetyriskawarenessforllmagents

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao,TianXia,LizhenXu,BinglinZhou,FangqiLi,ZhuoshengZhang, etal.2024. R-judge:Benchmarkingsafetyriskawarenessforllmagents. InFindings of the Association for Computational Linguistics: EMNLP

2024
[22]

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. 2024. Agent secu- rity bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents.arXiv preprint arXiv:2410.02644(2024)

work page internal anchor Pith review arXiv 2024
[23]

Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats, and Ninghui Li. 2025. Browsesafe: Understanding and pre- venting prompt injection within ai browser agents.arXiv preprint arXiv:2511.20597(2025)

work page arXiv 2025
[24]

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. 2024. Agent-safetybench: Evaluat- ing the safety of llm agents.arXiv preprint arXiv:2412.14470(2024)

work page internal anchor Pith review arXiv 2024