pith. sign in

arxiv: 2606.01166 · v1 · pith:PKH4YDKLnew · submitted 2026-05-31 · 💻 cs.CR · cs.CL

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Pith reviewed 2026-06-28 17:02 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords computer-use agentssafety guard modelstrajectory supervisionopen-world threatsadaptive defenseAgentHazardself-evolving frameworkagent rollouts
0
0 comments X

The pith

BraveGuard trains guard models on agent execution traces mined from recent research, raising detection accuracy on computer-use safety benchmarks from 38.79 percent to 82.38 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

BraveGuard identifies emerging risks in published research, converts them into executable computer-use tasks, runs agents to collect full execution traces, and uses those traces as training data for guard models. Harm in these agents often surfaces only across multiple steps that look harmless in isolation, so prompt-level or final-response guards miss it. The pipeline repeats whenever new threats or failures appear, creating ongoing supervision instead of a one-time benchmark. This matters because agents now control files, terminals, and browsers where damage builds through sequences rather than single actions. The reported result is that guards trained this way detect trajectory-level risks far more reliably than models trained on fixed taxonomies or synthetic prompts.

Core claim

BraveGuard is a self-evolving defense framework that mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. When new threats or validation failures appear, the pipeline repeats. On the AgentHazard benchmark, guards trained under this process raise averaged detection accuracy from 38.79 percent to 82.38 percent across multiple backbones, demonstrating that supervision grounded in open-world threat discovery and realistic agent execution outperforms static taxonomies and synthetic prompt-level data.

What carries the argument

BraveGuard, the pipeline that converts open research sources into executable agent tasks, collects full rollouts, and produces iterative trajectory-level labels for guard training.

If this is right

  • Guard models detect harmful behavior across multi-step computer-use trajectories more accurately than off-the-shelf models trained on isolated prompts.
  • The defense can adapt by repeating the mining-and-rollout loop as new threats emerge rather than remaining fixed to an initial taxonomy.
  • Trajectory-level supervision from realistic agent executions improves generalization to open-world risks compared with synthetic or prompt-only data.
  • Multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, show consistent gains when trained on the same derived supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mining-to-rollout loop could be applied to other agent settings such as web or code agents if the research-source step scales.
  • Static safety benchmarks may systematically understate evolving risks, making repeated discovery loops a practical necessity for long-term monitoring.
  • If the collected trajectories prove representative, similar adaptive supervision could transfer to safety problems outside computer-use agents.

Load-bearing premise

Tasks derived from recent research papers produce agent trajectories that fairly represent the distribution of real open-world threats.

What would settle it

Trained guards show no accuracy gain over baselines when tested on a collection of real deployed agent failures whose originating tasks were never published in research papers.

read the original abstract

Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces BraveGuard, a self-evolving defense framework for computer-use agents that mines recent research sources to identify emerging risks, instantiates them as executable tasks, collects agent rollouts, derives trajectory-level supervision, and uses this to train guard models (e.g., Qwen3-Guard and Llama-Guard variants). It claims consistent improvements in safety detection, with accuracy on the AgentHazard benchmark rising from 38.79% to 82.38% under an averaged guard-model setting, arguing this provides a scalable, adaptive alternative to static taxonomies and synthetic prompt-level data.

Significance. If the data-generation pipeline produces representative, unbiased trajectory supervision that generalizes beyond the mined sources, the self-evolving loop would represent a meaningful advance for adaptive safety monitoring in multi-step agent interactions, where harm emerges only across traces. The approach of grounding supervision in open-world threat discovery rather than fixed benchmarks has clear potential value for the field if the representativeness assumption holds.

major comments (2)
  1. [Abstract] Abstract: The central claim of substantial accuracy improvement (38.79% o 82.38%) on AgentHazard rests on the pipeline producing representative trajectory-level supervision; however, the abstract supplies no enumeration of mined research sources, selection criteria, label-verification procedure, number of trajectories, or statistical details (error bars, baselines, data exclusion), preventing assessment of whether the gain reflects genuine generalization or distribution-shift artifacts.
  2. [Abstract] Abstract (self-evolving loop description): The claim that the pipeline can be repeated as new threats appear to yield an adaptive defense depends on the mined tasks and rollouts matching real open-world risk distributions; without explicit validation that the instantiated executable tasks avoid over-representing certain attack classes or sharing surface features with AgentHazard, the measured improvement risks being an artifact rather than evidence of broader applicability.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'averaged guard-model setting' is used without definition or reference to how the averaging is performed across backbones, which reduces clarity for readers evaluating the reported accuracy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on the abstract. We address each point below and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of substantial accuracy improvement (38.79% o 82.38%) on AgentHazard rests on the pipeline producing representative trajectory-level supervision; however, the abstract supplies no enumeration of mined research sources, selection criteria, label-verification procedure, number of trajectories, or statistical details (error bars, baselines, data exclusion), preventing assessment of whether the gain reflects genuine generalization or distribution-shift artifacts.

    Authors: We agree that the abstract, constrained by length, omits these specifics. The main manuscript describes the mining process, selection criteria, verification steps, trajectory counts, baselines, and statistical reporting. We will revise the abstract to incorporate a concise summary of the data scale, source types, and verification approach to allow better evaluation of the results. revision: yes

  2. Referee: [Abstract] Abstract (self-evolving loop description): The claim that the pipeline can be repeated as new threats appear to yield an adaptive defense depends on the mined tasks and rollouts matching real open-world risk distributions; without explicit validation that the instantiated executable tasks avoid over-representing certain attack classes or sharing surface features with AgentHazard, the measured improvement risks being an artifact rather than evidence of broader applicability.

    Authors: We acknowledge that the self-evolving claim assumes representativeness of the mined tasks and that the manuscript does not include explicit checks for over-representation of attack classes or surface-feature overlap with AgentHazard. We will revise the abstract to qualify the description of the adaptive loop, explicitly noting the representativeness assumption and the possibility of distribution shift as a limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical result on external benchmark with no self-referential reduction

full rationale

The paper claims an accuracy improvement on the external AgentHazard benchmark (38.79% to 82.38%) via a pipeline that mines research sources, instantiates tasks, collects rollouts, and derives labels. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. The evaluation references an independent benchmark with no indication that results reduce to self-defined quantities or that the mining process is justified only by prior author work. This is a standard non-circular empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5842 in / 1016 out tokens · 29188 ms · 2026-06-28T17:02:32.284861+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 34 canonical work pages · 20 internal anchors

  1. [1]

    Introducing claude sonnet 4.6, 2026

    Anthropic. Introducing claude sonnet 4.6, 2026. URLhttps://www.anthropic.com/news/claude-sonnet-4-6

  2. [2]

    A trajectory-based safety audit of clawdbot (openclaw)

    Tianyu Chen, Dongrui Liu, Xia Hu, Jingyi Yu, and Wenjie Wang. A trajectory-based safety audit of clawdbot (openclaw). arXivpreprintarXiv:2602.14364, 2026

  3. [3]

    TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

    Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, and Yun-Nung Chen. Tracesafe: A systematic assessment of llm guardrails on multi-step tool-calling trajectories.arXivpreprintarXiv:2604.07223, 2026

  4. [4]

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases.Advancesin NeuralInformation Processing Systems, 37:130185–130213, 2024

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases.Advancesin NeuralInformation Processing Systems, 37:130185–130213, 2024

  5. [5]

    Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models.arXiv preprint arXiv:2405.13401, 2024

    Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models.arXiv preprint arXiv:2405.13401, 2024

  6. [6]

    Stateful Agent Backdoor

    Zhengchunmin Dai, Jiaxiong Tang, Liantao Wu, Peng Sun, and Honglong Chen. Stateful agent backdoor.arXiv preprint arXiv:2605.06158, 2026

  7. [7]

    Taming openclaw: Security analysis and mitigation of autonomous llm agent threats.arXiv preprint arXiv:2603.11619,

    Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, et al. Taming openclaw: Security analysis and mitigation of autonomous llm agent threats.arXivpreprint arXiv:2603.11619, 2026

  8. [8]

    Eyes on google’s notebooklm: using generative ai to create ophthalmology podcasts with a single click.Eye, 39(2):215–216, 2025

    Qais A Dihan, Bharti R Nihalani, Andrea A Tooley, and Abdelrahman M Elhusseiny. Eyes on google’s notebooklm: using generative ai to create ophthalmology podcasts with a single click.Eye, 39(2):215–216, 2025

  9. [9]

    Clawdrain: Exploiting tool-calling chains for stealthy token exhaustion in openclaw agents.arXivpreprintarXiv:2603.00902, 2026

    Ben Dong, Hui Feng, and Qian Wang. Clawdrain: Exploiting tool-calling chains for stealthy token exhaustion in openclaw agents.arXivpreprintarXiv:2603.00902, 2026

  10. [10]

    AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

    Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, and Yanming Guo. Agenthazard: Abenchmarkforevaluatingharmfulbehaviorincomputer-useagents. arXivpreprintarXiv:2604.02947, 2026

  11. [11]

    SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

    Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, and Wenke Huang. Skilltrojan: Backdoor attacks on skill-based agent systems.arXiv preprintarXiv:2604.06811, 2026

  12. [12]

    Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

    YunhaoFeng,YigeLi,YutaoWu,YingshuiTan,YanmingGuo,YifanDing,KunZhai,XingjunMa,andYu-GangJiang. Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

  13. [13]

    Gemini 3.1 pro: A smarter model for your most complex tasks, 2026

    Google. Gemini 3.1 pro: A smarter model for your most complex tasks, 2026. URLhttps://blog.google/innova tion-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

  14. [14]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  15. [15]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprintarXiv:2312.06674, 2023

  16. [16]

    Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

    Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, and Xiang Ren. Proof-of-guardrail in ai agents and what (not) to trust from it.arXivpreprint arXiv:2603.05786, 2026

  17. [17]

    Openclaw prism: A zero-fork, defense-in-depth runtime security layer for tool-augmented llm agents

    Frank Li. Openclaw prism: A zero-fork, defense-in-depth runtime security layer for tool-augmented llm agents. arXiv preprintarXiv:2603.11853, 2026

  18. [18]

    Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprintarXiv:2511.16709, 2025

    Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, and Jun Sun. Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprintarXiv:2511.16709, 2025

  19. [19]

    ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

    Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, et al. Atbench: A diverse and realistic trajectory benchmark for long-horizon agent safety.arXiv preprint arXiv:2604.02022, 2026

  20. [20]

    arXiv preprint arXiv:2601.15588 , year=

    Junyu Lin, Meizhen Liu, Xiufeng Huang, Jinfeng Li, Haiwen Hong, Xiaohan Yuan, Yuefeng Chen, Longtao Huang, Hui Xue, Ranjie Duan, et al. Yufeng-xguard: A reasoning-centric, interpretable, and flexible guardrail model for large language models.arXivpreprintarXiv:2601.15588, 2026

  21. [21]

    AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

    Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security.arXiv preprint arXiv:2601.18491, 2026

  22. [22]

    Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

    Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen. Dive into claude code: The design space of today’s and future ai agent systems.arXiv preprintarXiv:2604.14228, 2026

  23. [23]

    Prompt Injection attack against LLM-integrated Applications

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang,YanZheng,etal. Promptinjectionattackagainstllm-integratedapplications. arXivpreprintarXiv:2306.05499, 2023

  24. [24]

    Agentauditor: Human-levelsafetyandsecurityevaluationforllmagents

    Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. Agentauditor: Human-levelsafetyandsecurityevaluationforllmagents. AdvancesinNeuralInformationProcessing Systems, 38:43241–43298, 2026

  25. [25]

    GuardPhish: Securing Open-Source LLMs from Phishing Abuse

    Rina Mishra, Gaurav Varshney, and Doddipatla Sesha Sahithi. Guardphish: Securing open-source llms from phishing abuse.arXivpreprintarXiv:2604.17313, 2026

  26. [26]

    Introducing gpt-5.5, 2026

    OpenAI. Introducing gpt-5.5, 2026. URLhttps://openai.com/index/introducing-gpt-5-5/

  27. [27]

    Towards policy-adaptive image guardrail: Benchmark and method.arXivpreprintarXiv:2603.01228, 2026

    Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, and Shuigeng Zhou. Towards policy-adaptive image guardrail: Benchmark and method.arXivpreprintarXiv:2603.01228, 2026

  28. [28]

    Locobench-agent: An interactive benchmark for llm agents in long-context software engineering

    Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, et al. Locobench-agent: An interactive benchmark for llm agents in long-context software engineering. arXivpreprintarXiv:2511.13998, 2025

  29. [29]

    Qwen3-235b, 2025

    Qwen Team. Qwen3-235b, 2025. URLhttps://qwen.ai/blog?id=qwen3

  30. [30]

    Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails

    Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. InProceedings of the 2023 conferenceon empiricalmethods in naturallanguageprocessing: systemdemonstrations, pages 431–445, 2023

  31. [31]

    arXiv preprint arXiv:2603.10387 , year=

    Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu. Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXivpreprintarXiv:2603.10387, 2026

  32. [32]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025

  33. [33]

    Qwen Team. Qwen3. 5-omni technical report.arXivpreprintarXiv:2604.15804, 2026

  34. [34]

    BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

    Guiyao Tie, Jiawen Shi, Pan Zhou, and Lichao Sun. Badskill: Backdoor attacks on agent skills via model-in-skill poisoning. arXiv preprintarXiv:2604.09378, 2026

  35. [35]

    Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem.arXiv preprintarXiv:2512.24873, 2025

    Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, et al. Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem.arXiv preprintarXiv:2512.24873, 2025

  36. [36]

    Openhands: An open platform for ai software developers as generalist agents

    XingyaoWang,BoxuanLi,YufanSong,FrankFXu,XiangruTang,MingchenZhuge,JiayiPan,YueqiSong,BowenLi, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conferenceon Learning Representations, volume 2025, pages 65882–65919, 2025

  37. [37]

    OpenClaw-RL: Train Any Agent Simply by Talking

    Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXivpreprintarXiv:2603.10165, 2026

  38. [38]

    A Systematic Security Evaluation of OpenClaw and Its Variants

    Yuhang Wang, Haichang Gao, Zhenxing Niu, Zhaoxiang Liu, Wenjing Zhang, Xiang Wang, and Shiguo Lian. A systematic security evaluation of openclaw and its variants.arXivpreprintarXiv:2604.03131, 2026

  39. [39]

    Internal safety collapse in frontier large language models,

    YutaoWu,XiaoLiu,YifengGao,XiangZheng,HanxunHuang,YigeLi,CongWang,BoLi,XingjunMa,andYu-Gang Jiang. Internal safety collapse in frontier large language models.arXivpreprintarXiv:2603.23509, 2026

  40. [40]

    Guardtrace-vl: Detecting unsafe multimodel reasoning via iterative safety supervision

    Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan, Qi Chu, Tao Gong, and Nenghai Yu. Guardtrace-vl: Detecting unsafe multimodel reasoning via iterative safety supervision. arXiv preprint arXiv:2511.20994, 2025

  41. [41]

    GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

    Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning.arXiv preprint arXiv:2406.09187, 2024

  42. [42]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

  43. [43]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin NeuralInformation ProcessingSystems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin NeuralInformation ProcessingSystems, 37:50528–50652, 2024

  44. [44]

    R-judge: Benchmarkingsafetyriskawarenessforllmagents

    TongxinYuan,ZhiweiHe,LingzhongDong,YimingWang,RuijieZhao,TianXia,LizhenXu,BinglinZhou,FangqiLi, ZhuoshengZhang,etal. R-judge: Benchmarkingsafetyriskawarenessforllmagents. In FindingsoftheAssociation forComputationalLinguistics: EMNLP 2024, pages 1467–1490, 2024

  45. [45]

    Qwen3Guard Technical Report

    Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXivpreprintarXiv:2510.14276, 2025

  46. [46]

    existence audit

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia.{PoisonedRAG}: Knowledge corruption attacks to {Retrieval-Augmented} generation of large language models. In34th USENIX Security Symposium (USENIX Security25), pages 3827–3844, 2025. A Data Collection Details This appendix provides additional details on the data collection stage of BraveGuard. We summar...

  47. [47]

    Second, the comparison with Llama3.1-8B- Instruct highlights an important distinction betweendetecting many unsafe cases and making balanced safetyjudgments

    Even when evaluated on external benchmarks built for agent-risk judgment, BraveGuard-trained guards remain strong and often achieve the best accuracy. Second, the comparison with Llama3.1-8B- Instruct highlights an important distinction betweendetecting many unsafe cases and making balanced safetyjudgments. Extremely high recall can be achieved by over-pr...