BraveGuard: From Open-World Threats to Safer Computer-Use Agents
Pith reviewed 2026-06-28 17:02 UTC · model grok-4.3
The pith
BraveGuard trains guard models on agent execution traces mined from recent research, raising detection accuracy on computer-use safety benchmarks from 38.79 percent to 82.38 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BraveGuard is a self-evolving defense framework that mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. When new threats or validation failures appear, the pipeline repeats. On the AgentHazard benchmark, guards trained under this process raise averaged detection accuracy from 38.79 percent to 82.38 percent across multiple backbones, demonstrating that supervision grounded in open-world threat discovery and realistic agent execution outperforms static taxonomies and synthetic prompt-level data.
What carries the argument
BraveGuard, the pipeline that converts open research sources into executable agent tasks, collects full rollouts, and produces iterative trajectory-level labels for guard training.
If this is right
- Guard models detect harmful behavior across multi-step computer-use trajectories more accurately than off-the-shelf models trained on isolated prompts.
- The defense can adapt by repeating the mining-and-rollout loop as new threats emerge rather than remaining fixed to an initial taxonomy.
- Trajectory-level supervision from realistic agent executions improves generalization to open-world risks compared with synthetic or prompt-only data.
- Multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, show consistent gains when trained on the same derived supervision.
Where Pith is reading between the lines
- The same mining-to-rollout loop could be applied to other agent settings such as web or code agents if the research-source step scales.
- Static safety benchmarks may systematically understate evolving risks, making repeated discovery loops a practical necessity for long-term monitoring.
- If the collected trajectories prove representative, similar adaptive supervision could transfer to safety problems outside computer-use agents.
Load-bearing premise
Tasks derived from recent research papers produce agent trajectories that fairly represent the distribution of real open-world threats.
What would settle it
Trained guards show no accuracy gain over baselines when tested on a collection of real deployed agent failures whose originating tasks were never published in research papers.
read the original abstract
Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BraveGuard, a self-evolving defense framework for computer-use agents that mines recent research sources to identify emerging risks, instantiates them as executable tasks, collects agent rollouts, derives trajectory-level supervision, and uses this to train guard models (e.g., Qwen3-Guard and Llama-Guard variants). It claims consistent improvements in safety detection, with accuracy on the AgentHazard benchmark rising from 38.79% to 82.38% under an averaged guard-model setting, arguing this provides a scalable, adaptive alternative to static taxonomies and synthetic prompt-level data.
Significance. If the data-generation pipeline produces representative, unbiased trajectory supervision that generalizes beyond the mined sources, the self-evolving loop would represent a meaningful advance for adaptive safety monitoring in multi-step agent interactions, where harm emerges only across traces. The approach of grounding supervision in open-world threat discovery rather than fixed benchmarks has clear potential value for the field if the representativeness assumption holds.
major comments (2)
- [Abstract] Abstract: The central claim of substantial accuracy improvement (38.79% o 82.38%) on AgentHazard rests on the pipeline producing representative trajectory-level supervision; however, the abstract supplies no enumeration of mined research sources, selection criteria, label-verification procedure, number of trajectories, or statistical details (error bars, baselines, data exclusion), preventing assessment of whether the gain reflects genuine generalization or distribution-shift artifacts.
- [Abstract] Abstract (self-evolving loop description): The claim that the pipeline can be repeated as new threats appear to yield an adaptive defense depends on the mined tasks and rollouts matching real open-world risk distributions; without explicit validation that the instantiated executable tasks avoid over-representing certain attack classes or sharing surface features with AgentHazard, the measured improvement risks being an artifact rather than evidence of broader applicability.
minor comments (1)
- [Abstract] Abstract: The phrase 'averaged guard-model setting' is used without definition or reference to how the averaging is performed across backbones, which reduces clarity for readers evaluating the reported accuracy.
Simulated Author's Rebuttal
We thank the referee for the detailed comments on the abstract. We address each point below and will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of substantial accuracy improvement (38.79% o 82.38%) on AgentHazard rests on the pipeline producing representative trajectory-level supervision; however, the abstract supplies no enumeration of mined research sources, selection criteria, label-verification procedure, number of trajectories, or statistical details (error bars, baselines, data exclusion), preventing assessment of whether the gain reflects genuine generalization or distribution-shift artifacts.
Authors: We agree that the abstract, constrained by length, omits these specifics. The main manuscript describes the mining process, selection criteria, verification steps, trajectory counts, baselines, and statistical reporting. We will revise the abstract to incorporate a concise summary of the data scale, source types, and verification approach to allow better evaluation of the results. revision: yes
-
Referee: [Abstract] Abstract (self-evolving loop description): The claim that the pipeline can be repeated as new threats appear to yield an adaptive defense depends on the mined tasks and rollouts matching real open-world risk distributions; without explicit validation that the instantiated executable tasks avoid over-representing certain attack classes or sharing surface features with AgentHazard, the measured improvement risks being an artifact rather than evidence of broader applicability.
Authors: We acknowledge that the self-evolving claim assumes representativeness of the mined tasks and that the manuscript does not include explicit checks for over-representation of attack classes or surface-feature overlap with AgentHazard. We will revise the abstract to qualify the description of the adaptive loop, explicitly noting the representativeness assumption and the possibility of distribution shift as a limitation. revision: yes
Circularity Check
No circularity: empirical result on external benchmark with no self-referential reduction
full rationale
The paper claims an accuracy improvement on the external AgentHazard benchmark (38.79% to 82.38%) via a pipeline that mines research sources, instantiates tasks, collects rollouts, and derives labels. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. The evaluation references an independent benchmark with no indication that results reduce to self-defined quantities or that the mining process is justified only by prior author work. This is a standard non-circular empirical claim.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introducing claude sonnet 4.6, 2026
Anthropic. Introducing claude sonnet 4.6, 2026. URLhttps://www.anthropic.com/news/claude-sonnet-4-6
2026
-
[2]
A trajectory-based safety audit of clawdbot (openclaw)
Tianyu Chen, Dongrui Liu, Xia Hu, Jingyi Yu, and Wenjie Wang. A trajectory-based safety audit of clawdbot (openclaw). arXivpreprintarXiv:2602.14364, 2026
-
[3]
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, and Yun-Nung Chen. Tracesafe: A systematic assessment of llm guardrails on multi-step tool-calling trajectories.arXivpreprintarXiv:2604.07223, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases.Advancesin NeuralInformation Processing Systems, 37:130185–130213, 2024
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases.Advancesin NeuralInformation Processing Systems, 37:130185–130213, 2024
2024
-
[5]
Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models.arXiv preprint arXiv:2405.13401, 2024
-
[6]
Zhengchunmin Dai, Jiaxiong Tang, Liantao Wu, Peng Sun, and Honglong Chen. Stateful agent backdoor.arXiv preprint arXiv:2605.06158, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, et al. Taming openclaw: Security analysis and mitigation of autonomous llm agent threats.arXivpreprint arXiv:2603.11619, 2026
-
[8]
Eyes on google’s notebooklm: using generative ai to create ophthalmology podcasts with a single click.Eye, 39(2):215–216, 2025
Qais A Dihan, Bharti R Nihalani, Andrea A Tooley, and Abdelrahman M Elhusseiny. Eyes on google’s notebooklm: using generative ai to create ophthalmology podcasts with a single click.Eye, 39(2):215–216, 2025
2025
-
[9]
Ben Dong, Hui Feng, and Qian Wang. Clawdrain: Exploiting tool-calling chains for stealthy token exhaustion in openclaw agents.arXivpreprintarXiv:2603.00902, 2026
-
[10]
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, and Yanming Guo. Agenthazard: Abenchmarkforevaluatingharmfulbehaviorincomputer-useagents. arXivpreprintarXiv:2604.02947, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, and Wenke Huang. Skilltrojan: Backdoor attacks on skill-based agent systems.arXiv preprintarXiv:2604.06811, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
YunhaoFeng,YigeLi,YutaoWu,YingshuiTan,YanmingGuo,YifanDing,KunZhai,XingjunMa,andYu-GangJiang. Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026
-
[13]
Gemini 3.1 pro: A smarter model for your most complex tasks, 2026
Google. Gemini 3.1 pro: A smarter model for your most complex tasks, 2026. URLhttps://blog.google/innova tion-and-ai/models-and-research/gemini-models/gemini-3-1-pro/
2026
-
[14]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprintarXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, and Xiang Ren. Proof-of-guardrail in ai agents and what (not) to trust from it.arXivpreprint arXiv:2603.05786, 2026
work page internal anchor Pith review arXiv 2026
-
[17]
Openclaw prism: A zero-fork, defense-in-depth runtime security layer for tool-augmented llm agents
Frank Li. Openclaw prism: A zero-fork, defense-in-depth runtime security layer for tool-augmented llm agents. arXiv preprintarXiv:2603.11853, 2026
-
[18]
Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprintarXiv:2511.16709, 2025
Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, and Jun Sun. Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprintarXiv:2511.16709, 2025
-
[19]
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, et al. Atbench: A diverse and realistic trajectory benchmark for long-horizon agent safety.arXiv preprint arXiv:2604.02022, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
arXiv preprint arXiv:2601.15588 , year=
Junyu Lin, Meizhen Liu, Xiufeng Huang, Jinfeng Li, Haiwen Hong, Xiaohan Yuan, Yuefeng Chen, Longtao Huang, Hui Xue, Ranjie Duan, et al. Yufeng-xguard: A reasoning-centric, interpretable, and flexible guardrail model for large language models.arXivpreprintarXiv:2601.15588, 2026
-
[21]
AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security
Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security.arXiv preprint arXiv:2601.18491, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen. Dive into claude code: The design space of today’s and future ai agent systems.arXiv preprintarXiv:2604.14228, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Prompt Injection attack against LLM-integrated Applications
Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang,YanZheng,etal. Promptinjectionattackagainstllm-integratedapplications. arXivpreprintarXiv:2306.05499, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Agentauditor: Human-levelsafetyandsecurityevaluationforllmagents
Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. Agentauditor: Human-levelsafetyandsecurityevaluationforllmagents. AdvancesinNeuralInformationProcessing Systems, 38:43241–43298, 2026
2026
-
[25]
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
Rina Mishra, Gaurav Varshney, and Doddipatla Sesha Sahithi. Guardphish: Securing open-source llms from phishing abuse.arXivpreprintarXiv:2604.17313, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Introducing gpt-5.5, 2026
OpenAI. Introducing gpt-5.5, 2026. URLhttps://openai.com/index/introducing-gpt-5-5/
2026
-
[27]
Towards policy-adaptive image guardrail: Benchmark and method.arXivpreprintarXiv:2603.01228, 2026
Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, and Shuigeng Zhou. Towards policy-adaptive image guardrail: Benchmark and method.arXivpreprintarXiv:2603.01228, 2026
-
[28]
Locobench-agent: An interactive benchmark for llm agents in long-context software engineering
Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, et al. Locobench-agent: An interactive benchmark for llm agents in long-context software engineering. arXivpreprintarXiv:2511.13998, 2025
-
[29]
Qwen3-235b, 2025
Qwen Team. Qwen3-235b, 2025. URLhttps://qwen.ai/blog?id=qwen3
2025
-
[30]
Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails
Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. InProceedings of the 2023 conferenceon empiricalmethods in naturallanguageprocessing: systemdemonstrations, pages 431–445, 2023
2023
-
[31]
arXiv preprint arXiv:2603.10387 , year=
Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu. Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXivpreprintarXiv:2603.10387, 2026
-
[32]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Qwen Team. Qwen3. 5-omni technical report.arXivpreprintarXiv:2604.15804, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
Guiyao Tie, Jiawen Shi, Pan Zhou, and Lichao Sun. Badskill: Backdoor attacks on agent skills via model-in-skill poisoning. arXiv preprintarXiv:2604.09378, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, et al. Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem.arXiv preprintarXiv:2512.24873, 2025
-
[36]
Openhands: An open platform for ai software developers as generalist agents
XingyaoWang,BoxuanLi,YufanSong,FrankFXu,XiangruTang,MingchenZhuge,JiayiPan,YueqiSong,BowenLi, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conferenceon Learning Representations, volume 2025, pages 65882–65919, 2025
2025
-
[37]
OpenClaw-RL: Train Any Agent Simply by Talking
Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXivpreprintarXiv:2603.10165, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
A Systematic Security Evaluation of OpenClaw and Its Variants
Yuhang Wang, Haichang Gao, Zhenxing Niu, Zhaoxiang Liu, Wenjing Zhang, Xiang Wang, and Shiguo Lian. A systematic security evaluation of openclaw and its variants.arXivpreprintarXiv:2604.03131, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Internal safety collapse in frontier large language models,
YutaoWu,XiaoLiu,YifengGao,XiangZheng,HanxunHuang,YigeLi,CongWang,BoLi,XingjunMa,andYu-Gang Jiang. Internal safety collapse in frontier large language models.arXivpreprintarXiv:2603.23509, 2026
-
[40]
Guardtrace-vl: Detecting unsafe multimodel reasoning via iterative safety supervision
Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan, Qi Chu, Tao Gong, and Nenghai Yu. Guardtrace-vl: Detecting unsafe multimodel reasoning via iterative safety supervision. arXiv preprint arXiv:2511.20994, 2025
-
[41]
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning.arXiv preprint arXiv:2406.09187, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin NeuralInformation ProcessingSystems, 37:50528–50652, 2024
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin NeuralInformation ProcessingSystems, 37:50528–50652, 2024
2024
-
[44]
R-judge: Benchmarkingsafetyriskawarenessforllmagents
TongxinYuan,ZhiweiHe,LingzhongDong,YimingWang,RuijieZhao,TianXia,LizhenXu,BinglinZhou,FangqiLi, ZhuoshengZhang,etal. R-judge: Benchmarkingsafetyriskawarenessforllmagents. In FindingsoftheAssociation forComputationalLinguistics: EMNLP 2024, pages 1467–1490, 2024
2024
-
[45]
Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXivpreprintarXiv:2510.14276, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
existence audit
Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia.{PoisonedRAG}: Knowledge corruption attacks to {Retrieval-Augmented} generation of large language models. In34th USENIX Security Symposium (USENIX Security25), pages 3827–3844, 2025. A Data Collection Details This appendix provides additional details on the data collection stage of BraveGuard. We summar...
2025
-
[47]
Second, the comparison with Llama3.1-8B- Instruct highlights an important distinction betweendetecting many unsafe cases and making balanced safetyjudgments
Even when evaluated on external benchmarks built for agent-risk judgment, BraveGuard-trained guards remain strong and often achieve the best accuracy. Second, the comparison with Llama3.1-8B- Instruct highlights an important distinction betweendetecting many unsafe cases and making balanced safetyjudgments. Extremely high recall can be achieved by over-pr...
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.