BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Baihui Zheng; Kerui Cao; Ming Wen; Wenke Huang; Xiaohu Du; Xingjun Ma; Xinhao Deng; Yanming Guo; Yifan Ding; Yige Li

arxiv: 2606.01166 · v1 · pith:PKH4YDKLnew · submitted 2026-05-31 · 💻 cs.CR · cs.CL

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Yunhao Feng , Yifan Ding , Xiaohu Du , Ming Wen , Xinhao Deng , Yanming Guo , Yuxiang Xie , Baihui Zheng

show 8 more authors

Yingshui Tan Yige Li Yutao Wu Yixu Wang Kerui Cao Wenke Huang Xingjun Ma Yu-Gang Jiang

This is my paper

Pith reviewed 2026-06-28 17:02 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords computer-use agentssafety guard modelstrajectory supervisionopen-world threatsadaptive defenseAgentHazardself-evolving frameworkagent rollouts

0 comments

The pith

BraveGuard trains guard models on agent execution traces mined from recent research, raising detection accuracy on computer-use safety benchmarks from 38.79 percent to 82.38 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

BraveGuard identifies emerging risks in published research, converts them into executable computer-use tasks, runs agents to collect full execution traces, and uses those traces as training data for guard models. Harm in these agents often surfaces only across multiple steps that look harmless in isolation, so prompt-level or final-response guards miss it. The pipeline repeats whenever new threats or failures appear, creating ongoing supervision instead of a one-time benchmark. This matters because agents now control files, terminals, and browsers where damage builds through sequences rather than single actions. The reported result is that guards trained this way detect trajectory-level risks far more reliably than models trained on fixed taxonomies or synthetic prompts.

Core claim

BraveGuard is a self-evolving defense framework that mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. When new threats or validation failures appear, the pipeline repeats. On the AgentHazard benchmark, guards trained under this process raise averaged detection accuracy from 38.79 percent to 82.38 percent across multiple backbones, demonstrating that supervision grounded in open-world threat discovery and realistic agent execution outperforms static taxonomies and synthetic prompt-level data.

What carries the argument

BraveGuard, the pipeline that converts open research sources into executable agent tasks, collects full rollouts, and produces iterative trajectory-level labels for guard training.

If this is right

Guard models detect harmful behavior across multi-step computer-use trajectories more accurately than off-the-shelf models trained on isolated prompts.
The defense can adapt by repeating the mining-and-rollout loop as new threats emerge rather than remaining fixed to an initial taxonomy.
Trajectory-level supervision from realistic agent executions improves generalization to open-world risks compared with synthetic or prompt-only data.
Multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, show consistent gains when trained on the same derived supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mining-to-rollout loop could be applied to other agent settings such as web or code agents if the research-source step scales.
Static safety benchmarks may systematically understate evolving risks, making repeated discovery loops a practical necessity for long-term monitoring.
If the collected trajectories prove representative, similar adaptive supervision could transfer to safety problems outside computer-use agents.

Load-bearing premise

Tasks derived from recent research papers produce agent trajectories that fairly represent the distribution of real open-world threats.

What would settle it

Trained guards show no accuracy gain over baselines when tested on a collection of real deployed agent failures whose originating tasks were never published in research papers.

read the original abstract

Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BraveGuard's self-evolving pipeline from mined research threats to trajectory labels is the main new piece, but the abstract gives no methods or verification steps to support the 38.79% to 82.38% jump.

read the letter

The paper's central move is to close the loop between open research papers on attacks, turning them into executable tasks, running agent rollouts, and using those traces to supervise guard models. That adaptive loop is the actual novelty here; most prior guard work stays at prompt level or fixed taxonomies.

What works is the problem framing. Computer-use agents create harms that only show up across steps, so trajectory-level supervision makes sense as a direction. Training multiple backbones and testing on AgentHazard is a reasonable start.

The soft spot is exactly where the stress-test points: nothing in the abstract shows how the mining step avoids over-representing certain attack classes or how the instantiated tasks differ from the evaluation benchmark. The accuracy numbers are large, but without source lists, selection rules, label checks, or even basic baselines and variance, it's impossible to tell if the gain is real generalization or just matching the test distribution. The self-evolving claim inherits the same gap.

This is for readers already working on agent safety or guard models who want to see one concrete attempt at open-world adaptation. It is not ready for citation yet because the evidence is missing.

If the full paper supplies the missing pipeline details, data sources, and controls, it should go to review. On the abstract alone it would be a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces BraveGuard, a self-evolving defense framework for computer-use agents that mines recent research sources to identify emerging risks, instantiates them as executable tasks, collects agent rollouts, derives trajectory-level supervision, and uses this to train guard models (e.g., Qwen3-Guard and Llama-Guard variants). It claims consistent improvements in safety detection, with accuracy on the AgentHazard benchmark rising from 38.79% to 82.38% under an averaged guard-model setting, arguing this provides a scalable, adaptive alternative to static taxonomies and synthetic prompt-level data.

Significance. If the data-generation pipeline produces representative, unbiased trajectory supervision that generalizes beyond the mined sources, the self-evolving loop would represent a meaningful advance for adaptive safety monitoring in multi-step agent interactions, where harm emerges only across traces. The approach of grounding supervision in open-world threat discovery rather than fixed benchmarks has clear potential value for the field if the representativeness assumption holds.

major comments (2)

[Abstract] Abstract: The central claim of substantial accuracy improvement (38.79% o 82.38%) on AgentHazard rests on the pipeline producing representative trajectory-level supervision; however, the abstract supplies no enumeration of mined research sources, selection criteria, label-verification procedure, number of trajectories, or statistical details (error bars, baselines, data exclusion), preventing assessment of whether the gain reflects genuine generalization or distribution-shift artifacts.
[Abstract] Abstract (self-evolving loop description): The claim that the pipeline can be repeated as new threats appear to yield an adaptive defense depends on the mined tasks and rollouts matching real open-world risk distributions; without explicit validation that the instantiated executable tasks avoid over-representing certain attack classes or sharing surface features with AgentHazard, the measured improvement risks being an artifact rather than evidence of broader applicability.

minor comments (1)

[Abstract] Abstract: The phrase 'averaged guard-model setting' is used without definition or reference to how the averaging is performed across backbones, which reduces clarity for readers evaluating the reported accuracy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on the abstract. We address each point below and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of substantial accuracy improvement (38.79% o 82.38%) on AgentHazard rests on the pipeline producing representative trajectory-level supervision; however, the abstract supplies no enumeration of mined research sources, selection criteria, label-verification procedure, number of trajectories, or statistical details (error bars, baselines, data exclusion), preventing assessment of whether the gain reflects genuine generalization or distribution-shift artifacts.

Authors: We agree that the abstract, constrained by length, omits these specifics. The main manuscript describes the mining process, selection criteria, verification steps, trajectory counts, baselines, and statistical reporting. We will revise the abstract to incorporate a concise summary of the data scale, source types, and verification approach to allow better evaluation of the results. revision: yes
Referee: [Abstract] Abstract (self-evolving loop description): The claim that the pipeline can be repeated as new threats appear to yield an adaptive defense depends on the mined tasks and rollouts matching real open-world risk distributions; without explicit validation that the instantiated executable tasks avoid over-representing certain attack classes or sharing surface features with AgentHazard, the measured improvement risks being an artifact rather than evidence of broader applicability.

Authors: We acknowledge that the self-evolving claim assumes representativeness of the mined tasks and that the manuscript does not include explicit checks for over-representation of attack classes or surface-feature overlap with AgentHazard. We will revise the abstract to qualify the description of the adaptive loop, explicitly noting the representativeness assumption and the possibility of distribution shift as a limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical result on external benchmark with no self-referential reduction

full rationale

The paper claims an accuracy improvement on the external AgentHazard benchmark (38.79% to 82.38%) via a pipeline that mines research sources, instantiates tasks, collects rollouts, and derives labels. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. The evaluation references an independent benchmark with no indication that results reduce to self-defined quantities or that the mining process is justified only by prior author work. This is a standard non-circular empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5842 in / 1016 out tokens · 29188 ms · 2026-06-28T17:02:32.284861+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 34 canonical work pages · 20 internal anchors

[1]

Introducing claude sonnet 4.6, 2026

Anthropic. Introducing claude sonnet 4.6, 2026. URLhttps://www.anthropic.com/news/claude-sonnet-4-6

2026
[2]

A trajectory-based safety audit of clawdbot (openclaw)

Tianyu Chen, Dongrui Liu, Xia Hu, Jingyi Yu, and Wenjie Wang. A trajectory-based safety audit of clawdbot (openclaw). arXivpreprintarXiv:2602.14364, 2026

work page arXiv 2026
[3]

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, and Yun-Nung Chen. Tracesafe: A systematic assessment of llm guardrails on multi-step tool-calling trajectories.arXivpreprintarXiv:2604.07223, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases.Advancesin NeuralInformation Processing Systems, 37:130185–130213, 2024

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases.Advancesin NeuralInformation Processing Systems, 37:130185–130213, 2024

2024
[5]

Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models.arXiv preprint arXiv:2405.13401, 2024

Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models.arXiv preprint arXiv:2405.13401, 2024

work page arXiv 2024
[6]

Stateful Agent Backdoor

Zhengchunmin Dai, Jiaxiong Tang, Liantao Wu, Peng Sun, and Honglong Chen. Stateful agent backdoor.arXiv preprint arXiv:2605.06158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Taming openclaw: Security analysis and mitigation of autonomous llm agent threats.arXiv preprint arXiv:2603.11619,

Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, et al. Taming openclaw: Security analysis and mitigation of autonomous llm agent threats.arXivpreprint arXiv:2603.11619, 2026

work page arXiv 2026
[8]

Eyes on google’s notebooklm: using generative ai to create ophthalmology podcasts with a single click.Eye, 39(2):215–216, 2025

Qais A Dihan, Bharti R Nihalani, Andrea A Tooley, and Abdelrahman M Elhusseiny. Eyes on google’s notebooklm: using generative ai to create ophthalmology podcasts with a single click.Eye, 39(2):215–216, 2025

2025
[9]

Clawdrain: Exploiting tool-calling chains for stealthy token exhaustion in openclaw agents.arXivpreprintarXiv:2603.00902, 2026

Ben Dong, Hui Feng, and Qian Wang. Clawdrain: Exploiting tool-calling chains for stealthy token exhaustion in openclaw agents.arXivpreprintarXiv:2603.00902, 2026

work page arXiv 2026
[10]

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, and Yanming Guo. Agenthazard: Abenchmarkforevaluatingharmfulbehaviorincomputer-useagents. arXivpreprintarXiv:2604.02947, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, and Wenke Huang. Skilltrojan: Backdoor attacks on skill-based agent systems.arXiv preprintarXiv:2604.06811, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

YunhaoFeng,YigeLi,YutaoWu,YingshuiTan,YanmingGuo,YifanDing,KunZhai,XingjunMa,andYu-GangJiang. Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

work page arXiv 2026
[13]

Gemini 3.1 pro: A smarter model for your most complex tasks, 2026

Google. Gemini 3.1 pro: A smarter model for your most complex tasks, 2026. URLhttps://blog.google/innova tion-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

2026
[14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprintarXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, and Xiang Ren. Proof-of-guardrail in ai agents and what (not) to trust from it.arXivpreprint arXiv:2603.05786, 2026

work page internal anchor Pith review arXiv 2026
[17]

Openclaw prism: A zero-fork, defense-in-depth runtime security layer for tool-augmented llm agents

Frank Li. Openclaw prism: A zero-fork, defense-in-depth runtime security layer for tool-augmented llm agents. arXiv preprintarXiv:2603.11853, 2026

work page arXiv 2026
[18]

Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprintarXiv:2511.16709, 2025

Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, and Jun Sun. Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprintarXiv:2511.16709, 2025

work page arXiv 2025
[19]

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, et al. Atbench: A diverse and realistic trajectory benchmark for long-horizon agent safety.arXiv preprint arXiv:2604.02022, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

arXiv preprint arXiv:2601.15588 , year=

Junyu Lin, Meizhen Liu, Xiufeng Huang, Jinfeng Li, Haiwen Hong, Xiaohan Yuan, Yuefeng Chen, Longtao Huang, Hui Xue, Ranjie Duan, et al. Yufeng-xguard: A reasoning-centric, interpretable, and flexible guardrail model for large language models.arXivpreprintarXiv:2601.15588, 2026

work page arXiv 2026
[21]

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security.arXiv preprint arXiv:2601.18491, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen. Dive into claude code: The design space of today’s and future ai agent systems.arXiv preprintarXiv:2604.14228, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang,YanZheng,etal. Promptinjectionattackagainstllm-integratedapplications. arXivpreprintarXiv:2306.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Agentauditor: Human-levelsafetyandsecurityevaluationforllmagents

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. Agentauditor: Human-levelsafetyandsecurityevaluationforllmagents. AdvancesinNeuralInformationProcessing Systems, 38:43241–43298, 2026

2026
[25]

GuardPhish: Securing Open-Source LLMs from Phishing Abuse

Rina Mishra, Gaurav Varshney, and Doddipatla Sesha Sahithi. Guardphish: Securing open-source llms from phishing abuse.arXivpreprintarXiv:2604.17313, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Introducing gpt-5.5, 2026

OpenAI. Introducing gpt-5.5, 2026. URLhttps://openai.com/index/introducing-gpt-5-5/

2026
[27]

Towards policy-adaptive image guardrail: Benchmark and method.arXivpreprintarXiv:2603.01228, 2026

Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, and Shuigeng Zhou. Towards policy-adaptive image guardrail: Benchmark and method.arXivpreprintarXiv:2603.01228, 2026

work page arXiv 2026
[28]

Locobench-agent: An interactive benchmark for llm agents in long-context software engineering

Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, et al. Locobench-agent: An interactive benchmark for llm agents in long-context software engineering. arXivpreprintarXiv:2511.13998, 2025

work page arXiv 2025
[29]

Qwen3-235b, 2025

Qwen Team. Qwen3-235b, 2025. URLhttps://qwen.ai/blog?id=qwen3

2025
[30]

Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. InProceedings of the 2023 conferenceon empiricalmethods in naturallanguageprocessing: systemdemonstrations, pages 431–445, 2023

2023
[31]

arXiv preprint arXiv:2603.10387 , year=

Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu. Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXivpreprintarXiv:2603.10387, 2026

work page arXiv 2026
[32]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Qwen Team. Qwen3. 5-omni technical report.arXivpreprintarXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

Guiyao Tie, Jiawen Shi, Pan Zhou, and Lichao Sun. Badskill: Backdoor attacks on agent skills via model-in-skill poisoning. arXiv preprintarXiv:2604.09378, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem.arXiv preprintarXiv:2512.24873, 2025

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, et al. Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem.arXiv preprintarXiv:2512.24873, 2025

work page arXiv 2025
[36]

Openhands: An open platform for ai software developers as generalist agents

XingyaoWang,BoxuanLi,YufanSong,FrankFXu,XiangruTang,MingchenZhuge,JiayiPan,YueqiSong,BowenLi, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conferenceon Learning Representations, volume 2025, pages 65882–65919, 2025

2025
[37]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXivpreprintarXiv:2603.10165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

A Systematic Security Evaluation of OpenClaw and Its Variants

Yuhang Wang, Haichang Gao, Zhenxing Niu, Zhaoxiang Liu, Wenjing Zhang, Xiang Wang, and Shiguo Lian. A systematic security evaluation of openclaw and its variants.arXivpreprintarXiv:2604.03131, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Internal safety collapse in frontier large language models,

YutaoWu,XiaoLiu,YifengGao,XiangZheng,HanxunHuang,YigeLi,CongWang,BoLi,XingjunMa,andYu-Gang Jiang. Internal safety collapse in frontier large language models.arXivpreprintarXiv:2603.23509, 2026

work page arXiv 2026
[40]

Guardtrace-vl: Detecting unsafe multimodel reasoning via iterative safety supervision

Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan, Qi Chu, Tao Gong, and Nenghai Yu. Guardtrace-vl: Detecting unsafe multimodel reasoning via iterative safety supervision. arXiv preprint arXiv:2511.20994, 2025

work page arXiv 2025
[41]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning.arXiv preprint arXiv:2406.09187, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin NeuralInformation ProcessingSystems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin NeuralInformation ProcessingSystems, 37:50528–50652, 2024

2024
[44]

R-judge: Benchmarkingsafetyriskawarenessforllmagents

TongxinYuan,ZhiweiHe,LingzhongDong,YimingWang,RuĳieZhao,TianXia,LizhenXu,BinglinZhou,FangqiLi, ZhuoshengZhang,etal. R-judge: Benchmarkingsafetyriskawarenessforllmagents. In FindingsoftheAssociation forComputationalLinguistics: EMNLP 2024, pages 1467–1490, 2024

2024
[45]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXivpreprintarXiv:2510.14276, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

existence audit

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia.{PoisonedRAG}: Knowledge corruption attacks to {Retrieval-Augmented} generation of large language models. In34th USENIX Security Symposium (USENIX Security25), pages 3827–3844, 2025. A Data Collection Details This appendix provides additional details on the data collection stage of BraveGuard. We summar...

2025
[47]

Second, the comparison with Llama3.1-8B- Instruct highlights an important distinction betweendetecting many unsafe cases and making balanced safetyjudgments

Even when evaluated on external benchmarks built for agent-risk judgment, BraveGuard-trained guards remain strong and often achieve the best accuracy. Second, the comparison with Llama3.1-8B- Instruct highlights an important distinction betweendetecting many unsafe cases and making balanced safetyjudgments. Extremely high recall can be achieved by over-pr...

2000

[1] [1]

Introducing claude sonnet 4.6, 2026

Anthropic. Introducing claude sonnet 4.6, 2026. URLhttps://www.anthropic.com/news/claude-sonnet-4-6

2026

[2] [2]

A trajectory-based safety audit of clawdbot (openclaw)

Tianyu Chen, Dongrui Liu, Xia Hu, Jingyi Yu, and Wenjie Wang. A trajectory-based safety audit of clawdbot (openclaw). arXivpreprintarXiv:2602.14364, 2026

work page arXiv 2026

[3] [3]

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, and Yun-Nung Chen. Tracesafe: A systematic assessment of llm guardrails on multi-step tool-calling trajectories.arXivpreprintarXiv:2604.07223, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases.Advancesin NeuralInformation Processing Systems, 37:130185–130213, 2024

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases.Advancesin NeuralInformation Processing Systems, 37:130185–130213, 2024

2024

[5] [5]

Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models.arXiv preprint arXiv:2405.13401, 2024

Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models.arXiv preprint arXiv:2405.13401, 2024

work page arXiv 2024

[6] [6]

Stateful Agent Backdoor

Zhengchunmin Dai, Jiaxiong Tang, Liantao Wu, Peng Sun, and Honglong Chen. Stateful agent backdoor.arXiv preprint arXiv:2605.06158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Taming openclaw: Security analysis and mitigation of autonomous llm agent threats.arXiv preprint arXiv:2603.11619,

Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, et al. Taming openclaw: Security analysis and mitigation of autonomous llm agent threats.arXivpreprint arXiv:2603.11619, 2026

work page arXiv 2026

[8] [8]

Eyes on google’s notebooklm: using generative ai to create ophthalmology podcasts with a single click.Eye, 39(2):215–216, 2025

Qais A Dihan, Bharti R Nihalani, Andrea A Tooley, and Abdelrahman M Elhusseiny. Eyes on google’s notebooklm: using generative ai to create ophthalmology podcasts with a single click.Eye, 39(2):215–216, 2025

2025

[9] [9]

Clawdrain: Exploiting tool-calling chains for stealthy token exhaustion in openclaw agents.arXivpreprintarXiv:2603.00902, 2026

Ben Dong, Hui Feng, and Qian Wang. Clawdrain: Exploiting tool-calling chains for stealthy token exhaustion in openclaw agents.arXivpreprintarXiv:2603.00902, 2026

work page arXiv 2026

[10] [10]

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, and Yanming Guo. Agenthazard: Abenchmarkforevaluatingharmfulbehaviorincomputer-useagents. arXivpreprintarXiv:2604.02947, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, and Wenke Huang. Skilltrojan: Backdoor attacks on skill-based agent systems.arXiv preprintarXiv:2604.06811, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

YunhaoFeng,YigeLi,YutaoWu,YingshuiTan,YanmingGuo,YifanDing,KunZhai,XingjunMa,andYu-GangJiang. Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

work page arXiv 2026

[13] [13]

Gemini 3.1 pro: A smarter model for your most complex tasks, 2026

Google. Gemini 3.1 pro: A smarter model for your most complex tasks, 2026. URLhttps://blog.google/innova tion-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

2026

[14] [14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprintarXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, and Xiang Ren. Proof-of-guardrail in ai agents and what (not) to trust from it.arXivpreprint arXiv:2603.05786, 2026

work page internal anchor Pith review arXiv 2026

[17] [17]

Openclaw prism: A zero-fork, defense-in-depth runtime security layer for tool-augmented llm agents

Frank Li. Openclaw prism: A zero-fork, defense-in-depth runtime security layer for tool-augmented llm agents. arXiv preprintarXiv:2603.11853, 2026

work page arXiv 2026

[18] [18]

Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprintarXiv:2511.16709, 2025

Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, and Jun Sun. Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprintarXiv:2511.16709, 2025

work page arXiv 2025

[19] [19]

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, et al. Atbench: A diverse and realistic trajectory benchmark for long-horizon agent safety.arXiv preprint arXiv:2604.02022, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

arXiv preprint arXiv:2601.15588 , year=

Junyu Lin, Meizhen Liu, Xiufeng Huang, Jinfeng Li, Haiwen Hong, Xiaohan Yuan, Yuefeng Chen, Longtao Huang, Hui Xue, Ranjie Duan, et al. Yufeng-xguard: A reasoning-centric, interpretable, and flexible guardrail model for large language models.arXivpreprintarXiv:2601.15588, 2026

work page arXiv 2026

[21] [21]

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security.arXiv preprint arXiv:2601.18491, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen. Dive into claude code: The design space of today’s and future ai agent systems.arXiv preprintarXiv:2604.14228, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang,YanZheng,etal. Promptinjectionattackagainstllm-integratedapplications. arXivpreprintarXiv:2306.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Agentauditor: Human-levelsafetyandsecurityevaluationforllmagents

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. Agentauditor: Human-levelsafetyandsecurityevaluationforllmagents. AdvancesinNeuralInformationProcessing Systems, 38:43241–43298, 2026

2026

[25] [25]

GuardPhish: Securing Open-Source LLMs from Phishing Abuse

Rina Mishra, Gaurav Varshney, and Doddipatla Sesha Sahithi. Guardphish: Securing open-source llms from phishing abuse.arXivpreprintarXiv:2604.17313, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Introducing gpt-5.5, 2026

OpenAI. Introducing gpt-5.5, 2026. URLhttps://openai.com/index/introducing-gpt-5-5/

2026

[27] [27]

Towards policy-adaptive image guardrail: Benchmark and method.arXivpreprintarXiv:2603.01228, 2026

Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, and Shuigeng Zhou. Towards policy-adaptive image guardrail: Benchmark and method.arXivpreprintarXiv:2603.01228, 2026

work page arXiv 2026

[28] [28]

Locobench-agent: An interactive benchmark for llm agents in long-context software engineering

Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, et al. Locobench-agent: An interactive benchmark for llm agents in long-context software engineering. arXivpreprintarXiv:2511.13998, 2025

work page arXiv 2025

[29] [29]

Qwen3-235b, 2025

Qwen Team. Qwen3-235b, 2025. URLhttps://qwen.ai/blog?id=qwen3

2025

[30] [30]

Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. InProceedings of the 2023 conferenceon empiricalmethods in naturallanguageprocessing: systemdemonstrations, pages 431–445, 2023

2023

[31] [31]

arXiv preprint arXiv:2603.10387 , year=

Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu. Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXivpreprintarXiv:2603.10387, 2026

work page arXiv 2026

[32] [32]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Qwen Team. Qwen3. 5-omni technical report.arXivpreprintarXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

Guiyao Tie, Jiawen Shi, Pan Zhou, and Lichao Sun. Badskill: Backdoor attacks on agent skills via model-in-skill poisoning. arXiv preprintarXiv:2604.09378, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem.arXiv preprintarXiv:2512.24873, 2025

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, et al. Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem.arXiv preprintarXiv:2512.24873, 2025

work page arXiv 2025

[36] [36]

Openhands: An open platform for ai software developers as generalist agents

XingyaoWang,BoxuanLi,YufanSong,FrankFXu,XiangruTang,MingchenZhuge,JiayiPan,YueqiSong,BowenLi, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conferenceon Learning Representations, volume 2025, pages 65882–65919, 2025

2025

[37] [37]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXivpreprintarXiv:2603.10165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

A Systematic Security Evaluation of OpenClaw and Its Variants

Yuhang Wang, Haichang Gao, Zhenxing Niu, Zhaoxiang Liu, Wenjing Zhang, Xiang Wang, and Shiguo Lian. A systematic security evaluation of openclaw and its variants.arXivpreprintarXiv:2604.03131, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Internal safety collapse in frontier large language models,

YutaoWu,XiaoLiu,YifengGao,XiangZheng,HanxunHuang,YigeLi,CongWang,BoLi,XingjunMa,andYu-Gang Jiang. Internal safety collapse in frontier large language models.arXivpreprintarXiv:2603.23509, 2026

work page arXiv 2026

[40] [40]

Guardtrace-vl: Detecting unsafe multimodel reasoning via iterative safety supervision

Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan, Qi Chu, Tao Gong, and Nenghai Yu. Guardtrace-vl: Detecting unsafe multimodel reasoning via iterative safety supervision. arXiv preprint arXiv:2511.20994, 2025

work page arXiv 2025

[41] [41]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning.arXiv preprint arXiv:2406.09187, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin NeuralInformation ProcessingSystems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin NeuralInformation ProcessingSystems, 37:50528–50652, 2024

2024

[44] [44]

R-judge: Benchmarkingsafetyriskawarenessforllmagents

TongxinYuan,ZhiweiHe,LingzhongDong,YimingWang,RuĳieZhao,TianXia,LizhenXu,BinglinZhou,FangqiLi, ZhuoshengZhang,etal. R-judge: Benchmarkingsafetyriskawarenessforllmagents. In FindingsoftheAssociation forComputationalLinguistics: EMNLP 2024, pages 1467–1490, 2024

2024

[45] [45]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXivpreprintarXiv:2510.14276, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

existence audit

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia.{PoisonedRAG}: Knowledge corruption attacks to {Retrieval-Augmented} generation of large language models. In34th USENIX Security Symposium (USENIX Security25), pages 3827–3844, 2025. A Data Collection Details This appendix provides additional details on the data collection stage of BraveGuard. We summar...

2025

[47] [47]

Second, the comparison with Llama3.1-8B- Instruct highlights an important distinction betweendetecting many unsafe cases and making balanced safetyjudgments

Even when evaluated on external benchmarks built for agent-risk judgment, BraveGuard-trained guards remain strong and often achieve the best accuracy. Second, the comparison with Llama3.1-8B- Instruct highlights an important distinction betweendetecting many unsafe cases and making balanced safetyjudgments. Extremely high recall can be achieved by over-pr...

2000