arxiv: 2601.18491 · v2 · submitted 2026-01-26 · 💻 cs.AI · cs.CC· cs.CL· cs.CV· cs.LG

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu , Qihan Ren , Chen Qian , Shuai Shao , Yuejin Xie , Yu Li , Zhonghao Yang , Haoyu Luo

show 35 more authors

Peng Wang Qingyu Liu Binxin Hu Ling Tang Jilin Mei Dadi Guo Leitao Yuan Junyao Yang Guanxu Chen Qihao Lin Yi Yu Bo Zhang Jiaxuan Guo Jie Zhang Wenqi Shao Huiqi Deng Zhiheng Xi Wenjie Wang Wenxuan Wang Wen Shen Zhikai Chen Haoyu Xie Jialing Tao Juntao Dai Jiaming Ji Zhongjie Ba Linfeng Zhang Yong Liu Quanshi Zhang Lei Zhu Zhihua Wei Hui Xue Chaochao Lu Jing Shao Xia Hu

This is my paper

Pith reviewed 2026-05-16 11:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CCcs.CLcs.CVcs.LG

keywords agentic safetyguardrail frameworkrisk taxonomyroot cause diagnosisAI agentssafety benchmarkATBench

0 comments

The pith

AgentDoG uses a three-dimensional taxonomy to diagnose root causes of unsafe actions in AI agents beyond binary labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a unified taxonomy that categorizes agentic risks orthogonally by source, failure mode, and consequence to structure safety analysis. This taxonomy guides construction of the ATBench benchmark and the AgentDoG framework, which monitors full agent trajectories and identifies why unsafe or unreasonable actions occur. Unlike prior guardrails limited to binary detection, AgentDoG supplies provenance and transparency to support alignment. The approach addresses autonomous tool use and environmental interactions that create complex risks current methods cannot fully capture. Variants in 4B, 7B, and 8B sizes across model families demonstrate state-of-the-art moderation performance in diverse interactive scenarios.

Core claim

AgentDoG is a diagnostic guardrail framework that provides fine-grained and contextual monitoring across agent trajectories and diagnoses the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment.

What carries the argument

The three-dimensional taxonomy that orthogonally categorizes agentic risks by source (where), failure mode (how), and consequence (what), which structures both the benchmark and the diagnostic monitoring process.

If this is right

Enables transparent diagnosis that supports targeted fixes during agent alignment.
Provides fine-grained monitoring that captures risks emerging across entire interaction trajectories.
Achieves superior performance in safety moderation for complex, tool-using agent scenarios.
Releases models and datasets to allow community extension of the diagnostic approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy structure could be reused to create diagnostic guardrails for non-agent AI systems with sequential decision making.
Root-cause outputs might generate synthetic training data focused on specific failure modes to reduce recurrence.
Integration into agent runtime loops could allow real-time intervention before consequences materialize.

Load-bearing premise

The three-dimensional taxonomy is orthogonal, comprehensive, and sufficient to cover all relevant agent behaviors for accurate root-cause diagnosis.

What would settle it

A collection of agent trajectories containing unsafe behaviors where human experts identify root causes outside the taxonomy categories or where AgentDoG diagnosis mismatches expert analysis.

read the original abstract

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentDoG gives a practical 3D taxonomy and root-cause diagnosis for agent risks, but the SOTA claim sits on a self-built benchmark with no external anchors shown.

read the letter

The core contribution is a three-dimensional taxonomy that splits agentic risks by source, failure mode, and consequence, then uses that structure to build both ATBench and a guardrail model that outputs diagnoses instead of binary flags. That diagnostic angle is the part that actually moves the needle for downstream alignment work, because knowing the provenance of an unsafe trajectory is more actionable than a simple reject. The open release of the 4B/7B/8B variants and the datasets is also straightforwardly useful; anyone can now run the same checks on their own agents.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a three-dimensional taxonomy for agentic risks (source, failure mode, consequence), constructs the ATBench benchmark guided by this taxonomy, and introduces the AgentDoG diagnostic guardrail framework. AgentDoG comprises 4B/7B/8B models (Qwen and Llama families) that perform fine-grained trajectory monitoring and root-cause diagnosis of unsafe or unreasonable agent actions, claiming state-of-the-art performance on diverse interactive scenarios with open release of models and data.

Significance. If the performance claims hold without circularity, the work could meaningfully advance agent safety research by shifting from binary guardrails to transparent, provenance-aware diagnosis. The open release of models and datasets is a clear strength for reproducibility. The structured taxonomy may also aid systematic benchmark design in the field.

major comments (2)

[Abstract] Abstract: The SOTA claim rests on ATBench, which is 'guided by' the proposed taxonomy. This creates a circularity risk where reported gains may reflect taxonomy alignment rather than superior risk detection; the manuscript must report results on independent suites (e.g., ToolBench or WebArena safety subsets) to substantiate generalization to 'diverse and complex interactive scenarios'.
[Taxonomy and Benchmark sections] Taxonomy and Benchmark sections: The assertion that the three dimensions are orthogonal and comprehensive is load-bearing for both ATBench construction and the diagnostic claims, yet no empirical validation (e.g., coverage analysis against real agent logs or inter-annotator agreement on category assignment) is provided.

minor comments (2)

Ensure experimental sections explicitly list all baselines, exact metrics (precision/recall/F1 per category), statistical tests, and ablation results on the diagnostic component so that the SOTA claim can be independently verified.
Clarify model training details (e.g., instruction tuning data composition, loss weighting for diagnosis vs. detection) to distinguish the contribution of the taxonomy from standard fine-tuning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful for the referee's detailed and constructive feedback on our manuscript. We address each major comment point-by-point below, outlining planned revisions to strengthen the work.

read point-by-point responses

Referee: [Abstract] The SOTA claim rests on ATBench, which is 'guided by' the proposed taxonomy. This creates a circularity risk where reported gains may reflect taxonomy alignment rather than superior risk detection; the manuscript must report results on independent suites (e.g., ToolBench or WebArena safety subsets) to substantiate generalization to 'diverse and complex interactive scenarios'.

Authors: We thank the referee for highlighting this important point on potential circularity. ATBench was deliberately constructed to provide systematic coverage of the taxonomy for evaluating agentic risks, but we agree that claims of generalization to diverse scenarios benefit from evaluation on independent benchmarks. In the revised manuscript, we will report AgentDoG performance on safety-related subsets of ToolBench and WebArena to better substantiate effectiveness beyond the taxonomy-guided benchmark. revision: yes
Referee: [Taxonomy and Benchmark sections] The assertion that the three dimensions are orthogonal and comprehensive is load-bearing for both ATBench construction and the diagnostic claims, yet no empirical validation (e.g., coverage analysis against real agent logs or inter-annotator agreement on category assignment) is provided.

Authors: We appreciate the referee's observation regarding the need for empirical support of the taxonomy's properties. The three dimensions (source, failure mode, consequence) were derived from a comprehensive review of agent safety literature and documented real-world incidents to promote orthogonality and coverage. To address this directly, the revised manuscript will include an inter-annotator agreement analysis on category assignments for a sample of trajectories and a coverage study comparing ATBench categories against logs from public agent datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new taxonomy, benchmark, and framework are constructed without definitional or fitted reduction.

full rationale

The paper proposes a three-dimensional taxonomy, then builds ATBench guided by it and introduces AgentDoG for diagnosis on agent trajectories. The SOTA claim rests on performance within this new benchmark, but no equations, parameters, or predictions reduce by construction to the taxonomy inputs or to fitted values from the same data. No self-citations are load-bearing in the provided text, and the derivation chain introduces novel elements rather than renaming or smuggling prior results. This is self-contained construction against a purpose-built benchmark, which is a normal non-circular outcome per the evaluation rules.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the introduced 3D taxonomy is a valid and complete organizing structure for agentic risks; no numerical free parameters are introduced beyond standard model training, and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption The three risk dimensions (source, failure mode, consequence) are orthogonal and collectively exhaustive for agentic safety and security risks.
Invoked to justify the taxonomy that guides both benchmark creation and the diagnostic capability of AgentDoG.

pith-pipeline@v0.9.0 · 5678 in / 1301 out tokens · 33627 ms · 2026-05-16T11:24:11.593981+00:00 · methodology

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
cs.AI 2026-04 unverdicted novelty 6.0

ATBench supplies 1,000 trajectories (503 safe, 497 unsafe) organized by risk source, failure mode, and harm to evaluate long-horizon safety in LLM-based agents.
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
cs.AI 2026-04 unverdicted novelty 6.0

ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.
HearthNet: Edge Multi-Agent Orchestration for Smart Homes
cs.DC 2026-03 unverdicted novelty 6.0

HearthNet is an edge multi-agent orchestration system that runs role-specialized LLM agents locally to handle natural-language smart-home control, conflict resolution, and failure recovery through MQTT and shared state.
Security Considerations for Multi-agent Systems
cs.CR 2026-03 unverdicted novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI
cs.SE 2026-04 unverdicted novelty 5.0

Agentic AI evaluation and governance lack mechanisms to bind obligations to actions and prove compliance at runtime; a new synthesis framework with ODTA criteria and action-evidence bundles addresses this closure gap.
Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex
cs.AI 2026-04 unverdicted novelty 5.0

ATBench-Claw and ATBench-Codex extend the ATBench framework by customizing a three-dimensional safety taxonomy for trajectory evaluation in OpenClaw and Codex agent settings.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI
cs.AI 2026-04 unverdicted novelty 4.0

The paper presents a layered method to translate governance objectives from standards such as ISO/IEC 42001 into four control layers for agentic AI, with runtime guardrails limited to observable, determinate, and time...