AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security
Pith reviewed 2026-05-16 11:24 UTC · model grok-4.3
The pith
AgentDoG uses a three-dimensional taxonomy to diagnose root causes of unsafe actions in AI agents beyond binary labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentDoG is a diagnostic guardrail framework that provides fine-grained and contextual monitoring across agent trajectories and diagnoses the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment.
What carries the argument
The three-dimensional taxonomy that orthogonally categorizes agentic risks by source (where), failure mode (how), and consequence (what), which structures both the benchmark and the diagnostic monitoring process.
If this is right
- Enables transparent diagnosis that supports targeted fixes during agent alignment.
- Provides fine-grained monitoring that captures risks emerging across entire interaction trajectories.
- Achieves superior performance in safety moderation for complex, tool-using agent scenarios.
- Releases models and datasets to allow community extension of the diagnostic approach.
Where Pith is reading between the lines
- The taxonomy structure could be reused to create diagnostic guardrails for non-agent AI systems with sequential decision making.
- Root-cause outputs might generate synthetic training data focused on specific failure modes to reduce recurrence.
- Integration into agent runtime loops could allow real-time intervention before consequences materialize.
Load-bearing premise
The three-dimensional taxonomy is orthogonal, comprehensive, and sufficient to cover all relevant agent behaviors for accurate root-cause diagnosis.
What would settle it
A collection of agent trajectories containing unsafe behaviors where human experts identify root causes outside the taxonomy categories or where AgentDoG diagnosis mismatches expert analysis.
read the original abstract
The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a three-dimensional taxonomy for agentic risks (source, failure mode, consequence), constructs the ATBench benchmark guided by this taxonomy, and introduces the AgentDoG diagnostic guardrail framework. AgentDoG comprises 4B/7B/8B models (Qwen and Llama families) that perform fine-grained trajectory monitoring and root-cause diagnosis of unsafe or unreasonable agent actions, claiming state-of-the-art performance on diverse interactive scenarios with open release of models and data.
Significance. If the performance claims hold without circularity, the work could meaningfully advance agent safety research by shifting from binary guardrails to transparent, provenance-aware diagnosis. The open release of models and datasets is a clear strength for reproducibility. The structured taxonomy may also aid systematic benchmark design in the field.
major comments (2)
- [Abstract] Abstract: The SOTA claim rests on ATBench, which is 'guided by' the proposed taxonomy. This creates a circularity risk where reported gains may reflect taxonomy alignment rather than superior risk detection; the manuscript must report results on independent suites (e.g., ToolBench or WebArena safety subsets) to substantiate generalization to 'diverse and complex interactive scenarios'.
- [Taxonomy and Benchmark sections] Taxonomy and Benchmark sections: The assertion that the three dimensions are orthogonal and comprehensive is load-bearing for both ATBench construction and the diagnostic claims, yet no empirical validation (e.g., coverage analysis against real agent logs or inter-annotator agreement on category assignment) is provided.
minor comments (2)
- Ensure experimental sections explicitly list all baselines, exact metrics (precision/recall/F1 per category), statistical tests, and ablation results on the diagnostic component so that the SOTA claim can be independently verified.
- Clarify model training details (e.g., instruction tuning data composition, loss weighting for diagnosis vs. detection) to distinguish the contribution of the taxonomy from standard fine-tuning.
Simulated Author's Rebuttal
We are grateful for the referee's detailed and constructive feedback on our manuscript. We address each major comment point-by-point below, outlining planned revisions to strengthen the work.
read point-by-point responses
-
Referee: [Abstract] The SOTA claim rests on ATBench, which is 'guided by' the proposed taxonomy. This creates a circularity risk where reported gains may reflect taxonomy alignment rather than superior risk detection; the manuscript must report results on independent suites (e.g., ToolBench or WebArena safety subsets) to substantiate generalization to 'diverse and complex interactive scenarios'.
Authors: We thank the referee for highlighting this important point on potential circularity. ATBench was deliberately constructed to provide systematic coverage of the taxonomy for evaluating agentic risks, but we agree that claims of generalization to diverse scenarios benefit from evaluation on independent benchmarks. In the revised manuscript, we will report AgentDoG performance on safety-related subsets of ToolBench and WebArena to better substantiate effectiveness beyond the taxonomy-guided benchmark. revision: yes
-
Referee: [Taxonomy and Benchmark sections] The assertion that the three dimensions are orthogonal and comprehensive is load-bearing for both ATBench construction and the diagnostic claims, yet no empirical validation (e.g., coverage analysis against real agent logs or inter-annotator agreement on category assignment) is provided.
Authors: We appreciate the referee's observation regarding the need for empirical support of the taxonomy's properties. The three dimensions (source, failure mode, consequence) were derived from a comprehensive review of agent safety literature and documented real-world incidents to promote orthogonality and coverage. To address this directly, the revised manuscript will include an inter-annotator agreement analysis on category assignments for a sample of trajectories and a coverage study comparing ATBench categories against logs from public agent datasets. revision: yes
Circularity Check
No significant circularity; new taxonomy, benchmark, and framework are constructed without definitional or fitted reduction.
full rationale
The paper proposes a three-dimensional taxonomy, then builds ATBench guided by it and introduces AgentDoG for diagnosis on agent trajectories. The SOTA claim rests on performance within this new benchmark, but no equations, parameters, or predictions reduce by construction to the taxonomy inputs or to fitted values from the same data. No self-citations are load-bearing in the provided text, and the derivation chain introduces novel elements rather than renaming or smuggling prior results. This is self-contained construction against a purpose-built benchmark, which is a normal non-circular outcome per the evaluation rules.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three risk dimensions (source, failure mode, consequence) are orthogonal and collectively exhaustive for agentic safety and security risks.
Forward citations
Cited by 9 Pith papers
-
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
-
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
ATBench supplies 1,000 trajectories (503 safe, 497 unsafe) organized by risk source, failure mode, and harm to evaluate long-horizon safety in LLM-based agents.
-
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.
-
HearthNet: Edge Multi-Agent Orchestration for Smart Homes
HearthNet is an edge multi-agent orchestration system that runs role-specialized LLM agents locally to handle natural-language smart-home control, conflict resolution, and failure recovery through MQTT and shared state.
-
Security Considerations for Multi-agent Systems
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
-
Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI
Agentic AI evaluation and governance lack mechanisms to bind obligations to actions and prove compliance at runtime; a new synthesis framework with ODTA criteria and action-evidence bundles addresses this closure gap.
-
Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex
ATBench-Claw and ATBench-Codex extend the ATBench framework by customizing a three-dimensional safety taxonomy for trajectory evaluation in OpenClaw and Codex agent settings.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
-
From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI
The paper presents a layered method to translate governance objectives from standards such as ISO/IEC 42001 into four control layers for agentic AI, with runtime guardrails limited to observable, determinate, and time...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.