Recognition: unknown
Decision Evidence Maturity Model for Agentic AI: A Property-Level Method Specification
Pith reviewed 2026-05-09 20:36 UTC · model grok-4.3
The pith
DEMM classifies agentic AI decision evidence into four sufficiency categories plus conflicts and aggregates them into a five-level capability rubric.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper specifies the Decision Evidence Maturity Model (DEMM), a property-level reconstructability method for agentic decisions. DEMM classifies evidence sufficiency into four executable categories plus a protocol-level 'conflicting' category and aggregates per-property verdicts into a five-level capability rubric anchored to the established maturity-model lineage. The open-source Decision Trace Reconstructor ships ten executable adapter-fallback classes spanning vendor SDKs, protocol traces, public-postmortem prose, and generic JSONL records, with a reproducible feasibility exercise on 140 synthetic scenarios plus three public incidents.
What carries the argument
The Decision Evidence Maturity Model (DEMM), which applies four executable sufficiency categories to each decision property and rolls the verdicts into a five-level capability rubric.
If this is right
- Evidence sufficiency becomes measurable at the individual property level rather than only at the level of entire logs or containers.
- Per-property verdicts can be aggregated into a standardized five-level rubric that reports overall decision-process auditability.
- External parties can assess reconstructability using only existing telemetry and traces, without model internals.
- The open-source adapter set enables the method to run across vendor SDKs, protocol traces, and public records with measurable completeness.
Where Pith is reading between the lines
- Regulators could reference the rubric when setting minimum evidence requirements for deployed agentic systems.
- Developers could instrument their agents to target higher maturity categories by default during design.
- The same property-level approach might transfer to auditing non-agentic AI once decision properties are defined for those systems.
Load-bearing premise
Available execution telemetry, protocol traces, and public records contain sufficient raw material to reconstruct decision properties at the level required by the four sufficiency categories without needing internal model state or additional instrumentation.
What would settle it
A set of real agentic AI deployments in which external reviewers apply the four categories to answer concrete governance questions and find that the resulting sufficiency ratings do not predict whether the evidence actually supports those answers.
read the original abstract
Agentic AI systems produce decision evidence at scale through execution telemetry, but property-level reconstruction often fails when an external party asks a specific governance question about a specific decision: the assembled evidence is insufficient to answer it. We name this pattern the container fallacy: the automatic equation of evidence-container presence with audit sufficiency. This paper specifies the Decision Evidence Maturity Model (DEMM), a property-level reconstructability method for agentic decisions. DEMM classifies evidence sufficiency into four executable categories plus a protocol-level "conflicting" category and aggregates per-property verdicts into a five-level capability rubric anchored to the established maturity-model lineage. The open-source Decision Trace Reconstructor ships ten executable adapter-fallback classes spanning vendor SDKs, protocol traces, public-postmortem prose, and generic JSONL records. A reproducible feasibility exercise runs the protocol on 140 synthetic scenarios plus three public incidents; the resulting completeness range (53.6% to 100%) is implementation behaviour, not external validation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript specifies the Decision Evidence Maturity Model (DEMM) for agentic AI, which classifies decision evidence sufficiency into four executable categories (plus a 'conflicting' category at protocol level) and aggregates per-property verdicts into a five-level capability rubric. It introduces the 'container fallacy' and provides an open-source Decision Trace Reconstructor with ten adapter-fallback classes for various data sources. A feasibility exercise on 140 synthetic scenarios and three public incidents reports completeness ranging from 53.6% to 100%, presented as implementation behavior rather than external validation.
Significance. If the DEMM categories prove reconstructable from real-world telemetry, the model would offer a valuable, property-level framework for assessing audit sufficiency in agentic systems, extending maturity model approaches to AI governance. The provision of executable adapters and reproducible code is a notable strength that supports practical adoption and further testing. The work addresses a timely issue in AI accountability by distinguishing evidence presence from sufficiency.
major comments (1)
- [feasibility exercise section] The feasibility exercise section: The evaluation relies exclusively on 140 synthetic scenarios generated to exercise the protocol and three selected public post-mortems. This setup does not test whether typical production agentic execution telemetry and protocol traces contain sufficient raw material to reconstruct decision properties into the four sufficiency categories without internal model state or additional instrumentation. As a result, the reported 53.6–100% completeness range reflects implementation coverage on favorable inputs rather than demonstrating the core reconstructability assumption for real governance applications.
minor comments (1)
- [abstract] The abstract appropriately notes that the completeness range 'is implementation behaviour, not external validation,' but the main text would benefit from an explicit limitations subsection reiterating the scope of the synthetic and curated data.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the timeliness of the work on agentic AI accountability. We address the major comment below and outline targeted revisions to improve clarity without altering the manuscript's scope as a method specification.
read point-by-point responses
-
Referee: The feasibility exercise section: The evaluation relies exclusively on 140 synthetic scenarios generated to exercise the protocol and three selected public post-mortems. This setup does not test whether typical production agentic execution telemetry and protocol traces contain sufficient raw material to reconstruct decision properties into the four sufficiency categories without internal model state or additional instrumentation. As a result, the reported 53.6–100% completeness range reflects implementation coverage on favorable inputs rather than demonstrating the core reconstructability assumption for real governance applications.
Authors: We agree that the feasibility exercise uses synthetic scenarios and selected public post-mortems to demonstrate protocol executability rather than testing reconstructability from typical production telemetry. The manuscript already states that the completeness range 'is implementation behaviour, not external validation' precisely to avoid implying broader empirical claims. The primary contribution is the DEMM specification, the four sufficiency categories, the conflicting category, the five-level rubric, and the open-source Decision Trace Reconstructor with its ten adapter-fallback classes. The exercise confirms that the categories can be applied programmatically to the provided input types. We will revise the feasibility exercise section to add explicit discussion of the assumptions required for real-world governance use, including the potential need for sufficient telemetry or instrumentation in production agentic systems, and to frame the reported range as coverage on the tested inputs. This will better delineate the exercise's purpose while preserving the paper's focus. revision: partial
Circularity Check
No significant circularity; DEMM is a definitional specification with feasibility explicitly non-validating.
full rationale
The paper is a method specification that defines the four sufficiency categories and five-level rubric as part of the DEMM itself, then applies those same definitions to produce per-property verdicts before aggregation. This is definitional rather than a derivation that reduces an independent claim to its inputs. The feasibility exercise on 140 synthetic scenarios plus three public incidents is explicitly labeled 'implementation behaviour, not external validation' in the abstract, so the reported completeness range does not function as a prediction or test of the reconstructability assumption. No self-citation, uniqueness theorem, or ansatz is invoked as load-bearing; the framework is anchored to the general maturity-model lineage without circular reduction. The derivation chain is therefore self-contained as a proposed classification protocol.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agentic AI execution telemetry and traces contain properties that can be classified by sufficiency for answering specific governance questions.
invented entities (2)
-
Container fallacy
no independent evidence
-
Decision Evidence Maturity Model (DEMM) categories and rubric
no independent evidence
Reference graph
Works this paper leans on
-
[1]
LanG -- A Governance-Aware Agentic AI Platform for Unified Security Operations
• Abdennebi, A.; Kara, N.; Lahlou, L.; Ould-Slimane, H. LanG: A Governance-Aware Agentic AI Platform for Unified Security Operations. arXiv preprint 2026, 1–37. https: //doi.org/10.48550/arXiv.2604.05440 • Acharya, N.; Gupta, G.K. A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms (MCPSHIELD)....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.05440 2026
-
[2]
https://doi.org/10.1017/cfl.2024.12 • Bandara, E.; Gunaratna, A.; Gore, R.; Rahman, A.; Mukkamala, R.; Shetty, S.; et al. AI Trust OS: A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments. arXiv preprint 2026, 1–39. https: //doi.org/10.48550/arXiv.2604.04749 • Beshane, N.K.S. RuntimeGuard-A...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1017/cfl.2024.12 2024
-
[3]
https://doi.org/10.48047/jocaaa.2026.35.01.87 • Beyer, B.; Jones, C.; Petoff, J.; Murphy, N.R.; Lucas, J.; Pernici, A.C.H. Site Reliability Engineering: How Google Runs Production Systems — Chapter 15: Postmortem Culture: Learning from Failure
-
[4]
A vailable online: https://sre.google/sre-book/postmortem- culture/ • Brady, S. Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception. 2026, 2–39. https://doi.org/10 .48550/arXiv.2604.04660 • Busuioc, M. Accountable Artificial Intelligence: Holding Algorithms to Account. Public Ad...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1111/puar.13293 2026
-
[5]
https://doi.org/10.2139/ssrn.6352164 • Cursor forum (user-reported). Cursor AI Executes Destructive Command (rm -rf) During Development Session (Cursor forum user report, caveated pattern evidence)
-
[6]
HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems
A vailable online: https://forum.cursor.com/t/cursor-ai-executes-destructive-command- rm-rf-during-development-session/129401 • Dalugoda, A. HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems. 2026, 2–8. https://doi.org/10.48550/arXiv.2604.04522 • Dotan, R.; Blili-Hamelin, B.; Madhavan, R.; Matthews, J.; Scarpi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.04522 2026
-
[7]
Data Management Capability Assessment Model (DCAM) v2.2
https://doi.org/10.48550/arXiv.2401.15229 • EDM Council. Data Management Capability Assessment Model (DCAM) v2.2
-
[8]
Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction
A vailable online: https://edmcouncil.org/dcam/ • Elish, M.C. Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction. Engaging Science Technology and Society 2019, 5, 40–60. https://doi.org/10.17351/e sts2019.260 • European Parliament and Council. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rule...
work page doi:10.17351/e 2019
-
[9]
A vailable online: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng • Feakins; Habli; Morgan Clear, Compelling Arguments: Rethinking the Foundations of Frontier AI Safety Cases. IASEAI
2024
-
[10]
SR 26-2: Revised Guidance on Model Risk Management (Federal 𝑅𝑒𝑠𝑒𝑟𝑣𝑒/𝑂𝐶𝐶/𝐹 𝐷𝐼𝐶 joint, 17 April 2026)
https://doi.org/10.48550/arxiv.2603.08760 • Federal Reserve System; OCC; FDIC. SR 26-2: Revised Guidance on Model Risk Management (Federal 𝑅𝑒𝑠𝑒𝑟𝑣𝑒/𝑂𝐶𝐶/𝐹 𝐷𝐼𝐶 joint, 17 April 2026). 2026, 1–12. A vailable online: https://www.federalreserve.gov/supervisionreg/srletters/SR2602.htm • Fernandez, M. Atomic Decision Boundaries: A Structural Requirement for Guaran...
-
[11]
A Layered Model for AI Governance
https://doi.org/10.4 8550/arXiv.2603.23471 • Gasser, U.; Almeida, V.A.F. A Layered Model for AI Governance. IEEE Internet Computing 2017, 21, 58–62. https://doi.org/10.1109/MIC.2017.4180835 • Goncalves, C. Engineering Explainable AI Systems for GDPR-Aligned Decision 36 Transparency
-
[12]
Foundations for Agentic AI Investigations from the Forensic Analysis of OpenClaw
https://doi.org/10.20944/preprints202601.1610.v1 • Gruber, J.; Hilgert, J.-N. Foundations for Agentic AI Investigations from the Forensic Analysis of OpenClaw. 2026, 1–13. https://doi.org/10.48550/arXiv.2604.05589 • He, J.; Yu, D. OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains. 2026, 1–17. https://doi.org/10.48550/ar...
-
[13]
𝐼𝑆𝑂/𝐼𝐸𝐶 27037:2012 — Information technology — Security techniques — Guidelines for identification, collection, acquisition and preservation of digital evidence. 2012, 4–16. A vailable online: https://www.iso.org/standard/44381.html • Janssen, M.; Brous, P.; Estevez, E.; Barbosa, L.; Janowski, T. Data governance: Organizing data for trustworthy Artificial ...
-
[14]
Runtime Governance for AI Agents: Policies on Paths
https://doi.org/10.5281/zenodo.18760913 • Kaptein, M.; Khan, V.-J.; Podstavnychy, A. Runtime Governance for AI Agents: Policies on Paths. 2026, 1–20. https://doi.org/10.48550/arXiv.2603.16586 • Kent, K.; Chevalier, S.; Grance, T.; Dang, H. Guide to Integrating Forensic Techniques into Incident Response (NIST Special Publication 800-86)
-
[15]
https://doi.org/10.48550/a rXiv.2604.19818 • Kroll, J.A.; Huey, J.; Barocas, S.; Felten, E.W.; Reidenberg, J.R.; Robinson, D.G.; et al. Accountable Algorithms. University of Pennsylvania Law Review
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/a
-
[16]
https://doi.or g/10.2139/ssrn.2765268 • Kurtz, A.; Krawiecka, K. Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries. 2026, 8–87. https://doi.org/10.48550/arXiv.2604.06148 • Lamport, L. Time, Clocks, and the Ordering of Events in a Distributed System. Communications ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.2139/ssrn.2765268 2026
-
[17]
Time, clocks, and the ordering of events in a distributed system
https://doi.org/10.1145/359545.359563 • Langer, B. Understanding Data & Analytics Maturity: A Systematic Review of Maturity Model Composition. Schmalenbach Journal of Business Research
-
[18]
Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems
https://doi.org/ 37 10.1007/s41471-024-00205-2 • Liu, J.; Zhao, X.; Shang, X.; Shen, Z. Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems
-
[19]
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
https://doi.org/10.48550/arXiv.2604.14228 • Lupascu, C.; Lupascu, A. ElephantBroker: A Knowledge-Grounded Cognitive Runtime for Trustworthy AI Agents. arXiv 2026, 1–32. https://doi.org/10.48550/arxiv.2603.25097 • McGregor, S. Preventing Repeated Real World AI Failures by Cataloging Incidents: The AI Incident Database. In Proceedings of the AAAI Conference...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.14228 2026
-
[20]
A vailable online: https://atlas.mitre.org/ • Muñoz-Ordóñez, J.; Cobos, C.; Vidal-Rojas, J.C.; Herrera, F. A Maturity Model for eXplainable Artificial Intelligence-Analysis and Evaluation (MM4XAI-AE) Models. International Journal of Intelligent Systems 2025, 2025, 4934696. https://doi.org/10 .1155/int/4934696 • Nannini, L.; Smith, A.L.; Maggini, M.J.; Pan...
work page internal anchor Pith review doi:10.48550/arxiv.2604.04604 2025
-
[21]
A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance,
A vailable online: https://owasp.org/www-project-top-10-for-large-language-model- applications/ 38 • Paduraru, C.; Bouruc, P.-L.; Stefanescu, A. A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance. 2026, 1–11. https://doi. org/10.48550/arXiv.2603.18096 • Papagiannidis, E.; Mikalef, P.; Conboy, K. Responsible ...
-
[22]
https: //doi.org/10.1016/j.jsis.2024.101885 • Partasyuk, V. Applicability-First Evaluation Module 01: Vendor Claim Admissibility for AI Security and Agentic Systems
-
[23]
SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems
https://doi.org/10.5281/zenodo.19487979 • Patil, K. SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems. 2026, 1–12. https://doi.org/10.48550/arXiv.2604.02767 • Paulk, M.C.; Curtis, B.; Chrissis, M.B.; Weber, C.V. Capability Maturity Model, Version 1.1. IEEE Software 10(4):18-27
work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.19487979 2026
-
[24]
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
https://doi.org/10.1109/52.219617 • Qin, X.; Luan, S.; See, J.; Yang, C.; Li, Z. EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems. arXiv 2026, 1–34. https://doi.org/10.48550/arXiv.2604.11174 • Ramdhani, Y.L.; Surendro, K. Development of AI Governance Model in Enterprises Based on CMMI Model Structure. In...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/52.219617 2026
-
[26]
Making AI Compliance Evidence Machine-Readable
A vailable online: https://www.tmmi.org/tmmi-model/ • Ugarte, R.C.; Guisado, M.N.P.; Jesús, A.B.D.; López, J.M.M. Making AI Compliance Evidence Machine-Readable. arXiv 2026, 1–8. https://doi.org/10.48550/arxiv.2604.13767 • Vassilev, A.; Oprea, A.; Fordyce, A.; Anderson, H.; Davies, X.; Hamin, M. Adversarial Machine Learning: A Taxonomy and Terminology of ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.13767 2026
-
[27]
https://doi.org/10.6028/NI ST.AI.100-2e2025 • Vispute, N.; Kadam, A. Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces. 2026, 1–10. https: //doi.org/10.48550/arXiv.2603.21692 • Vonderhaar, L.; Couder, J.; Procko, T.T.; Lueddeke, E.; Cisneros, D.; Ochoa, O. Verifying Machine Learnin...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.6028/ni 2026
-
[28]
EpochX: Building the 40 Infrastructure for an Emergent Agent Civilization
https://doi.org/10.3390/software5010009 • Wang, H.; Yuan, C.; Zhuang, X.; Hu, T.; Zhang, S.; Han, J.; et al. EpochX: Building the 40 Infrastructure for an Emergent Agent Civilization. 2026, 1–13. https://doi.org/10.48550 /arXiv.2603.27304 • Yampolskiy, R.V. On monitorability of AI. AI and Ethics
-
[29]
Towards Learning Boulder Excavation with Hydraulic Excavators
https://doi.org/10.1007/ s43681-024-00420-x • Yuan, A.; Su, Z.; Zhao, Y. AEGIS: No Tool Call Left Unchecked – A Pre-Execution Firewall and Audit Layer for AI Agents. 2026a, 1–12. https://doi.org/10.48550/arXiv.2 603.12621 • Yuan, A.; Zhang, H.; Wang, Z.; Zhao, Y. Sovereign-OS: A Charter-Governed Operating System for Autonomous AI Agents with Verifiable Fi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.