pith. machine review for the scientific record. sign in

arxiv: 2605.04093 · v1 · submitted 2026-04-29 · 💻 cs.CY

Recognition: unknown

Decision Evidence Maturity Model for Agentic AI: A Property-Level Method Specification

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:36 UTC · model grok-4.3

classification 💻 cs.CY
keywords agentic AIdecision evidencematurity modelevidence sufficiencyauditabilityproperty-level reconstructiongovernance questions
0
0 comments X

The pith

DEMM classifies agentic AI decision evidence into four sufficiency categories plus conflicts and aggregates them into a five-level capability rubric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic AI systems produce large volumes of execution telemetry and traces, but this evidence frequently cannot answer a specific external question about why a particular decision was made. The paper names the error of assuming data containers equal audit readiness the container fallacy. To address it, the Decision Evidence Maturity Model specifies four executable categories for evidence sufficiency at the level of individual decision properties, plus a protocol-level conflicting category. These per-property verdicts roll up into a five-level overall capability rubric drawn from established maturity-model practice. An open-source reconstructor tool with adapter classes for SDKs, traces, and records demonstrates the method on synthetic scenarios and public incidents, yielding completeness from 53.6 percent to 100 percent.

Core claim

The paper specifies the Decision Evidence Maturity Model (DEMM), a property-level reconstructability method for agentic decisions. DEMM classifies evidence sufficiency into four executable categories plus a protocol-level 'conflicting' category and aggregates per-property verdicts into a five-level capability rubric anchored to the established maturity-model lineage. The open-source Decision Trace Reconstructor ships ten executable adapter-fallback classes spanning vendor SDKs, protocol traces, public-postmortem prose, and generic JSONL records, with a reproducible feasibility exercise on 140 synthetic scenarios plus three public incidents.

What carries the argument

The Decision Evidence Maturity Model (DEMM), which applies four executable sufficiency categories to each decision property and rolls the verdicts into a five-level capability rubric.

If this is right

  • Evidence sufficiency becomes measurable at the individual property level rather than only at the level of entire logs or containers.
  • Per-property verdicts can be aggregated into a standardized five-level rubric that reports overall decision-process auditability.
  • External parties can assess reconstructability using only existing telemetry and traces, without model internals.
  • The open-source adapter set enables the method to run across vendor SDKs, protocol traces, and public records with measurable completeness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regulators could reference the rubric when setting minimum evidence requirements for deployed agentic systems.
  • Developers could instrument their agents to target higher maturity categories by default during design.
  • The same property-level approach might transfer to auditing non-agentic AI once decision properties are defined for those systems.

Load-bearing premise

Available execution telemetry, protocol traces, and public records contain sufficient raw material to reconstruct decision properties at the level required by the four sufficiency categories without needing internal model state or additional instrumentation.

What would settle it

A set of real agentic AI deployments in which external reviewers apply the four categories to answer concrete governance questions and find that the resulting sufficiency ratings do not predict whether the evidence actually supports those answers.

read the original abstract

Agentic AI systems produce decision evidence at scale through execution telemetry, but property-level reconstruction often fails when an external party asks a specific governance question about a specific decision: the assembled evidence is insufficient to answer it. We name this pattern the container fallacy: the automatic equation of evidence-container presence with audit sufficiency. This paper specifies the Decision Evidence Maturity Model (DEMM), a property-level reconstructability method for agentic decisions. DEMM classifies evidence sufficiency into four executable categories plus a protocol-level "conflicting" category and aggregates per-property verdicts into a five-level capability rubric anchored to the established maturity-model lineage. The open-source Decision Trace Reconstructor ships ten executable adapter-fallback classes spanning vendor SDKs, protocol traces, public-postmortem prose, and generic JSONL records. A reproducible feasibility exercise runs the protocol on 140 synthetic scenarios plus three public incidents; the resulting completeness range (53.6% to 100%) is implementation behaviour, not external validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript specifies the Decision Evidence Maturity Model (DEMM) for agentic AI, which classifies decision evidence sufficiency into four executable categories (plus a 'conflicting' category at protocol level) and aggregates per-property verdicts into a five-level capability rubric. It introduces the 'container fallacy' and provides an open-source Decision Trace Reconstructor with ten adapter-fallback classes for various data sources. A feasibility exercise on 140 synthetic scenarios and three public incidents reports completeness ranging from 53.6% to 100%, presented as implementation behavior rather than external validation.

Significance. If the DEMM categories prove reconstructable from real-world telemetry, the model would offer a valuable, property-level framework for assessing audit sufficiency in agentic systems, extending maturity model approaches to AI governance. The provision of executable adapters and reproducible code is a notable strength that supports practical adoption and further testing. The work addresses a timely issue in AI accountability by distinguishing evidence presence from sufficiency.

major comments (1)
  1. [feasibility exercise section] The feasibility exercise section: The evaluation relies exclusively on 140 synthetic scenarios generated to exercise the protocol and three selected public post-mortems. This setup does not test whether typical production agentic execution telemetry and protocol traces contain sufficient raw material to reconstruct decision properties into the four sufficiency categories without internal model state or additional instrumentation. As a result, the reported 53.6–100% completeness range reflects implementation coverage on favorable inputs rather than demonstrating the core reconstructability assumption for real governance applications.
minor comments (1)
  1. [abstract] The abstract appropriately notes that the completeness range 'is implementation behaviour, not external validation,' but the main text would benefit from an explicit limitations subsection reiterating the scope of the synthetic and curated data.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the timeliness of the work on agentic AI accountability. We address the major comment below and outline targeted revisions to improve clarity without altering the manuscript's scope as a method specification.

read point-by-point responses
  1. Referee: The feasibility exercise section: The evaluation relies exclusively on 140 synthetic scenarios generated to exercise the protocol and three selected public post-mortems. This setup does not test whether typical production agentic execution telemetry and protocol traces contain sufficient raw material to reconstruct decision properties into the four sufficiency categories without internal model state or additional instrumentation. As a result, the reported 53.6–100% completeness range reflects implementation coverage on favorable inputs rather than demonstrating the core reconstructability assumption for real governance applications.

    Authors: We agree that the feasibility exercise uses synthetic scenarios and selected public post-mortems to demonstrate protocol executability rather than testing reconstructability from typical production telemetry. The manuscript already states that the completeness range 'is implementation behaviour, not external validation' precisely to avoid implying broader empirical claims. The primary contribution is the DEMM specification, the four sufficiency categories, the conflicting category, the five-level rubric, and the open-source Decision Trace Reconstructor with its ten adapter-fallback classes. The exercise confirms that the categories can be applied programmatically to the provided input types. We will revise the feasibility exercise section to add explicit discussion of the assumptions required for real-world governance use, including the potential need for sufficient telemetry or instrumentation in production agentic systems, and to frame the reported range as coverage on the tested inputs. This will better delineate the exercise's purpose while preserving the paper's focus. revision: partial

Circularity Check

0 steps flagged

No significant circularity; DEMM is a definitional specification with feasibility explicitly non-validating.

full rationale

The paper is a method specification that defines the four sufficiency categories and five-level rubric as part of the DEMM itself, then applies those same definitions to produce per-property verdicts before aggregation. This is definitional rather than a derivation that reduces an independent claim to its inputs. The feasibility exercise on 140 synthetic scenarios plus three public incidents is explicitly labeled 'implementation behaviour, not external validation' in the abstract, so the reported completeness range does not function as a prediction or test of the reconstructability assumption. No self-citation, uniqueness theorem, or ansatz is invoked as load-bearing; the framework is anchored to the general maturity-model lineage without circular reduction. The derivation chain is therefore self-contained as a proposed classification protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The model rests on the domain assumption that decision properties are reconstructible from telemetry and traces, introduces the container fallacy and DEMM categories as new constructs, and performs no parameter fitting.

axioms (1)
  • domain assumption Agentic AI execution telemetry and traces contain properties that can be classified by sufficiency for answering specific governance questions.
    This assumption underpins the entire four-category classification and is stated as the motivation for the model.
invented entities (2)
  • Container fallacy no independent evidence
    purpose: Names the automatic equation of evidence-container presence with audit sufficiency.
    New term coined in the paper to describe the core problem.
  • Decision Evidence Maturity Model (DEMM) categories and rubric no independent evidence
    purpose: Provides the four executable sufficiency categories, conflicting state, and five-level aggregation.
    New specification introduced by the authors.

pith-pipeline@v0.9.0 · 5461 in / 1406 out tokens · 31527 ms · 2026-05-09T20:36:32.687952+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 27 canonical work pages · 12 internal anchors

  1. [1]

    LanG -- A Governance-Aware Agentic AI Platform for Unified Security Operations

    • Abdennebi, A.; Kara, N.; Lahlou, L.; Ould-Slimane, H. LanG: A Governance-Aware Agentic AI Platform for Unified Security Operations. arXiv preprint 2026, 1–37. https: //doi.org/10.48550/arXiv.2604.05440 • Acharya, N.; Gupta, G.K. A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms (MCPSHIELD)....

  2. [2]

    AI Trust OS -- A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments

    https://doi.org/10.1017/cfl.2024.12 • Bandara, E.; Gunaratna, A.; Gore, R.; Rahman, A.; Mukkamala, R.; Shetty, S.; et al. AI Trust OS: A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments. arXiv preprint 2026, 1–39. https: //doi.org/10.48550/arXiv.2604.04749 • Beshane, N.K.S. RuntimeGuard-A...

  3. [3]

    Site Reliability Engineering: How Google Runs Production Systems — Chapter 15: Postmortem Culture: Learning from Failure

    https://doi.org/10.48047/jocaaa.2026.35.01.87 • Beyer, B.; Jones, C.; Petoff, J.; Murphy, N.R.; Lucas, J.; Pernici, A.C.H. Site Reliability Engineering: How Google Runs Production Systems — Chapter 15: Postmortem Culture: Learning from Failure

  4. [4]

    Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception

    A vailable online: https://sre.google/sre-book/postmortem- culture/ • Brady, S. Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception. 2026, 2–39. https://doi.org/10 .48550/arXiv.2604.04660 • Busuioc, M. Accountable Artificial Intelligence: Holding Algorithms to Account. Public Ad...

  5. [5]

    Cursor AI Executes Destructive Command (rm -rf) During Development Session (Cursor forum user report, caveated pattern evidence)

    https://doi.org/10.2139/ssrn.6352164 • Cursor forum (user-reported). Cursor AI Executes Destructive Command (rm -rf) During Development Session (Cursor forum user report, caveated pattern evidence)

  6. [6]

    HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems

    A vailable online: https://forum.cursor.com/t/cursor-ai-executes-destructive-command- rm-rf-during-development-session/129401 • Dalugoda, A. HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems. 2026, 2–8. https://doi.org/10.48550/arXiv.2604.04522 • Dotan, R.; Blili-Hamelin, B.; Madhavan, R.; Matthews, J.; Scarpi...

  7. [7]

    Data Management Capability Assessment Model (DCAM) v2.2

    https://doi.org/10.48550/arXiv.2401.15229 • EDM Council. Data Management Capability Assessment Model (DCAM) v2.2

  8. [8]

    Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction

    A vailable online: https://edmcouncil.org/dcam/ • Elish, M.C. Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction. Engaging Science Technology and Society 2019, 5, 40–60. https://doi.org/10.17351/e sts2019.260 • European Parliament and Council. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rule...

  9. [9]

    A vailable online: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng • Feakins; Habli; Morgan Clear, Compelling Arguments: Rethinking the Foundations of Frontier AI Safety Cases. IASEAI

  10. [10]

    SR 26-2: Revised Guidance on Model Risk Management (Federal 𝑅𝑒𝑠𝑒𝑟𝑣𝑒/𝑂𝐶𝐶/𝐹 𝐷𝐼𝐶 joint, 17 April 2026)

    https://doi.org/10.48550/arxiv.2603.08760 • Federal Reserve System; OCC; FDIC. SR 26-2: Revised Guidance on Model Risk Management (Federal 𝑅𝑒𝑠𝑒𝑟𝑣𝑒/𝑂𝐶𝐶/𝐹 𝐷𝐼𝐶 joint, 17 April 2026). 2026, 1–12. A vailable online: https://www.federalreserve.gov/supervisionreg/srletters/SR2602.htm • Fernandez, M. Atomic Decision Boundaries: A Structural Requirement for Guaran...

  11. [11]

    A Layered Model for AI Governance

    https://doi.org/10.4 8550/arXiv.2603.23471 • Gasser, U.; Almeida, V.A.F. A Layered Model for AI Governance. IEEE Internet Computing 2017, 21, 58–62. https://doi.org/10.1109/MIC.2017.4180835 • Goncalves, C. Engineering Explainable AI Systems for GDPR-Aligned Decision 36 Transparency

  12. [12]

    Foundations for Agentic AI Investigations from the Forensic Analysis of OpenClaw

    https://doi.org/10.20944/preprints202601.1610.v1 • Gruber, J.; Hilgert, J.-N. Foundations for Agentic AI Investigations from the Forensic Analysis of OpenClaw. 2026, 1–13. https://doi.org/10.48550/arXiv.2604.05589 • He, J.; Yu, D. OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains. 2026, 1–17. https://doi.org/10.48550/ar...

  13. [13]

    2012, 4–16

    𝐼𝑆𝑂/𝐼𝐸𝐶 27037:2012 — Information technology — Security techniques — Guidelines for identification, collection, acquisition and preservation of digital evidence. 2012, 4–16. A vailable online: https://www.iso.org/standard/44381.html • Janssen, M.; Brous, P.; Estevez, E.; Barbosa, L.; Janowski, T. Data governance: Organizing data for trustworthy Artificial ...

  14. [14]

    Runtime Governance for AI Agents: Policies on Paths

    https://doi.org/10.5281/zenodo.18760913 • Kaptein, M.; Khan, V.-J.; Podstavnychy, A. Runtime Governance for AI Agents: Policies on Paths. 2026, 1–20. https://doi.org/10.48550/arXiv.2603.16586 • Kent, K.; Chevalier, S.; Grance, T.; Dang, H. Guide to Integrating Forensic Techniques into Incident Response (NIST Special Publication 800-86)

  15. [15]

    Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI

    https://doi.org/10.48550/a rXiv.2604.19818 • Kroll, J.A.; Huey, J.; Barocas, S.; Felten, E.W.; Reidenberg, J.R.; Robinson, D.G.; et al. Accountable Algorithms. University of Pennsylvania Law Review

  16. [16]

    Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries

    https://doi.or g/10.2139/ssrn.2765268 • Kurtz, A.; Krawiecka, K. Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries. 2026, 8–87. https://doi.org/10.48550/arXiv.2604.06148 • Lamport, L. Time, Clocks, and the Ordering of Events in a Distributed System. Communications ...

  17. [17]

    Time, clocks, and the ordering of events in a distributed system

    https://doi.org/10.1145/359545.359563 • Langer, B. Understanding Data & Analytics Maturity: A Systematic Review of Maturity Model Composition. Schmalenbach Journal of Business Research

  18. [18]

    Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems

    https://doi.org/ 37 10.1007/s41471-024-00205-2 • Liu, J.; Zhao, X.; Shang, X.; Shen, Z. Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems

  19. [19]

    Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

    https://doi.org/10.48550/arXiv.2604.14228 • Lupascu, C.; Lupascu, A. ElephantBroker: A Knowledge-Grounded Cognitive Runtime for Trustworthy AI Agents. arXiv 2026, 1–32. https://doi.org/10.48550/arxiv.2603.25097 • McGregor, S. Preventing Repeated Real World AI Failures by Cataloging Incidents: The AI Incident Database. In Proceedings of the AAAI Conference...

  20. [20]

    AI Agents Under EU Law

    A vailable online: https://atlas.mitre.org/ • Muñoz-Ordóñez, J.; Cobos, C.; Vidal-Rojas, J.C.; Herrera, F. A Maturity Model for eXplainable Artificial Intelligence-Analysis and Evaluation (MM4XAI-AE) Models. International Journal of Intelligent Systems 2025, 2025, 4934696. https://doi.org/10 .1155/int/4934696 • Nannini, L.; Smith, A.L.; Maggini, M.J.; Pan...

  21. [21]

    A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance,

    A vailable online: https://owasp.org/www-project-top-10-for-large-language-model- applications/ 38 • Paduraru, C.; Bouruc, P.-L.; Stefanescu, A. A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance. 2026, 1–11. https://doi. org/10.48550/arXiv.2603.18096 • Papagiannidis, E.; Mikalef, P.; Conboy, K. Responsible ...

  22. [22]

    Applicability-First Evaluation Module 01: Vendor Claim Admissibility for AI Security and Agentic Systems

    https: //doi.org/10.1016/j.jsis.2024.101885 • Partasyuk, V. Applicability-First Evaluation Module 01: Vendor Claim Admissibility for AI Security and Agentic Systems

  23. [23]

    SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems

    https://doi.org/10.5281/zenodo.19487979 • Patil, K. SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems. 2026, 1–12. https://doi.org/10.48550/arXiv.2604.02767 • Paulk, M.C.; Curtis, B.; Chrissis, M.B.; Weber, C.V. Capability Maturity Model, Version 1.1. IEEE Software 10(4):18-27

  24. [24]

    EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

    https://doi.org/10.1109/52.219617 • Qin, X.; Luan, S.; See, J.; Yang, C.; Li, Z. EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems. arXiv 2026, 1–34. https://doi.org/10.48550/arXiv.2604.11174 • Ramdhani, Y.L.; Surendro, K. Development of AI Governance Model in Enterprises Based on CMMI Model Structure. In...

  25. [26]

    Making AI Compliance Evidence Machine-Readable

    A vailable online: https://www.tmmi.org/tmmi-model/ • Ugarte, R.C.; Guisado, M.N.P.; Jesús, A.B.D.; López, J.M.M. Making AI Compliance Evidence Machine-Readable. arXiv 2026, 1–8. https://doi.org/10.48550/arxiv.2604.13767 • Vassilev, A.; Oprea, A.; Fordyce, A.; Anderson, H.; Davies, X.; Hamin, M. Adversarial Machine Learning: A Taxonomy and Terminology of ...

  26. [27]

    Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces

    https://doi.org/10.6028/NI ST.AI.100-2e2025 • Vispute, N.; Kadam, A. Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces. 2026, 1–10. https: //doi.org/10.48550/arXiv.2603.21692 • Vonderhaar, L.; Couder, J.; Procko, T.T.; Lueddeke, E.; Cisneros, D.; Ochoa, O. Verifying Machine Learnin...

  27. [28]

    EpochX: Building the 40 Infrastructure for an Emergent Agent Civilization

    https://doi.org/10.3390/software5010009 • Wang, H.; Yuan, C.; Zhuang, X.; Hu, T.; Zhang, S.; Han, J.; et al. EpochX: Building the 40 Infrastructure for an Emergent Agent Civilization. 2026, 1–13. https://doi.org/10.48550 /arXiv.2603.27304 • Yampolskiy, R.V. On monitorability of AI. AI and Ethics

  28. [29]

    Towards Learning Boulder Excavation with Hydraulic Excavators

    https://doi.org/10.1007/ s43681-024-00420-x • Yuan, A.; Su, Z.; Zhao, Y. AEGIS: No Tool Call Left Unchecked – A Pre-Execution Firewall and Audit Layer for AI Agents. 2026a, 1–12. https://doi.org/10.48550/arXiv.2 603.12621 • Yuan, A.; Zhang, H.; Wang, Z.; Zhao, Y. Sovereign-OS: A Charter-Governed Operating System for Autonomous AI Agents with Verifiable Fi...