DEMM-Bench: A Cross-Regime Benchmark for Agent-Runtime Governance-Evidence Sufficiency

Oleg Solozobov

arxiv: 2606.20634 · v1 · pith:Z3SG5TFPnew · submitted 2026-05-30 · 💻 cs.AI · cs.CY

DEMM-Bench: A Cross-Regime Benchmark for Agent-Runtime Governance-Evidence Sufficiency

Oleg Solozobov This is my paper

Pith reviewed 2026-06-28 18:49 UTC · model grok-4.3

classification 💻 cs.AI cs.CY

keywords agent runtimegovernance evidencedecision sufficiencybenchmarkoverclaimproperty reconstructionDEMM

0 comments

The pith

A benchmark shows common agent evidence overclaims sufficiency for governance questions on 50-75 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DEMM-Bench to determine whether records emitted by agent-runtime systems contain enough information to answer concrete governance questions about a decision, rather than merely existing in some container. It normalizes eight evidence regimes through adapters, poses questions about actor, authority, action, policy, decision basis, resource touch, lifecycle context, and verification strength, and applies eight deterministic degradation conditions to 64 manuscript cases. Trace-present and schema-present baselines overclaim on 75 percent of cases while ledger-present overclaims on 50 percent; a redacted property-level candidate scorer reaches 56.25 percent mean Property Sufficiency Accuracy with zero overclaim. The deposited package supplies the dataset, labels, baselines, and adapters for reproducible tests across heterogeneous substrates.

Core claim

DEMM-Bench demonstrates that evidence regimes do not necessarily answer governance questions about specific decisions, as trace-present and schema-present baselines overclaim on 75 percent of cases, ledger-present overclaims on 50 percent, and the redacted property-level candidate scorer achieves zero overclaim with 56.25 percent mean Property Sufficiency Accuracy across the 64 cases and eight degradation conditions.

What carries the argument

DEMM-Bench, which uses adapters to normalize eight evidence regimes and evaluates sufficiency by asking property questions under eight deterministic degradation conditions.

If this is right

Mere presence of traces, schemas, or ledgers does not guarantee sufficiency for reconstructing decision-level governance properties.
A property-level scorer can eliminate overclaiming while retaining measurable accuracy.
The benchmark supplies a reproducible method to compare evidence maturity across different agent-runtime substrates.
Eight degradation conditions provide a controlled way to test how evidence quality affects governance answers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent system builders may need to instrument evidence collection specifically around the eight property categories rather than relying on general logs.
Governance policies for agents could specify minimum sufficiency thresholds derived from benchmarks like this one.
The same sufficiency-testing approach might transfer to audit needs in other autonomous decision domains such as robotics or financial agents.

Load-bearing premise

The 64 manuscript cases together with the eight deterministic degradation conditions and the chosen property questions form a representative sample of real agent-runtime governance scenarios.

What would settle it

Applying DEMM-Bench to a fresh collection of 100 or more real deployed agent decision traces and obtaining substantially different overclaim rates or Property Sufficiency Accuracy figures.

read the original abstract

Agent-runtime systems emit traces, ledgers, provenance graphs, policy logs, delegation tokens, cache events, and tool-firewall records, but those containers do not necessarily answer governance questions about a specific decision. DEMM-Bench is a cross-regime benchmark for agent-runtime governance-evidence sufficiency, grounded in the Decision Evidence Maturity Model (DEMM): it measures whether records across eight evidence regimes are sufficient to reconstruct decision-level properties rather than merely present. The benchmark normalizes the regimes through adapters, asks property questions over actor, authority, action, policy, decision basis, resource touch, lifecycle context, and verification strength, and applies eight deterministic degradation conditions. Across 64 manuscript cases, trace-present and schema-present baselines overclaim on 75% of cases, ledger-present overclaims on 50%, and the redacted property-level candidate scorer has zero overclaim with 56.25% mean Property Sufficiency Accuracy. The deposited package provides the 64-case dataset, construction-oracle labels, baselines, and adapters, supporting reproducible evaluation of decision-evidence maturity across heterogeneous agent-runtime evidence substrates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DEMM-Bench gives a concrete cross-regime test for whether runtime evidence answers governance properties, with a scorer showing zero overclaim on the reported cases, but the 64 manuscript cases lack any shown sampling justification.

read the letter

The main takeaway is that this benchmark finds a property-level scorer avoids overclaiming on decision evidence sufficiency while trace, schema, and ledger baselines overclaim on half to three-quarters of the 64 cases, at 56.25% mean accuracy for the scorer.

What is new is the combination of eight evidence regimes normalized through adapters, eight fixed property questions covering actor through verification strength, and eight deterministic degradation conditions applied to the same cases. The deposited package with the dataset, oracle labels, baselines, and adapters makes the evaluation reproducible, which is a practical step forward for this corner of governance work.

The paper does well at turning the abstract idea of evidence sufficiency into measurable property questions and reporting direct overclaim rates instead of derived predictions.

The soft spot is the case set itself. All 64 cases come from manuscripts with internally generated labels and no sampling frame, diversity metrics, or external validation is described. The stress-test concern about representativeness holds up on the given details; without that, the zero-overclaim result stays tied to this particular collection and does not yet demonstrate general detection of evidence maturity.

This is for researchers building or auditing agent runtimes who need standardized ways to check whether records answer specific decision questions. A reader focused on evidence standards or benchmark design would get direct value from the construction and the public materials.

It deserves peer review because the benchmark is defined clearly enough to discuss and the package lets referees and others inspect the claims directly.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces DEMM-Bench, a benchmark grounded in the Decision Evidence Maturity Model (DEMM) for evaluating whether records from eight agent-runtime evidence regimes (traces, ledgers, provenance graphs, etc.) are sufficient to answer governance questions on eight decision-level properties (actor, authority, action, policy, decision basis, resource touch, lifecycle context, verification strength). Regimes are normalized via adapters; eight deterministic degradation conditions are applied to 64 manuscript cases. Reported results: trace-present and schema-present baselines overclaim on 75% of cases, ledger-present on 50%, while a redacted property-level scorer achieves zero overclaim and 56.25% mean Property Sufficiency Accuracy. A deposited package supplies the dataset, construction-oracle labels, baselines, and adapters for reproducibility.

Significance. If the case selection and scoring procedures hold under scrutiny, the benchmark offers a concrete, cross-regime method to distinguish evidence presence from sufficiency for governance questions in agent systems. The deposited package enabling reproducible evaluation of decision-evidence maturity is a clear strength, allowing direct community testing across heterogeneous substrates. This could support more rigorous auditability requirements in deployed agent runtimes.

major comments (3)

[Abstract] Abstract: The central numerical claims (56.25% mean Property Sufficiency Accuracy for the redacted scorer; specific overclaim percentages for the three baselines) are stated without any description of how the eight property questions are scored, how the redacted scorer is constructed, or the procedure used to select and label the 64 cases. These omissions make the reported performance figures unverifiable from the manuscript text.
[Abstract] Abstract: The validity of the test distribution rests on the 64 manuscript cases plus the eight degradation conditions forming a representative sample of real agent-runtime governance scenarios, yet no sampling frame, diversity metrics, external validation, or coverage analysis (e.g., multi-agent or long-horizon decisions) is supplied. The construction-oracle labels are generated internally, leaving open whether the sufficiency-accuracy results generalize beyond the chosen manuscript-derived scenarios.
[Abstract] Abstract: The eight deterministic degradation conditions are invoked as part of the evaluation protocol but receive no definition or operational description, preventing assessment of whether they adequately probe evidence sufficiency across the eight properties.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these focused comments on the abstract. We address each point below. Where the abstract's brevity has created gaps in verifiability, we will revise it; where the full manuscript already supplies the requested detail, we will ensure the abstract points to it explicitly.

read point-by-point responses

Referee: [Abstract] Abstract: The central numerical claims (56.25% mean Property Sufficiency Accuracy for the redacted scorer; specific overclaim percentages for the three baselines) are stated without any description of how the eight property questions are scored, how the redacted scorer is constructed, or the procedure used to select and label the 64 cases. These omissions make the reported performance figures unverifiable from the manuscript text.

Authors: We agree the abstract is too terse on these points. The full manuscript (Sections 3.1–3.3 and 4.1) defines the eight property questions, describes the redacted property-level scorer as a deterministic evaluator that inspects only the normalized property fields after adapter application, and details the case-selection procedure (systematic enumeration across the eight regimes with construction-oracle labeling). We will revise the abstract to include one-sentence summaries of each element so the numerical claims become verifiable from the abstract alone. revision: yes
Referee: [Abstract] Abstract: The validity of the test distribution rests on the 64 manuscript cases plus the eight degradation conditions forming a representative sample of real agent-runtime governance scenarios, yet no sampling frame, diversity metrics, external validation, or coverage analysis (e.g., multi-agent or long-horizon decisions) is supplied. The construction-oracle labels are generated internally, leaving open whether the sufficiency-accuracy results generalize beyond the chosen manuscript-derived scenarios.

Authors: The 64 cases were constructed to exhaustively cover the eight regimes and eight properties using an internal construction oracle; no external sampling frame or diversity metrics were applied. We acknowledge this is a genuine limitation of the current benchmark and that generalization beyond the manuscript-derived scenarios remains untested. In revision we will add an explicit Limitations paragraph stating these constraints and noting that future releases will incorporate external validation and multi-agent/long-horizon cases. revision: partial
Referee: [Abstract] Abstract: The eight deterministic degradation conditions are invoked as part of the evaluation protocol but receive no definition or operational description, preventing assessment of whether they adequately probe evidence sufficiency across the eight properties.

Authors: The eight degradation conditions (trace truncation, schema stripping, ledger redaction, provenance edge removal, policy-log omission, token invalidation, cache eviction, and firewall-log suppression) are defined with operational pseudocode in Section 3.2 and implemented in the deposited package. The abstract will be revised to name and briefly characterize them so readers can immediately assess their coverage of the eight properties. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

The paper describes construction of a benchmark dataset (64 manuscript cases plus eight deterministic degradation conditions), application of adapters and property questions, and direct reporting of measured sufficiency accuracies and overclaim rates for baselines versus the candidate scorer. No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the load-bearing claims. Results are presented as empirical observations on the chosen cases rather than derivations that reduce to the inputs by construction. The representativeness assumption is noted as external but does not create circularity in the reported measurements themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unvalidated assumption that the DEMM model correctly captures decision-level properties and that the 64 cases plus degradation conditions are representative; no free parameters or invented entities are visible in the abstract.

axioms (1)

domain assumption The Decision Evidence Maturity Model (DEMM) supplies a sound definition of evidence sufficiency for governance questions.
The entire benchmark is grounded in DEMM with no independent justification supplied.

pith-pipeline@v0.9.1-grok · 5724 in / 1230 out tokens · 26233 ms · 2026-06-28T18:49:08.631968+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

arXiv:2510.25863 [cs.CR]

https://doi.org/10.48550/arXiv.2510.25863 • Hallgren, K.A. Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. In Tutorials in Quantitative Methods for Psychology , 2012. https://doi.org/ 10.20982/tqmp.08.1.p023 • He, J.; Yu, D. OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains. 2026, 1–1...

work page doi:10.48550/arxiv.2510.25863 2012
[2]

Faustino, Guanheng Liu, Shan Zhang, Hongbin Luo, Suhaib A

https://doi.org/10.48550/arXiv.2601.00481 • Mazzocchetti, A.M. Cryptographic Runtime Governance for Autonomous AI Systems: The Aegis Architecture for Verifiable Policy Enforcement. arXiv preprint 2026, 2–16. https: //doi.org/10.48550/arXiv.2603.16938 • McCann, A.L. The Two Boundaries: Why Behavioral AI Governance Fails Structurally. arXiv preprint 2026. h...

work page doi:10.48550/arxiv.2601.00481 2026

[1] [1]

arXiv:2510.25863 [cs.CR]

https://doi.org/10.48550/arXiv.2510.25863 • Hallgren, K.A. Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. In Tutorials in Quantitative Methods for Psychology , 2012. https://doi.org/ 10.20982/tqmp.08.1.p023 • He, J.; Yu, D. OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains. 2026, 1–1...

work page doi:10.48550/arxiv.2510.25863 2012

[2] [2]

Faustino, Guanheng Liu, Shan Zhang, Hongbin Luo, Suhaib A

https://doi.org/10.48550/arXiv.2601.00481 • Mazzocchetti, A.M. Cryptographic Runtime Governance for Autonomous AI Systems: The Aegis Architecture for Verifiable Policy Enforcement. arXiv preprint 2026, 2–16. https: //doi.org/10.48550/arXiv.2603.16938 • McCann, A.L. The Two Boundaries: Why Behavioral AI Governance Fails Structurally. arXiv preprint 2026. h...

work page doi:10.48550/arxiv.2601.00481 2026