arxiv: 2605.04093 · v1 · submitted 2026-04-29 · 💻 cs.CY

Recognition: unknown

Decision Evidence Maturity Model for Agentic AI: A Property-Level Method Specification

Oleg Solozobov

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:36 UTC · model grok-4.3

classification 💻 cs.CY

keywords agentic AIdecision evidencematurity modelevidence sufficiencyauditabilityproperty-level reconstructiongovernance questions

0 comments

The pith

DEMM classifies agentic AI decision evidence into four sufficiency categories plus conflicts and aggregates them into a five-level capability rubric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic AI systems produce large volumes of execution telemetry and traces, but this evidence frequently cannot answer a specific external question about why a particular decision was made. The paper names the error of assuming data containers equal audit readiness the container fallacy. To address it, the Decision Evidence Maturity Model specifies four executable categories for evidence sufficiency at the level of individual decision properties, plus a protocol-level conflicting category. These per-property verdicts roll up into a five-level overall capability rubric drawn from established maturity-model practice. An open-source reconstructor tool with adapter classes for SDKs, traces, and records demonstrates the method on synthetic scenarios and public incidents, yielding completeness from 53.6 percent to 100 percent.

Core claim

The paper specifies the Decision Evidence Maturity Model (DEMM), a property-level reconstructability method for agentic decisions. DEMM classifies evidence sufficiency into four executable categories plus a protocol-level 'conflicting' category and aggregates per-property verdicts into a five-level capability rubric anchored to the established maturity-model lineage. The open-source Decision Trace Reconstructor ships ten executable adapter-fallback classes spanning vendor SDKs, protocol traces, public-postmortem prose, and generic JSONL records, with a reproducible feasibility exercise on 140 synthetic scenarios plus three public incidents.

What carries the argument

The Decision Evidence Maturity Model (DEMM), which applies four executable sufficiency categories to each decision property and rolls the verdicts into a five-level capability rubric.

If this is right

Evidence sufficiency becomes measurable at the individual property level rather than only at the level of entire logs or containers.
Per-property verdicts can be aggregated into a standardized five-level rubric that reports overall decision-process auditability.
External parties can assess reconstructability using only existing telemetry and traces, without model internals.
The open-source adapter set enables the method to run across vendor SDKs, protocol traces, and public records with measurable completeness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regulators could reference the rubric when setting minimum evidence requirements for deployed agentic systems.
Developers could instrument their agents to target higher maturity categories by default during design.
The same property-level approach might transfer to auditing non-agentic AI once decision properties are defined for those systems.

Load-bearing premise

Available execution telemetry, protocol traces, and public records contain sufficient raw material to reconstruct decision properties at the level required by the four sufficiency categories without needing internal model state or additional instrumentation.

What would settle it

A set of real agentic AI deployments in which external reviewers apply the four categories to answer concrete governance questions and find that the resulting sufficiency ratings do not predict whether the evidence actually supports those answers.

read the original abstract

Agentic AI systems produce decision evidence at scale through execution telemetry, but property-level reconstruction often fails when an external party asks a specific governance question about a specific decision: the assembled evidence is insufficient to answer it. We name this pattern the container fallacy: the automatic equation of evidence-container presence with audit sufficiency. This paper specifies the Decision Evidence Maturity Model (DEMM), a property-level reconstructability method for agentic decisions. DEMM classifies evidence sufficiency into four executable categories plus a protocol-level "conflicting" category and aggregates per-property verdicts into a five-level capability rubric anchored to the established maturity-model lineage. The open-source Decision Trace Reconstructor ships ten executable adapter-fallback classes spanning vendor SDKs, protocol traces, public-postmortem prose, and generic JSONL records. A reproducible feasibility exercise runs the protocol on 140 synthetic scenarios plus three public incidents; the resulting completeness range (53.6% to 100%) is implementation behaviour, not external validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable property-level rubric and open code for checking whether agentic AI logs can answer specific audit questions, but its tests use only synthetic data and selected post-mortems so the key reconstructability claim stays unproven.

read the letter

The paper defines a Decision Evidence Maturity Model that breaks agentic AI decision logs into four sufficiency categories plus a conflicting state, then aggregates those into a five-level maturity scale. It also ships an open-source reconstructor with adapters for SDK traces, protocol logs, postmortems, and JSONL records. That combination of explicit categories and runnable code is the main practical output. The feasibility numbers on 140 synthetic cases and three public incidents are reported clearly as implementation coverage rather than external proof. The container fallacy framing is a useful shorthand for the gap between having logs and having answerable evidence. The definitions stay anchored to earlier maturity-model work without inventing unsupported claims. The internal logic holds together and the citation choices are appropriate. The soft spot is the test design. Synthetic scenarios can be generated to contain exactly the traces the protocol expects, and the three public incidents are already filtered for narrative completeness. This setup does not check whether ordinary production logs, which often omit reasoning steps or use opaque vendor formats, supply enough raw material for the four categories. The reported 53.6–100% range therefore reflects favorable inputs rather than typical observability. The assumption that external telemetry alone suffices without internal model access is stated but not stress-tested against partial logs. Readers working on AI governance tooling, compliance pipelines, or audit standards will find the categories and adapters immediately usable. A practitioner who needs a starting checklist for evidence sufficiency gets concrete material here. The work is coherent on its own terms and shows honest engagement with the reconstruction problem. It deserves a serious referee because the method is specified and implemented in enough detail to review, even if the validation needs expansion to real production traces. I would send it to peer review with a note to strengthen the test cases.

Referee Report

1 major / 1 minor

Summary. The manuscript specifies the Decision Evidence Maturity Model (DEMM) for agentic AI, which classifies decision evidence sufficiency into four executable categories (plus a 'conflicting' category at protocol level) and aggregates per-property verdicts into a five-level capability rubric. It introduces the 'container fallacy' and provides an open-source Decision Trace Reconstructor with ten adapter-fallback classes for various data sources. A feasibility exercise on 140 synthetic scenarios and three public incidents reports completeness ranging from 53.6% to 100%, presented as implementation behavior rather than external validation.

Significance. If the DEMM categories prove reconstructable from real-world telemetry, the model would offer a valuable, property-level framework for assessing audit sufficiency in agentic systems, extending maturity model approaches to AI governance. The provision of executable adapters and reproducible code is a notable strength that supports practical adoption and further testing. The work addresses a timely issue in AI accountability by distinguishing evidence presence from sufficiency.

major comments (1)

[feasibility exercise section] The feasibility exercise section: The evaluation relies exclusively on 140 synthetic scenarios generated to exercise the protocol and three selected public post-mortems. This setup does not test whether typical production agentic execution telemetry and protocol traces contain sufficient raw material to reconstruct decision properties into the four sufficiency categories without internal model state or additional instrumentation. As a result, the reported 53.6–100% completeness range reflects implementation coverage on favorable inputs rather than demonstrating the core reconstructability assumption for real governance applications.

minor comments (1)

[abstract] The abstract appropriately notes that the completeness range 'is implementation behaviour, not external validation,' but the main text would benefit from an explicit limitations subsection reiterating the scope of the synthetic and curated data.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the timeliness of the work on agentic AI accountability. We address the major comment below and outline targeted revisions to improve clarity without altering the manuscript's scope as a method specification.

read point-by-point responses

Referee: The feasibility exercise section: The evaluation relies exclusively on 140 synthetic scenarios generated to exercise the protocol and three selected public post-mortems. This setup does not test whether typical production agentic execution telemetry and protocol traces contain sufficient raw material to reconstruct decision properties into the four sufficiency categories without internal model state or additional instrumentation. As a result, the reported 53.6–100% completeness range reflects implementation coverage on favorable inputs rather than demonstrating the core reconstructability assumption for real governance applications.

Authors: We agree that the feasibility exercise uses synthetic scenarios and selected public post-mortems to demonstrate protocol executability rather than testing reconstructability from typical production telemetry. The manuscript already states that the completeness range 'is implementation behaviour, not external validation' precisely to avoid implying broader empirical claims. The primary contribution is the DEMM specification, the four sufficiency categories, the conflicting category, the five-level rubric, and the open-source Decision Trace Reconstructor with its ten adapter-fallback classes. The exercise confirms that the categories can be applied programmatically to the provided input types. We will revise the feasibility exercise section to add explicit discussion of the assumptions required for real-world governance use, including the potential need for sufficient telemetry or instrumentation in production agentic systems, and to frame the reported range as coverage on the tested inputs. This will better delineate the exercise's purpose while preserving the paper's focus. revision: partial

Circularity Check

0 steps flagged

No significant circularity; DEMM is a definitional specification with feasibility explicitly non-validating.

full rationale

The paper is a method specification that defines the four sufficiency categories and five-level rubric as part of the DEMM itself, then applies those same definitions to produce per-property verdicts before aggregation. This is definitional rather than a derivation that reduces an independent claim to its inputs. The feasibility exercise on 140 synthetic scenarios plus three public incidents is explicitly labeled 'implementation behaviour, not external validation' in the abstract, so the reported completeness range does not function as a prediction or test of the reconstructability assumption. No self-citation, uniqueness theorem, or ansatz is invoked as load-bearing; the framework is anchored to the general maturity-model lineage without circular reduction. The derivation chain is therefore self-contained as a proposed classification protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The model rests on the domain assumption that decision properties are reconstructible from telemetry and traces, introduces the container fallacy and DEMM categories as new constructs, and performs no parameter fitting.

axioms (1)

domain assumption Agentic AI execution telemetry and traces contain properties that can be classified by sufficiency for answering specific governance questions.
This assumption underpins the entire four-category classification and is stated as the motivation for the model.

invented entities (2)

Container fallacy no independent evidence
purpose: Names the automatic equation of evidence-container presence with audit sufficiency.
New term coined in the paper to describe the core problem.
Decision Evidence Maturity Model (DEMM) categories and rubric no independent evidence
purpose: Provides the four executable sufficiency categories, conflicting state, and five-level aggregation.
New specification introduced by the authors.

pith-pipeline@v0.9.0 · 5461 in / 1406 out tokens · 31527 ms · 2026-05-09T20:36:32.687952+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 27 canonical work pages · 12 internal anchors

[1]

LanG -- A Governance-Aware Agentic AI Platform for Unified Security Operations

• Abdennebi, A.; Kara, N.; Lahlou, L.; Ould-Slimane, H. LanG: A Governance-Aware Agentic AI Platform for Unified Security Operations. arXiv preprint 2026, 1–37. https: //doi.org/10.48550/arXiv.2604.05440 • Acharya, N.; Gupta, G.K. A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms (MCPSHIELD)....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.05440 2026
[2]

AI Trust OS -- A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments

https://doi.org/10.1017/cfl.2024.12 • Bandara, E.; Gunaratna, A.; Gore, R.; Rahman, A.; Mukkamala, R.; Shetty, S.; et al. AI Trust OS: A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments. arXiv preprint 2026, 1–39. https: //doi.org/10.48550/arXiv.2604.04749 • Beshane, N.K.S. RuntimeGuard-A...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1017/cfl.2024.12 2024
[3]

Site Reliability Engineering: How Google Runs Production Systems — Chapter 15: Postmortem Culture: Learning from Failure

https://doi.org/10.48047/jocaaa.2026.35.01.87 • Beyer, B.; Jones, C.; Petoff, J.; Murphy, N.R.; Lucas, J.; Pernici, A.C.H. Site Reliability Engineering: How Google Runs Production Systems — Chapter 15: Postmortem Culture: Learning from Failure

work page doi:10.48047/jocaaa.2026.35.01.87 2026
[4]

Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception

A vailable online: https://sre.google/sre-book/postmortem- culture/ • Brady, S. Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception. 2026, 2–39. https://doi.org/10 .48550/arXiv.2604.04660 • Busuioc, M. Accountable Artificial Intelligence: Holding Algorithms to Account. Public Ad...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1111/puar.13293 2026
[5]

Cursor AI Executes Destructive Command (rm -rf) During Development Session (Cursor forum user report, caveated pattern evidence)

https://doi.org/10.2139/ssrn.6352164 • Cursor forum (user-reported). Cursor AI Executes Destructive Command (rm -rf) During Development Session (Cursor forum user report, caveated pattern evidence)

work page doi:10.2139/ssrn.6352164
[6]

HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems

A vailable online: https://forum.cursor.com/t/cursor-ai-executes-destructive-command- rm-rf-during-development-session/129401 • Dalugoda, A. HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems. 2026, 2–8. https://doi.org/10.48550/arXiv.2604.04522 • Dotan, R.; Blili-Hamelin, B.; Madhavan, R.; Matthews, J.; Scarpi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.04522 2026
[7]

Data Management Capability Assessment Model (DCAM) v2.2

https://doi.org/10.48550/arXiv.2401.15229 • EDM Council. Data Management Capability Assessment Model (DCAM) v2.2

work page doi:10.48550/arxiv.2401.15229
[8]

Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction

A vailable online: https://edmcouncil.org/dcam/ • Elish, M.C. Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction. Engaging Science Technology and Society 2019, 5, 40–60. https://doi.org/10.17351/e sts2019.260 • European Parliament and Council. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rule...

work page doi:10.17351/e 2019
[9]

A vailable online: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng • Feakins; Habli; Morgan Clear, Compelling Arguments: Rethinking the Foundations of Frontier AI Safety Cases. IASEAI

2024
[10]

SR 26-2: Revised Guidance on Model Risk Management (Federal 𝑅𝑒𝑠𝑒𝑟𝑣𝑒/𝑂𝐶𝐶/𝐹 𝐷𝐼𝐶 joint, 17 April 2026)

https://doi.org/10.48550/arxiv.2603.08760 • Federal Reserve System; OCC; FDIC. SR 26-2: Revised Guidance on Model Risk Management (Federal 𝑅𝑒𝑠𝑒𝑟𝑣𝑒/𝑂𝐶𝐶/𝐹 𝐷𝐼𝐶 joint, 17 April 2026). 2026, 1–12. A vailable online: https://www.federalreserve.gov/supervisionreg/srletters/SR2602.htm • Fernandez, M. Atomic Decision Boundaries: A Structural Requirement for Guaran...

work page doi:10.48550/arxiv.2603.08760 2026
[11]

A Layered Model for AI Governance

https://doi.org/10.4 8550/arXiv.2603.23471 • Gasser, U.; Almeida, V.A.F. A Layered Model for AI Governance. IEEE Internet Computing 2017, 21, 58–62. https://doi.org/10.1109/MIC.2017.4180835 • Goncalves, C. Engineering Explainable AI Systems for GDPR-Aligned Decision 36 Transparency

work page doi:10.1109/mic.2017.4180835 2017
[12]

Foundations for Agentic AI Investigations from the Forensic Analysis of OpenClaw

https://doi.org/10.20944/preprints202601.1610.v1 • Gruber, J.; Hilgert, J.-N. Foundations for Agentic AI Investigations from the Forensic Analysis of OpenClaw. 2026, 1–13. https://doi.org/10.48550/arXiv.2604.05589 • He, J.; Yu, D. OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains. 2026, 1–17. https://doi.org/10.48550/ar...

work page doi:10.20944/preprints202601.1610.v1 2026
[13]

2012, 4–16

𝐼𝑆𝑂/𝐼𝐸𝐶 27037:2012 — Information technology — Security techniques — Guidelines for identification, collection, acquisition and preservation of digital evidence. 2012, 4–16. A vailable online: https://www.iso.org/standard/44381.html • Janssen, M.; Brous, P.; Estevez, E.; Barbosa, L.; Janowski, T. Data governance: Organizing data for trustworthy Artificial ...

work page doi:10.1016/j.giq.2020.101493 2012
[14]

Runtime Governance for AI Agents: Policies on Paths

https://doi.org/10.5281/zenodo.18760913 • Kaptein, M.; Khan, V.-J.; Podstavnychy, A. Runtime Governance for AI Agents: Policies on Paths. 2026, 1–20. https://doi.org/10.48550/arXiv.2603.16586 • Kent, K.; Chevalier, S.; Grance, T.; Dang, H. Guide to Integrating Forensic Techniques into Incident Response (NIST Special Publication 800-86)

work page doi:10.5281/zenodo.18760913 2026
[15]

Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI

https://doi.org/10.48550/a rXiv.2604.19818 • Kroll, J.A.; Huey, J.; Barocas, S.; Felten, E.W.; Reidenberg, J.R.; Robinson, D.G.; et al. Accountable Algorithms. University of Pennsylvania Law Review

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/a
[16]

Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries

https://doi.or g/10.2139/ssrn.2765268 • Kurtz, A.; Krawiecka, K. Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries. 2026, 8–87. https://doi.org/10.48550/arXiv.2604.06148 • Lamport, L. Time, Clocks, and the Ordering of Events in a Distributed System. Communications ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.2139/ssrn.2765268 2026
[17]

Time, clocks, and the ordering of events in a distributed system

https://doi.org/10.1145/359545.359563 • Langer, B. Understanding Data & Analytics Maturity: A Systematic Review of Maturity Model Composition. Schmalenbach Journal of Business Research

work page doi:10.1145/359545.359563
[18]

Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems

https://doi.org/ 37 10.1007/s41471-024-00205-2 • Liu, J.; Zhao, X.; Shang, X.; Shen, Z. Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems

work page doi:10.1007/s41471-024-00205-2
[19]

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

https://doi.org/10.48550/arXiv.2604.14228 • Lupascu, C.; Lupascu, A. ElephantBroker: A Knowledge-Grounded Cognitive Runtime for Trustworthy AI Agents. arXiv 2026, 1–32. https://doi.org/10.48550/arxiv.2603.25097 • McGregor, S. Preventing Repeated Real World AI Failures by Cataloging Incidents: The AI Incident Database. In Proceedings of the AAAI Conference...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.14228 2026
[20]

AI Agents Under EU Law

A vailable online: https://atlas.mitre.org/ • Muñoz-Ordóñez, J.; Cobos, C.; Vidal-Rojas, J.C.; Herrera, F. A Maturity Model for eXplainable Artificial Intelligence-Analysis and Evaluation (MM4XAI-AE) Models. International Journal of Intelligent Systems 2025, 2025, 4934696. https://doi.org/10 .1155/int/4934696 • Nannini, L.; Smith, A.L.; Maggini, M.J.; Pan...

work page internal anchor Pith review doi:10.48550/arxiv.2604.04604 2025
[21]

A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance,

A vailable online: https://owasp.org/www-project-top-10-for-large-language-model- applications/ 38 • Paduraru, C.; Bouruc, P.-L.; Stefanescu, A. A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance. 2026, 1–11. https://doi. org/10.48550/arXiv.2603.18096 • Papagiannidis, E.; Mikalef, P.; Conboy, K. Responsible ...

work page doi:10.48550/arxiv.2603.18096 2026
[22]

Applicability-First Evaluation Module 01: Vendor Claim Admissibility for AI Security and Agentic Systems

https: //doi.org/10.1016/j.jsis.2024.101885 • Partasyuk, V. Applicability-First Evaluation Module 01: Vendor Claim Admissibility for AI Security and Agentic Systems

work page doi:10.1016/j.jsis.2024.101885 2024
[23]

SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems

https://doi.org/10.5281/zenodo.19487979 • Patil, K. SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems. 2026, 1–12. https://doi.org/10.48550/arXiv.2604.02767 • Paulk, M.C.; Curtis, B.; Chrissis, M.B.; Weber, C.V. Capability Maturity Model, Version 1.1. IEEE Software 10(4):18-27

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.19487979 2026
[24]

EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

https://doi.org/10.1109/52.219617 • Qin, X.; Luan, S.; See, J.; Yang, C.; Li, Z. EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems. arXiv 2026, 1–34. https://doi.org/10.48550/arXiv.2604.11174 • Ramdhani, Y.L.; Surendro, K. Development of AI Governance Model in Enterprises Based on CMMI Model Structure. In...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/52.219617 2026
[26]

Making AI Compliance Evidence Machine-Readable

A vailable online: https://www.tmmi.org/tmmi-model/ • Ugarte, R.C.; Guisado, M.N.P.; Jesús, A.B.D.; López, J.M.M. Making AI Compliance Evidence Machine-Readable. arXiv 2026, 1–8. https://doi.org/10.48550/arxiv.2604.13767 • Vassilev, A.; Oprea, A.; Fordyce, A.; Anderson, H.; Davies, X.; Hamin, M. Adversarial Machine Learning: A Taxonomy and Terminology of ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.13767 2026
[27]

Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces

https://doi.org/10.6028/NI ST.AI.100-2e2025 • Vispute, N.; Kadam, A. Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces. 2026, 1–10. https: //doi.org/10.48550/arXiv.2603.21692 • Vonderhaar, L.; Couder, J.; Procko, T.T.; Lueddeke, E.; Cisneros, D.; Ochoa, O. Verifying Machine Learnin...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.6028/ni 2026
[28]

EpochX: Building the 40 Infrastructure for an Emergent Agent Civilization

https://doi.org/10.3390/software5010009 • Wang, H.; Yuan, C.; Zhuang, X.; Hu, T.; Zhang, S.; Han, J.; et al. EpochX: Building the 40 Infrastructure for an Emergent Agent Civilization. 2026, 1–13. https://doi.org/10.48550 /arXiv.2603.27304 • Yampolskiy, R.V. On monitorability of AI. AI and Ethics

work page doi:10.3390/software5010009 2026
[29]

Towards Learning Boulder Excavation with Hydraulic Excavators

https://doi.org/10.1007/ s43681-024-00420-x • Yuan, A.; Su, Z.; Zhao, Y. AEGIS: No Tool Call Left Unchecked – A Pre-Execution Firewall and Audit Layer for AI Agents. 2026a, 1–12. https://doi.org/10.48550/arXiv.2 603.12621 • Yuan, A.; Zhang, H.; Wang, Z.; Zhao, Y. Sovereign-OS: A Charter-Governed Operating System for Autonomous AI Agents with Verifiable Fi...

work page doi:10.48550/arxiv.2 2026