arxiv: 2604.10658 · v1 · submitted 2026-04-12 · 💻 cs.AI · cs.CY· cs.MA

Recognition: no theorem link

Governed Reasoning for Institutional AI

Mamadou Seck

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:28 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.MA

keywords institutional AIgoverned reasoningcognitive primitivessilent errorsprior authorizationgovernance modelaudit ledgeragent frameworks

0 comments

The pith

Institutional decisions require a governed AI architecture that mandates human review before execution to prevent silent errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that general agent frameworks are ill-suited for institutional decisions such as regulatory compliance, clinical triage, and prior authorization appeals because they infer authority conversationally and can output incorrect determinations without any flag for human review. It proposes Cognitive Core as a dedicated substrate built on nine typed cognitive primitives, a four-tier governance model that treats human review as a precondition for execution, an endogenous tamper-evident audit ledger, and demand-driven delegation. This structure is intended to make the system know when it should not act alone. The approach also allows new decision domains to be added through configuration files rather than new code. If the claim holds, institutions could deploy AI that maintains accountability by design in settings where uncaught mistakes carry regulatory or clinical costs.

Core claim

Cognitive Core is a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences. On an 11-case balanced prior authorization appeal evaluation set, Cognitive Core reaches 91% accuracy with zero silent errors while prompt-based ReAct reaches 55% accuracy with 5-6 silent errors and Plan-and-Solve ReAc

What carries the argument

Cognitive Core, the system of nine typed cognitive primitives combined with four-tier governance requiring human review as a precondition for execution and an endogenous tamper-evident hash-chain audit ledger.

If this is right

New institutional decision domains can be deployed by editing YAML configuration files instead of writing new code.
Accountability is maintained through a hash-chain ledger that records every reasoning step as part of the computation itself.
Governability, defined as reliably knowing when to defer to humans, becomes a required evaluation axis alongside accuracy.
Execution of any determination depends on explicit governance signals, eliminating the possibility of silent incorrect outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same primitive-and-governance structure could be tested in adjacent regulated domains such as financial compliance or legal document review.
The demand-driven delegation mechanism might allow the system to scale to longer reasoning chains without increasing silent-error risk.
Making the nine primitives the fixed interface could simplify auditing and verification across different institutions.
The configuration-driven deployment model reduces the engineering barrier for adopting governed AI in smaller organizations.

Load-bearing premise

The 11-case evaluation set is representative of real-world institutional decisions and that prompt-based implementations of ReAct and Plan-and-Solve accurately reflect realistic deployment alternatives to a governed framework.

What would settle it

A larger test on hundreds of actual prior authorization cases in which Cognitive Core produces silent errors at rates comparable to or higher than the ReAct and Plan-and-Solve baselines.

Figures

Figures reproduced from arXiv: 2604.10658 by Mamadou Seck.

**Figure 2.** Figure 2: The reflect post-challenge guard. Reflect reads the challenge output against the prior determination and accumulated evidence. If the challenge identifies a genuine epistemic vulnerability, reflect sets trajectory: revise and a constrained second deliberate runs. If challenge applies authority pressure or attacks a different epistemic domain, the determination is preserved. Either path leads to govern. 4.… view at source ↗

**Figure 3.** Figure 3: Four-tier governance model. Triggers (inside each box) show the epistemic conditions [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Two execution modes. In workflow mode the epistemic sequence is declared; in agentic [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Three-layer epistemic state. Mechanical signals (Layer 1) are deterministic. Judgment [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Institutional decisions -- regulatory compliance, clinical triage, prior authorization appeal -- require a different AI architecture than general-purpose agents provide. Agent frameworks infer authority conversationally, reconstruct accountability from logs, and produce silent errors: incorrect determinations that execute without any human review signal. We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences. We benchmark three systems on an 11-case balanced prior authorization appeal evaluation set. Cognitive Core achieves 91% accuracy against 55% (ReAct) and 45% (Plan-and-Solve). The governance result is more significant: CC produced zero silent errors while both baselines produced 5-6. We introduce governability -- how reliably a system knows when it should not act autonomously -- as a primary evaluation axis for institutional AI alongside accuracy. The baselines are implemented as prompts, representing the realistic deployment alternative to a governed framework. A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes a nine-primitive governed substrate for institutional decisions but its 91% accuracy and zero-silent-error claims rest on an undescribed 11-case set.

read the letter

The core idea is a decision architecture that bakes human review into the execution path rather than treating it as an afterthought. It defines nine typed primitives, a four-tier governance stack, an endogenous hash-chain ledger, and demand-driven delegation so the system can flag when it should not act alone. That framing of governability as a first-class metric alongside accuracy is the part worth paying attention to for anyone working on regulated domains like prior authorization or compliance checks. The configuration-driven domain model also looks practical for reducing engineering lift when moving to a new institutional setting. The baselines are presented as prompt-only versions of ReAct and Plan-and-Solve, which the paper treats as the realistic alternative to a governed system. The reported gap (91% vs 55%/45%, zero vs 5-6 silent errors) is the headline result. The problem is that none of the supporting details are visible: case selection criteria, how ground truth was established, the exact definition of a silent error, or whether the baselines received equivalent engineering. An 11-case balanced set is too small to carry those numbers without further checks on selection bias or prompt sensitivity. The architecture itself is spelled out clearly enough that the proposal stands on its own terms even if the current numbers do not. This is the kind of work that belongs in a reading group focused on agent safety or deployment in high-stakes settings, mainly to pressure-test the primitives and the human-review precondition. I would not cite it yet because the empirical section needs more documentation. It does deserve peer review because the governance question is real and the paper gives a concrete substrate to discuss, even if the evaluation has to be strengthened.

Referee Report

1 major / 1 minor

Summary. The paper proposes Cognitive Core as a governed decision substrate for institutional AI tasks such as prior authorization appeals. It is built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model that treats human review as a precondition for execution, an endogenous tamper-evident SHA-256 hash-chain audit ledger, and demand-driven delegation supporting declared and reasoned epistemic sequences. The authors benchmark the system against prompt-based ReAct and Plan-and-Solve implementations on an 11-case balanced prior authorization appeal evaluation set, reporting 91% accuracy and zero silent errors for Cognitive Core versus 55% and 45% accuracy with 5-6 silent errors for the baselines. They introduce governability as a primary evaluation axis alongside accuracy and note that new domains can be deployed via YAML configuration rather than custom engineering.

Significance. If the reported performance and governance advantages are substantiated with complete methodological detail, the work could meaningfully advance institutional AI by shifting emphasis from general-purpose agent frameworks to systems that embed accountability, auditability, and explicit non-autonomous behavior as first-class properties. The configuration-driven deployment model and the explicit focus on silent-error prevention are practical strengths that address real deployment barriers in regulated domains.

major comments (1)

[Abstract and Evaluation] Abstract and Evaluation section: The central claims of 91% accuracy with zero silent errors (versus 55%/45% and 5-6 silent errors for the baselines) rest on an 11-case evaluation set, yet the manuscript supplies no case-selection criteria, protocol for establishing ground truth, inter-annotator agreement statistics, operational definition of silent error (including how it is detected), or statistical significance tests. With n=11, these omissions make it impossible to determine whether the accuracy gap or the governability result is robust, reproducible, or free of selection bias, rendering the primary empirical support for the architecture load-bearing but unsupported.

minor comments (1)

[Abstract] The abstract refers to a 'balanced' evaluation set without specifying the balance criterion or case distribution; this should be clarified in the evaluation section for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying key gaps in the reporting of our evaluation. We address the comment below and will make corresponding revisions to improve transparency.

read point-by-point responses

Referee: Abstract and Evaluation section: The central claims of 91% accuracy with zero silent errors (versus 55%/45% and 5-6 silent errors for the baselines) rest on an 11-case evaluation set, yet the manuscript supplies no case-selection criteria, protocol for establishing ground truth, inter-annotator agreement statistics, operational definition of silent error (including how it is detected), or statistical significance tests. With n=11, these omissions make it impossible to determine whether the accuracy gap or the governability result is robust, reproducible, or free of selection bias, rendering the primary empirical support for the architecture load-bearing but unsupported.

Authors: We agree that the current manuscript does not supply adequate methodological detail on the evaluation. In the revised version we will expand the Evaluation section to describe the case-selection criteria used to construct the balanced 11-case set, the protocol followed to establish ground truth, inter-annotator agreement statistics where they were collected, an explicit operational definition of silent error together with the mechanism by which it is detected via the governance tiers and audit ledger, and a discussion of the limitations imposed by the small sample size. We will also clarify that, given n=11, the results are presented as descriptive illustrations of the architecture rather than as statistically powered claims, and we will make the evaluation cases available in supplementary material to support reproducibility and independent assessment of selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on direct measurement rather than self-referential derivation.

full rationale

The paper describes an architecture (Cognitive Core with nine primitives, four-tier governance, hash-chain ledger) and then reports direct empirical results on an 11-case prior-authorization set: 91% accuracy and zero silent errors versus 55%/45% and 5-6 silent errors for prompt-based ReAct and Plan-and-Solve. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the supplied text. Governability is introduced as an explicit new evaluation axis defined by the presence/absence of human-review signals, not presupposed by the accuracy numbers. The central claims are therefore measurements of the implemented system against baselines, not reductions of outputs to inputs by construction. The small evaluation set raises external-validity concerns but does not create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the design assumes the nine primitives suffice to cover institutional reasoning without further justification.

axioms (2)

domain assumption Nine typed cognitive primitives are sufficient to model institutional decision processes
Invoked by the proposal of Cognitive Core as the substrate for governed reasoning.
domain assumption Human review can be reliably inserted as a condition of execution via the four-tier model
Central to the claim of zero silent errors.

invented entities (1)

Cognitive Core no independent evidence
purpose: Governed decision substrate for institutional AI
New architecture introduced in the paper.

pith-pipeline@v0.9.0 · 5525 in / 1410 out tokens · 35447 ms · 2026-05-10T15:28:56.928554+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Springer, 3rd ed., 2019

Weske, M.Business Process Management: Concepts, Languages, Architectures. Springer, 3rd ed., 2019

2019
[2]

Proposal for a Regulation on Artificial Intelligence (AI Act)

European Commission. Proposal for a Regulation on Artificial Intelligence (AI Act). EUR-Lex, 2021. 25

2021
[3]

Towards A Rigorous Science of Interpretable Machine Learning

Doshi-Velez, F. and Kim, B. Towards a rigorous science of interpretable machine learning. arXiv:1702.08608, 2017

work page internal anchor Pith review arXiv 2017
[4]

LangGraph: Build stateful, multi-actor applications with LLMs.https:// github.com/langchain-ai/langgraph, 2024

LangChain AI. LangGraph: Build stateful, multi-actor applications with LLMs.https:// github.com/langchain-ai/langgraph, 2024

2024
[5]

The Joint Commission, 2024

Joint Commission.Provision of Care, Treatment, and Services Standards. The Joint Commission, 2024

2024
[6]

MIT Press, 1998

Klein, G.Sources of Power: How People Make Decisions. MIT Press, 1998

1998
[7]

and Olsen, J.P.Rediscovering Institutions: The Organizational Basis of Politics

March, J.G. and Olsen, J.P.Rediscovering Institutions: The Organizational Basis of Politics. Free Press, 1989

1989
[8]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215, 2019

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215, 2019

2019
[9]

Shapira, N., Wendler, C., Yen, A. et al. Agents of Chaos.arXiv:2602.20021, 2026

work page arXiv 2026
[10]

Macmillan, 4th ed., 1997

Simon, H.A.Administrative Behavior. Macmillan, 4th ed., 1997

1997
[11]

MIT Press, 3rd ed., 1996

Simon, H.A.The Sciences of the Artificial. MIT Press, 3rd ed., 1996

1996
[12]

Temporal: Build invincible apps.https://temporal.io, 2024

Temporal Technologies. Temporal: Build invincible apps.https://temporal.io, 2024

2024
[13]

ReAct: Synergizing Reasoning and Acting in Language Models.ICLR, 2023

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models.ICLR, 2023

2023
[14]

Academic Press, 2nd ed., 2000

Zeigler, B.P., Kim, T.G., and Praehofer, H.Theory of Modeling and Simulation. Academic Press, 2nd ed., 2000

2000
[15]

and Vangheluwe, H

Van Tendeloo, Y. and Vangheluwe, H. The Modelling and Simulation of DEVS Models in PythonPDEVS.Simulation: Transactions of the Society for Modeling and Simulation Interna- tional, 2014

2014
[16]

Cognitive Core: Governed Reasoning for Institutional AI.https://github.com/ dioufseck-rgb/cognitive-core, 2026

Seck, M. Cognitive Core: Governed Reasoning for Institutional AI.https://github.com/ dioufseck-rgb/cognitive-core, 2026

2026
[17]

InProceedings of the 61st Annual Meeting of the ACL, pp

Wang, L., Xu, W., Lan, Y., Hu, Z., Lan, Y., Lee, R.K.W., andLim, E.P.Plan-and-SolvePrompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. InProceedings of the 61st Annual Meeting of the ACL, pp. 2609–2634, Toronto, Canada, 2023

2023
[18]

MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi- Agent Systems.arXiv:2510.10185, 2025

Gu, L., Zhu, Y., Sang, H., Wang, Z., Sui, D., Tang, W., Harrison, E., Gao, J., Yu, L., and Ma, L. MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi- Agent Systems.arXiv:2510.10185, 2025

work page arXiv 2025
[19]

This is a deliberate design choice, not a limitation

Gent, E.AI’sWrongAnswersAreBad.ItsWrongReasoningIsWorse.IEEESpectrum, December 2025.https://spectrum.ieee.org/ai-reasoning-failures 26 A Baseline Agent Implementations A.1 Implementation Philosophy The ReAct and Plan-and-Solve baselines are implemented as prompts — which is what a well- resourced engineering team would actually deploy for this task withou...

2025
[20]

Defines all four dispositions (OVERTURN, UPHOLD, PARTIAL, REMAND) as equally valid out- comes
[21]

States that UPHOLD and REMAND are correct answers when evidence supports them
[22]

Requires a procedural defect check (CHSC 1374.31(b)) as the first evaluation step
[23]

Instructs that when criteria are not met and no higher source overrides them, the denial stands
[24]

C004 produced OVERTURN despite the explicit instruction that UPHOLD is correct when criteria are not met

Defines REMAND as the correct disposition when the denial notice itself is procedurally defective The neutral framing fixed REMAND detection (D001 and D003 both correct) but did not resolve approval prior bias on UPHOLD cases. C004 produced OVERTURN despite the explicit instruction that UPHOLD is correct when criteria are not met. The approval prior opera...
[25]

Is structured to reach whichever disposition the evidence supports—all four dispositions equally valid
[26]

Includes a procedural defect check as a required plan element 27
[27]

Requires a disposition decision tree: under what conditions is each of OVERTURN / UPHOLD / PARTIAL / REMAND the correct answer?
[28]

Phase2(Execution)receivesthefullplanplusthecompletecaseinputanddomainknowledge

For each plan step, specifies both the positive and negative finding The planning phase generates 11–19K character plans that correctly identify the regulatory hi- erarchy, applicable criteria, and decision conditions. Phase2(Execution)receivesthefullplanplusthecompletecaseinputanddomainknowledge. Itis instructed to follow the plan’s disposition decision ...
[29]

At least 2 cases per disposition class (OVERTURN, UPHOLD, REMAND, PARTIAL)
[30]

Cases from multiple letter groups (A, B, C, D, E, G) to ensure diversity
[31]

Inclusion of known hard cases (G003 authority sycophancy, B001 per-level asymmetry, C003 procedural defect)
[32]

Case GT Key Reasoning A001OVERTURN Myelomalacia on MRI, PT contraindicated by physician declaration

Exclusion of cases that are near-duplicates within the same disposition class 28 B.2 Case Descriptions Table 5: 11-case evaluation set: case descriptions and ground truth reasoning. Case GT Key Reasoning A001OVERTURN Myelomalacia on MRI, PT contraindicated by physician declaration. CIC 10169.5(a)(1) and (a)(3) both apply. Plan criteria legally unenforceab...