arxiv: 2604.16870 · v1 · submitted 2026-04-18 · 💻 cs.CR · cs.AI· cs.OS

Recognition: unknown

Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives

Daeyeon Son

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:51 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.OS

keywords AI agentstool governancekernel securityModel Context Protocolsafety primitivesWASM runtimelogit-based detectioncomplete mediation

0 comments

The pith

Kernel-resident gateway for AI agent tool calls makes 10-line userspace bypasses structurally impossible.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that safety enforcement for AI agents calling external tools must move from userspace libraries into the operating system kernel. Current guardrails can be defeated by short scripts, but a kernel-level gateway can interpose on every tool call and route all host functions through its checks. The system uses a six-layer pipeline that includes schema validation, rate limits, and a semantic safety decision based on model logits. An ablation on a 101-prompt benchmark shows the semantic layer is essential, lifting F1 from 0.327 to 0.773. Implementation in a custom bare-metal OS demonstrates that the non-semantic layers add only tens of microseconds while delivering complete mediation of the WASM ABI surface.

Core claim

Governed MCP places a kernel-resident gateway in the path of every MCP tool call from AI agents, running a pipeline of schema validation, trust tier check, rate limit, adversarial pre-filter, ProbeLogits semantic gate, and constitutional policy match, with a hashed audit trail; because all fifteen WASM-to-system host functions are forced through this gateway, any userspace bypass that works against existing guardrail libraries becomes structurally impossible.

What carries the argument

The Governed MCP gateway, a kernel-resident 6-layer pipeline that performs complete mediation on MCP tool calls, with the ProbeLogits semantic gate as the load-bearing step that makes the safety decision from model logits.

If this is right

All fifteen WASM-to-system host functions are forced through the gateway, eliminating the possibility of userspace bypass.
Removing the semantic ProbeLogits layer drops F1 score from 0.773 to 0.327 on the 101-prompt benchmark, showing hand-rule methods alone are insufficient.
The five non-inference layers add 65.3 microseconds of overhead per call on the custom OS.
The ProbeLogits semantic decision adds 65 milliseconds per token-class on a 7B Q4_0 model.
A Blake3-hashed audit chain records every mediated call for later inspection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar kernel-level mediation could be applied to other agent tool-calling interfaces beyond MCP.
OS designers may need to treat AI agent tool calls as a new class of privileged operation equivalent to syscalls.
The approach suggests that semantic safety checks can be made part of the OS rather than left to application-level libraries.
Scaling the gateway to production workloads would require measuring its effect on overall agent latency and throughput.

Load-bearing premise

The ProbeLogits gate can correctly identify unsafe or unintended tool calls across realistic adversarial and benign cases.

What would settle it

Finding either a working short userspace script that evades the kernel gateway or a set of prompts on which ProbeLogits produces incorrect safety decisions while the paper claims reliable detection.

Figures

Figures reproduced from arXiv: 2604.16870 by Daeyeon Son.

read the original abstract

AI agents increasingly call external tools (file system, network, APIs) through the Model Context Protocol (MCP). These tool calls are the agent's syscalls -- privileged operations with side effects on shared state -- yet today's safety enforcement lives entirely in userspace, where a 10-line script can bypass it. I propose Governed MCP, a kernel-resident tool governance gateway built on a logit-based safety primitive (ProbeLogits, companion paper: arXiv:2604.11943). The gateway interposes on every MCP tool call in a 6-layer pipeline: schema validation, trust tier check, rate limit, adversarial pre-filter, ProbeLogits gate (the load-bearing semantic check), and constitutional policy match, with a Blake3-hashed audit chain. I implement Governed MCP in Anima OS, a bare-metal x86_64 OS in approximately 86,000 lines of Rust. The five non-inference layers add 65.3 microseconds of overhead per call; ProbeLogits adds 65 ms (per-token-class semantic decision) on 7B Q4_0. A 4-config ablation on a 101-prompt MCP-domain benchmark shows that removing the ProbeLogits layer collapses F1 from 0.773 to 0.327 (Delta F1 = -0.446) -- hand-rule firewalling alone is insufficient. All 15 WASM-to-system host functions in the runtime route through the gateway (complete mediation of the WASM ABI surface; the scope and caveats of this claim are stated in Section 4.6); a 10-LoC userspace bypass that defeats existing guardrail libraries is structurally impossible against the kernel-resident gate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents Governed MCP, a kernel-resident tool governance gateway for AI agents using the Model Context Protocol (MCP). Implemented in Anima OS (an ~86,000-line bare-metal x86_64 Rust OS), it interposes a 6-layer pipeline (schema validation, trust tier check, rate limit, adversarial pre-filter, ProbeLogits semantic gate from companion paper arXiv:2604.11943, and constitutional policy match) with a Blake3-hashed audit chain on MCP tool calls. Key results include 65.3 μs overhead for the five non-inference layers, 65 ms added by ProbeLogits (per-token-class decision on 7B Q4_0), an ablation on a 101-prompt benchmark showing F1 collapse from 0.773 to 0.327 without ProbeLogits, and the claim that all 15 WASM-to-system host functions route through the gateway, rendering 10-LoC userspace bypasses structurally impossible.

Significance. If the complete mediation of the WASM ABI surface holds and the ProbeLogits primitive reliably identifies unsafe tool calls, the work provides a concrete advance in AI agent safety by relocating enforcement from easily bypassed userspace guardrails to a kernel-resident gate, backed by an implemented system, explicit performance numbers, and ablation evidence that rule-based layers alone are insufficient.

major comments (3)

[Section 4.6] Section 4.6: The claim that 'all 15 WASM-to-system host functions in the runtime route through the gateway' with complete mediation of the WASM ABI surface (making 10-LoC userspace bypasses structurally impossible) lacks an explicit enumeration of the 15 functions, a formal argument that these exhaust every MCP tool invocation path, and analysis of indirect mechanisms such as alternative ABI encodings or direct kernel interfaces outside the WASM host surface. This assumption in the 86k-line Anima OS is load-bearing for the security guarantee.
[Abstract] Abstract and ablation section: The 101-prompt MCP-domain benchmark reports F1 of 0.773 (full pipeline) vs. 0.327 (without ProbeLogits), but provides no details on benchmark construction, prompt diversity, adversarial coverage, error bars, or statistical significance. This weakens the evidence for the ProbeLogits layer's contribution given the small sample size.
[Abstract] Abstract and Section 4 (ProbeLogits gate): The load-bearing semantic decision reduces to the ProbeLogits primitive from companion paper arXiv:2604.11943. The ablation demonstrates necessity within this work but does not independently validate the primitive's reliability across realistic adversarial and benign scenarios here.

minor comments (1)

[Abstract] Abstract: The parenthetical note that 'the scope and caveats of this claim are stated in Section 4.6' would benefit from a one-sentence summary of those caveats to help readers assess the security claim without immediately consulting the section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Section 4.6] Section 4.6: The claim that 'all 15 WASM-to-system host functions in the runtime route through the gateway' with complete mediation of the WASM ABI surface (making 10-LoC userspace bypasses structurally impossible) lacks an explicit enumeration of the 15 functions, a formal argument that these exhaust every MCP tool invocation path, and analysis of indirect mechanisms such as alternative ABI encodings or direct kernel interfaces outside the WASM host surface. This assumption in the 86k-line Anima OS is load-bearing for the security guarantee.

Authors: We agree that providing an explicit enumeration and a more formal argument would enhance the clarity and verifiability of our security claims. In the revised version, we will add a table in Section 4.6 enumerating all 15 WASM-to-system host functions, along with a structured argument demonstrating that these functions constitute the complete set of MCP tool invocation paths within the Anima OS WASM runtime. We will also address potential indirect mechanisms by explaining the isolation properties of the kernel-resident gateway and why alternative ABI encodings or direct kernel interfaces are not exposed to userspace code. This revision will make the complete mediation claim more robust without altering the core results. revision: yes
Referee: [Abstract] Abstract and ablation section: The 101-prompt MCP-domain benchmark reports F1 of 0.773 (full pipeline) vs. 0.327 (without ProbeLogits), but provides no details on benchmark construction, prompt diversity, adversarial coverage, error bars, or statistical significance. This weakens the evidence for the ProbeLogits layer's contribution given the small sample size.

Authors: We acknowledge that additional details on the benchmark would improve the manuscript. The 101-prompt set was curated to include a balanced mix of benign tool calls and adversarial attempts targeting common MCP operations such as file access and network requests. We will expand the ablation section to describe the benchmark construction process, note the diversity (covering multiple tool types and prompt variations), and clarify that the evaluation is deterministic, hence no error bars or statistical tests were applied. While the sample size is modest, the large F1 delta (0.446) provides clear evidence of the layer's contribution; we will add a note on this limitation. revision: yes
Referee: [Abstract] Abstract and Section 4 (ProbeLogits gate): The load-bearing semantic decision reduces to the ProbeLogits primitive from companion paper arXiv:2604.11943. The ablation demonstrates necessity within this work but does not independently validate the primitive's reliability across realistic adversarial and benign scenarios here.

Authors: The ProbeLogits primitive's reliability is established in the companion paper (arXiv:2604.11943), which includes evaluations on diverse datasets. This manuscript's contribution is the integration into the 6-layer pipeline and the ablation study demonstrating its necessity for high F1 performance. We do not perform new independent validation here, as that would duplicate the companion work. In revision, we will include a concise summary of the key validation results from the companion paper to make the reliance explicit and self-contained for readers. revision: partial

Circularity Check

1 steps flagged

Core semantic safety gate reduces to self-cited companion primitive

specific steps

self citation load bearing [Abstract]
"I propose Governed MCP, a kernel-resident tool governance gateway built on a logit-based safety primitive (ProbeLogits, companion paper: arXiv:2604.11943). ... ProbeLogits gate (the load-bearing semantic check), ... All 15 WASM-to-system host functions in the runtime route through the gateway (complete mediation of the WASM ABI surface; the scope and caveats of this claim are stated in Section 4.6)"

The load-bearing semantic decision is the ProbeLogits gate, which is defined and justified only in the companion paper by the same author. The present paper's claims about structural impossibility of bypasses and the 6-layer pipeline's effectiveness therefore reduce to the correctness of that external primitive; the ablation merely demonstrates necessity on this paper's benchmark rather than deriving the gate.

full rationale

The paper's central guarantee—that kernel residency makes 10-LoC userspace bypasses structurally impossible—rests on complete mediation plus the load-bearing ProbeLogits semantic check. The latter is imported wholesale from the companion paper by the same author and is only shown to be necessary via an ablation on a 101-prompt benchmark; no independent derivation or validation of the primitive occurs inside this work. The mediation claim itself is asserted as an implementation fact (Section 4.6) without exhibited enumeration or formal exhaustion argument, but does not constitute a definitional reduction. This produces partial circularity: the safety result is not self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the correctness of the ProbeLogits primitive from the companion paper and on the assumption that the custom OS implementation correctly interposes on all relevant paths; no new physical entities or large numbers of fitted parameters are introduced in the abstract.

axioms (1)

domain assumption ProbeLogits provides reliable semantic safety classification for MCP tool calls
This is the load-bearing layer whose removal collapses benchmark F1; details reside in the companion paper.

pith-pipeline@v0.9.0 · 5612 in / 1470 out tokens · 66619 ms · 2026-05-10T06:51:26.482129+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Model Context Protocol Specifica- tion,

Anthropic, “Model Context Protocol Specifica- tion,” 2024.https://modelcontextprotocol.io 11

2024
[2]

WebAssembly Core Specifica- tion,

A. Rossberg (ed.), “WebAssembly Core Specifica- tion,” W3C Recommendation, 2019/2024.https: //www.w3.org/TR/wasm-core/

2019
[3]

JSON-RPC 2.0 Specification,

JSON-RPC Working Group, “JSON-RPC 2.0 Specification,” 2013.https://www.jsonrpc.org/ specification

2013
[4]

FunctionCallingandToolUseDocumen- tation,

OpenAI,“FunctionCallingandToolUseDocumen- tation,” 2023–2024.https://platform.openai. com/docs/guides/function-calling

2023
[5]

LangChain: Building Applications with LLMs through Composability,

H. Chase, “LangChain: Building Applications with LLMs through Composability,” GitHub repository, 2022

2022
[6]

ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems

D. Son, “ProbeLogits: Kernel-Level LLM Infer- ence Primitives for AI-Native Operating Systems,” arXiv:2604.11943, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Malik Sallam

T. Rebedea, R. Dinu, M. Sreedhar, et al., “NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails,” arXiv:2310.10501, NVIDIA, 2023

work page arXiv 2023
[8]

Agent Governance Toolkit (AGT) for LLM Tool Calls,

Microsoft, “Agent Governance Toolkit (AGT) for LLM Tool Calls,” Microsoft Research preview, 2026

2026
[9]

AutoGPT: An Autonomous GPT-4 Experiment,

T. B. Richards (Significant-Gravitas), “AutoGPT: An Autonomous GPT-4 Experiment,” GitHub repository, 2023

2023
[10]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inan et al., “Llama Guard: LLM-based Input- Output Safeguard for Human-AI Conversations,” arXiv:2312.06674, 2023

work page internal anchor Pith review arXiv 2023
[11]

Llama Guard 3-8B Model Card,

Meta, “Llama Guard 3-8B Model Card,”https: //github.com/meta-llama/PurpleLlama, 2024

2024
[12]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024

S. Han et al., “WildGuard: Open One-Stop Mod- eration Tools for Safety Risks, Jailbreaks, and Re- fusals of LLMs,” arXiv:2406.18495, 2024

work page arXiv 2024
[13]

Guillotine: Hypervisor-Based Isolation for Adver- sarial AI Agents,

“Guillotine: Hypervisor-Based Isolation for Adver- sarial AI Agents,” HotOS 2025. Affiliation: Har- vard/Princeton

2025
[14]

AIOS: LLM Agent Operating Sys- tem,

K. Mei et al., “AIOS: LLM Agent Operating Sys- tem,” COLM 2025

2025
[15]

Computer Security Technology Planning Study,

J. P. Anderson, “Computer Security Technology Planning Study,” Tech. Rep. ESD-TR-73-51, 1972

1972
[16]

Not what you’ve signed up for: Compromising Real-World LLM- Integrated Applications with Indirect Prompt In- jection,

K. Greshake, S. Abdelnabi, S. Mishra, C. En- dres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising Real-World LLM- Integrated Applications with Indirect Prompt In- jection,” AISec Workshop @ CCS 2023

2023
[17]

The Protection of Information in Computer Systems,

J. H. Saltzer and M. D. Schroeder, “The Protection of Information in Computer Systems,” Proceedings of the IEEE, 63(9):1278–1308, 1975

1975
[18]

The Flask Security Architecture,

R. Spencer, S. Smalley, et al., “The Flask Security Architecture,” USENIX Security 1999

1999
[19]

Capsicum: Practical Ca- pabilities for UNIX,

R. N. M. Watson et al., “Capsicum: Practical Ca- pabilities for UNIX,” USENIX Security 2010

2010
[20]

Efficient Guided Generation for Large Language Models

B. Willard and R. Louf, “Efficient Guided Generation for Large Language Models,” arXiv:2307.09702, 2023

work page internal anchor Pith review arXiv 2023
[21]

llguidance: Fast Constrained Decoding Library,

Microsoft, “llguidance: Fast Constrained Decoding Library,” GitHub repository, 2024

2024
[22]

HarmBench: A Standardized Evaluation Framework for Automated Red Team- ing and Robust Refusal,

M. Mazeika et al., “HarmBench: A Standardized Evaluation Framework for Automated Red Team- ing and Robust Refusal,” ICML 2024

2024
[23]

XSTest: A Test Suite for Identi- fying Exaggerated Safety Behaviours in Large Lan- guage Models,

P. Röttger et al., “XSTest: A Test Suite for Identi- fying Exaggerated Safety Behaviours in Large Lan- guage Models,” NAACL 2024

2024
[24]

ToxicChat: Unveiling Hidden Chal- lenges of Toxicity Detection in Real-World User-AI Conversation,

Z. Lin et al., “ToxicChat: Unveiling Hidden Chal- lenges of Toxicity Detection in Real-World User-AI Conversation,” Findings of EMNLP 2023. 12

2023