Recognition: unknown
Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives
Pith reviewed 2026-05-10 06:51 UTC · model grok-4.3
The pith
Kernel-resident gateway for AI agent tool calls makes 10-line userspace bypasses structurally impossible.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Governed MCP places a kernel-resident gateway in the path of every MCP tool call from AI agents, running a pipeline of schema validation, trust tier check, rate limit, adversarial pre-filter, ProbeLogits semantic gate, and constitutional policy match, with a hashed audit trail; because all fifteen WASM-to-system host functions are forced through this gateway, any userspace bypass that works against existing guardrail libraries becomes structurally impossible.
What carries the argument
The Governed MCP gateway, a kernel-resident 6-layer pipeline that performs complete mediation on MCP tool calls, with the ProbeLogits semantic gate as the load-bearing step that makes the safety decision from model logits.
If this is right
- All fifteen WASM-to-system host functions are forced through the gateway, eliminating the possibility of userspace bypass.
- Removing the semantic ProbeLogits layer drops F1 score from 0.773 to 0.327 on the 101-prompt benchmark, showing hand-rule methods alone are insufficient.
- The five non-inference layers add 65.3 microseconds of overhead per call on the custom OS.
- The ProbeLogits semantic decision adds 65 milliseconds per token-class on a 7B Q4_0 model.
- A Blake3-hashed audit chain records every mediated call for later inspection.
Where Pith is reading between the lines
- Similar kernel-level mediation could be applied to other agent tool-calling interfaces beyond MCP.
- OS designers may need to treat AI agent tool calls as a new class of privileged operation equivalent to syscalls.
- The approach suggests that semantic safety checks can be made part of the OS rather than left to application-level libraries.
- Scaling the gateway to production workloads would require measuring its effect on overall agent latency and throughput.
Load-bearing premise
The ProbeLogits gate can correctly identify unsafe or unintended tool calls across realistic adversarial and benign cases.
What would settle it
Finding either a working short userspace script that evades the kernel gateway or a set of prompts on which ProbeLogits produces incorrect safety decisions while the paper claims reliable detection.
Figures
read the original abstract
AI agents increasingly call external tools (file system, network, APIs) through the Model Context Protocol (MCP). These tool calls are the agent's syscalls -- privileged operations with side effects on shared state -- yet today's safety enforcement lives entirely in userspace, where a 10-line script can bypass it. I propose Governed MCP, a kernel-resident tool governance gateway built on a logit-based safety primitive (ProbeLogits, companion paper: arXiv:2604.11943). The gateway interposes on every MCP tool call in a 6-layer pipeline: schema validation, trust tier check, rate limit, adversarial pre-filter, ProbeLogits gate (the load-bearing semantic check), and constitutional policy match, with a Blake3-hashed audit chain. I implement Governed MCP in Anima OS, a bare-metal x86_64 OS in approximately 86,000 lines of Rust. The five non-inference layers add 65.3 microseconds of overhead per call; ProbeLogits adds 65 ms (per-token-class semantic decision) on 7B Q4_0. A 4-config ablation on a 101-prompt MCP-domain benchmark shows that removing the ProbeLogits layer collapses F1 from 0.773 to 0.327 (Delta F1 = -0.446) -- hand-rule firewalling alone is insufficient. All 15 WASM-to-system host functions in the runtime route through the gateway (complete mediation of the WASM ABI surface; the scope and caveats of this claim are stated in Section 4.6); a 10-LoC userspace bypass that defeats existing guardrail libraries is structurally impossible against the kernel-resident gate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Governed MCP, a kernel-resident tool governance gateway for AI agents using the Model Context Protocol (MCP). Implemented in Anima OS (an ~86,000-line bare-metal x86_64 Rust OS), it interposes a 6-layer pipeline (schema validation, trust tier check, rate limit, adversarial pre-filter, ProbeLogits semantic gate from companion paper arXiv:2604.11943, and constitutional policy match) with a Blake3-hashed audit chain on MCP tool calls. Key results include 65.3 μs overhead for the five non-inference layers, 65 ms added by ProbeLogits (per-token-class decision on 7B Q4_0), an ablation on a 101-prompt benchmark showing F1 collapse from 0.773 to 0.327 without ProbeLogits, and the claim that all 15 WASM-to-system host functions route through the gateway, rendering 10-LoC userspace bypasses structurally impossible.
Significance. If the complete mediation of the WASM ABI surface holds and the ProbeLogits primitive reliably identifies unsafe tool calls, the work provides a concrete advance in AI agent safety by relocating enforcement from easily bypassed userspace guardrails to a kernel-resident gate, backed by an implemented system, explicit performance numbers, and ablation evidence that rule-based layers alone are insufficient.
major comments (3)
- [Section 4.6] Section 4.6: The claim that 'all 15 WASM-to-system host functions in the runtime route through the gateway' with complete mediation of the WASM ABI surface (making 10-LoC userspace bypasses structurally impossible) lacks an explicit enumeration of the 15 functions, a formal argument that these exhaust every MCP tool invocation path, and analysis of indirect mechanisms such as alternative ABI encodings or direct kernel interfaces outside the WASM host surface. This assumption in the 86k-line Anima OS is load-bearing for the security guarantee.
- [Abstract] Abstract and ablation section: The 101-prompt MCP-domain benchmark reports F1 of 0.773 (full pipeline) vs. 0.327 (without ProbeLogits), but provides no details on benchmark construction, prompt diversity, adversarial coverage, error bars, or statistical significance. This weakens the evidence for the ProbeLogits layer's contribution given the small sample size.
- [Abstract] Abstract and Section 4 (ProbeLogits gate): The load-bearing semantic decision reduces to the ProbeLogits primitive from companion paper arXiv:2604.11943. The ablation demonstrates necessity within this work but does not independently validate the primitive's reliability across realistic adversarial and benign scenarios here.
minor comments (1)
- [Abstract] Abstract: The parenthetical note that 'the scope and caveats of this claim are stated in Section 4.6' would benefit from a one-sentence summary of those caveats to help readers assess the security claim without immediately consulting the section.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and indicating where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [Section 4.6] Section 4.6: The claim that 'all 15 WASM-to-system host functions in the runtime route through the gateway' with complete mediation of the WASM ABI surface (making 10-LoC userspace bypasses structurally impossible) lacks an explicit enumeration of the 15 functions, a formal argument that these exhaust every MCP tool invocation path, and analysis of indirect mechanisms such as alternative ABI encodings or direct kernel interfaces outside the WASM host surface. This assumption in the 86k-line Anima OS is load-bearing for the security guarantee.
Authors: We agree that providing an explicit enumeration and a more formal argument would enhance the clarity and verifiability of our security claims. In the revised version, we will add a table in Section 4.6 enumerating all 15 WASM-to-system host functions, along with a structured argument demonstrating that these functions constitute the complete set of MCP tool invocation paths within the Anima OS WASM runtime. We will also address potential indirect mechanisms by explaining the isolation properties of the kernel-resident gateway and why alternative ABI encodings or direct kernel interfaces are not exposed to userspace code. This revision will make the complete mediation claim more robust without altering the core results. revision: yes
-
Referee: [Abstract] Abstract and ablation section: The 101-prompt MCP-domain benchmark reports F1 of 0.773 (full pipeline) vs. 0.327 (without ProbeLogits), but provides no details on benchmark construction, prompt diversity, adversarial coverage, error bars, or statistical significance. This weakens the evidence for the ProbeLogits layer's contribution given the small sample size.
Authors: We acknowledge that additional details on the benchmark would improve the manuscript. The 101-prompt set was curated to include a balanced mix of benign tool calls and adversarial attempts targeting common MCP operations such as file access and network requests. We will expand the ablation section to describe the benchmark construction process, note the diversity (covering multiple tool types and prompt variations), and clarify that the evaluation is deterministic, hence no error bars or statistical tests were applied. While the sample size is modest, the large F1 delta (0.446) provides clear evidence of the layer's contribution; we will add a note on this limitation. revision: yes
-
Referee: [Abstract] Abstract and Section 4 (ProbeLogits gate): The load-bearing semantic decision reduces to the ProbeLogits primitive from companion paper arXiv:2604.11943. The ablation demonstrates necessity within this work but does not independently validate the primitive's reliability across realistic adversarial and benign scenarios here.
Authors: The ProbeLogits primitive's reliability is established in the companion paper (arXiv:2604.11943), which includes evaluations on diverse datasets. This manuscript's contribution is the integration into the 6-layer pipeline and the ablation study demonstrating its necessity for high F1 performance. We do not perform new independent validation here, as that would duplicate the companion work. In revision, we will include a concise summary of the key validation results from the companion paper to make the reliance explicit and self-contained for readers. revision: partial
Circularity Check
Core semantic safety gate reduces to self-cited companion primitive
specific steps
-
self citation load bearing
[Abstract]
"I propose Governed MCP, a kernel-resident tool governance gateway built on a logit-based safety primitive (ProbeLogits, companion paper: arXiv:2604.11943). ... ProbeLogits gate (the load-bearing semantic check), ... All 15 WASM-to-system host functions in the runtime route through the gateway (complete mediation of the WASM ABI surface; the scope and caveats of this claim are stated in Section 4.6)"
The load-bearing semantic decision is the ProbeLogits gate, which is defined and justified only in the companion paper by the same author. The present paper's claims about structural impossibility of bypasses and the 6-layer pipeline's effectiveness therefore reduce to the correctness of that external primitive; the ablation merely demonstrates necessity on this paper's benchmark rather than deriving the gate.
full rationale
The paper's central guarantee—that kernel residency makes 10-LoC userspace bypasses structurally impossible—rests on complete mediation plus the load-bearing ProbeLogits semantic check. The latter is imported wholesale from the companion paper by the same author and is only shown to be necessary via an ablation on a 101-prompt benchmark; no independent derivation or validation of the primitive occurs inside this work. The mediation claim itself is asserted as an implementation fact (Section 4.6) without exhibited enumeration or formal exhaustion argument, but does not constitute a definitional reduction. This produces partial circularity: the safety result is not self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ProbeLogits provides reliable semantic safety classification for MCP tool calls
Reference graph
Works this paper leans on
-
[1]
Model Context Protocol Specifica- tion,
Anthropic, “Model Context Protocol Specifica- tion,” 2024.https://modelcontextprotocol.io 11
2024
-
[2]
WebAssembly Core Specifica- tion,
A. Rossberg (ed.), “WebAssembly Core Specifica- tion,” W3C Recommendation, 2019/2024.https: //www.w3.org/TR/wasm-core/
2019
-
[3]
JSON-RPC 2.0 Specification,
JSON-RPC Working Group, “JSON-RPC 2.0 Specification,” 2013.https://www.jsonrpc.org/ specification
2013
-
[4]
FunctionCallingandToolUseDocumen- tation,
OpenAI,“FunctionCallingandToolUseDocumen- tation,” 2023–2024.https://platform.openai. com/docs/guides/function-calling
2023
-
[5]
LangChain: Building Applications with LLMs through Composability,
H. Chase, “LangChain: Building Applications with LLMs through Composability,” GitHub repository, 2022
2022
-
[6]
ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems
D. Son, “ProbeLogits: Kernel-Level LLM Infer- ence Primitives for AI-Native Operating Systems,” arXiv:2604.11943, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
T. Rebedea, R. Dinu, M. Sreedhar, et al., “NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails,” arXiv:2310.10501, NVIDIA, 2023
-
[8]
Agent Governance Toolkit (AGT) for LLM Tool Calls,
Microsoft, “Agent Governance Toolkit (AGT) for LLM Tool Calls,” Microsoft Research preview, 2026
2026
-
[9]
AutoGPT: An Autonomous GPT-4 Experiment,
T. B. Richards (Significant-Gravitas), “AutoGPT: An Autonomous GPT-4 Experiment,” GitHub repository, 2023
2023
-
[10]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
H. Inan et al., “Llama Guard: LLM-based Input- Output Safeguard for Human-AI Conversations,” arXiv:2312.06674, 2023
work page internal anchor Pith review arXiv 2023
-
[11]
Llama Guard 3-8B Model Card,
Meta, “Llama Guard 3-8B Model Card,”https: //github.com/meta-llama/PurpleLlama, 2024
2024
-
[12]
S. Han et al., “WildGuard: Open One-Stop Mod- eration Tools for Safety Risks, Jailbreaks, and Re- fusals of LLMs,” arXiv:2406.18495, 2024
-
[13]
Guillotine: Hypervisor-Based Isolation for Adver- sarial AI Agents,
“Guillotine: Hypervisor-Based Isolation for Adver- sarial AI Agents,” HotOS 2025. Affiliation: Har- vard/Princeton
2025
-
[14]
AIOS: LLM Agent Operating Sys- tem,
K. Mei et al., “AIOS: LLM Agent Operating Sys- tem,” COLM 2025
2025
-
[15]
Computer Security Technology Planning Study,
J. P. Anderson, “Computer Security Technology Planning Study,” Tech. Rep. ESD-TR-73-51, 1972
1972
-
[16]
Not what you’ve signed up for: Compromising Real-World LLM- Integrated Applications with Indirect Prompt In- jection,
K. Greshake, S. Abdelnabi, S. Mishra, C. En- dres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising Real-World LLM- Integrated Applications with Indirect Prompt In- jection,” AISec Workshop @ CCS 2023
2023
-
[17]
The Protection of Information in Computer Systems,
J. H. Saltzer and M. D. Schroeder, “The Protection of Information in Computer Systems,” Proceedings of the IEEE, 63(9):1278–1308, 1975
1975
-
[18]
The Flask Security Architecture,
R. Spencer, S. Smalley, et al., “The Flask Security Architecture,” USENIX Security 1999
1999
-
[19]
Capsicum: Practical Ca- pabilities for UNIX,
R. N. M. Watson et al., “Capsicum: Practical Ca- pabilities for UNIX,” USENIX Security 2010
2010
-
[20]
Efficient Guided Generation for Large Language Models
B. Willard and R. Louf, “Efficient Guided Generation for Large Language Models,” arXiv:2307.09702, 2023
work page internal anchor Pith review arXiv 2023
-
[21]
llguidance: Fast Constrained Decoding Library,
Microsoft, “llguidance: Fast Constrained Decoding Library,” GitHub repository, 2024
2024
-
[22]
HarmBench: A Standardized Evaluation Framework for Automated Red Team- ing and Robust Refusal,
M. Mazeika et al., “HarmBench: A Standardized Evaluation Framework for Automated Red Team- ing and Robust Refusal,” ICML 2024
2024
-
[23]
XSTest: A Test Suite for Identi- fying Exaggerated Safety Behaviours in Large Lan- guage Models,
P. Röttger et al., “XSTest: A Test Suite for Identi- fying Exaggerated Safety Behaviours in Large Lan- guage Models,” NAACL 2024
2024
-
[24]
ToxicChat: Unveiling Hidden Chal- lenges of Toxicity Detection in Real-World User-AI Conversation,
Z. Lin et al., “ToxicChat: Unveiling Hidden Chal- lenges of Toxicity Detection in Real-World User-AI Conversation,” Findings of EMNLP 2023. 12
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.