arxiv: 2604.11943 · v2 · submitted 2026-04-13 · 💻 cs.OS · cs.LG

Recognition: unknown

ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems

Daeyeon Son

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:23 UTC · model grok-4.3

classification 💻 cs.OS cs.LG

keywords kernel LLM inferencelogit probingAI agent safetyzero-parameter classificationverbalizer calibrationAI-native operating systemsHarmBench evaluationToxicChat benchmark

0 comments

The pith

Kernel reads specific token logits in one forward pass to classify AI agent actions as safe or dangerous with no learned parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an operating system kernel can run LLM inference internally and use a single forward pass to inspect logits for chosen tokens, deciding whether an agent action should be blocked before any text is generated. This ProbeLogits primitive works on unmodified base models across benchmarks, reaching 97-99 percent block rates on HarmBench non-copyright prompts and F1 scores that match or exceed Llama Guard 3 on ToxicChat while running faster because it avoids token generation. The method places the check below the WASM sandbox and before kernel host functions, so enforcement cannot be bypassed at the application layer. A calibration value alpha adjusts for verbalizer bias at deployment time without retraining. If correct, this turns safety classification into a lightweight kernel governance tool rather than a separate model or post-processing step.

Core claim

ProbeLogits is a kernel-level operation that performs one forward pass on a base LLM and reads the logit values at specific token positions defined by a verbalizer to label an agent action safe or dangerous. Across Qwen 2.5-7B, Llama 3 8B, and Mistral 7B, the approach yields 97-99 percent block rates on HarmBench non-copyright cases with suitable verbalizers. On ToxicChat the strongest configuration reaches F1 0.812 with confidence intervals disjoint from Llama Guard 3, while latency drops to roughly 2.5 times faster in hosted settings and 65 ms bare-metal because only a single logit position is inspected instead of generating output. The implementation sits inside Anima OS below the sandbox

What carries the argument

ProbeLogits, a kernel operation that executes one forward pass and reads logits for a fixed verbalizer's safe and dangerous tokens to produce a classification, using alpha as a deployment-time knob to correct prior asymmetry.

If this is right

Classification occurs without generating any tokens, cutting latency compared with full guard-model inference.
Checks run below the WASM sandbox boundary and before the 15 kernel host functions, raising the bar for circumvention.
Alpha provides a runtime policy dial that corrects verbalizer bias without retraining or added parameters.
The same primitive applies across multiple base models without model-specific fine-tuning.
Enforcement integrates directly into agent execution paths inside the OS rather than as an external service.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Placing safety decisions at the kernel logit level could let OS designers compose multiple lightweight primitives into richer policies without multiplying inference cost.
The approach suggests safety can move from separate application-layer models to shared OS primitives that every agent must cross.
If alpha calibration generalizes, deployment teams could tune safety thresholds per environment without touching model weights.

Load-bearing premise

A fixed verbalizer and single logit read on pre-trained models will separate safe from dangerous actions reliably across new tasks and models without any post-hoc tuning that effectively adds parameters.

What would settle it

A new benchmark of agent action prompts where ProbeLogits F1 falls significantly below Llama Guard 3 or where block rate on clearly harmful actions drops below 90 percent for all tested verbalizers.

read the original abstract

An OS kernel that runs LLM inference internally can read logit distributions before any text is generated and act on them as a governance primitive. This paper presents ProbeLogits, a kernel-level operation that performs a single forward pass and reads specific token logits to classify agent actions as safe or dangerous, with zero learned parameters. I evaluate ProbeLogits across three base models (Qwen 2.5-7B, Llama 3 8B, Mistral 7B) on three external benchmarks: HarmBench, XSTest, and ToxicChat. On HarmBench non-copyright (n=300), all three models reach 97-99% block rate with the right verbalizer. On ToxicChat (n=1,000), ProbeLogits achieves F1 parity-or-better against Llama Guard 3 in the same hosted environment: the strongest configuration (Qwen 2.5-7B Safe/Dangerous, alpha=0.0) reaches F1=0.812 with bootstrap 95% CIs disjoint from LG3 (+13.7pp significant); Llama 3 S/D matches LG3 within CI (+0.4pp, parity); Mistral Y/N exceeds by +4.4pp. Latency is approximately 2.5x faster than LG3 in the same hosted environment because the primitive reads a single logit position instead of generating tokens; in the bare-metal native runtime ProbeLogits drops to 65 ms. A key design contribution is the calibration strength alpha, which serves as a deployment-time policy knob rather than a learned hyperparameter. Contextual calibration corrects verbalizer prior asymmetry, with bias magnitude varying by (model, verbalizer) pair. I implement ProbeLogits within Anima OS, a bare-metal x86_64 OS written in approximately 86,000 lines of Rust. Because agent actions must pass through 15 kernel-mediated host functions, ProbeLogits enforcement operates below the WASM sandbox boundary, making it significantly harder to circumvent than application-layer classifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ProbeLogits, a kernel-level primitive for AI-native OSes that performs a single forward pass on a pre-trained LLM and reads fixed token logits (via a verbalizer) to classify agent actions as safe or dangerous, claiming zero learned parameters. It reports 97-99% block rates on HarmBench non-copyright (n=300) across Qwen 2.5-7B, Llama 3 8B, and Mistral 7B with model-specific verbalizers, and on ToxicChat (n=1000) the strongest config (Qwen S/D + alpha=0.0) reaches F1=0.812 (bootstrap 95% CI disjoint from Llama Guard 3 by +13.7pp), with 2.5x lower latency than LG3 in the same environment and 65ms in bare-metal Anima OS (86k LOC Rust). Alpha is presented as a deployment knob for contextual calibration whose bias varies by (model, verbalizer) pair.

Significance. If the zero-parameter generalization holds without per-model search, this would be a notable contribution to efficient, kernel-enforced safety primitives that operate below the WASM boundary and are harder to circumvent than application-layer classifiers. The concrete benchmark results with bootstrap CIs, direct same-environment latency comparison, and full OS implementation provide reproducible empirical grounding; the design of alpha as a tunable policy knob (rather than learned) is a practical strength.

major comments (3)

[Abstract] Abstract and evaluation: the central 'zero learned parameters' claim is load-bearing but qualified by references to 'the right verbalizer' (Safe/Dangerous for Qwen, Y/N for Mistral) and 'strongest configuration' (Qwen S/D + alpha=0.0). If verbalizer selection requires trying options on target data or per-model search, this introduces post-hoc choices that function as hidden parameters and undermine the no-tuning generalization for unseen models/tasks. Please provide an explicit, a-priori procedure for verbalizer choice that does not rely on evaluation-set performance.
[Abstract] Abstract and calibration section: alpha is described as a deployment-time knob whose bias magnitude 'varies by (model, verbalizer) pair,' yet the reported peak results use specific values (e.g., alpha=0.0). If determining the appropriate alpha or pair for a new model requires calibration on held-out data, this contradicts the zero-parameter framing. Clarify whether alpha can be set without reference to the evaluation distribution and report performance for a fixed alpha across all models.
[Evaluation] Evaluation on HarmBench and ToxicChat: results are reported only for the per-model 'right' verbalizer and strongest config. To support the claim that a fixed verbalizer plus single-pass logit reading works across unseen tasks/models, include an ablation with a single fixed verbalizer (e.g., always Y/N) applied uniformly and report the resulting F1/block rates with CIs.

minor comments (2)

[Abstract] The abstract states latency is 'approximately 2.5x faster' and 'drops to 65 ms' in bare-metal; move the exact per-model latency tables and measurement methodology into the main text with error bars.
Clarify the exact logit indices read for each verbalizer (e.g., which token IDs correspond to 'Safe' vs 'Dangerous') and whether they are model-specific or derived from a fixed mapping.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying the zero-parameter claim and strengthening the evaluation. We address each major comment point-by-point below, committing to revisions that preserve the manuscript's core contributions while improving rigor.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation: the central 'zero learned parameters' claim is load-bearing but qualified by references to 'the right verbalizer' (Safe/Dangerous for Qwen, Y/N for Mistral) and 'strongest configuration' (Qwen S/D + alpha=0.0). If verbalizer selection requires trying options on target data or per-model search, this introduces post-hoc choices that function as hidden parameters and undermine the no-tuning generalization for unseen models/tasks. Please provide an explicit, a-priori procedure for verbalizer choice that does not rely on evaluation-set performance.

Authors: The zero learned parameters claim refers to the absence of any trained classifier weights; the verbalizer is a static token-to-label mapping chosen from the model's vocabulary. We will revise the abstract and methods to state an explicit a-priori procedure: select the pair of single-token English words most commonly used for binary affirmation/negation in the model's tokenizer (prioritizing 'Yes'/'No' when available as single tokens, otherwise 'Safe'/'Danger' or equivalents documented in prior LLM safety literature). This selection is made from model documentation alone, without reference to any evaluation data or performance metrics. The revised manuscript will document this rule and note that it yields the verbalizers used in our experiments. revision: yes
Referee: [Abstract] Abstract and calibration section: alpha is described as a deployment-time knob whose bias magnitude 'varies by (model, verbalizer) pair,' yet the reported peak results use specific values (e.g., alpha=0.0). If determining the appropriate alpha or pair for a new model requires calibration on held-out data, this contradicts the zero-parameter framing. Clarify whether alpha can be set without reference to the evaluation distribution and report performance for a fixed alpha across all models.

Authors: Alpha is a fixed, non-learned scalar set at deployment; its default value of 0.0 requires no reference to any data distribution and represents the uncalibrated baseline. We will revise the calibration section and abstract to emphasize that alpha=0.0 is the recommended fixed setting for all models and verbalizers. We will also add a table reporting block rates and F1 scores for this fixed alpha=0.0 across all three models on both benchmarks, with bootstrap CIs, to demonstrate performance without per-model tuning. revision: yes
Referee: [Evaluation] Evaluation on HarmBench and ToxicChat: results are reported only for the per-model 'right' verbalizer and strongest config. To support the claim that a fixed verbalizer plus single-pass logit reading works across unseen tasks/models, include an ablation with a single fixed verbalizer (e.g., always Y/N) applied uniformly and report the resulting F1/block rates with CIs.

Authors: We agree that a uniform-verbalizer ablation directly addresses generalization concerns. We will add this ablation to the evaluation section: apply the fixed 'Y/N' verbalizer (or nearest single-token equivalent) uniformly across Qwen 2.5-7B, Llama 3 8B, and Mistral 7B. We will report the resulting block rates on HarmBench (n=300) and F1 scores on ToxicChat (n=1000) with bootstrap 95% CIs, alongside the per-model results, to quantify the performance trade-off of a single fixed choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents ProbeLogits as an empirical kernel primitive evaluated on independent external benchmarks (HarmBench n=300, ToxicChat n=1000, XSTest). Reported metrics (block rates 97-99%, F1 scores with bootstrap CIs) are measured directly against these held-out datasets rather than derived from any internally fitted parameters, self-defined quantities, or equations that reduce to the inputs by construction. Alpha is described explicitly as a deployment-time policy knob whose bias varies by model-verbalizer pair but is not optimized on the evaluation data. No mathematical derivation chain, self-citation load-bearing steps, or ansatz smuggling appears in the manuscript text; the zero-learned-parameter claim is supported by the single-pass logit read design and external benchmark results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that pre-trained LLM logit distributions already encode actionable safety signals that can be read out with a fixed verbalizer and a single scalar knob.

free parameters (1)

alpha
Calibration strength used to correct verbalizer prior asymmetry; treated as a deployment-time policy knob rather than a fitted hyperparameter.

axioms (1)

domain assumption Logit values at specific token positions from a single forward pass suffice to classify agent actions as safe or dangerous across models and tasks
This is the load-bearing premise that allows the method to claim zero learned parameters.

pith-pipeline@v0.9.0 · 5680 in / 1362 out tokens · 52767 ms · 2026-05-10T16:23:46.973780+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives
cs.CR 2026-04 unverdicted novelty 7.0

Governed MCP implements kernel-level governance for MCP tool calls in AI agents through a 6-layer pipeline including ProbeLogits semantic verification, with an ablation showing F1 drop from 0.773 to 0.327 without it a...

Reference graph

Works this paper leans on

32 extracted references · 9 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

AutoGPT: An Au- tonomous GPT-4 Experiment,

T. B. Richards (Significant-Gravitas), “AutoGPT: An Au- tonomous GPT-4 Experiment,” GitHub repository, 2023

2023
[2]

Constitutional AI: Harmlessness from AI Feedback

Y. Bai et al., “Constitutional AI: Harmlessness from AI Feed- back,” arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Discover- ing Latent Knowledge in Language Models Without Super- vision,

C. Burns, H. Ye, D. Klein, and J. Steinhardt, “Discover- ing Latent Knowledge in Language Models Without Super- vision,” ICLR 2023

2023
[4]

CrewAI: Framework for Orchestrating Role- Playing Autonomous AI Agents,

J. Moura, “CrewAI: Framework for Orchestrating Role- Playing Autonomous AI Agents,” GitHub repository, 2024

2024
[5]

Guidance: A Language for Controlling Large Language Models,

S. Lundberg et al., “Guidance: A Language for Controlling Large Language Models,” GitHub repository, 2023

2023
[6]

LangChain: Building Applications with LLMs through Composability,

H. Chase, “LangChain: Building Applications with LLMs through Composability,” GitHub repository, 2022

2022
[7]

llama.cpp: LLM inference in C/C++,

G. Gerganov, “llama.cpp: LLM inference in C/C++,” GitHub repository, 2023

2023
[8]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inan et al., “Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations,” arXiv:2312.06674, 2023

work page internal anchor Pith review arXiv 2023
[9]

llguidance: Fast Constrained Decoding Library,

Microsoft, “llguidance: Fast Constrained Decoding Library,” GitHub repository, 2024

2024
[10]

MetaGPT: Meta Programming for A Multi- Agent Collaborative Framework,

S. Hong et al., “MetaGPT: Meta Programming for A Multi- Agent Collaborative Framework,” ICLR 2024

2024
[11]

Malik Sallam

T. Rebedea, R. Dinu, M. Sreedhar, et al., “NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails,” arXiv:2310.10501, NVIDIA, 2023

work page arXiv 2023
[12]

Os-copilot: Towards generalist computer agents with self-improvement

Z. Wu et al., “OS-Copilot: Towards Generalist Computer Agents with Self-Improvement,” arXiv:2402.07456, 2024

work page arXiv 2024
[13]

Efficient Guided Generation for Large Language Models

B. Willard and R. Louf, “Efficient Guided Generation for Large Language Models,” arXiv:2307.09702, 2023

work page internal anchor Pith review arXiv 2023
[14]

Petals: Collaborative Inference and Fine-tuning of Large Models,

A. Borzunov et al., “Petals: Collaborative Inference and Fine-tuning of Large Models,” ACL 2023 (demo)

2023
[15]

In Search of an Understand- able Consensus Algorithm,

D. Ongaro and J. Ousterhout, “In Search of an Understand- able Consensus Algorithm,” USENIX ATC 2014

2014
[16]

seL4: Formal Verification of an OS Kernel,

G. Klein et al., “seL4: Formal Verification of an OS Kernel,” SOSP 2009

2009
[17]

SGLang: Efficient Execution of Structured Language Model Programs

L. Zheng et al., “SGLang: Efficient Execution of Structured Language Model Programs,” arXiv:2312.07104, 2023

work page internal anchor Pith review arXiv 2023
[18]

Theseus: An Exper- iment in Operating System Structure and State Manage- ment,

K. Boos, N. Liber, and L. Zhong, “Theseus: An Exper- iment in Operating System Structure and State Manage- ment,” OSDI 2020

2020
[19]

Efficient Memory Management for Large Language Model Serving with PagedAttention,

W. Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention,” SOSP 2023

2023
[20]

Llama Guard 3-8B Model Card,

Meta, “Llama Guard 3-8B Model Card,”https://github. com/meta-llama/PurpleLlama, 2024

2024
[21]

Agent Governance Toolkit (AGT) for LLM Tool Calls,

Microsoft, “Agent Governance Toolkit (AGT) for LLM Tool Calls,” Microsoft Research preview, 2026

2026
[22]

AIOS: LLM Agent Operating System,

K. Mei et al., “AIOS: LLM Agent Operating System,” COLM 2025

2025
[23]

Right to History: A Sovereignty Kernel for Veri- fiable AI Agent Execution,

Z. Zhang, “Right to History: A Sovereignty Kernel for Veri- fiable AI Agent Execution,” arXiv:2602.20214, 2026

work page arXiv 2026
[24]

Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Infer- ence,

T. Schick and H. Schütze, “Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Infer- ence,” EACL 2021

2021
[25]

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation,

Z. Lin et al., “ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation,” Findings of EMNLP 2023

2023
[26]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024

S. Han et al., “WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs,” arXiv:2406.18495, 2024

work page arXiv 2024
[27]

Cali- brate Before Use: Improving Few-Shot Performance of Lan- guage Models,

Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Cali- brate Before Use: Improving Few-Shot Performance of Lan- guage Models,” ICML 2021

2021
[28]

Representation Engineering: A Top-Down Approach to AI Transparency

A. Zou et al., “Representation Engineering: A Top-Down Approach to AI Transparency,” arXiv:2310.01405, 2023

work page internal anchor Pith review arXiv 2023
[29]

Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives,

D. Son, “Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives,” Anima OS companion paper, 2026 (in preparation)

2026
[30]

HarmBench: A Standardized Evalua- tion Framework for Automated Red Teaming and Robust Refusal,

M. Mazeika et al., “HarmBench: A Standardized Evalua- tion Framework for Automated Red Teaming and Robust Refusal,” ICML 2024

2024
[31]

Surface Form Competition: Why the Highest Prob- ability Answer Isn’t Always Right,

A. Holtzman, P. West, V. Shwartz, Y. Choi, and L. Zettle- moyer, “Surface Form Competition: Why the Highest Prob- ability Answer Isn’t Always Right,” EMNLP 2021

2021
[32]

XSTest: A Test Suite for Identifying Ex- aggerated Safety Behaviours in Large Language Models,

P. Röttger et al., “XSTest: A Test Suite for Identifying Ex- aggerated Safety Behaviours in Large Language Models,” NAACL 2024. 15

2024