arxiv: 2604.22032 · v1 · submitted 2026-04-23 · 💻 cs.LG · cs.PL

Recognition: unknown

Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon

Cooper Veit

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:27 UTC · model grok-4.3

classification 💻 cs.LG cs.PL

keywords kernel contractsspecification languageML kernelsheterogeneous siliconprecision toleranceconformance testingcompiler-induced failuresexceptional values

0 comments

The pith

ML kernel disagreements across silicon can be arbitrated by writing explicit eight-part contracts that define expected behavior and measurable violations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a specification language for making the implicit contracts of machine learning kernels explicit and testable. Each contract is built from eight required parts: an identifier, scope, precondition, postcondition, tolerance for differences, a reference oracle, a measurement protocol, and a violation signature. Twelve reusable contract classes are defined to cover precision, ordering, compiler-induced, and exceptional-value problems, each drawn from published empirical measurements of real hardware differences. The approach is demonstrated by mapping three documented incidents on Huawei, Sakana AI, and AMD platforms onto specific, measurable contract violations. If the language works as described, kernel conformance could be graded against a contract suite in the same way industrial systems are certified against security standards.

Core claim

We present a specification language for kernel contracts. A contract has eight parts: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. We use it to state twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. We require a three-state calibration: every contract must admit at least one reference-conforming implementation and at least one contract-violating implementation that passes basic functional tests. We apply the framework to three documented incidents and show that each informal diagnosis maps to a specific, fals

What carries the argument

The eight-part kernel contract, which combines a reference oracle, tolerance bounds, and a measurable violation signature to formalize what an ML kernel must compute on heterogeneous hardware.

If this is right

Every kernel disagreement across hardware platforms can be diagnosed as a specific contract violation with a measurable signature.
Conformance of any implementation can be graded against a contract suite using the reference oracle and violation signature.
The twelve classes supply a catalog of common failure modes that is already anchored in empirical data from published studies.
Kernel development and testing can shift from informal debugging to checking against explicit, arbitrable contracts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automated verification tools could check new kernels against contract suites during compilation or runtime.
Hardware and software vendors might publish contract suites as part of their kernel libraries to certify compatibility.
The same contract structure could be applied to numerical kernels outside machine learning, such as linear algebra routines in scientific computing.

Load-bearing premise

That disagreements between ML kernels on different chips can always be captured by these eight contract parts and twelve classes, and that both reference-conforming and violating implementations can be found for practical calibration.

What would settle it

A documented case of kernels producing different results on different silicon platforms that cannot be expressed as a violation of any of the twelve contract classes or for which no reference-conforming and violating implementations can be identified.

Figures

Figures reproduced from arXiv: 2604.22032 by Cooper Veit.

read the original abstract

Every ML kernel ships with an implicit contract about what it computes. People rarely write the contract down. When two kernels disagree -- when a matmul on AMD produces a different gradient than the same matmul on NVIDIA, when a fused attention kernel silently downcasts an accumulator, when an out-of-bounds access returns zero on one stack and garbage on another -- there is no formal artifact to arbitrate the dispute. Recent empirical work has measured the gap across silicon platforms, but none of it specifies the contract being violated. We present a specification language for kernel contracts. A contract has eight parts: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. We use it to state twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. We require a three-state calibration: every contract must admit at least one reference-conforming implementation and at least one contract-violating implementation that passes basic functional tests. We apply the framework to three documented incidents -- Huawei Ascend silent precision coercion, Sakana AI CUDA Engineer reward hacking, AMD out-of-bounds silent acceptance -- and show that each informal diagnosis maps to a specific contract violation with a measurable signature. A kernel contract suite is a normative reference against which conformance can be graded, in the way that ISASecure grades industrial control systems against IEC 62443.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines an eight-part contract language and twelve failure classes for ML kernels, grounded in three real incidents, but the required calibration with reference and violating implementations is not shown.

read the letter

The main takeaway is that this work formalizes a specification language for what ML kernels implicitly promise across different hardware. A contract breaks down into identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. The authors lay out twelve classes covering precision, ordering, compiler, and exceptional-value issues, each tied to published cases, then map three incidents (Huawei Ascend coercion, Sakana reward hack, AMD out-of-bounds) to specific violations with measurable signatures. They also require every contract to have both a conforming implementation and a violating one that still passes basic functional tests, which they call three-state calibration. That structure and the calibration bar are the actual new pieces here. The paper does a clean job of turning scattered empirical observations into one reusable artifact, and the incident mappings show the language can at least organize existing diagnoses without obvious internal contradictions. The soft spot is exactly the calibration step the authors themselves flag as necessary. The text stops at diagnostic mapping; it does not construct or point to concrete reference and violating implementations for any of the three cases, does not execute the measurement protocols, and does not verify that the violating versions pass functional tests while failing the contract. Without that, the claim that the eight-part form plus twelve classes is sufficient for arbitration across silicon remains untested. The work is still definitional rather than demonstrated. This is for engineers and researchers who already deal with cross-vendor kernel differences in production ML systems and want a shared vocabulary for disputes. A reader who needs a starting framework for writing down kernel expectations will get something usable to adapt, even if they have to supply the missing calibration examples themselves. It is coherent enough on its own terms to deserve referee time so practitioners can say whether the structure actually scales to more kernels.

Referee Report

1 major / 1 minor

Summary. The paper proposes a specification language for ML kernel contracts to formalize implicit agreements about kernel behavior across heterogeneous silicon. Each contract has eight parts (identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, violation signature). It defines twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. The framework requires three-state calibration (reference-conforming implementation plus at least one violating implementation that still passes basic functional tests) and demonstrates the approach by mapping three documented incidents (Huawei Ascend silent precision coercion, Sakana AI CUDA Engineer reward hacking, AMD out-of-bounds silent acceptance) to specific contract violations with measurable signatures. A kernel contract suite is positioned as a normative reference analogous to ISASecure grading against IEC 62443.

Significance. If the framework is realized with the required calibration, it would provide a concrete, auditable mechanism for specifying and arbitrating kernel correctness across platforms, directly addressing measured cross-silicon discrepancies in ML workloads. The explicit three-state calibration requirement and grounding of the twelve classes in empirical literature are strengths that distinguish this from purely informal or post-hoc debugging approaches.

major comments (1)

[Application to three documented incidents] In the application of the framework to the three documented incidents: the mappings are presented as diagnostic illustrations, but the manuscript does not exhibit the three-state calibration that the paper itself requires. No reference-conforming implementation and no contract-violating implementation (that still passes basic functional tests) are identified or constructed for any of the three cases, and no measurement protocol is executed to produce a violation signature. This leaves the central claim—that the eight-part structure plus twelve classes can arbitrate real kernel disagreements—without the concrete demonstration needed to establish feasibility.

minor comments (1)

The derivation of the twelve contract classes from the cited empirical studies could be made more explicit, for example by adding a summary table that lists each class, its failure mode, and the specific published evidence used to motivate it.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the constructive major comment. We address it point by point below.

read point-by-point responses

Referee: In the application of the framework to the three documented incidents: the mappings are presented as diagnostic illustrations, but the manuscript does not exhibit the three-state calibration that the paper itself requires. No reference-conforming implementation and no contract-violating implementation (that still passes basic functional tests) are identified or constructed for any of the three cases, and no measurement protocol is executed to produce a violation signature. This leaves the central claim—that the eight-part structure plus twelve classes can arbitrate real kernel disagreements—without the concrete demonstration needed to establish feasibility.

Authors: We agree that the three applications are diagnostic mappings rather than full three-state calibrations. The manuscript defines the calibration requirement as part of the framework for practical deployment and verification of new contracts, but the historical incidents are used only to illustrate how the eight-part structure and twelve classes can capture documented cross-silicon discrepancies. Reconstructing reference-conforming and violating implementations for these past cases would require proprietary hardware, exact software stacks, and measurement environments that are no longer accessible. We will revise the manuscript to explicitly distinguish illustrative mappings (which demonstrate expressiveness and grounding in empirical literature) from the calibration step required for new contracts, and we will add a forward-looking statement that full calibration is necessary to operationalize the framework for arbitration. This revision preserves the central claim about the specification language while acknowledging the gap in empirical demonstration for the examples. revision: yes

Circularity Check

0 steps flagged

No circularity: purely definitional framework with no derivations or self-referential reductions.

full rationale

The paper introduces an eight-part contract structure and twelve contract classes as a specification language, grounded in external published empirical evidence rather than any derivation or fit performed in the present work. No equations, predictions, or first-principles results are claimed; the three-state calibration is presented as a normative requirement for contracts to be usable, not as a result derived from the framework itself. The mappings of three incidents to contract violations are diagnostic illustrations, not reductions that equate outputs to inputs by construction. No self-citations appear as load-bearing premises, and the central claim (that the language can arbitrate disagreements) does not reduce to a tautology or renamed input. The work is self-contained as a definitional proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The paper introduces a new definitional structure for contracts without free parameters or external derivations; it relies on domain assumptions about kernel behavior and invents the contract entity itself.

axioms (2)

domain assumption Every ML kernel ships with an implicit contract about what it computes that is rarely written down.
Opening premise of the abstract establishing the problem.
domain assumption Disagreements across silicon platforms can be arbitrated by formal contracts with measurable signatures.
Core premise enabling the mapping of incidents to contract violations.

invented entities (2)

Kernel contract with eight parts (identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, violation signature) no independent evidence
purpose: To provide a formal artifact for specifying and checking kernel correctness
Newly defined structure introduced in the paper.
Twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes no independent evidence
purpose: To categorize and ground specific failure modes in empirical evidence
Invented classification scheme as part of the framework.

pith-pipeline@v0.9.0 · 5549 in / 1565 out tokens · 63349 ms · 2026-05-09T22:27:45.251181+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 14 canonical work pages

[1]

E. Wen, S. Ma, E. Tempero, B. Sham, Y. Song, H. Jia, D. Luo, J. Hua, J. Dietrich, K. Zhao, and J. Shen. Mind the Gap: Revealing Inconsistencies Across Heterogeneous AI Accelerators. arXiv preprint arXiv:2511.11601, October 2025

work page arXiv 2025
[2]

A. H. Zahid, I. Laguna, and W. Le. Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs. SC’24 Workshops, October 2024. LLNL-CONF-868447. arXiv preprint arXiv:2410.09172

work page arXiv 2024
[3]

Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a

A. Ouyang, H. Mao, T. Tang, P. Liang, A. Ng, and the Stanford Scaling Intelligence Lab. KernelBench: Can LLMs write efficient GPU kernels?arXiv preprint arXiv:2502.10517, February 2025

work page arXiv 2025
[4]

Z. Wen, Y. Zhang, Z. Li, Z. Liu, L. Xie, and T. Zhang. MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation.arXiv preprint arXiv:2507.17773, July 2025

work page arXiv 2025
[5]

R. T. Lange, Q. Sun, A. Prasad, M. Faldor, Y. Tang, and D. Ha. Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization.arXiv preprint arXiv:2509.14279, September 2025. Sakana AI

work page arXiv 2025
[6]

Impacts of floating-point non-associativity on reproducibility for hpc and deep learning applications.arXiv preprint, arXiv:2408.05148, 2024

S. Shanmugavelu, M. Taillefumier, C. Culver, O. Hernandez, M. Coletti, and A. Sedova. Im- pacts of floating-point non-associativity on reproducibility for HPC and deep learning appli- cations.arXiv preprint arXiv:2408.05148, August 2024. Oak Ridge National Laboratory and CSCS

work page arXiv 2024
[7]

J. Ma, H. Pei, L. Lausen, and G. Karypis. Understanding silent data corruption in LLM training.arXiv preprint arXiv:2502.12340, February 2025. AWS

work page arXiv 2025
[8]

Qiang, H

X. Qiang, H. Chen, S. Sun, J. Leng, X. Liu, and M. Guo. DASH: Deterministic attention scheduling for high-throughput reproducible LLM training.ICLR 2026. AlsoarXiv preprint arXiv:2601.21824. Shanghai Jiao Tong University and ByteDance Seed

work page arXiv 2026
[9]

J. Yuan, H. Li, X. Ding, W. Xie, Y.-J. Li, W. Zhao, K. Wan, J. Shi, X. Hu, and Z. Liu. Un- derstanding and mitigating numerical sources of nondeterminism in LLM inference.NeurIPS
[10]

AlsoarXiv preprint arXiv:2506.09501, October 2025

work page arXiv 2025
[11]

H. D. Dixit, S. Pendharkar, M. Beadon, C. Mason, T. Chakravarthy, B. Muthiah, and S. Sankar. Silent data corruptions at scale.arXiv preprint arXiv:2102.11245, February 2021 (updated 2024). Meta

work page arXiv 2021
[12]

George et al

N. George et al. (authors from NVIDIA, AMD, ARM, Google, Intel, Meta, and Microsoft). Silent Data Corruption in AI. Open Compute Project Foundation whitepaper, August 26, 2025.https://www.opencompute.org/documents/sdc-in-ai-ocp-whitepaper-final-pdf 27

2025
[13]

He and Thinking Machines Lab

H. He and Thinking Machines Lab. Defeating nondeterminism in LLM inference. Blog post and GitHub repositorythinking-machines-lab/batch-invariant-ops, September 10, 2025

2025
[14]

Dubey, B

K. Dubey, B. Driscoll, A. Wei, N. Kayal, R. Sharma, and A. Aiken. Equivalence checking of ML GPU kernels.arXiv preprint arXiv:2511.12638v2, November 2025. Microsoft Research India and Stanford

work page arXiv 2025
[15]

Demmel and H

J. Demmel and H. D. Nguyen. Fast reproducible floating-point summation.IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2013. See also ReproBLAS project, UC Berkeley

2013
[16]

J. R. Shewchuk. Adaptive precision floating-point arithmetic and fast robust geometric pred- icates.Discrete & Computational Geometry18:305–363, 1997. DOI: 10.1007/PL00009321

work page doi:10.1007/pl00009321 1997
[17]

Freivalds

R. Freivalds. Probabilistic machines can use less running time.Information Processing 77: Proceedings of IFIP Congress 77, B. Gilchrist (ed.), North-Holland, pp. 839–842, 1977

1977
[18]

J. Bates. Processing with compact arithmetic processing element. U.S. Patent No. 8,407,273 B2, issued March 26, 2013. Assignee: Singular Computing LLC

2013
[19]

J. Bates. Processing with compact arithmetic processing element. U.S. Patent No. 11,327,715 B2, issued May 10, 2022. Assignee: Singular Computing LLC

2022
[20]

ISO/PAS 8800:2024 Road vehicles—Safety and artificial intelligence.ISO, 2024

International Organization for Standardization. ISO/PAS 8800:2024 Road vehicles—Safety and artificial intelligence.ISO, 2024

2024
[21]

Ikarashi, K

Y. Ikarashi, K. Qian, S. Droubi, A. Reinking, G. Bernstein, and J. Ragan-Kelley. Exo 2: Growing a Scheduling Language.ASPLOS 2025. AlsoarXiv:2411.07211, November 2024

work page arXiv 2025
[22]

J. Shah, G. Bikshandi, Y. Zhang, V. Thakkar, P. Ramani, and T. Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.NeurIPS 2024, 2024

2024
[23]

Tillet, H

P. Tillet, H. T. Kung, and D. Cox. Triton: An intermediate language and compiler for tiled neural network computations.MAPL 2019, ACM, 2019

2019
[24]

T. Chen, T. Moreau, Z. Jiang, et al. TVM: An automated end-to-end optimizing compiler for deep learning.OSDI 2018, USENIX, 2018

2018
[25]

IEEE Standard for Floating-Point Arithmetic.IEEE 754-2019, IEEE, 2019

IEEE Standards Committee. IEEE Standard for Floating-Point Arithmetic.IEEE 754-2019, IEEE, 2019

2019
[26]

C. Veit. Good tech, messy spec: Evaluating whether coding agents follow document-grounded requirements.Ashiba Research Technical Report, April 2026

2026
[27]

Wang and J

J. Wang and J. Huang. Reward hacking as equilibrium under finite evaluation.arXiv preprint arXiv:2603.28063, March 2026

work page arXiv 2026
[28]

MLPerf Training and Inference Benchmarks.mlcommons.org, 2018–present

MLCommons. MLPerf Training and Inference Benchmarks.mlcommons.org, 2018–present

2018
[29]

ISASecure certification scheme for IEC 62443.isase- cure.org, 2010–present

ISA Security Compliance Institute. ISASecure certification scheme for IEC 62443.isase- cure.org, 2010–present

2010
[30]

ISO 26262: Road vehicles—Functional safety

International Organization for Standardization. ISO 26262: Road vehicles—Functional safety. ISO, 2018. 28

2018