Recognition: unknown
Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon
Pith reviewed 2026-05-09 22:27 UTC · model grok-4.3
The pith
ML kernel disagreements across silicon can be arbitrated by writing explicit eight-part contracts that define expected behavior and measurable violations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a specification language for kernel contracts. A contract has eight parts: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. We use it to state twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. We require a three-state calibration: every contract must admit at least one reference-conforming implementation and at least one contract-violating implementation that passes basic functional tests. We apply the framework to three documented incidents and show that each informal diagnosis maps to a specific, fals
What carries the argument
The eight-part kernel contract, which combines a reference oracle, tolerance bounds, and a measurable violation signature to formalize what an ML kernel must compute on heterogeneous hardware.
If this is right
- Every kernel disagreement across hardware platforms can be diagnosed as a specific contract violation with a measurable signature.
- Conformance of any implementation can be graded against a contract suite using the reference oracle and violation signature.
- The twelve classes supply a catalog of common failure modes that is already anchored in empirical data from published studies.
- Kernel development and testing can shift from informal debugging to checking against explicit, arbitrable contracts.
Where Pith is reading between the lines
- Automated verification tools could check new kernels against contract suites during compilation or runtime.
- Hardware and software vendors might publish contract suites as part of their kernel libraries to certify compatibility.
- The same contract structure could be applied to numerical kernels outside machine learning, such as linear algebra routines in scientific computing.
Load-bearing premise
That disagreements between ML kernels on different chips can always be captured by these eight contract parts and twelve classes, and that both reference-conforming and violating implementations can be found for practical calibration.
What would settle it
A documented case of kernels producing different results on different silicon platforms that cannot be expressed as a violation of any of the twelve contract classes or for which no reference-conforming and violating implementations can be identified.
Figures
read the original abstract
Every ML kernel ships with an implicit contract about what it computes. People rarely write the contract down. When two kernels disagree -- when a matmul on AMD produces a different gradient than the same matmul on NVIDIA, when a fused attention kernel silently downcasts an accumulator, when an out-of-bounds access returns zero on one stack and garbage on another -- there is no formal artifact to arbitrate the dispute. Recent empirical work has measured the gap across silicon platforms, but none of it specifies the contract being violated. We present a specification language for kernel contracts. A contract has eight parts: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. We use it to state twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. We require a three-state calibration: every contract must admit at least one reference-conforming implementation and at least one contract-violating implementation that passes basic functional tests. We apply the framework to three documented incidents -- Huawei Ascend silent precision coercion, Sakana AI CUDA Engineer reward hacking, AMD out-of-bounds silent acceptance -- and show that each informal diagnosis maps to a specific contract violation with a measurable signature. A kernel contract suite is a normative reference against which conformance can be graded, in the way that ISASecure grades industrial control systems against IEC 62443.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a specification language for ML kernel contracts to formalize implicit agreements about kernel behavior across heterogeneous silicon. Each contract has eight parts (identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, violation signature). It defines twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. The framework requires three-state calibration (reference-conforming implementation plus at least one violating implementation that still passes basic functional tests) and demonstrates the approach by mapping three documented incidents (Huawei Ascend silent precision coercion, Sakana AI CUDA Engineer reward hacking, AMD out-of-bounds silent acceptance) to specific contract violations with measurable signatures. A kernel contract suite is positioned as a normative reference analogous to ISASecure grading against IEC 62443.
Significance. If the framework is realized with the required calibration, it would provide a concrete, auditable mechanism for specifying and arbitrating kernel correctness across platforms, directly addressing measured cross-silicon discrepancies in ML workloads. The explicit three-state calibration requirement and grounding of the twelve classes in empirical literature are strengths that distinguish this from purely informal or post-hoc debugging approaches.
major comments (1)
- [Application to three documented incidents] In the application of the framework to the three documented incidents: the mappings are presented as diagnostic illustrations, but the manuscript does not exhibit the three-state calibration that the paper itself requires. No reference-conforming implementation and no contract-violating implementation (that still passes basic functional tests) are identified or constructed for any of the three cases, and no measurement protocol is executed to produce a violation signature. This leaves the central claim—that the eight-part structure plus twelve classes can arbitrate real kernel disagreements—without the concrete demonstration needed to establish feasibility.
minor comments (1)
- The derivation of the twelve contract classes from the cited empirical studies could be made more explicit, for example by adding a summary table that lists each class, its failure mode, and the specific published evidence used to motivate it.
Simulated Author's Rebuttal
We thank the referee for the careful reading and the constructive major comment. We address it point by point below.
read point-by-point responses
-
Referee: In the application of the framework to the three documented incidents: the mappings are presented as diagnostic illustrations, but the manuscript does not exhibit the three-state calibration that the paper itself requires. No reference-conforming implementation and no contract-violating implementation (that still passes basic functional tests) are identified or constructed for any of the three cases, and no measurement protocol is executed to produce a violation signature. This leaves the central claim—that the eight-part structure plus twelve classes can arbitrate real kernel disagreements—without the concrete demonstration needed to establish feasibility.
Authors: We agree that the three applications are diagnostic mappings rather than full three-state calibrations. The manuscript defines the calibration requirement as part of the framework for practical deployment and verification of new contracts, but the historical incidents are used only to illustrate how the eight-part structure and twelve classes can capture documented cross-silicon discrepancies. Reconstructing reference-conforming and violating implementations for these past cases would require proprietary hardware, exact software stacks, and measurement environments that are no longer accessible. We will revise the manuscript to explicitly distinguish illustrative mappings (which demonstrate expressiveness and grounding in empirical literature) from the calibration step required for new contracts, and we will add a forward-looking statement that full calibration is necessary to operationalize the framework for arbitration. This revision preserves the central claim about the specification language while acknowledging the gap in empirical demonstration for the examples. revision: yes
Circularity Check
No circularity: purely definitional framework with no derivations or self-referential reductions.
full rationale
The paper introduces an eight-part contract structure and twelve contract classes as a specification language, grounded in external published empirical evidence rather than any derivation or fit performed in the present work. No equations, predictions, or first-principles results are claimed; the three-state calibration is presented as a normative requirement for contracts to be usable, not as a result derived from the framework itself. The mappings of three incidents to contract violations are diagnostic illustrations, not reductions that equate outputs to inputs by construction. No self-citations appear as load-bearing premises, and the central claim (that the language can arbitrate disagreements) does not reduce to a tautology or renamed input. The work is self-contained as a definitional proposal.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Every ML kernel ships with an implicit contract about what it computes that is rarely written down.
- domain assumption Disagreements across silicon platforms can be arbitrated by formal contracts with measurable signatures.
invented entities (2)
-
Kernel contract with eight parts (identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, violation signature)
no independent evidence
-
Twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes
no independent evidence
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a
A. Ouyang, H. Mao, T. Tang, P. Liang, A. Ng, and the Stanford Scaling Intelligence Lab. KernelBench: Can LLMs write efficient GPU kernels?arXiv preprint arXiv:2502.10517, February 2025
- [4]
- [5]
-
[6]
S. Shanmugavelu, M. Taillefumier, C. Culver, O. Hernandez, M. Coletti, and A. Sedova. Im- pacts of floating-point non-associativity on reproducibility for HPC and deep learning appli- cations.arXiv preprint arXiv:2408.05148, August 2024. Oak Ridge National Laboratory and CSCS
- [7]
- [8]
-
[9]
J. Yuan, H. Li, X. Ding, W. Xie, Y.-J. Li, W. Zhao, K. Wan, J. Shi, X. Hu, and Z. Liu. Un- derstanding and mitigating numerical sources of nondeterminism in LLM inference.NeurIPS
- [10]
- [11]
-
[12]
George et al
N. George et al. (authors from NVIDIA, AMD, ARM, Google, Intel, Meta, and Microsoft). Silent Data Corruption in AI. Open Compute Project Foundation whitepaper, August 26, 2025.https://www.opencompute.org/documents/sdc-in-ai-ocp-whitepaper-final-pdf 27
2025
-
[13]
He and Thinking Machines Lab
H. He and Thinking Machines Lab. Defeating nondeterminism in LLM inference. Blog post and GitHub repositorythinking-machines-lab/batch-invariant-ops, September 10, 2025
2025
- [14]
-
[15]
Demmel and H
J. Demmel and H. D. Nguyen. Fast reproducible floating-point summation.IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2013. See also ReproBLAS project, UC Berkeley
2013
-
[16]
J. R. Shewchuk. Adaptive precision floating-point arithmetic and fast robust geometric pred- icates.Discrete & Computational Geometry18:305–363, 1997. DOI: 10.1007/PL00009321
-
[17]
Freivalds
R. Freivalds. Probabilistic machines can use less running time.Information Processing 77: Proceedings of IFIP Congress 77, B. Gilchrist (ed.), North-Holland, pp. 839–842, 1977
1977
-
[18]
J. Bates. Processing with compact arithmetic processing element. U.S. Patent No. 8,407,273 B2, issued March 26, 2013. Assignee: Singular Computing LLC
2013
-
[19]
J. Bates. Processing with compact arithmetic processing element. U.S. Patent No. 11,327,715 B2, issued May 10, 2022. Assignee: Singular Computing LLC
2022
-
[20]
ISO/PAS 8800:2024 Road vehicles—Safety and artificial intelligence.ISO, 2024
International Organization for Standardization. ISO/PAS 8800:2024 Road vehicles—Safety and artificial intelligence.ISO, 2024
2024
-
[21]
Y. Ikarashi, K. Qian, S. Droubi, A. Reinking, G. Bernstein, and J. Ragan-Kelley. Exo 2: Growing a Scheduling Language.ASPLOS 2025. AlsoarXiv:2411.07211, November 2024
-
[22]
J. Shah, G. Bikshandi, Y. Zhang, V. Thakkar, P. Ramani, and T. Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.NeurIPS 2024, 2024
2024
-
[23]
Tillet, H
P. Tillet, H. T. Kung, and D. Cox. Triton: An intermediate language and compiler for tiled neural network computations.MAPL 2019, ACM, 2019
2019
-
[24]
T. Chen, T. Moreau, Z. Jiang, et al. TVM: An automated end-to-end optimizing compiler for deep learning.OSDI 2018, USENIX, 2018
2018
-
[25]
IEEE Standard for Floating-Point Arithmetic.IEEE 754-2019, IEEE, 2019
IEEE Standards Committee. IEEE Standard for Floating-Point Arithmetic.IEEE 754-2019, IEEE, 2019
2019
-
[26]
C. Veit. Good tech, messy spec: Evaluating whether coding agents follow document-grounded requirements.Ashiba Research Technical Report, April 2026
2026
-
[27]
J. Wang and J. Huang. Reward hacking as equilibrium under finite evaluation.arXiv preprint arXiv:2603.28063, March 2026
-
[28]
MLPerf Training and Inference Benchmarks.mlcommons.org, 2018–present
MLCommons. MLPerf Training and Inference Benchmarks.mlcommons.org, 2018–present
2018
-
[29]
ISASecure certification scheme for IEC 62443.isase- cure.org, 2010–present
ISA Security Compliance Institute. ISASecure certification scheme for IEC 62443.isase- cure.org, 2010–present
2010
-
[30]
ISO 26262: Road vehicles—Functional safety
International Organization for Standardization. ISO 26262: Road vehicles—Functional safety. ISO, 2018. 28
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.