pith. machine review for the scientific record. sign in

arxiv: 2604.22032 · v1 · submitted 2026-04-23 · 💻 cs.LG · cs.PL

Recognition: unknown

Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:27 UTC · model grok-4.3

classification 💻 cs.LG cs.PL
keywords kernel contractsspecification languageML kernelsheterogeneous siliconprecision toleranceconformance testingcompiler-induced failuresexceptional values
0
0 comments X

The pith

ML kernel disagreements across silicon can be arbitrated by writing explicit eight-part contracts that define expected behavior and measurable violations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a specification language for making the implicit contracts of machine learning kernels explicit and testable. Each contract is built from eight required parts: an identifier, scope, precondition, postcondition, tolerance for differences, a reference oracle, a measurement protocol, and a violation signature. Twelve reusable contract classes are defined to cover precision, ordering, compiler-induced, and exceptional-value problems, each drawn from published empirical measurements of real hardware differences. The approach is demonstrated by mapping three documented incidents on Huawei, Sakana AI, and AMD platforms onto specific, measurable contract violations. If the language works as described, kernel conformance could be graded against a contract suite in the same way industrial systems are certified against security standards.

Core claim

We present a specification language for kernel contracts. A contract has eight parts: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. We use it to state twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. We require a three-state calibration: every contract must admit at least one reference-conforming implementation and at least one contract-violating implementation that passes basic functional tests. We apply the framework to three documented incidents and show that each informal diagnosis maps to a specific, fals

What carries the argument

The eight-part kernel contract, which combines a reference oracle, tolerance bounds, and a measurable violation signature to formalize what an ML kernel must compute on heterogeneous hardware.

If this is right

  • Every kernel disagreement across hardware platforms can be diagnosed as a specific contract violation with a measurable signature.
  • Conformance of any implementation can be graded against a contract suite using the reference oracle and violation signature.
  • The twelve classes supply a catalog of common failure modes that is already anchored in empirical data from published studies.
  • Kernel development and testing can shift from informal debugging to checking against explicit, arbitrable contracts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automated verification tools could check new kernels against contract suites during compilation or runtime.
  • Hardware and software vendors might publish contract suites as part of their kernel libraries to certify compatibility.
  • The same contract structure could be applied to numerical kernels outside machine learning, such as linear algebra routines in scientific computing.

Load-bearing premise

That disagreements between ML kernels on different chips can always be captured by these eight contract parts and twelve classes, and that both reference-conforming and violating implementations can be found for practical calibration.

What would settle it

A documented case of kernels producing different results on different silicon platforms that cannot be expressed as a violation of any of the twelve contract classes or for which no reference-conforming and violating implementations can be identified.

Figures

Figures reproduced from arXiv: 2604.22032 by Cooper Veit.

Figure 1
Figure 1. Figure 1: Cross-silicon Freivalds verification overhead vs. matrix dimension, FP32, [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
read the original abstract

Every ML kernel ships with an implicit contract about what it computes. People rarely write the contract down. When two kernels disagree -- when a matmul on AMD produces a different gradient than the same matmul on NVIDIA, when a fused attention kernel silently downcasts an accumulator, when an out-of-bounds access returns zero on one stack and garbage on another -- there is no formal artifact to arbitrate the dispute. Recent empirical work has measured the gap across silicon platforms, but none of it specifies the contract being violated. We present a specification language for kernel contracts. A contract has eight parts: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. We use it to state twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. We require a three-state calibration: every contract must admit at least one reference-conforming implementation and at least one contract-violating implementation that passes basic functional tests. We apply the framework to three documented incidents -- Huawei Ascend silent precision coercion, Sakana AI CUDA Engineer reward hacking, AMD out-of-bounds silent acceptance -- and show that each informal diagnosis maps to a specific contract violation with a measurable signature. A kernel contract suite is a normative reference against which conformance can be graded, in the way that ISASecure grades industrial control systems against IEC 62443.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a specification language for ML kernel contracts to formalize implicit agreements about kernel behavior across heterogeneous silicon. Each contract has eight parts (identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, violation signature). It defines twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. The framework requires three-state calibration (reference-conforming implementation plus at least one violating implementation that still passes basic functional tests) and demonstrates the approach by mapping three documented incidents (Huawei Ascend silent precision coercion, Sakana AI CUDA Engineer reward hacking, AMD out-of-bounds silent acceptance) to specific contract violations with measurable signatures. A kernel contract suite is positioned as a normative reference analogous to ISASecure grading against IEC 62443.

Significance. If the framework is realized with the required calibration, it would provide a concrete, auditable mechanism for specifying and arbitrating kernel correctness across platforms, directly addressing measured cross-silicon discrepancies in ML workloads. The explicit three-state calibration requirement and grounding of the twelve classes in empirical literature are strengths that distinguish this from purely informal or post-hoc debugging approaches.

major comments (1)
  1. [Application to three documented incidents] In the application of the framework to the three documented incidents: the mappings are presented as diagnostic illustrations, but the manuscript does not exhibit the three-state calibration that the paper itself requires. No reference-conforming implementation and no contract-violating implementation (that still passes basic functional tests) are identified or constructed for any of the three cases, and no measurement protocol is executed to produce a violation signature. This leaves the central claim—that the eight-part structure plus twelve classes can arbitrate real kernel disagreements—without the concrete demonstration needed to establish feasibility.
minor comments (1)
  1. The derivation of the twelve contract classes from the cited empirical studies could be made more explicit, for example by adding a summary table that lists each class, its failure mode, and the specific published evidence used to motivate it.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the constructive major comment. We address it point by point below.

read point-by-point responses
  1. Referee: In the application of the framework to the three documented incidents: the mappings are presented as diagnostic illustrations, but the manuscript does not exhibit the three-state calibration that the paper itself requires. No reference-conforming implementation and no contract-violating implementation (that still passes basic functional tests) are identified or constructed for any of the three cases, and no measurement protocol is executed to produce a violation signature. This leaves the central claim—that the eight-part structure plus twelve classes can arbitrate real kernel disagreements—without the concrete demonstration needed to establish feasibility.

    Authors: We agree that the three applications are diagnostic mappings rather than full three-state calibrations. The manuscript defines the calibration requirement as part of the framework for practical deployment and verification of new contracts, but the historical incidents are used only to illustrate how the eight-part structure and twelve classes can capture documented cross-silicon discrepancies. Reconstructing reference-conforming and violating implementations for these past cases would require proprietary hardware, exact software stacks, and measurement environments that are no longer accessible. We will revise the manuscript to explicitly distinguish illustrative mappings (which demonstrate expressiveness and grounding in empirical literature) from the calibration step required for new contracts, and we will add a forward-looking statement that full calibration is necessary to operationalize the framework for arbitration. This revision preserves the central claim about the specification language while acknowledging the gap in empirical demonstration for the examples. revision: yes

Circularity Check

0 steps flagged

No circularity: purely definitional framework with no derivations or self-referential reductions.

full rationale

The paper introduces an eight-part contract structure and twelve contract classes as a specification language, grounded in external published empirical evidence rather than any derivation or fit performed in the present work. No equations, predictions, or first-principles results are claimed; the three-state calibration is presented as a normative requirement for contracts to be usable, not as a result derived from the framework itself. The mappings of three incidents to contract violations are diagnostic illustrations, not reductions that equate outputs to inputs by construction. No self-citations appear as load-bearing premises, and the central claim (that the language can arbitrate disagreements) does not reduce to a tautology or renamed input. The work is self-contained as a definitional proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The paper introduces a new definitional structure for contracts without free parameters or external derivations; it relies on domain assumptions about kernel behavior and invents the contract entity itself.

axioms (2)
  • domain assumption Every ML kernel ships with an implicit contract about what it computes that is rarely written down.
    Opening premise of the abstract establishing the problem.
  • domain assumption Disagreements across silicon platforms can be arbitrated by formal contracts with measurable signatures.
    Core premise enabling the mapping of incidents to contract violations.
invented entities (2)
  • Kernel contract with eight parts (identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, violation signature) no independent evidence
    purpose: To provide a formal artifact for specifying and checking kernel correctness
    Newly defined structure introduced in the paper.
  • Twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes no independent evidence
    purpose: To categorize and ground specific failure modes in empirical evidence
    Invented classification scheme as part of the framework.

pith-pipeline@v0.9.0 · 5549 in / 1565 out tokens · 63349 ms · 2026-05-09T22:27:45.251181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 14 canonical work pages

  1. [1]

    E. Wen, S. Ma, E. Tempero, B. Sham, Y. Song, H. Jia, D. Luo, J. Hua, J. Dietrich, K. Zhao, and J. Shen. Mind the Gap: Revealing Inconsistencies Across Heterogeneous AI Accelerators. arXiv preprint arXiv:2511.11601, October 2025

  2. [2]

    A. H. Zahid, I. Laguna, and W. Le. Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs. SC’24 Workshops, October 2024. LLNL-CONF-868447. arXiv preprint arXiv:2410.09172

  3. [3]

    Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a

    A. Ouyang, H. Mao, T. Tang, P. Liang, A. Ng, and the Stanford Scaling Intelligence Lab. KernelBench: Can LLMs write efficient GPU kernels?arXiv preprint arXiv:2502.10517, February 2025

  4. [4]

    Z. Wen, Y. Zhang, Z. Li, Z. Liu, L. Xie, and T. Zhang. MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation.arXiv preprint arXiv:2507.17773, July 2025

  5. [5]

    R. T. Lange, Q. Sun, A. Prasad, M. Faldor, Y. Tang, and D. Ha. Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization.arXiv preprint arXiv:2509.14279, September 2025. Sakana AI

  6. [6]

    Impacts of floating-point non-associativity on reproducibility for hpc and deep learning applications.arXiv preprint, arXiv:2408.05148, 2024

    S. Shanmugavelu, M. Taillefumier, C. Culver, O. Hernandez, M. Coletti, and A. Sedova. Im- pacts of floating-point non-associativity on reproducibility for HPC and deep learning appli- cations.arXiv preprint arXiv:2408.05148, August 2024. Oak Ridge National Laboratory and CSCS

  7. [7]

    J. Ma, H. Pei, L. Lausen, and G. Karypis. Understanding silent data corruption in LLM training.arXiv preprint arXiv:2502.12340, February 2025. AWS

  8. [8]

    Qiang, H

    X. Qiang, H. Chen, S. Sun, J. Leng, X. Liu, and M. Guo. DASH: Deterministic attention scheduling for high-throughput reproducible LLM training.ICLR 2026. AlsoarXiv preprint arXiv:2601.21824. Shanghai Jiao Tong University and ByteDance Seed

  9. [9]

    J. Yuan, H. Li, X. Ding, W. Xie, Y.-J. Li, W. Zhao, K. Wan, J. Shi, X. Hu, and Z. Liu. Un- derstanding and mitigating numerical sources of nondeterminism in LLM inference.NeurIPS

  10. [10]

    AlsoarXiv preprint arXiv:2506.09501, October 2025

  11. [11]

    H. D. Dixit, S. Pendharkar, M. Beadon, C. Mason, T. Chakravarthy, B. Muthiah, and S. Sankar. Silent data corruptions at scale.arXiv preprint arXiv:2102.11245, February 2021 (updated 2024). Meta

  12. [12]

    George et al

    N. George et al. (authors from NVIDIA, AMD, ARM, Google, Intel, Meta, and Microsoft). Silent Data Corruption in AI. Open Compute Project Foundation whitepaper, August 26, 2025.https://www.opencompute.org/documents/sdc-in-ai-ocp-whitepaper-final-pdf 27

  13. [13]

    He and Thinking Machines Lab

    H. He and Thinking Machines Lab. Defeating nondeterminism in LLM inference. Blog post and GitHub repositorythinking-machines-lab/batch-invariant-ops, September 10, 2025

  14. [14]

    Dubey, B

    K. Dubey, B. Driscoll, A. Wei, N. Kayal, R. Sharma, and A. Aiken. Equivalence checking of ML GPU kernels.arXiv preprint arXiv:2511.12638v2, November 2025. Microsoft Research India and Stanford

  15. [15]

    Demmel and H

    J. Demmel and H. D. Nguyen. Fast reproducible floating-point summation.IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2013. See also ReproBLAS project, UC Berkeley

  16. [16]

    J. R. Shewchuk. Adaptive precision floating-point arithmetic and fast robust geometric pred- icates.Discrete & Computational Geometry18:305–363, 1997. DOI: 10.1007/PL00009321

  17. [17]

    Freivalds

    R. Freivalds. Probabilistic machines can use less running time.Information Processing 77: Proceedings of IFIP Congress 77, B. Gilchrist (ed.), North-Holland, pp. 839–842, 1977

  18. [18]

    J. Bates. Processing with compact arithmetic processing element. U.S. Patent No. 8,407,273 B2, issued March 26, 2013. Assignee: Singular Computing LLC

  19. [19]

    J. Bates. Processing with compact arithmetic processing element. U.S. Patent No. 11,327,715 B2, issued May 10, 2022. Assignee: Singular Computing LLC

  20. [20]

    ISO/PAS 8800:2024 Road vehicles—Safety and artificial intelligence.ISO, 2024

    International Organization for Standardization. ISO/PAS 8800:2024 Road vehicles—Safety and artificial intelligence.ISO, 2024

  21. [21]

    Ikarashi, K

    Y. Ikarashi, K. Qian, S. Droubi, A. Reinking, G. Bernstein, and J. Ragan-Kelley. Exo 2: Growing a Scheduling Language.ASPLOS 2025. AlsoarXiv:2411.07211, November 2024

  22. [22]

    J. Shah, G. Bikshandi, Y. Zhang, V. Thakkar, P. Ramani, and T. Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.NeurIPS 2024, 2024

  23. [23]

    Tillet, H

    P. Tillet, H. T. Kung, and D. Cox. Triton: An intermediate language and compiler for tiled neural network computations.MAPL 2019, ACM, 2019

  24. [24]

    T. Chen, T. Moreau, Z. Jiang, et al. TVM: An automated end-to-end optimizing compiler for deep learning.OSDI 2018, USENIX, 2018

  25. [25]

    IEEE Standard for Floating-Point Arithmetic.IEEE 754-2019, IEEE, 2019

    IEEE Standards Committee. IEEE Standard for Floating-Point Arithmetic.IEEE 754-2019, IEEE, 2019

  26. [26]

    C. Veit. Good tech, messy spec: Evaluating whether coding agents follow document-grounded requirements.Ashiba Research Technical Report, April 2026

  27. [27]

    Wang and J

    J. Wang and J. Huang. Reward hacking as equilibrium under finite evaluation.arXiv preprint arXiv:2603.28063, March 2026

  28. [28]

    MLPerf Training and Inference Benchmarks.mlcommons.org, 2018–present

    MLCommons. MLPerf Training and Inference Benchmarks.mlcommons.org, 2018–present

  29. [29]

    ISASecure certification scheme for IEC 62443.isase- cure.org, 2010–present

    ISA Security Compliance Institute. ISASecure certification scheme for IEC 62443.isase- cure.org, 2010–present

  30. [30]

    ISO 26262: Road vehicles—Functional safety

    International Organization for Standardization. ISO 26262: Road vehicles—Functional safety. ISO, 2018. 28