The Correctness Illusion in LLM-Generated GPU Kernels

Dipankar Sarkar

arxiv: 2606.20128 · v1 · pith:HRXW3HW5new · submitted 2026-06-18 · 💻 cs.SE · cs.DC· cs.LG

The Correctness Illusion in LLM-Generated GPU Kernels

Dipankar Sarkar This is my paper

Pith reviewed 2026-06-26 16:30 UTC · model grok-4.3

classification 💻 cs.SE cs.DCcs.LG

keywords LLM-generated kernelscorrectness oracleGPU kernelsfuzzingTritontranscription errorsbenchmark evaluation

0 comments

The pith

Fixed-shape allclose checks in LLM GPU kernel benchmarks pass transcription-error bugs that a fuzzing oracle detects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a controlled set of 15 correct Triton kernels and 9 variants seeded with documented LLM-style transcription errors. It shows that the allclose-on-one-shape oracle used by existing benchmarks certifies the buggy variants as correct, while an op-schema-aware seeded fuzzing method with fp64 reference and per-operation tolerances flags every seeded bug and keeps every control clean. The same pattern appears on five GPUs from consumer to datacenter class. A reader would care because the result implies that current benchmark scores systematically overstate how correct LLM-generated kernels actually are.

Core claim

Benchmarks that judge LLM-generated GPU kernels correct via fixed-shape, small-sample allclose checks certify kernels containing transcription errors as correct; an alternative oracle that applies op-schema-aware seeded fuzzing, a high-precision fp64 CPU reference, and per-(op, dtype) absolute tolerances detects all nine seeded bugs while passing fifteen controls, with identical verdicts on RTX 3060, A10, L40S, A100, and H100 hardware.

What carries the argument

op-schema-aware seeded fuzzing with fp64 CPU reference and per-(op, dtype) absolute tolerances that replays every failure byte-for-byte from a stored seed.

If this is right

Existing benchmarks (KernelBench, TritonBench, GEAK) overestimate correctness rates for LLM kernels.
Transcription errors of the seeded kind survive single-shape allclose checks but are caught by shape- and input-diverse testing.
The illusion is independent of GPU architecture: the same ten failures and sixteen passes appear on every tested device.
Adding flash-attention to the corpus does not change the outcome.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks could adopt fuzzing oracles to produce more trustworthy scores for generated kernels.
The result points to a broader need for input-diverse testing whenever LLMs generate numerical or shape-sensitive code.
The same seeding technique could be applied to evaluate correctness oracles in other code-generation domains.

Load-bearing premise

The nine seeded transcription errors are representative of the mistakes LLMs actually make when writing GPU kernels.

What would settle it

An experiment that shows real LLM outputs for the same kernels never contain the seeded transcription errors, or that the flagged bugs never produce wrong results on any practical workload.

Figures

Figures reproduced from arXiv: 2606.20128 by Dipankar Sarkar.

**Figure 1.** Figure 1: Verdict per kernel on the full 26-op corpus, plotted from the RTX 3060 crossGPU run. Green indicates correct controls that pass cleanly. Red indicates illusions (bench oracle pass, seeded oracle fail). The cross-GPU sweep in §4.1 confirms the same verdict on the four remaining GPU classes. Magnitude-uniform bugs (gelu missing 0.5, silu sigmoid(2x), leaky relu wrong α, rmsnorm and l2norm missing sqrt, atte… view at source ↗

**Figure 2.** Figure 2: Cross-GPU verdict consistency on the 26-op corpus. Each panel covers half the corpus; rows are the five GPU classes; cells are per-kernel fail rates. Controls stay green on every GPU. Illusions stay red on every GPU. The validator currently does same-dtype comparison: kernel-fp16 against reference-fp16 rounded from fp64. Cross-dtype comparison (kernel-fp16 against reference-fp64) is a noted future extensio… view at source ↗

read the original abstract

Benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness through fixed-shape, small-sample allclose-style checks. The number of inputs varies between benchmarks. The shape, dtype, and tolerance are fixed for each kernel. We test that oracle empirically. We construct a controlled corpus of 24 Triton and CPU stand-in kernels (15 correct controls and 9 LLM-style buggy variants seeded with documented transcription errors) and re-evaluate it under op-schema-aware seeded fuzzing with a high-precision (fp64) CPU reference and per-(op, dtype) absolute tolerances. The seeded oracle flags 9 of 9 buggy kernels and passes 15 of 15 correct controls, at zero precision cost on controls. We extend the corpus to 26 ops (adding a flash-attention pair) and re-run the same protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL). The verdicts are identical across all five GPUs: 10 of 10 illusions caught and 16 of 16 controls clean. The corpus result is about LLM-style transcription bugs that the allclose-on-one-shape oracle certifies as correct, not about the bug rate of any specific deployed LLM. Every flagged failure replays byte-for-byte from a stored seed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean empirical separation on a seeded corpus showing that fixed-shape allclose oracles miss certain transcription bugs while an fp64 seeded-fuzzing check catches them, with identical results across five GPUs.

read the letter

The core result is straightforward. On a corpus of 24 kernels (15 clean controls, 9 with seeded transcription errors drawn from documented sources), the standard allclose-on-fixed-shape checks used in KernelBench-style evaluations pass every buggy variant, while the op-schema-aware fuzzing protocol with high-precision CPU references flags all nine and passes all fifteen controls. The same verdicts hold on RTX 3060 through H100 hardware. The protocol is fully seeded and replayable, with per-(op, dtype) tolerances and no post-hoc exclusions.

What the work actually adds is the controlled corpus plus the fuzzing protocol itself. Prior benchmarks are critiqued for their oracle design, and this paper supplies a concrete counter-example that demonstrates the gap on the exact error class it targets. The authors are explicit that they are not measuring LLM bug rates, only showing what happens to these particular LLM-style transcription bugs under the two oracles.

The main limitation is the one the stress-test flags. The headline claim about a 'correctness illusion' for LLM-generated kernels rests on the assumption that the nine seeded transcription errors are representative of the mistakes LLMs actually make. The paper does not test that assumption and does not claim to. If LLMs more commonly produce shape mismatches, indexing errors, or algorithmic deviations rather than the specific transcription slips seeded here, the oracle failure on this corpus does not directly establish the broader illusion. That gap is real but narrow; the measurement on the corpus itself is internally sound.

This is worth a reading group for anyone working on LLM code generation benchmarks or GPU kernel testing. The experiment is small, reproducible, and directly addresses oracle quality. It deserves peer review because the empirical protocol is clear, the hardware replication is present, and the finding is falsifiable from the released seeds. Minor revisions could tighten the framing around representativeness, but the central measurement stands on its own.

Referee Report

0 major / 2 minor

Summary. The paper claims that fixed-shape allclose-style oracles in benchmarks for LLM-generated GPU kernels (e.g., KernelBench) can certify certain transcription-error bugs as correct, creating a 'correctness illusion.' This is shown via a controlled corpus of 24 Triton/CPU stand-in kernels (15 correct controls + 9 LLM-style buggy variants seeded with documented transcription errors) where a new op-schema-aware seeded-fuzzing oracle using fp64 CPU reference and per-(op, dtype) absolute tolerances flags all 9 bugs while passing all 15 controls at zero precision cost; the result replicates identically on an extended 26-op corpus across five GPU classes (RTX 3060 to H100), with all failures replayable from stored seeds. The claim is explicitly limited to LLM-style transcription bugs on the constructed corpus rather than measured LLM bug rates.

Significance. If the result holds, the work provides a reproducible empirical demonstration of a concrete weakness in current benchmark oracles, with perfect separation on the corpus, multi-GPU consistency, and seed-based replayability as notable strengths. This could directly inform improved evaluation protocols for LLM kernel generation without relying on fitted parameters or post-hoc exclusions.

minor comments (2)

[Abstract] Abstract: the qualifier 'LLM-style' is used consistently, but a single concrete example of one seeded transcription error (e.g., the specific dtype or indexing mistake) would help readers immediately grasp the bug class without needing the full corpus details.
[§3 (method)] The description of how per-(op, dtype) absolute tolerances are chosen could be expanded with one sentence on their derivation (e.g., from fp64 reference statistics or literature values) to make the protocol fully self-contained.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation to accept. The report correctly captures the scope and limitations of our work. There are no major comments to address.

Circularity Check

0 steps flagged

No circularity: empirical corpus evaluation stands on independent construction and measurement

full rationale

The paper constructs an explicit corpus of 15 correct controls plus 9 seeded transcription-error variants, then measures oracle behavior under two protocols (standard allclose vs. fp64 seeded fuzzing). The reported outcome (standard oracle passes all 9 bugs; new oracle flags all 9 while preserving controls) follows directly from running the defined tests on the defined inputs. No parameters are fitted, no equations reduce the result to prior self-citations, and the paper explicitly disclaims any claim about actual LLM bug distributions. The derivation chain contains no self-definitional, fitted-prediction, or self-citation-load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of the seeded transcription errors and the assumption that the fp64 CPU reference serves as reliable ground truth. No free parameters are fitted to produce the reported verdicts.

axioms (2)

domain assumption Seeded transcription errors represent the kinds of mistakes LLMs make when generating kernels
The corpus is built from 9 variants seeded with documented transcription errors to simulate LLM outputs.
domain assumption The high-precision fp64 CPU reference provides correct ground truth for detecting GPU kernel discrepancies
Used as the reference implementation in the fuzzing protocol.

pith-pipeline@v0.9.1-grok · 5768 in / 1463 out tokens · 33805 ms · 2026-06-26T16:30:33.158717+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-Input Generation for Tensor Programs: What Actually Finds Kernel Bugs
cs.SE 2026-06 unverdicted novelty 5.0

Boundary shape sampling for tensor kernel testing achieves 78% recall on seeded bugs with 0% false positives on correct kernels, while adversarial value sampling reaches 99% recall at the cost of 94% false positives.

Reference graph

Works this paper leans on

16 extracted references · 4 canonical work pages · cited by 1 Pith paper

[1]

PyTorch blog (2022), https://pytorch.org/blog/what-every-u ser-should-know-about-mixed-precision-training-in-pytorch/ , updated November 2024

Ahmed, S., et al.: What every user should know about mixed precision training in PyTorch. PyTorch blog (2022), https://pytorch.org/blog/what-every-u ser-should-know-about-mixed-precision-training-in-pytorch/ , updated November 2024

2022
[2]

In: Proc

Deng, Y., Yang, C., Wei, A., Zhang, L.: Fuzzing deep-learning libraries via auto- mated relational API inference. In: Proc. 30th ACM Joint Eur. Softw. Eng. Conf. and Symp. Found. Softw. Eng. (ESEC/FSE) (2022). https://doi.org/10.1145/ 3540250.3549085,https://doi.org/10.1145/3540250.3549085

work page doi:10.1145/3540250.3549085 2022
[3]

arXiv preprint (2025), https://arxiv.org/abs/2510.16996

Dong, S., Yang, Y., Liu, Y., Wang, H., Qi, Y., Tarokh, V., Rangadurai, K., Yang, Y.: STARK: Strategic team of agents for refining kernels. arXiv preprint (2025), https://arxiv.org/abs/2510.16996

arXiv 2025
[4]

arXiv preprint (2019), https://arxiv.org/abs/1905.12322

Kalamkar, D., Mudigere, D., Mellempudi, N., Das, D., Banerjee, K., Avancha, S., Vooturi, D.T., Jammalamadaka, N., Huang, J., Yuen, H., Yang, J., Park, J., Heinecke, A., Georganas, E., Srinivasan, S., Kundu, A., Smelyanskiy, M., Kaul, B., Dubey, P.: A study of BFLOAT16 for deep learning training. arXiv preprint (2019), https://arxiv.org/abs/1905.12322

Pith/arXiv arXiv 2019
[5]

arXiv preprint (2026), https://arxi v.org/abs/2602.10478

Li, Z., Lu, Y., Guo, H., Zhang, M., Wang, Y., Zhang, L.: GPU-Fuzz: Finding memory errors in deep learning frameworks. arXiv preprint (2026), https://arxi v.org/abs/2602.10478

arXiv 2026
[6]

Ouyang, A., Guo, S., Arora, S., Zhang, A.L., Hu, W., R´ e, C., Mirhoseini, A.: KernelBench: Can LLMs write efficient GPU kernels? arXiv preprint (2025), https: //arxiv.org/abs/2502.10517

Pith/arXiv arXiv 2025
[7]

arXiv preprint (2025), https://arxiv.org/ abs/2511.18868

Ran, H., Xie, S., Ji, H., Liu, Y., Wu, Y., Cao, H., Guo, A., Yu, Y., Li, L., Hu, W., Yang, D., Xie, T.: KernelBand: Steering LLM-based kernel optimization via hardware-aware multi-armed bandits. arXiv preprint (2025), https://arxiv.org/ abs/2511.18868

arXiv 2025
[8]

Sarkar, D.: Operator-aware mixed-precision tolerance calibration for tensor kernels (2026), manuscript in preparation

2026
[9]

Sarkar, D.: Test-input generation for tensor programs: What actually finds kernel bugs (2026), manuscript in preparation

2026
[10]

arXiv preprint (2023),https://arxiv.org/abs/2310.06912

Shiri Harzevili, N., Pham, H.V., Wang, S.: Benchmarking deep learning fuzzers. arXiv preprint (2023),https://arxiv.org/abs/2310.06912

arXiv 2023
[11]

ACM Trans

Shiri Harzevili, N., Pham, H.V., Wang, S.: Evaluating API-level deep learning fuzzers: A comprehensive benchmarking study. ACM Trans. Softw. Eng. Methodol. (TOSEM) (2025). https://doi.org/10.1145/3729533, https://dl.acm.org/doi /10.1145/3729533 10 D. Sarkar

work page doi:10.1145/3729533 2025
[12]

arXiv preprint (2026),https://arxiv.org/abs/2605.04956

Wang, H., Zhang, Y., Jiang, W., Wang, X., Chen, L., Zhu, Y.: KernelBench-X: A comprehensive benchmark for evaluating LLM-generated GPU kernels. arXiv preprint (2026),https://arxiv.org/abs/2605.04956

Pith/arXiv arXiv 2026
[13]

arXiv preprint (2025),https://arxiv.org/abs/2507.23194

Wang, J., Joshi, V., Majumder, S., Chao, K., Ding, Y., Liu, K., Brahma, P., Li, Y., Liu, J., Barsoum, E.: GEAK: Introducing Triton kernel AI agent & evaluation benchmarks. arXiv preprint (2025),https://arxiv.org/abs/2507.23194

arXiv 2025
[14]

In: Proc

Wei, A., Deng, Y., Yang, C., Zhang, L.: Free lunch for testing: Fuzzing deep-learning libraries from open source. In: Proc. 44th Int. Conf. Software Engineering (ICSE) (2022),https://arxiv.org/abs/2201.06589

arXiv 2022
[15]

In: Proc

Xie, D., Li, Y., Kim, M., Pham, H.V., Tan, L., Zhang, X., Godfrey, M.W.: DocTer: Documentation-guided fuzzing for testing deep learning API functions. In: Proc. 31st ACM SIGSOFT Int. Symp. Software Testing and Analysis (ISSTA) (2022). https: //doi.org/10.1145/3533767.3534220,https://arxiv.org/abs/2109.01002

work page doi:10.1145/3533767.3534220 2022
[16]

Evaluating the impact of experimental assumptions in automated fault localization,

Yang, C., Deng, Y., Yao, J., Tu, Y., Li, H., Zhang, L.: Fuzzing automatic dif- ferentiation in deep-learning libraries. In: Proc. 45th Int. Conf. Software Engi- neering (ICSE) (2023). https://doi.org/10.1109/ICSE48619.2023.00105 , https://arxiv.org/abs/2302.04351

work page doi:10.1109/icse48619.2023.00105 2023

[1] [1]

PyTorch blog (2022), https://pytorch.org/blog/what-every-u ser-should-know-about-mixed-precision-training-in-pytorch/ , updated November 2024

Ahmed, S., et al.: What every user should know about mixed precision training in PyTorch. PyTorch blog (2022), https://pytorch.org/blog/what-every-u ser-should-know-about-mixed-precision-training-in-pytorch/ , updated November 2024

2022

[2] [2]

In: Proc

Deng, Y., Yang, C., Wei, A., Zhang, L.: Fuzzing deep-learning libraries via auto- mated relational API inference. In: Proc. 30th ACM Joint Eur. Softw. Eng. Conf. and Symp. Found. Softw. Eng. (ESEC/FSE) (2022). https://doi.org/10.1145/ 3540250.3549085,https://doi.org/10.1145/3540250.3549085

work page doi:10.1145/3540250.3549085 2022

[3] [3]

arXiv preprint (2025), https://arxiv.org/abs/2510.16996

Dong, S., Yang, Y., Liu, Y., Wang, H., Qi, Y., Tarokh, V., Rangadurai, K., Yang, Y.: STARK: Strategic team of agents for refining kernels. arXiv preprint (2025), https://arxiv.org/abs/2510.16996

arXiv 2025

[4] [4]

arXiv preprint (2019), https://arxiv.org/abs/1905.12322

Kalamkar, D., Mudigere, D., Mellempudi, N., Das, D., Banerjee, K., Avancha, S., Vooturi, D.T., Jammalamadaka, N., Huang, J., Yuen, H., Yang, J., Park, J., Heinecke, A., Georganas, E., Srinivasan, S., Kundu, A., Smelyanskiy, M., Kaul, B., Dubey, P.: A study of BFLOAT16 for deep learning training. arXiv preprint (2019), https://arxiv.org/abs/1905.12322

Pith/arXiv arXiv 2019

[5] [5]

arXiv preprint (2026), https://arxi v.org/abs/2602.10478

Li, Z., Lu, Y., Guo, H., Zhang, M., Wang, Y., Zhang, L.: GPU-Fuzz: Finding memory errors in deep learning frameworks. arXiv preprint (2026), https://arxi v.org/abs/2602.10478

arXiv 2026

[6] [6]

Ouyang, A., Guo, S., Arora, S., Zhang, A.L., Hu, W., R´ e, C., Mirhoseini, A.: KernelBench: Can LLMs write efficient GPU kernels? arXiv preprint (2025), https: //arxiv.org/abs/2502.10517

Pith/arXiv arXiv 2025

[7] [7]

arXiv preprint (2025), https://arxiv.org/ abs/2511.18868

Ran, H., Xie, S., Ji, H., Liu, Y., Wu, Y., Cao, H., Guo, A., Yu, Y., Li, L., Hu, W., Yang, D., Xie, T.: KernelBand: Steering LLM-based kernel optimization via hardware-aware multi-armed bandits. arXiv preprint (2025), https://arxiv.org/ abs/2511.18868

arXiv 2025

[8] [8]

Sarkar, D.: Operator-aware mixed-precision tolerance calibration for tensor kernels (2026), manuscript in preparation

2026

[9] [9]

Sarkar, D.: Test-input generation for tensor programs: What actually finds kernel bugs (2026), manuscript in preparation

2026

[10] [10]

arXiv preprint (2023),https://arxiv.org/abs/2310.06912

Shiri Harzevili, N., Pham, H.V., Wang, S.: Benchmarking deep learning fuzzers. arXiv preprint (2023),https://arxiv.org/abs/2310.06912

arXiv 2023

[11] [11]

ACM Trans

Shiri Harzevili, N., Pham, H.V., Wang, S.: Evaluating API-level deep learning fuzzers: A comprehensive benchmarking study. ACM Trans. Softw. Eng. Methodol. (TOSEM) (2025). https://doi.org/10.1145/3729533, https://dl.acm.org/doi /10.1145/3729533 10 D. Sarkar

work page doi:10.1145/3729533 2025

[12] [12]

arXiv preprint (2026),https://arxiv.org/abs/2605.04956

Wang, H., Zhang, Y., Jiang, W., Wang, X., Chen, L., Zhu, Y.: KernelBench-X: A comprehensive benchmark for evaluating LLM-generated GPU kernels. arXiv preprint (2026),https://arxiv.org/abs/2605.04956

Pith/arXiv arXiv 2026

[13] [13]

arXiv preprint (2025),https://arxiv.org/abs/2507.23194

Wang, J., Joshi, V., Majumder, S., Chao, K., Ding, Y., Liu, K., Brahma, P., Li, Y., Liu, J., Barsoum, E.: GEAK: Introducing Triton kernel AI agent & evaluation benchmarks. arXiv preprint (2025),https://arxiv.org/abs/2507.23194

arXiv 2025

[14] [14]

In: Proc

Wei, A., Deng, Y., Yang, C., Zhang, L.: Free lunch for testing: Fuzzing deep-learning libraries from open source. In: Proc. 44th Int. Conf. Software Engineering (ICSE) (2022),https://arxiv.org/abs/2201.06589

arXiv 2022

[15] [15]

In: Proc

Xie, D., Li, Y., Kim, M., Pham, H.V., Tan, L., Zhang, X., Godfrey, M.W.: DocTer: Documentation-guided fuzzing for testing deep learning API functions. In: Proc. 31st ACM SIGSOFT Int. Symp. Software Testing and Analysis (ISSTA) (2022). https: //doi.org/10.1145/3533767.3534220,https://arxiv.org/abs/2109.01002

work page doi:10.1145/3533767.3534220 2022

[16] [16]

Evaluating the impact of experimental assumptions in automated fault localization,

Yang, C., Deng, Y., Yao, J., Tu, Y., Li, H., Zhang, L.: Fuzzing automatic dif- ferentiation in deep-learning libraries. In: Proc. 45th Int. Conf. Software Engi- neering (ICSE) (2023). https://doi.org/10.1109/ICSE48619.2023.00105 , https://arxiv.org/abs/2302.04351

work page doi:10.1109/icse48619.2023.00105 2023