pith. sign in

arxiv: 2606.27396 · v1 · pith:RWFRE7JInew · submitted 2026-06-23 · 💻 cs.SE · cs.LG

Test-Input Generation for Tensor Programs: What Actually Finds Kernel Bugs

Pith reviewed 2026-06-29 01:57 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords test input generationtensor kernelsbug detectionfuzzing strategiesshape samplingboundary testingfalse positive ratesoftmax bugs
0
0 comments X

The pith

Boundary-only shape sampling detects 78% of seeded tensor kernel bugs with zero false positives on correct kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper makes explicit the folkloric choices in generating test inputs for tensor kernels and measures their effectiveness at finding bugs. It compares seven strategies that differ in shape selection, data type choices, and input value distributions across a corpus of 16 correct operations and 10 buggy variants. Boundary-only shape sampling emerges as the safest approach because it catches most of the seeded bugs without flagging any correct kernels. Other strategies either miss entire classes of bugs, such as certain softmax tail-mask errors, or produce high false-positive rates when they introduce special values that validators flag indiscriminately. The work shows that the choice of shape sampling strategy drives the largest differences in what gets caught.

Core claim

Through systematic evaluation of test input generation strategies on a 26-operation corpus, boundary-only shape sampling achieves 78% recall on the 10 buggy kernels while maintaining a 0% false-positive rate on the 16 correct controls. Adversarial value sampling reaches 99% recall but inflates the false-positive rate to 94% because the validator's checks on NaN and Inf propagation fire on both buggy and correct kernels. On the two softmax tail-mask bugs, the strategy without boundary shapes catches zero instances, whereas boundary shapes raise recall to 100% and 62% respectively. The results concern which seeded transcription patterns each strategy detects rather than overall bug rates in de

What carries the argument

Comparison of seven test-generation strategies that vary the shape candidate set, dtype mix, and input value distribution, with performance measured by bug recall against controls and false-positive rate.

If this is right

  • Boundary-only shape sampling serves as an operationally safe default for test input generation that avoids false alarms while still detecting most seeded transcription errors.
  • Regular shape sampling without boundaries misses all instances of the two softmax tail-mask bugs examined.
  • Adversarial value sampling increases recall but produces high false-positive rates when validators check for NaN and Inf propagation.
  • Shape sampling choice has larger impact on detection of specific bug patterns than dtype mix or value distribution alone.
  • Test strategies must be evaluated jointly with the validator's treatment of special values to avoid inflated false-positive rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tensor framework maintainers could adopt boundary shape sampling as a default in their validation pipelines to reduce missed bugs without added noise from false positives.
  • The performance gap on softmax bugs indicates that boundary conditions are especially important for operations involving masks or tail elements.
  • Future work could test whether the same strategy rankings hold when the validator uses different methods for handling NaN and Inf values.
  • Extending the evaluation corpus beyond seeded transcription patterns to include bugs found in actual user reports would test how well the findings generalize.

Load-bearing premise

The 10 LLM-style buggy variants seeded with documented transcription patterns are representative of the bug types that matter in practice, and observed recall differences are driven by the test strategies rather than validator handling of NaN and Inf.

What would settle it

Applying the same seven strategies to a larger collection of real, reported tensor kernel bugs from production code and checking whether boundary-only sampling maintains 78% or higher recall with zero false positives on correct implementations.

Figures

Figures reproduced from arXiv: 2606.27396 by Dipankar Sarkar.

Figure 1
Figure 1. Figure 1: Bug recall (%) per (strategy, buggy kernel), 8 iters each. Strategy order matches [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Test-input generation for tensor kernels is folkloric. Most projects pick a representative shape and dtype, run a fixed-shape allclose-style check, and ship. We make the choices explicit and measure them. Using the gpuemu op-schema-aware seeded fuzzer (arXiv:2606.20128), we evaluate seven test-generation strategies across a 26-op corpus (16 correct controls and 10 LLM-style buggy variants seeded with documented transcription patterns) on an RTX 3060 GPU instance. Strategies vary the shape candidate set, the dtype mix, and the input value distribution. We report each strategy on two axes: bug recall and control false-positive (FP) rate. Boundary-only shape sampling is the operationally safe winner: 78% recall on the 10 buggy kernels with 0% FP on the 16 controls. Adversarial value sampling reaches higher recall (99%) but inflates control FP to 94% because the strategy injects NaN and Inf inputs and the validator's NaN check fires on every kernel that propagates them, not only on buggy kernels. On the two softmax tail-mask bugs the "regular" strategy (no boundary shapes) catches 0%, while boundary raises recall to 100% and 62% respectively. That gap is the clearest single signal in the data. The corpus result is about which seeded bug patterns each strategy catches, not about the bug rate of any specific deployed LLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates seven test-input generation strategies for tensor kernels via the gpuemu seeded fuzzer on a 26-op corpus (16 correct controls + 10 LLM-style buggy variants created by seeding documented transcription patterns). Strategies differ in shape candidate sets, dtype mix, and value distributions. Results are reported as bug recall and control FP rate. Boundary-only shape sampling is presented as the operationally safe winner (78% recall, 0% FP), while adversarial sampling reaches 99% recall but 94% FP due to NaN/Inf propagation triggering the validator on controls. Specific gaps are noted for softmax tail-mask bugs where boundary shapes raise recall from 0% to 100%/62%. The manuscript explicitly states that the corpus results concern which seeded patterns each strategy catches, not real bug rates in deployed LLMs.

Significance. If the seeded patterns prove representative of practically relevant kernel bugs, the work supplies concrete, reproducible empirical data on the trade-offs among shape, dtype, and value strategies, including a clear demonstration that boundary shapes catch certain transcription bugs missed by regular sampling. The explicit two-axis measurement (recall vs. FP) and use of an op-schema-aware fuzzer constitute a strength that can guide testing practice even if the absolute numbers are corpus-specific.

major comments (2)
  1. [Abstract] Abstract: The headline claim that boundary-only shape sampling is the 'operationally safe winner' is load-bearing, yet the evaluation is performed exclusively on 10 artificially seeded LLM transcription bugs. The manuscript itself notes that results address which seeded patterns are caught rather than real bug rates; without additional evidence that these patterns are representative of the bug types that matter in practice, the operational-safety interpretation is not fully supported by the data.
  2. [Abstract] Abstract (and methods description of the validator): The 94% FP rate for adversarial sampling is attributed to the validator's NaN check firing on controls that propagate NaN/Inf. No details are supplied on the exact implementation of that NaN/Inf check, the precise shape candidate sets used for each strategy, or any statistical significance assessment of the recall differences (e.g., the softmax tail-mask cases). Because validator mechanics are already shown to affect the metrics, these omissions prevent readers from determining whether observed gaps are driven by the shape strategy itself.
minor comments (1)
  1. The manuscript would benefit from an explicit limitations subsection that restates the synthetic nature of the corpus and the dependence on the validator implementation, even if the current abstract caveat is retained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, proposing revisions to clarify scope and add missing details where feasible. The manuscript already qualifies its claims as applying to the seeded patterns rather than real-world rates.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that boundary-only shape sampling is the 'operationally safe winner' is load-bearing, yet the evaluation is performed exclusively on 10 artificially seeded LLM transcription bugs. The manuscript itself notes that results address which seeded patterns are caught rather than real bug rates; without additional evidence that these patterns are representative of the bug types that matter in practice, the operational-safety interpretation is not fully supported by the data.

    Authors: We agree the evaluation uses 10 seeded bugs derived from documented transcription patterns, and the manuscript already states explicitly that results concern which seeded patterns each strategy catches rather than real bug rates in deployed LLMs. The 'operationally safe winner' label rests on the empirical 78% recall paired with 0% FP rate on the 16 correct controls in this corpus. To prevent overinterpretation, we will revise the abstract to qualify the claim more explicitly as applying to these seeded patterns (representative of common LLM transcription errors) rather than implying general operational safety across all possible bugs. This revision maintains the data-driven observation while addressing the concern. revision: partial

  2. Referee: [Abstract] Abstract (and methods description of the validator): The 94% FP rate for adversarial sampling is attributed to the validator's NaN check firing on controls that propagate NaN/Inf. No details are supplied on the exact implementation of that NaN/Inf check, the precise shape candidate sets used for each strategy, or any statistical significance assessment of the recall differences (e.g., the softmax tail-mask cases). Because validator mechanics are already shown to affect the metrics, these omissions prevent readers from determining whether observed gaps are driven by the shape strategy itself.

    Authors: We will expand the methods section to describe the validator's NaN/Inf check (any output containing NaN or Inf is flagged) and provide the exact shape candidate sets for each strategy (boundary shapes use edge values such as 0, 1, and maximum dimensions per the op schema; regular shapes use typical schema values). For the recall differences, including the softmax tail-mask cases (0% to 100% and 62%), we report descriptive rates given the small corpus; no statistical significance tests were performed. These additions will allow readers to evaluate the contribution of each component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement on fixed corpus

full rationale

The paper reports direct measurements of bug recall and control FP rates across seven test-generation strategies on a corpus of 16 correct kernels and 10 seeded buggy variants. No equations, fitted parameters, predictions derived from inputs, or self-referential definitions appear in the provided text. The central claims are observational counts (78% recall, 0% FP for boundary sampling) obtained by running the strategies and validator on the corpus; these do not reduce to the inputs by construction. The cited prior arXiv:2606.20128 supplies the fuzzer tool but is not used to derive the reported metrics. This is a standard empirical evaluation study with no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The evaluation rests on the assumption that the seeded buggy variants model realistic LLM transcription errors and that the fuzzer plus validator behave as described in the prior arXiv work; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The 10 seeded buggy variants with documented transcription patterns are representative of bugs that occur when LLMs generate tensor kernels
    Used as the ground truth for measuring bug recall across strategies.
  • domain assumption The validator's NaN/Inf check fires on propagation behavior rather than on kernel correctness
    Explains why adversarial sampling produces high false positives on correct controls.

pith-pipeline@v0.9.1-grok · 5783 in / 1481 out tokens · 58042 ms · 2026-06-29T01:57:35.463385+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 4 canonical work pages

  1. [1]

    In: Proc

    Deng, Y., Yang, C., Wei, A., Zhang, L.: Fuzzing deep-learning libraries via au- tomated relational API inference. In: Proc. 30th ACM Joint Eur. Softw. Eng. Conf. and Symp. Found. Softw. Eng. (ESEC/FSE). pp. 44–56 (2022). https: //doi.org/10.1145/3540250.3549085

  2. [2]

    arXiv preprint (2025), https://arxiv.org/abs/2510.16996

    Dong, J., Yang, Y., Liu, T., Wang, Y., Qi, F., Tarokh, V., Rangadurai, K., Yang, S.: STARK: Strategic team of agents for refining kernels. arXiv preprint (2025), https://arxiv.org/abs/2510.16996

  3. [3]

    Wiley (1979)

    Myers, G.J.: The Art of Software Testing. Wiley (1979)

  4. [4]

    Ouyang, A., Guo, S., Arora, S., Zhang, A.L., Hu, W., R´ e, C., Mirhoseini, A.: KernelBench: Can LLMs write efficient GPU kernels? arXiv preprint (2025), https: //arxiv.org/abs/2502.10517

  5. [5]

    arXiv preprint (2025),https://arxiv.org/abs/2509.14626

    Qin, F., Naziri, M.M.A., Ai, H., Dutta, S., d’Amorim, M.: Evaluating the effec- tiveness of coverage-guided fuzzing for testing deep learning library APIs. arXiv preprint (2025),https://arxiv.org/abs/2509.14626

  6. [6]

    arXiv preprint (2025), https://arxiv.org/ abs/2511.18868 8 D

    Ran, D., Xie, S., Ji, M., Liu, A., Wu, M., Cao, Y., Guo, Y., Yu, H., Li, L., Hu, Y., Yang, W., Xie, T.: KernelBand: Steering LLM-based kernel optimization via hardware-aware multi-armed bandits. arXiv preprint (2025), https://arxiv.org/ abs/2511.18868 8 D. Sarkar

  7. [7]

    arXiv preprint (2026),https://arxiv.org/abs/2606.20128

    Sarkar, D.: The correctness illusion in LLM-generated GPU kernels. arXiv preprint (2026),https://arxiv.org/abs/2606.20128

  8. [8]

    Draft source at https://github.com/sarkar-d ipankar/gpuemu-arxiv-paper/tree/main/p2

    Sarkar, D.: Operator-aware mixed-precision tolerance calibration for tensor kernels (2026), manuscript in preparation. Draft source at https://github.com/sarkar-d ipankar/gpuemu-arxiv-paper/tree/main/p2

  9. [9]

    arXiv preprint (2026),https://arxiv.org/abs/2605.04956

    Wang, H., Zhang, J., Jiang, K., Wang, H., Chen, J., Zhu, J.: KernelBenchX: A comprehensive benchmark for evaluating LLM-generated GPU kernels. arXiv preprint (2026),https://arxiv.org/abs/2605.04956

  10. [10]

    arXiv preprint (2025),https://arxiv.org/abs/2507.23194

    Wang, J., Joshi, V., Majumder, S., Xu Chao, Ding, B., Liu, Z., Brahma, P.P., Li, D., Liu, Z., Barsoum, E.: GEAK: Introducing Triton kernel AI agent & evaluation benchmarks. arXiv preprint (2025),https://arxiv.org/abs/2507.23194

  11. [11]

    In: Proc

    Wei, A., Deng, Y., Yang, C., Zhang, L.: Free lunch for testing: Fuzzing deep- learning libraries from open source. In: Proc. 44th Int. Conf. Software Engineering (ICSE). pp. 995–1007 (2022). https://doi.org/10.1145/3510003.3510041 , https://arxiv.org/abs/2201.06589

  12. [12]

    In: Proc

    Xie, D., Li, Y., Kim, M., Pham, H.V., Tan, L., Zhang, X., Godfrey, M.W.: DocTer: Documentation-guided fuzzing for testing deep learning API functions. In: Proc. 31st ACM SIGSOFT Int. Symp. Software Testing and Analysis (ISSTA). pp. 176– 188 (2022). https://doi.org/10.1145/3533767.3534220 , https://arxiv.org/ abs/2109.01002

  13. [13]

    Automated program repair in the era of large pre- trained language models

    Yang, C., Deng, Y., Yao, J., Tu, Y., Li, H., Zhang, L.: Fuzzing automatic differen- tiation in deep-learning libraries. In: Proc. 45th Int. Conf. Software Engineering (ICSE). pp. 1174–1186 (2023). https://doi.org/10.1109/ICSE48619.2023.00105, https://arxiv.org/abs/2302.04351