pith. sign in

arxiv: 2605.19652 · v1 · pith:B7ZNLXZTnew · submitted 2026-05-19 · 💻 cs.SE

Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection

Pith reviewed 2026-05-20 04:34 UTC · model grok-4.3

classification 💻 cs.SE
keywords tile programscode generation bugsGPU kernelscompiler testingbug characterizationdeep learningscientific computing
0
0 comments X

The pith

Tile program code generation bugs follow patterns tied to input shapes and compilation stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts the first systematic analysis of bugs that arise when tile-based frameworks generate GPU kernel code for deep learning and scientific computing. Researchers collected 401 GitHub bug reports, filtered them down to 301 relevant cases, and sorted the bugs by root cause, symptom, the input patterns that expose them, the test oracles that catch them, and the fixes that developers apply. A reader should care because these bugs often appear only as silent correctness or performance failures that ordinary compiler tests do not catch, and the multi-stage pipelines plus tile-specific language rules make the problems hard to diagnose. The resulting categories give concrete starting points for new debugging and repair tools aimed at tile compilers.

Core claim

This paper presents the first systematic study of tile-program code generation bugs. We curate 401 bug reports from GitHub and identify 301 tile-program codegen bugs for analysis, categorizing the root causes, symptoms, input patterns, test oracles that trigger these bugs, and the strategies used to fix bugs. Our study provides foundational insights for building debugging, testing, and repair tools tailored to tile-based compiler infrastructures.

What carries the argument

Manual curation and categorization of 301 tile-program bugs drawn from GitHub, organized by root causes in multi-stage compilation, symptoms, input shapes, data types, backend targets, and developer fix strategies.

If this is right

  • The identified input patterns can be used to generate more effective test cases for tile compilers.
  • Common symptoms point to places where silent errors are most likely to appear in production GPU kernels.
  • Fix strategies show that repair tools must incorporate knowledge of tile abstractions and pipeline stages.
  • Categorization of test oracles suggests concrete checkers that current general-purpose compiler testers lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automated detection systems could encode the reported root-cause patterns as static or dynamic checks inside tile compilers.
  • The same curation method could be applied to bugs in other high-performance DSLs that use multi-stage code generation.
  • The categories supply a benchmark set that future testing tools for tile programs can be measured against.

Load-bearing premise

The 401 GitHub bug reports are representative of real-world tile-program code generation bugs and the manual labels for root causes and symptoms are reliable even without full reproduction environments or original developer intent.

What would settle it

A new collection of tile-program bugs from additional repositories or production runs that shows substantially different distributions of root causes or input triggers would falsify the reported categories.

Figures

Figures reproduced from arXiv: 2605.19652 by Aaryaa Moharir, Nidhi Majoju, Ravishka Rathnasuriya, Tao Xie, Tingxi Li, Wei Yang, Zihe Song.

Figure 1
Figure 1. Figure 1: Conceptual comparison of compilation pipelines across traditional, tensor, and tile-based compilers. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An instance of a resource allocation bug triggered by premature warpgroup register deallocation in [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A tile-level argmax reduction triggering an operator-implementation bug in Triton [75]. this process, the resulting mappings may become ill-defined even if the computation itself is type￾correct. An illustrative case is reported in Apache TVM [5], where failures in transform_layout, transform_block_layout, and IndexMap::NonSurjectiveInverse are triggered by data-type mismatches. These transformations exist… view at source ↗
read the original abstract

Tile-based programming frameworks are increasingly adopted to write high-performance GPU kernels in domains such as deep learning and scientific computing. While these frameworks enhance productivity and hardware utilization, their multi-stage compilation pipelines introduce distinct code generation bugs that are tightly coupled to input shapes, data types, and backend targets. These bugs often manifest as silent correctness or performance issues, making them difficult to detect using existing compiler testing tools. Additionally, the unique programming conventions of tile domain-specific languages complicate root cause identification, while fixing such bugs demands specialized knowledge of tile abstractions and compilation pipelines. Despite the growing adoption of tile-based systems, their code generation bugs remain largely unexplored. This paper presents the first systematic study of tile-program code generation bugs. We curate 401 bug reports from GitHub and identify 301 tile-program codegen bugs for analysis, categorizing the root causes, symptoms, input patterns, test oracles that trigger these bugs, and the strategies used to fix bugs. Our study provides foundational insights for building debugging, testing, and repair tools tailored to tile-based compiler infrastructures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the first systematic study of tile-program code generation bugs in tile-based programming frameworks for high-performance GPU kernels. The authors curate 401 bug reports from GitHub, filter to 301 tile-program codegen bugs, and categorize root causes, symptoms, input patterns, test oracles that trigger the bugs, and fix strategies to provide insights for building specialized debugging, testing, and repair tools.

Significance. If the curation process and manual categorization prove reliable and representative, this empirical study would deliver valuable foundational data on an underexplored class of silent, shape- and backend-coupled bugs that evade standard compiler testing. It could directly inform tool-building for tile-based compiler infrastructures in deep learning and scientific computing. The work earns credit for grounding analysis in real GitHub reports rather than synthetic cases and for producing a multi-dimensional taxonomy (root causes through fixes) that is actionable for practitioners.

major comments (2)
  1. [Methodology / Data Collection] The description of curating 401 GitHub reports and identifying 301 codegen bugs provides no details on selection criteria, inclusion/exclusion rules, inter-rater agreement, or validation against reproduction environments. This is load-bearing for the central claim that the resulting taxonomy reflects real-world tile-program bugs, as GitHub issues are self-selected and often lack full context or developer intent.
  2. [Categorization and Analysis] Manual categorization of root causes, symptoms, and oracles from titles, descriptions, and comments alone risks misattribution (e.g., shape-dependent silent errors labeled as performance issues). Without reported measures of labeling reliability or access to reproduction scripts, the categories' correctness cannot be assessed, weakening the insights offered for automated bug detection.
minor comments (2)
  1. [Introduction] Clarify the exact definition of 'tile-program codegen bug' versus usage error or unrelated defect early in the paper to aid reader interpretation of the 301 cases.
  2. [Results] Consider adding a table or figure summarizing the distribution of categories (e.g., percentage of bugs per root cause) for quicker overview.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and outline the revisions we will make to strengthen the transparency of our methodology and analysis.

read point-by-point responses
  1. Referee: [Methodology / Data Collection] The description of curating 401 GitHub reports and identifying 301 codegen bugs provides no details on selection criteria, inclusion/exclusion rules, inter-rater agreement, or validation against reproduction environments. This is load-bearing for the central claim that the resulting taxonomy reflects real-world tile-program bugs, as GitHub issues are self-selected and often lack full context or developer intent.

    Authors: We agree that greater transparency in the curation process is required. In the revised manuscript we will expand the Methodology section with explicit inclusion/exclusion criteria used to select the 301 tile-program codegen bugs from the initial 401 reports, along with inter-rater agreement statistics obtained when multiple authors independently reviewed the issues. We will also clarify the degree to which we could validate reports against available code snippets and developer comments, while acknowledging that complete reproduction environments are not provided in most GitHub issues. revision: yes

  2. Referee: [Categorization and Analysis] Manual categorization of root causes, symptoms, and oracles from titles, descriptions, and comments alone risks misattribution (e.g., shape-dependent silent errors labeled as performance issues). Without reported measures of labeling reliability or access to reproduction scripts, the categories' correctness cannot be assessed, weakening the insights offered for automated bug detection.

    Authors: We acknowledge the inherent limitations of text-based manual categorization. The revised version will include a new subsection that defines each category with concrete examples and reports inter-annotator agreement measures for the labeling process. We will explicitly discuss the unavailability of reproduction scripts for many reports as a limitation and describe how we reduced misattribution risk by cross-referencing issue comments and code fragments. These changes will allow readers to better evaluate the taxonomy's reliability for guiding automated bug detection tools. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical bug categorization study

full rationale

This paper is an empirical study that curates 401 GitHub bug reports, filters to 301 tile-program codegen cases, and performs manual categorization of root causes, symptoms, input patterns, oracles, and fixes. It contains no mathematical derivations, equations, fitted parameters, or self-referential definitions. All claims rest on external data sources and direct inspection rather than any internal reduction or self-citation chain, rendering the analysis self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of GitHub-sourced bug reports and the validity of the manual categorization process as a basis for insights into tile-program bugs.

axioms (1)
  • domain assumption GitHub bug reports provide a representative and unbiased sample of real-world tile-program code generation bugs
    The study begins by curating 401 bug reports from GitHub as the foundation for identifying and analyzing 301 bugs.

pith-pipeline@v0.9.0 · 5733 in / 1239 out tokens · 43741 ms · 2026-05-20T04:34:30.439994+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 2 internal anchors

  1. [1]

    S. M. Mojahidul Ahsan, Tamzidul Hoque, Md Sakib Hasan, Mrittika Chowdhury, and Anurag Dhungel. 2025. Hardware accelerators for artificial intelligence. InAI-Enabled Electronic Circuit and System Design: From Ideation to Utilization. Springer

  2. [2]

    AIKernelResearch. 2025. TileBug: project repository. https://github.com/AIKernelResearch/TileCodegenBugStudy. github.io

  3. [3]

    AIKernelResearch. 2025. TileCodegenBug: project website. https://aikernelresearch.github.io/TileCodegenBugStudy. github.io/

  4. [4]

    Apache. 2025. apache/TVM: open deep learning compiler stack for CPU, GPU and specialized accelerators. https: //github.com/apache/tvm

  5. [5]

    Apache Authors. 2023. [Bug][MetaSchedule] failed to run apply_trace generated by print(sch.trace) for int8 conv2d workload #14112. https://github.com/apache/tvm/issues/14112

  6. [6]

    Maximilian Beck, Korbinian Pöppel, Phillip Lippe, and Sepp Hochreiter. 2025. Tiled flash linear attention: more efficient linear RNN and xLSTM kernels. arXiv preprint arXiv:2503.14376 (2025)

  7. [7]

    Lukas Bernhard, Nico Schiller, Moritz Schloegel, Nils Bars, and Thorsten Holz. 2024. DarthShader: fuzzing We- bGPU shader translators & compilers. In Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS)

  8. [8]

    Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: a verifier for GPU kernels. In Proceedings of the 27th ACM International Conference on Object Oriented Programming, Systems, Languages, and Applications (OOPSLA)

  9. [9]

    Michael Boyer, Kevin Skadron, and Westley Weimer. 2008. Automated dynamic analysis of CUDA programs. In Proceedings of the Third Workshop on Software Tools for MultiCore Systems (STMCS)

  10. [10]

    Junjie Chen, Yihua Liang, Qingchao Shen, Jiajun Jiang, and Shuochuan Li. 2023. Toward Understanding Deep Learning Framework Bugs. ACM Transactions on Software Engineering and Methodology (TOSEM) (2023)

  11. [11]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

  12. [12]

    Edoardo Cittadini, Mauro Marinoni, and Giorgio Buttazzo. 2025. A hardware accelerator to support deep learning processor units in real-time image processing. Engineering Applications of Artificial Intelligence (2025)

  13. [13]

    Anthony Di Franco, Hui Guo, and Cindy Rubio-González. 2017. A comprehensive study of real-world numerical bug characteristics. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)

  14. [14]

    Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson

    Alastair F. Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson. 2017. Automated testing of graphics shader compilers. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2017)

  15. [15]

    Ariel Eizenberg, Yuanfeng Peng, Toma Pigli, William Mansky, and Joseph Devietti. 2017. BARRACUDA: binary-level analysis of runtime races in CUDA programs. InProceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

  16. [16]

    Donaldson, and Cristian Cadar

    Karine Even-Mendoza, Arindam Sharma, Alastair F. Donaldson, and Cristian Cadar. 2023. GrayC: greybox fuzzing of compilers and analysers for C. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA)

  17. [17]

    Geoff Gerfin and Vyas Venkataraman. 2012. Debugging experience with CUDA-GDB and CUDA-Memcheck. InGPU Technology Conference (GTC)

  18. [18]

    GitHub, Inc. 2008. GitHub. https://github.com

  19. [19]

    Ganesh Gopalakrishnan, Ignacio Laguna, Ang Li, Pavel Panchekha, Cindy Rubio-González, and Zachary Tatlock

  20. [20]

    In Proceedings of the 5th IEEE/ACM International Workshop on Software Correctness for HPC Applications (Correctness)

    Guarding numerics amidst rising heterogeneity. In Proceedings of the 5th IEEE/ACM International Workshop on Software Correctness for HPC Applications (Correctness). , Vol. 1, No. 1, Article . Publication date: May 2026. Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection 21

  21. [21]

    Qianyu Guo, Xiaofei Xie, Yi Li, Xiaoyu Zhang, Yang Liu, Xiaohong Li, and Chao Shen. 2020. Audee: Automated Testing for Deep Learning Frameworks. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

  22. [22]

    Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi Zhou, and Charith Mendis. 2025. SPLAT: a framework for optimised GPU code-generation for SParse reguLar ATtention. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2025)

  23. [23]

    Halide. 2025. halide/Halide: a language for fast, portable data-parallel computation. https://github.com/halide/Halide

  24. [24]

    Halide Authors. 2025. undef prunes select branch incorrectly #8667. https://github.com/halide/Halide/issues/8667

  25. [25]

    Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

  26. [26]

    Mohammad Majharul Islam and Abdullah Muzahid. 2018. Bugaroo: exposing memory model bugs in many-core systems. In Proceedings of the 29th IEEE International Symposium on Software Reliability Engineering (ISSRE)

  27. [27]

    Bo Jiang, Xiaoyan Wang, Wing Kwong Chan, T. H. Tse, Na Li, Yongfeng Yin, and Zhenyu Zhang. 2020. CUDA- smith: a fuzzer for CUDA compilers. In Proceedings of the 44th IEEE Annual Computers, Software, and Applications Conference (COMPSAC)

  28. [28]

    Kloberdanz, and Wei Le

    Eliska Kloberdanz, Kyle G. Kloberdanz, and Wei Le. 2022. DeepStability: a study of unstable numerical methods and their solutions in deep learning. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE)

  29. [29]

    Ignacio Laguna. 2019. FPChecker: detecting floating-point exceptions in GPU applications. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE)

  30. [30]

    Ignacio Laguna and Ganesh Gopalakrishnan. 2022. Finding inputs that trigger floating-point exceptions in GPUs via Bayesian optimization. In Proceedings of the IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

  31. [31]

    Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State of the Art in Program Analysis (SOAP)

  32. [32]

    Chris Lattner. 2008. LLVM and Clang: next generation compiler technology. In Proceedings of the BSD Conference (BSDCan)

  33. [33]

    Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: concolic verification and test generation for GPUs. InProceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)

  34. [34]

    Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. TritonBench: benchmarking large language model capabilities for generating Triton operators. In Findings of the Association for Computational Linguistics (ACL Findings)

  35. [35]

    Jiashi Li and Shengyu Liu. 2025. FlashMLA: efficient multi-head latent attention kernels. https://github.com/deepseek- ai/FlashMLA

  36. [36]

    Wentao Li, Jianhua Sun, and Hao Chen. 2019. Detecting undefined behaviors in CUDA C. IEEE Access (2019)

  37. [37]

    Xinyi Li, Ignacio Laguna, Bo Fang, Katarzyna Swirydowicz, Ang Li, and Ganesh Gopalakrishnan. 2023. Design and evaluation of GPU-FPX: a low-overhead tool for floating-point exception detection in NVIDIA GPUs. In Proceedings of the 32nd ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC)

  38. [38]

    Donaldson

    Christopher Lidbury, Andrei Lascu, Nathan Chong, and Alastair F. Donaldson. 2015. Many-core compiler fuzzing. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

  39. [39]

    Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Eric Zitong Zhou, and Miryung Kim. 2025. Fuzzing MLIR com- pilers with custom mutation synthesis. In Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE)

  40. [40]

    Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for C and C++ compilers with YARPGen. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2020)

  41. [41]

    Haoyang Ma. 2023. A survey of modern compiler fuzzing. arXiv preprint arXiv:2306.06884 (2023)

  42. [42]

    Hasan Mohsin. 2022. WGSLsmith: a random generator of WebGPU shader programs. Master’s thesis, Imperial College London

  43. [43]

    John Nickolls. 2007. GPU parallel computing architecture and CUDA programming model. In Proceedings of the 19th IEEE Hot Chips Symposium (HCS)

  44. [44]

    NVIDIA. 2021. cuBLAS: basic linear algebra on NVIDIA GPUs. https://developer.nvidia.com/cublas

  45. [45]

    NVIDIA. 2021. NVIDIA cuDNN. https://developer.nvidia.com/cudnn

  46. [46]

    NVIDIA. 2025. NVIDIA graphics cards. https://www.nvidia.com/en-us/geforce/graphics-cards/ , Vol. 1, No. 1, Article . Publication date: May 2026. 22 Rathnasuriya and Song, et al

  47. [47]

    NVIDIA. 2025. NVIDIA/cuda-tile: an MLIR-based intermediate representation for tile-based CUDA kernel optimization. https://github.com/NVIDIA/cuda-tile

  48. [48]

    NVIDIA. 2025. NVIDIA/warp: a Python framework for accelerated simulation, data generation and spatial computing. https://github.com/NVIDIA/warp

  49. [49]

    NVIDIA. 2025. Triton Inference Server. https://github.com/triton-inference-server/server

  50. [50]

    NVIDIA Authors. 2025. [BUG] tile operations produce unexpected results #688. https://github.com/NVIDIA/warp/ issues/688

  51. [51]

    OpenAI. 2021. Introducing Triton: open-source GPU programming for neural networks. https://openai.com/index/ triton/

  52. [52]

    OpenXLA. 2025. openxla/XLA: a machine learning compiler for GPUs, CPUs, and ML accelerators. https://github. com/openxla/xla

  53. [53]

    KernelBench: Can LLMs Write Efficient GPU Kernels?

    Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. 2025. KernelBench: can LLMs write efficient GPU kernels? arXiv preprint arXiv:2502.10517 (2025)

  54. [54]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: an imperative style, high-pe...

  55. [55]

    Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE)

  56. [56]

    PyTorch. 2025. pytorch/pytorch: tensors and dynamic neural networks in Python with strong GPU acceleration. https://github.com/pytorch/pytorch

  57. [57]

    PyTorch Authors. 2024. Failure in generating a kernel with 3 tile groups #141121. https://github.com/pytorch/pytorch/ issues/141121

  58. [58]

    PyTorch Authors. 2025. flex_attention + dynamic=True with large batch or heads causes Triton error [CUDA]: invalid argument #157018. https://github.com/pytorch/pytorch/issues/157018

  59. [59]

    Ravishka Rathnasuriya, Nidhi Majoju, Zihe Song, and Wei Yang. 2025. An investigation on numerical bugs in GPU programs towards automated bug detection. Proceedings of the ACM on Software Engineering (PACMSE), ISSTA (2025)

  60. [60]

    Jason Sanders and Edward Kandrot. 2010. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional

  61. [61]

    Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A compre- hensive study of deep learning compiler bugs. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

  62. [62]

    Cristina Silvano, Daniele Ielmini, Fabrizio Ferrandi, Leandro Fiorin, Serena Curzel, Luca Benini, Francesco Conti, Angelo Garofalo, Cristian Zambelli, Enrico Calore, Sebastiano Schifano, Maurizio Palesi, Giuseppe Ascia, Davide Patti, Nicola Petra, Davide De Caro, Luciano Lavagno, Teodoro Urso, Valeria Cardellini, Gian Carlo Cardarilli, Robert Birke, and S...

  63. [63]

    Donaldson

    Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing errors related to weak memory in GPU applications. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

  64. [64]

    SPCL. 2025. spcl/dace: DaCe – data centric parallel programming. https://github.com/spcl/dace

  65. [65]

    F., Arora, S., Singhal, A., Fu, D

    Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, and Christopher Ré. 2024. ThunderKittens: simple, fast, and adorable AI kernels. arXiv preprint arXiv:2410.20399 (2024)

  66. [66]

    Chenyao Suo, Jianrong Wang, Yongjia Wang, Jiajun Jiang, Qingchao Shen, and Junjie Chen. 2025. DESIL: detecting silent bugs in MLIR compiler infrastructure. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2025)

  67. [67]

    Tile-AI. 2025. tile-ai/tilelang: a domain-specific language for high-performance GPU/CPU/accelerator kernels. https: //github.com/tile-ai/tilelang

  68. [68]

    Tile-AI Authors. 2025. [Bug] compilation error for mma on NVIDIA Hopper GPU #101. https://github.com/tile- ai/tilelang/issues/101

  69. [69]

    Tile-AI Authors. 2025. [Bug] compile/“cached” still not loading cached kernel for example in example_mha_bwd #313. https://github.com/tile-ai/tilelang/issues/313

  70. [70]

    Tile-AI Authors. 2025. [BUG] incorrect __sync_thread_partial placement in generated kernel code #1604. https: //github.com/tile-ai/tilelang/issues/1604

  71. [71]

    Tile-AI Authors. 2025. [BUG Report] encounter dead lock when implementing deepgemm with 8 warps on Hopper #359. https://github.com/tile-ai/tilelang/issues/359 , Vol. 1, No. 1, Article . Publication date: May 2026. Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection 23

  72. [72]

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshopon Machine Learning and Programming Languages (MAPL)

  73. [73]

    Devesh Tiwari, Saurabh Gupta, James Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan DeBardeleben, Philippe Navaux, Luigi Carro, and Arthur Bland. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the 21st IEEE International Symposium ...

  74. [74]

    Triton-Lang. 2025. triton-lang/triton: development repository for the Triton language and compiler. https://github. com/triton-lang/triton

  75. [75]

    Triton-Lang Authors. 2022. Segfault in dds_matmul #443. https://github.com/triton-lang/triton/issues/443

  76. [76]

    Triton-Lang Authors. 2023. Segmentation fault with matmul + argmax #1846. https://github.com/triton-lang/triton/ issues/1846

  77. [77]

    Triton-Lang Authors. 2023. WSMaterialization generates invalid IR – modifies module’s num-warps field without modifying tensor layouts #2658. https://github.com/triton-lang/triton/issues/2658

  78. [78]

    Triton-Lang Authors. 2024. Assertion failure in linear layouts when num_warps = 8, but passes with num_warps = 4 #5265. https://github.com/triton-lang/triton/issues/5265

  79. [79]

    Triton-Lang Authors. 2025. AMD ReorderInstruction pass will reorder the global_load ahead of local_store and break the local_prefetch logic which will miss match TritonAMDGPULowerInstructionSchedHints::createLocalPrefetchSchedule code logic #6750. https://github.com/triton-lang/triton/issues/6750

  80. [80]

    Dewei Wang, Wei Zhu, Liyang Ling, Ettore Tiotto, Quintin Wang, Whitney Tsang, Julian Oppermann, and Jacky Deng. 2025. ML-Triton, a multi-level compilation and language extension to Triton GPU programming. arXiv preprint arXiv:2503.14985 (2025)

Showing first 80 references.