Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection

Aaryaa Moharir; Nidhi Majoju; Ravishka Rathnasuriya; Tao Xie; Tingxi Li; Wei Yang; Zihe Song

arxiv: 2605.19652 · v1 · pith:B7ZNLXZTnew · submitted 2026-05-19 · 💻 cs.SE

Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection

Ravishka Rathnasuriya , Zihe Song , Nidhi Majoju , Aaryaa Moharir , Tingxi Li , Wei Yang , Tao Xie This is my paper

Pith reviewed 2026-05-20 04:34 UTC · model grok-4.3

classification 💻 cs.SE

keywords tile programscode generation bugsGPU kernelscompiler testingbug characterizationdeep learningscientific computing

0 comments

The pith

Tile program code generation bugs follow patterns tied to input shapes and compilation stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts the first systematic analysis of bugs that arise when tile-based frameworks generate GPU kernel code for deep learning and scientific computing. Researchers collected 401 GitHub bug reports, filtered them down to 301 relevant cases, and sorted the bugs by root cause, symptom, the input patterns that expose them, the test oracles that catch them, and the fixes that developers apply. A reader should care because these bugs often appear only as silent correctness or performance failures that ordinary compiler tests do not catch, and the multi-stage pipelines plus tile-specific language rules make the problems hard to diagnose. The resulting categories give concrete starting points for new debugging and repair tools aimed at tile compilers.

Core claim

This paper presents the first systematic study of tile-program code generation bugs. We curate 401 bug reports from GitHub and identify 301 tile-program codegen bugs for analysis, categorizing the root causes, symptoms, input patterns, test oracles that trigger these bugs, and the strategies used to fix bugs. Our study provides foundational insights for building debugging, testing, and repair tools tailored to tile-based compiler infrastructures.

What carries the argument

Manual curation and categorization of 301 tile-program bugs drawn from GitHub, organized by root causes in multi-stage compilation, symptoms, input shapes, data types, backend targets, and developer fix strategies.

If this is right

The identified input patterns can be used to generate more effective test cases for tile compilers.
Common symptoms point to places where silent errors are most likely to appear in production GPU kernels.
Fix strategies show that repair tools must incorporate knowledge of tile abstractions and pipeline stages.
Categorization of test oracles suggests concrete checkers that current general-purpose compiler testers lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automated detection systems could encode the reported root-cause patterns as static or dynamic checks inside tile compilers.
The same curation method could be applied to bugs in other high-performance DSLs that use multi-stage code generation.
The categories supply a benchmark set that future testing tools for tile programs can be measured against.

Load-bearing premise

The 401 GitHub bug reports are representative of real-world tile-program code generation bugs and the manual labels for root causes and symptoms are reliable even without full reproduction environments or original developer intent.

What would settle it

A new collection of tile-program bugs from additional repositories or production runs that shows substantially different distributions of root causes or input triggers would falsify the reported categories.

Figures

Figures reproduced from arXiv: 2605.19652 by Aaryaa Moharir, Nidhi Majoju, Ravishka Rathnasuriya, Tao Xie, Tingxi Li, Wei Yang, Zihe Song.

**Figure 2.** Figure 2: An instance of a resource allocation bug triggered by premature warpgroup register deallocation in [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: A tile-level argmax reduction triggering an operator-implementation bug in Triton [75]. this process, the resulting mappings may become ill-defined even if the computation itself is typecorrect. An illustrative case is reported in Apache TVM [5], where failures in transform_layout, transform_block_layout, and IndexMap::NonSurjectiveInverse are triggered by data-type mismatches. These transformations exist… view at source ↗

read the original abstract

Tile-based programming frameworks are increasingly adopted to write high-performance GPU kernels in domains such as deep learning and scientific computing. While these frameworks enhance productivity and hardware utilization, their multi-stage compilation pipelines introduce distinct code generation bugs that are tightly coupled to input shapes, data types, and backend targets. These bugs often manifest as silent correctness or performance issues, making them difficult to detect using existing compiler testing tools. Additionally, the unique programming conventions of tile domain-specific languages complicate root cause identification, while fixing such bugs demands specialized knowledge of tile abstractions and compilation pipelines. Despite the growing adoption of tile-based systems, their code generation bugs remain largely unexplored. This paper presents the first systematic study of tile-program code generation bugs. We curate 401 bug reports from GitHub and identify 301 tile-program codegen bugs for analysis, categorizing the root causes, symptoms, input patterns, test oracles that trigger these bugs, and the strategies used to fix bugs. Our study provides foundational insights for building debugging, testing, and repair tools tailored to tile-based compiler infrastructures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a useful first taxonomy of tile-program codegen bugs from real GitHub reports, though the curation and labeling steps need more detail to judge how solid the categories are.

read the letter

This paper's main takeaway is that it pulls 401 GitHub reports, filters to 301 tile-program code generation bugs, and sorts them by root causes, symptoms, input patterns, oracles, and fixes. That gives a concrete starting point for people building debuggers or testers for these frameworks, which are used more and more for GPU kernels in deep learning and scientific computing. The bugs they describe are often silent, shape-dependent, and tied to the multi-stage compilation in tile DSLs, which explains why off-the-shelf compiler testing tools miss them. The categorization work is the part that could actually help tool builders pick better oracles or repair strategies. What they do well is ground the study in actual reported cases instead of synthetic examples, and they make a clear case that tile programs have distinct bug patterns worth studying separately. The soft spots sit in the data handling. GitHub issues are self-selected and often short on reproduction scripts or clear developer intent, so the 301 cases might over-weight noticeable problems and under-weight quiet ones. Manual labeling from titles, descriptions, and comments alone can mix up performance issues with correctness ones, especially when errors depend on specific shapes or backends. The abstract does not lay out the exact filtering rules or any checks for labeling consistency, so the reliability of the taxonomy depends on how carefully those steps were done in the full methods. This work is for software engineering researchers who care about compiler testing and domain-specific languages for accelerators. A reader looking for real-world examples to motivate new tools would get direct value from the categories and the discussion of fix strategies. It deserves a serious referee because the topic fills a gap and the source material is external bug reports, even if the methodology section will need tightening to make the claims fully convincing.

Referee Report

2 major / 2 minor

Summary. The paper presents the first systematic study of tile-program code generation bugs in tile-based programming frameworks for high-performance GPU kernels. The authors curate 401 bug reports from GitHub, filter to 301 tile-program codegen bugs, and categorize root causes, symptoms, input patterns, test oracles that trigger the bugs, and fix strategies to provide insights for building specialized debugging, testing, and repair tools.

Significance. If the curation process and manual categorization prove reliable and representative, this empirical study would deliver valuable foundational data on an underexplored class of silent, shape- and backend-coupled bugs that evade standard compiler testing. It could directly inform tool-building for tile-based compiler infrastructures in deep learning and scientific computing. The work earns credit for grounding analysis in real GitHub reports rather than synthetic cases and for producing a multi-dimensional taxonomy (root causes through fixes) that is actionable for practitioners.

major comments (2)

[Methodology / Data Collection] The description of curating 401 GitHub reports and identifying 301 codegen bugs provides no details on selection criteria, inclusion/exclusion rules, inter-rater agreement, or validation against reproduction environments. This is load-bearing for the central claim that the resulting taxonomy reflects real-world tile-program bugs, as GitHub issues are self-selected and often lack full context or developer intent.
[Categorization and Analysis] Manual categorization of root causes, symptoms, and oracles from titles, descriptions, and comments alone risks misattribution (e.g., shape-dependent silent errors labeled as performance issues). Without reported measures of labeling reliability or access to reproduction scripts, the categories' correctness cannot be assessed, weakening the insights offered for automated bug detection.

minor comments (2)

[Introduction] Clarify the exact definition of 'tile-program codegen bug' versus usage error or unrelated defect early in the paper to aid reader interpretation of the 301 cases.
[Results] Consider adding a table or figure summarizing the distribution of categories (e.g., percentage of bugs per root cause) for quicker overview.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and outline the revisions we will make to strengthen the transparency of our methodology and analysis.

read point-by-point responses

Referee: [Methodology / Data Collection] The description of curating 401 GitHub reports and identifying 301 codegen bugs provides no details on selection criteria, inclusion/exclusion rules, inter-rater agreement, or validation against reproduction environments. This is load-bearing for the central claim that the resulting taxonomy reflects real-world tile-program bugs, as GitHub issues are self-selected and often lack full context or developer intent.

Authors: We agree that greater transparency in the curation process is required. In the revised manuscript we will expand the Methodology section with explicit inclusion/exclusion criteria used to select the 301 tile-program codegen bugs from the initial 401 reports, along with inter-rater agreement statistics obtained when multiple authors independently reviewed the issues. We will also clarify the degree to which we could validate reports against available code snippets and developer comments, while acknowledging that complete reproduction environments are not provided in most GitHub issues. revision: yes
Referee: [Categorization and Analysis] Manual categorization of root causes, symptoms, and oracles from titles, descriptions, and comments alone risks misattribution (e.g., shape-dependent silent errors labeled as performance issues). Without reported measures of labeling reliability or access to reproduction scripts, the categories' correctness cannot be assessed, weakening the insights offered for automated bug detection.

Authors: We acknowledge the inherent limitations of text-based manual categorization. The revised version will include a new subsection that defines each category with concrete examples and reports inter-annotator agreement measures for the labeling process. We will explicitly discuss the unavailability of reproduction scripts for many reports as a limitation and describe how we reduced misattribution risk by cross-referencing issue comments and code fragments. These changes will allow readers to better evaluate the taxonomy's reliability for guiding automated bug detection tools. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical bug categorization study

full rationale

This paper is an empirical study that curates 401 GitHub bug reports, filters to 301 tile-program codegen cases, and performs manual categorization of root causes, symptoms, input patterns, oracles, and fixes. It contains no mathematical derivations, equations, fitted parameters, or self-referential definitions. All claims rest on external data sources and direct inspection rather than any internal reduction or self-citation chain, rendering the analysis self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of GitHub-sourced bug reports and the validity of the manual categorization process as a basis for insights into tile-program bugs.

axioms (1)

domain assumption GitHub bug reports provide a representative and unbiased sample of real-world tile-program code generation bugs
The study begins by curating 401 bug reports from GitHub as the foundation for identifying and analyzing 301 bugs.

pith-pipeline@v0.9.0 · 5733 in / 1239 out tokens · 43741 ms · 2026-05-20T04:34:30.439994+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We curate 401 bug reports from GitHub and identify 301 tile-program codegen bugs for analysis, categorizing the root causes, symptoms, input patterns, test oracles that trigger these bugs, and the strategies used to fix bugs.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Table 2. The taxonomy of bug causes … Type and Operator Bugs … 147 (48.84%)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 2 internal anchors

[1]

S. M. Mojahidul Ahsan, Tamzidul Hoque, Md Sakib Hasan, Mrittika Chowdhury, and Anurag Dhungel. 2025. Hardware accelerators for artificial intelligence. InAI-Enabled Electronic Circuit and System Design: From Ideation to Utilization. Springer

work page 2025
[2]

AIKernelResearch. 2025. TileBug: project repository. https://github.com/AIKernelResearch/TileCodegenBugStudy. github.io

work page 2025
[3]

AIKernelResearch. 2025. TileCodegenBug: project website. https://aikernelresearch.github.io/TileCodegenBugStudy. github.io/

work page 2025
[4]

Apache. 2025. apache/TVM: open deep learning compiler stack for CPU, GPU and specialized accelerators. https: //github.com/apache/tvm

work page 2025
[5]

Apache Authors. 2023. [Bug][MetaSchedule] failed to run apply_trace generated by print(sch.trace) for int8 conv2d workload #14112. https://github.com/apache/tvm/issues/14112

work page 2023
[6]

Maximilian Beck, Korbinian Pöppel, Phillip Lippe, and Sepp Hochreiter. 2025. Tiled flash linear attention: more efficient linear RNN and xLSTM kernels. arXiv preprint arXiv:2503.14376 (2025)

work page arXiv 2025
[7]

Lukas Bernhard, Nico Schiller, Moritz Schloegel, Nils Bars, and Thorsten Holz. 2024. DarthShader: fuzzing We- bGPU shader translators & compilers. In Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS)

work page 2024
[8]

Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: a verifier for GPU kernels. In Proceedings of the 27th ACM International Conference on Object Oriented Programming, Systems, Languages, and Applications (OOPSLA)

work page 2012
[9]

Michael Boyer, Kevin Skadron, and Westley Weimer. 2008. Automated dynamic analysis of CUDA programs. In Proceedings of the Third Workshop on Software Tools for MultiCore Systems (STMCS)

work page 2008
[10]

Junjie Chen, Yihua Liang, Qingchao Shen, Jiajun Jiang, and Shuochuan Li. 2023. Toward Understanding Deep Learning Framework Bugs. ACM Transactions on Software Engineering and Methodology (TOSEM) (2023)

work page 2023
[11]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

work page 2018
[12]

Edoardo Cittadini, Mauro Marinoni, and Giorgio Buttazzo. 2025. A hardware accelerator to support deep learning processor units in real-time image processing. Engineering Applications of Artificial Intelligence (2025)

work page 2025
[13]

Anthony Di Franco, Hui Guo, and Cindy Rubio-González. 2017. A comprehensive study of real-world numerical bug characteristics. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)

work page 2017
[14]

Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson

Alastair F. Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson. 2017. Automated testing of graphics shader compilers. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2017)

work page 2017
[15]

Ariel Eizenberg, Yuanfeng Peng, Toma Pigli, William Mansky, and Joseph Devietti. 2017. BARRACUDA: binary-level analysis of runtime races in CUDA programs. InProceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

work page 2017
[16]

Donaldson, and Cristian Cadar

Karine Even-Mendoza, Arindam Sharma, Alastair F. Donaldson, and Cristian Cadar. 2023. GrayC: greybox fuzzing of compilers and analysers for C. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA)

work page 2023
[17]

Geoff Gerfin and Vyas Venkataraman. 2012. Debugging experience with CUDA-GDB and CUDA-Memcheck. InGPU Technology Conference (GTC)

work page 2012
[18]

GitHub, Inc. 2008. GitHub. https://github.com

work page 2008
[19]

Ganesh Gopalakrishnan, Ignacio Laguna, Ang Li, Pavel Panchekha, Cindy Rubio-González, and Zachary Tatlock

work page
[20]

In Proceedings of the 5th IEEE/ACM International Workshop on Software Correctness for HPC Applications (Correctness)

Guarding numerics amidst rising heterogeneity. In Proceedings of the 5th IEEE/ACM International Workshop on Software Correctness for HPC Applications (Correctness). , Vol. 1, No. 1, Article . Publication date: May 2026. Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection 21

work page 2026
[21]

Qianyu Guo, Xiaofei Xie, Yi Li, Xiaoyu Zhang, Yang Liu, Xiaohong Li, and Chao Shen. 2020. Audee: Automated Testing for Deep Learning Frameworks. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

work page 2020
[22]

Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi Zhou, and Charith Mendis. 2025. SPLAT: a framework for optimised GPU code-generation for SParse reguLar ATtention. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2025)

work page 2025
[23]

Halide. 2025. halide/Halide: a language for fast, portable data-parallel computation. https://github.com/halide/Halide

work page 2025
[24]

Halide Authors. 2025. undef prunes select branch incorrectly #8667. https://github.com/halide/Halide/issues/8667

work page 2025
[25]

Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

work page 2019
[26]

Mohammad Majharul Islam and Abdullah Muzahid. 2018. Bugaroo: exposing memory model bugs in many-core systems. In Proceedings of the 29th IEEE International Symposium on Software Reliability Engineering (ISSRE)

work page 2018
[27]

Bo Jiang, Xiaoyan Wang, Wing Kwong Chan, T. H. Tse, Na Li, Yongfeng Yin, and Zhenyu Zhang. 2020. CUDA- smith: a fuzzer for CUDA compilers. In Proceedings of the 44th IEEE Annual Computers, Software, and Applications Conference (COMPSAC)

work page 2020
[28]

Kloberdanz, and Wei Le

Eliska Kloberdanz, Kyle G. Kloberdanz, and Wei Le. 2022. DeepStability: a study of unstable numerical methods and their solutions in deep learning. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE)

work page 2022
[29]

Ignacio Laguna. 2019. FPChecker: detecting floating-point exceptions in GPU applications. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE)

work page 2019
[30]

Ignacio Laguna and Ganesh Gopalakrishnan. 2022. Finding inputs that trigger floating-point exceptions in GPUs via Bayesian optimization. In Proceedings of the IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

work page 2022
[31]

Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State of the Art in Program Analysis (SOAP)

work page 2022
[32]

Chris Lattner. 2008. LLVM and Clang: next generation compiler technology. In Proceedings of the BSD Conference (BSDCan)

work page 2008
[33]

Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: concolic verification and test generation for GPUs. InProceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)

work page 2012
[34]

Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. TritonBench: benchmarking large language model capabilities for generating Triton operators. In Findings of the Association for Computational Linguistics (ACL Findings)

work page 2025
[35]

Jiashi Li and Shengyu Liu. 2025. FlashMLA: efficient multi-head latent attention kernels. https://github.com/deepseek- ai/FlashMLA

work page 2025
[36]

Wentao Li, Jianhua Sun, and Hao Chen. 2019. Detecting undefined behaviors in CUDA C. IEEE Access (2019)

work page 2019
[37]

Xinyi Li, Ignacio Laguna, Bo Fang, Katarzyna Swirydowicz, Ang Li, and Ganesh Gopalakrishnan. 2023. Design and evaluation of GPU-FPX: a low-overhead tool for floating-point exception detection in NVIDIA GPUs. In Proceedings of the 32nd ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC)

work page 2023
[38]

Donaldson

Christopher Lidbury, Andrei Lascu, Nathan Chong, and Alastair F. Donaldson. 2015. Many-core compiler fuzzing. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

work page 2015
[39]

Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Eric Zitong Zhou, and Miryung Kim. 2025. Fuzzing MLIR com- pilers with custom mutation synthesis. In Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE)

work page 2025
[40]

Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for C and C++ compilers with YARPGen. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2020)

work page 2020
[41]

Haoyang Ma. 2023. A survey of modern compiler fuzzing. arXiv preprint arXiv:2306.06884 (2023)

work page arXiv 2023
[42]

Hasan Mohsin. 2022. WGSLsmith: a random generator of WebGPU shader programs. Master’s thesis, Imperial College London

work page 2022
[43]

John Nickolls. 2007. GPU parallel computing architecture and CUDA programming model. In Proceedings of the 19th IEEE Hot Chips Symposium (HCS)

work page 2007
[44]

NVIDIA. 2021. cuBLAS: basic linear algebra on NVIDIA GPUs. https://developer.nvidia.com/cublas

work page 2021
[45]

NVIDIA. 2021. NVIDIA cuDNN. https://developer.nvidia.com/cudnn

work page 2021
[46]

NVIDIA. 2025. NVIDIA graphics cards. https://www.nvidia.com/en-us/geforce/graphics-cards/ , Vol. 1, No. 1, Article . Publication date: May 2026. 22 Rathnasuriya and Song, et al

work page 2025
[47]

NVIDIA. 2025. NVIDIA/cuda-tile: an MLIR-based intermediate representation for tile-based CUDA kernel optimization. https://github.com/NVIDIA/cuda-tile

work page 2025
[48]

NVIDIA. 2025. NVIDIA/warp: a Python framework for accelerated simulation, data generation and spatial computing. https://github.com/NVIDIA/warp

work page 2025
[49]

NVIDIA. 2025. Triton Inference Server. https://github.com/triton-inference-server/server

work page 2025
[50]

NVIDIA Authors. 2025. [BUG] tile operations produce unexpected results #688. https://github.com/NVIDIA/warp/ issues/688

work page 2025
[51]

OpenAI. 2021. Introducing Triton: open-source GPU programming for neural networks. https://openai.com/index/ triton/

work page 2021
[52]

OpenXLA. 2025. openxla/XLA: a machine learning compiler for GPUs, CPUs, and ML accelerators. https://github. com/openxla/xla

work page 2025
[53]

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. 2025. KernelBench: can LLMs write efficient GPU kernels? arXiv preprint arXiv:2502.10517 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: an imperative style, high-pe...

work page 2019
[55]

Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE)

work page 2019
[56]

PyTorch. 2025. pytorch/pytorch: tensors and dynamic neural networks in Python with strong GPU acceleration. https://github.com/pytorch/pytorch

work page 2025
[57]

PyTorch Authors. 2024. Failure in generating a kernel with 3 tile groups #141121. https://github.com/pytorch/pytorch/ issues/141121

work page 2024
[58]

PyTorch Authors. 2025. flex_attention + dynamic=True with large batch or heads causes Triton error [CUDA]: invalid argument #157018. https://github.com/pytorch/pytorch/issues/157018

work page 2025
[59]

Ravishka Rathnasuriya, Nidhi Majoju, Zihe Song, and Wei Yang. 2025. An investigation on numerical bugs in GPU programs towards automated bug detection. Proceedings of the ACM on Software Engineering (PACMSE), ISSTA (2025)

work page 2025
[60]

Jason Sanders and Edward Kandrot. 2010. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional

work page 2010
[61]

Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A compre- hensive study of deep learning compiler bugs. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

work page 2021
[62]

Cristina Silvano, Daniele Ielmini, Fabrizio Ferrandi, Leandro Fiorin, Serena Curzel, Luca Benini, Francesco Conti, Angelo Garofalo, Cristian Zambelli, Enrico Calore, Sebastiano Schifano, Maurizio Palesi, Giuseppe Ascia, Davide Patti, Nicola Petra, Davide De Caro, Luciano Lavagno, Teodoro Urso, Valeria Cardellini, Gian Carlo Cardarilli, Robert Birke, and S...

work page 2025
[63]

Donaldson

Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing errors related to weak memory in GPU applications. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

work page 2016
[64]

SPCL. 2025. spcl/dace: DaCe – data centric parallel programming. https://github.com/spcl/dace

work page 2025
[65]

F., Arora, S., Singhal, A., Fu, D

Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, and Christopher Ré. 2024. ThunderKittens: simple, fast, and adorable AI kernels. arXiv preprint arXiv:2410.20399 (2024)

work page arXiv 2024
[66]

Chenyao Suo, Jianrong Wang, Yongjia Wang, Jiajun Jiang, Qingchao Shen, and Junjie Chen. 2025. DESIL: detecting silent bugs in MLIR compiler infrastructure. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2025)

work page 2025
[67]

Tile-AI. 2025. tile-ai/tilelang: a domain-specific language for high-performance GPU/CPU/accelerator kernels. https: //github.com/tile-ai/tilelang

work page 2025
[68]

Tile-AI Authors. 2025. [Bug] compilation error for mma on NVIDIA Hopper GPU #101. https://github.com/tile- ai/tilelang/issues/101

work page 2025
[69]

Tile-AI Authors. 2025. [Bug] compile/“cached” still not loading cached kernel for example in example_mha_bwd #313. https://github.com/tile-ai/tilelang/issues/313

work page 2025
[70]

Tile-AI Authors. 2025. [BUG] incorrect __sync_thread_partial placement in generated kernel code #1604. https: //github.com/tile-ai/tilelang/issues/1604

work page 2025
[71]

Tile-AI Authors. 2025. [BUG Report] encounter dead lock when implementing deepgemm with 8 warps on Hopper #359. https://github.com/tile-ai/tilelang/issues/359 , Vol. 1, No. 1, Article . Publication date: May 2026. Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection 23

work page 2025
[72]

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshopon Machine Learning and Programming Languages (MAPL)

work page 2019
[73]

Devesh Tiwari, Saurabh Gupta, James Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan DeBardeleben, Philippe Navaux, Luigi Carro, and Arthur Bland. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the 21st IEEE International Symposium ...

work page 2015
[74]

Triton-Lang. 2025. triton-lang/triton: development repository for the Triton language and compiler. https://github. com/triton-lang/triton

work page 2025
[75]

Triton-Lang Authors. 2022. Segfault in dds_matmul #443. https://github.com/triton-lang/triton/issues/443

work page 2022
[76]

Triton-Lang Authors. 2023. Segmentation fault with matmul + argmax #1846. https://github.com/triton-lang/triton/ issues/1846

work page 2023
[77]

Triton-Lang Authors. 2023. WSMaterialization generates invalid IR – modifies module’s num-warps field without modifying tensor layouts #2658. https://github.com/triton-lang/triton/issues/2658

work page 2023
[78]

Triton-Lang Authors. 2024. Assertion failure in linear layouts when num_warps = 8, but passes with num_warps = 4 #5265. https://github.com/triton-lang/triton/issues/5265

work page 2024
[79]

Triton-Lang Authors. 2025. AMD ReorderInstruction pass will reorder the global_load ahead of local_store and break the local_prefetch logic which will miss match TritonAMDGPULowerInstructionSchedHints::createLocalPrefetchSchedule code logic #6750. https://github.com/triton-lang/triton/issues/6750

work page 2025
[80]

Dewei Wang, Wei Zhu, Liyang Ling, Ettore Tiotto, Quintin Wang, Whitney Tsang, Julian Oppermann, and Jacky Deng. 2025. ML-Triton, a multi-level compilation and language extension to Triton GPU programming. arXiv preprint arXiv:2503.14985 (2025)

work page arXiv 2025

Showing first 80 references.

[1] [1]

S. M. Mojahidul Ahsan, Tamzidul Hoque, Md Sakib Hasan, Mrittika Chowdhury, and Anurag Dhungel. 2025. Hardware accelerators for artificial intelligence. InAI-Enabled Electronic Circuit and System Design: From Ideation to Utilization. Springer

work page 2025

[2] [2]

AIKernelResearch. 2025. TileBug: project repository. https://github.com/AIKernelResearch/TileCodegenBugStudy. github.io

work page 2025

[3] [3]

AIKernelResearch. 2025. TileCodegenBug: project website. https://aikernelresearch.github.io/TileCodegenBugStudy. github.io/

work page 2025

[4] [4]

Apache. 2025. apache/TVM: open deep learning compiler stack for CPU, GPU and specialized accelerators. https: //github.com/apache/tvm

work page 2025

[5] [5]

Apache Authors. 2023. [Bug][MetaSchedule] failed to run apply_trace generated by print(sch.trace) for int8 conv2d workload #14112. https://github.com/apache/tvm/issues/14112

work page 2023

[6] [6]

Maximilian Beck, Korbinian Pöppel, Phillip Lippe, and Sepp Hochreiter. 2025. Tiled flash linear attention: more efficient linear RNN and xLSTM kernels. arXiv preprint arXiv:2503.14376 (2025)

work page arXiv 2025

[7] [7]

Lukas Bernhard, Nico Schiller, Moritz Schloegel, Nils Bars, and Thorsten Holz. 2024. DarthShader: fuzzing We- bGPU shader translators & compilers. In Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS)

work page 2024

[8] [8]

Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: a verifier for GPU kernels. In Proceedings of the 27th ACM International Conference on Object Oriented Programming, Systems, Languages, and Applications (OOPSLA)

work page 2012

[9] [9]

Michael Boyer, Kevin Skadron, and Westley Weimer. 2008. Automated dynamic analysis of CUDA programs. In Proceedings of the Third Workshop on Software Tools for MultiCore Systems (STMCS)

work page 2008

[10] [10]

Junjie Chen, Yihua Liang, Qingchao Shen, Jiajun Jiang, and Shuochuan Li. 2023. Toward Understanding Deep Learning Framework Bugs. ACM Transactions on Software Engineering and Methodology (TOSEM) (2023)

work page 2023

[11] [11]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

work page 2018

[12] [12]

Edoardo Cittadini, Mauro Marinoni, and Giorgio Buttazzo. 2025. A hardware accelerator to support deep learning processor units in real-time image processing. Engineering Applications of Artificial Intelligence (2025)

work page 2025

[13] [13]

Anthony Di Franco, Hui Guo, and Cindy Rubio-González. 2017. A comprehensive study of real-world numerical bug characteristics. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)

work page 2017

[14] [14]

Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson

Alastair F. Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson. 2017. Automated testing of graphics shader compilers. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2017)

work page 2017

[15] [15]

Ariel Eizenberg, Yuanfeng Peng, Toma Pigli, William Mansky, and Joseph Devietti. 2017. BARRACUDA: binary-level analysis of runtime races in CUDA programs. InProceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

work page 2017

[16] [16]

Donaldson, and Cristian Cadar

Karine Even-Mendoza, Arindam Sharma, Alastair F. Donaldson, and Cristian Cadar. 2023. GrayC: greybox fuzzing of compilers and analysers for C. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA)

work page 2023

[17] [17]

Geoff Gerfin and Vyas Venkataraman. 2012. Debugging experience with CUDA-GDB and CUDA-Memcheck. InGPU Technology Conference (GTC)

work page 2012

[18] [18]

GitHub, Inc. 2008. GitHub. https://github.com

work page 2008

[19] [19]

Ganesh Gopalakrishnan, Ignacio Laguna, Ang Li, Pavel Panchekha, Cindy Rubio-González, and Zachary Tatlock

work page

[20] [20]

In Proceedings of the 5th IEEE/ACM International Workshop on Software Correctness for HPC Applications (Correctness)

Guarding numerics amidst rising heterogeneity. In Proceedings of the 5th IEEE/ACM International Workshop on Software Correctness for HPC Applications (Correctness). , Vol. 1, No. 1, Article . Publication date: May 2026. Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection 21

work page 2026

[21] [21]

Qianyu Guo, Xiaofei Xie, Yi Li, Xiaoyu Zhang, Yang Liu, Xiaohong Li, and Chao Shen. 2020. Audee: Automated Testing for Deep Learning Frameworks. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

work page 2020

[22] [22]

Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi Zhou, and Charith Mendis. 2025. SPLAT: a framework for optimised GPU code-generation for SParse reguLar ATtention. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2025)

work page 2025

[23] [23]

Halide. 2025. halide/Halide: a language for fast, portable data-parallel computation. https://github.com/halide/Halide

work page 2025

[24] [24]

Halide Authors. 2025. undef prunes select branch incorrectly #8667. https://github.com/halide/Halide/issues/8667

work page 2025

[25] [25]

Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

work page 2019

[26] [26]

Mohammad Majharul Islam and Abdullah Muzahid. 2018. Bugaroo: exposing memory model bugs in many-core systems. In Proceedings of the 29th IEEE International Symposium on Software Reliability Engineering (ISSRE)

work page 2018

[27] [27]

Bo Jiang, Xiaoyan Wang, Wing Kwong Chan, T. H. Tse, Na Li, Yongfeng Yin, and Zhenyu Zhang. 2020. CUDA- smith: a fuzzer for CUDA compilers. In Proceedings of the 44th IEEE Annual Computers, Software, and Applications Conference (COMPSAC)

work page 2020

[28] [28]

Kloberdanz, and Wei Le

Eliska Kloberdanz, Kyle G. Kloberdanz, and Wei Le. 2022. DeepStability: a study of unstable numerical methods and their solutions in deep learning. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE)

work page 2022

[29] [29]

Ignacio Laguna. 2019. FPChecker: detecting floating-point exceptions in GPU applications. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE)

work page 2019

[30] [30]

Ignacio Laguna and Ganesh Gopalakrishnan. 2022. Finding inputs that trigger floating-point exceptions in GPUs via Bayesian optimization. In Proceedings of the IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

work page 2022

[31] [31]

Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State of the Art in Program Analysis (SOAP)

work page 2022

[32] [32]

Chris Lattner. 2008. LLVM and Clang: next generation compiler technology. In Proceedings of the BSD Conference (BSDCan)

work page 2008

[33] [33]

Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: concolic verification and test generation for GPUs. InProceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)

work page 2012

[34] [34]

Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. TritonBench: benchmarking large language model capabilities for generating Triton operators. In Findings of the Association for Computational Linguistics (ACL Findings)

work page 2025

[35] [35]

Jiashi Li and Shengyu Liu. 2025. FlashMLA: efficient multi-head latent attention kernels. https://github.com/deepseek- ai/FlashMLA

work page 2025

[36] [36]

Wentao Li, Jianhua Sun, and Hao Chen. 2019. Detecting undefined behaviors in CUDA C. IEEE Access (2019)

work page 2019

[37] [37]

Xinyi Li, Ignacio Laguna, Bo Fang, Katarzyna Swirydowicz, Ang Li, and Ganesh Gopalakrishnan. 2023. Design and evaluation of GPU-FPX: a low-overhead tool for floating-point exception detection in NVIDIA GPUs. In Proceedings of the 32nd ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC)

work page 2023

[38] [38]

Donaldson

Christopher Lidbury, Andrei Lascu, Nathan Chong, and Alastair F. Donaldson. 2015. Many-core compiler fuzzing. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

work page 2015

[39] [39]

Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Eric Zitong Zhou, and Miryung Kim. 2025. Fuzzing MLIR com- pilers with custom mutation synthesis. In Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE)

work page 2025

[40] [40]

Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for C and C++ compilers with YARPGen. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2020)

work page 2020

[41] [41]

Haoyang Ma. 2023. A survey of modern compiler fuzzing. arXiv preprint arXiv:2306.06884 (2023)

work page arXiv 2023

[42] [42]

Hasan Mohsin. 2022. WGSLsmith: a random generator of WebGPU shader programs. Master’s thesis, Imperial College London

work page 2022

[43] [43]

John Nickolls. 2007. GPU parallel computing architecture and CUDA programming model. In Proceedings of the 19th IEEE Hot Chips Symposium (HCS)

work page 2007

[44] [44]

NVIDIA. 2021. cuBLAS: basic linear algebra on NVIDIA GPUs. https://developer.nvidia.com/cublas

work page 2021

[45] [45]

NVIDIA. 2021. NVIDIA cuDNN. https://developer.nvidia.com/cudnn

work page 2021

[46] [46]

NVIDIA. 2025. NVIDIA graphics cards. https://www.nvidia.com/en-us/geforce/graphics-cards/ , Vol. 1, No. 1, Article . Publication date: May 2026. 22 Rathnasuriya and Song, et al

work page 2025

[47] [47]

NVIDIA. 2025. NVIDIA/cuda-tile: an MLIR-based intermediate representation for tile-based CUDA kernel optimization. https://github.com/NVIDIA/cuda-tile

work page 2025

[48] [48]

NVIDIA. 2025. NVIDIA/warp: a Python framework for accelerated simulation, data generation and spatial computing. https://github.com/NVIDIA/warp

work page 2025

[49] [49]

NVIDIA. 2025. Triton Inference Server. https://github.com/triton-inference-server/server

work page 2025

[50] [50]

NVIDIA Authors. 2025. [BUG] tile operations produce unexpected results #688. https://github.com/NVIDIA/warp/ issues/688

work page 2025

[51] [51]

OpenAI. 2021. Introducing Triton: open-source GPU programming for neural networks. https://openai.com/index/ triton/

work page 2021

[52] [52]

OpenXLA. 2025. openxla/XLA: a machine learning compiler for GPUs, CPUs, and ML accelerators. https://github. com/openxla/xla

work page 2025

[53] [53]

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. 2025. KernelBench: can LLMs write efficient GPU kernels? arXiv preprint arXiv:2502.10517 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: an imperative style, high-pe...

work page 2019

[55] [55]

Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE)

work page 2019

[56] [56]

PyTorch. 2025. pytorch/pytorch: tensors and dynamic neural networks in Python with strong GPU acceleration. https://github.com/pytorch/pytorch

work page 2025

[57] [57]

PyTorch Authors. 2024. Failure in generating a kernel with 3 tile groups #141121. https://github.com/pytorch/pytorch/ issues/141121

work page 2024

[58] [58]

PyTorch Authors. 2025. flex_attention + dynamic=True with large batch or heads causes Triton error [CUDA]: invalid argument #157018. https://github.com/pytorch/pytorch/issues/157018

work page 2025

[59] [59]

Ravishka Rathnasuriya, Nidhi Majoju, Zihe Song, and Wei Yang. 2025. An investigation on numerical bugs in GPU programs towards automated bug detection. Proceedings of the ACM on Software Engineering (PACMSE), ISSTA (2025)

work page 2025

[60] [60]

Jason Sanders and Edward Kandrot. 2010. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional

work page 2010

[61] [61]

Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A compre- hensive study of deep learning compiler bugs. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

work page 2021

[62] [62]

Cristina Silvano, Daniele Ielmini, Fabrizio Ferrandi, Leandro Fiorin, Serena Curzel, Luca Benini, Francesco Conti, Angelo Garofalo, Cristian Zambelli, Enrico Calore, Sebastiano Schifano, Maurizio Palesi, Giuseppe Ascia, Davide Patti, Nicola Petra, Davide De Caro, Luciano Lavagno, Teodoro Urso, Valeria Cardellini, Gian Carlo Cardarilli, Robert Birke, and S...

work page 2025

[63] [63]

Donaldson

Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing errors related to weak memory in GPU applications. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

work page 2016

[64] [64]

SPCL. 2025. spcl/dace: DaCe – data centric parallel programming. https://github.com/spcl/dace

work page 2025

[65] [65]

F., Arora, S., Singhal, A., Fu, D

Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, and Christopher Ré. 2024. ThunderKittens: simple, fast, and adorable AI kernels. arXiv preprint arXiv:2410.20399 (2024)

work page arXiv 2024

[66] [66]

Chenyao Suo, Jianrong Wang, Yongjia Wang, Jiajun Jiang, Qingchao Shen, and Junjie Chen. 2025. DESIL: detecting silent bugs in MLIR compiler infrastructure. Proceedings of the ACM on Programming Languages (PACMPL), OOPSLA (2025)

work page 2025

[67] [67]

Tile-AI. 2025. tile-ai/tilelang: a domain-specific language for high-performance GPU/CPU/accelerator kernels. https: //github.com/tile-ai/tilelang

work page 2025

[68] [68]

Tile-AI Authors. 2025. [Bug] compilation error for mma on NVIDIA Hopper GPU #101. https://github.com/tile- ai/tilelang/issues/101

work page 2025

[69] [69]

Tile-AI Authors. 2025. [Bug] compile/“cached” still not loading cached kernel for example in example_mha_bwd #313. https://github.com/tile-ai/tilelang/issues/313

work page 2025

[70] [70]

Tile-AI Authors. 2025. [BUG] incorrect __sync_thread_partial placement in generated kernel code #1604. https: //github.com/tile-ai/tilelang/issues/1604

work page 2025

[71] [71]

Tile-AI Authors. 2025. [BUG Report] encounter dead lock when implementing deepgemm with 8 warps on Hopper #359. https://github.com/tile-ai/tilelang/issues/359 , Vol. 1, No. 1, Article . Publication date: May 2026. Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection 23

work page 2025

[72] [72]

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshopon Machine Learning and Programming Languages (MAPL)

work page 2019

[73] [73]

Devesh Tiwari, Saurabh Gupta, James Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan DeBardeleben, Philippe Navaux, Luigi Carro, and Arthur Bland. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the 21st IEEE International Symposium ...

work page 2015

[74] [74]

Triton-Lang. 2025. triton-lang/triton: development repository for the Triton language and compiler. https://github. com/triton-lang/triton

work page 2025

[75] [75]

Triton-Lang Authors. 2022. Segfault in dds_matmul #443. https://github.com/triton-lang/triton/issues/443

work page 2022

[76] [76]

Triton-Lang Authors. 2023. Segmentation fault with matmul + argmax #1846. https://github.com/triton-lang/triton/ issues/1846

work page 2023

[77] [77]

Triton-Lang Authors. 2023. WSMaterialization generates invalid IR – modifies module’s num-warps field without modifying tensor layouts #2658. https://github.com/triton-lang/triton/issues/2658

work page 2023

[78] [78]

Triton-Lang Authors. 2024. Assertion failure in linear layouts when num_warps = 8, but passes with num_warps = 4 #5265. https://github.com/triton-lang/triton/issues/5265

work page 2024

[79] [79]

Triton-Lang Authors. 2025. AMD ReorderInstruction pass will reorder the global_load ahead of local_store and break the local_prefetch logic which will miss match TritonAMDGPULowerInstructionSchedHints::createLocalPrefetchSchedule code logic #6750. https://github.com/triton-lang/triton/issues/6750

work page 2025

[80] [80]

Dewei Wang, Wei Zhu, Liyang Ling, Ettore Tiotto, Quintin Wang, Whitney Tsang, Julian Oppermann, and Jacky Deng. 2025. ML-Triton, a multi-level compilation and language extension to Triton GPU programming. arXiv preprint arXiv:2503.14985 (2025)

work page arXiv 2025