arxiv: 2604.23466 · v1 · submitted 2026-04-25 · 💻 cs.LG · cs.AI· cs.AR

Recognition: unknown

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Divakar Kumar Yadav , Tian Zhao , Deepak Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.AR

keywords CUDA TileCuTileGPU kernel developmentBlackwell GPUHopper GPUfused attentionGEMMperformance evaluation

0 comments

The pith

CuTile reaches 1007 TFLOP/s fused attention on Blackwell B200 using 60 lines of Python code, outperforming FlashAttention-2 by 2.5x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates NVIDIA's CUDA Tile abstraction, a Python-based tile-centric model for GPU kernels, on H100, B200, and RTX PRO 6000 Blackwell GPUs. It benchmarks GEMM, fused multi-head attention, and LLM inference in BF16/FP16, comparing against cuBLAS, Triton, WMMA, and SIMT. On datacenter Blackwell, CuTile hits high throughput with short code for attention and GEMM, but the same kernels drop to 53% of FlashAttention-2 on the consumer Blackwell variant. Triton maintains 62-101% of cuBLAS across all platforms without tuning. The results highlight strong workload and architecture dependence for CuTile's effectiveness.

Core claim

CuTile's Python abstraction enables efficient Tensor Core and TMA usage, delivering up to 1007 TFLOP/s for fused attention on B200 (2.5x FlashAttention-2) in 60 lines and 52-79% of cuBLAS for GEMM in 22 lines, yet the identical attention kernel achieves only 53% of FlashAttention-2 on RTX PRO 6000 Blackwell, while Triton sustains near cuBLAS performance portably across Hopper and Blackwell.

What carries the argument

The CuTile Python-based tile-centric abstraction for GPU kernel development that targets Tensor Cores and Tensor Memory Accelerator efficiency.

If this is right

CuTile offers a practical short-code alternative to hand-written CUDA kernels for attention on datacenter Blackwell GPUs.
GEMM performance with CuTile reaches over half of cuBLAS levels on tested platforms but falls short of vendor libraries.
Triton demonstrates stronger portability than CuTile across Hopper and both Blackwell variants without per-architecture changes.
Performance gaps on the RTX PRO 6000 indicate that CuTile kernels require architecture-specific tuning for consumer GPUs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Short code length in CuTile may speed up prototyping of custom AI kernels when full vendor performance is not required.
The observed portability difference suggests adding auto-tuning or compiler improvements could broaden CuTile's applicability.
End-to-end LLM inference results imply CuTile could integrate into training or serving pipelines on B200-class hardware with minimal code changes.

Load-bearing premise

The CuTile, Triton, WMMA, and cuBLAS implementations were developed and tuned with comparable effort without undisclosed architecture-specific optimizations biasing the comparisons.

What would settle it

A re-benchmark of the fused attention kernel on RTX PRO 6000 Blackwell showing CuTile matching or exceeding FlashAttention-2 throughput would falsify the claim of significant cross-architecture optimization gaps.

Figures

Figures reproduced from arXiv: 2604.23466 by Deepak Kumar, Divakar Kumar Yadav, Tian Zhao.

**Figure 1.** Figure 1: Performance–productivity frontier for GEMM (left, view at source ↗

**Figure 2.** Figure 2: GEMM performance (TFLOP/s, BF16) across four square matrix sizes on three GPUs. cuBLAS dominates on all platforms. CuTile (available only view at source ↗

**Figure 3.** Figure 3: Fused attention throughput (TFLOP/s) vs. sequence length (BF16, causal, batch=8). On the B200 (right), CuTile (red) dramatically outscales all other view at source ↗

**Figure 4.** Figure 4: The CuTile attention paradox: identical kernel code produces 2.51 view at source ↗

**Figure 5.** Figure 5: Normalized performance heatmap. Left: GEMM throughput as percentage of cuBLAS. Right: attention throughput as percentage of FlashAttention-2. view at source ↗

**Figure 6.** Figure 6: GEMM performance as a percentage of cuBLAS at view at source ↗

read the original abstract

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies the first reported CuTile numbers on Blackwell but the 2.5x attention claim and portability conclusions rest on unverified baseline tuning.

read the letter

The core takeaway is straightforward empirical data: CuTile hits 1007 TFLOP/s on fused attention for B200 while using only 60 lines of Python, and GEMM kernels reach 52-79% of cuBLAS in 22 lines. Those specific figures and the cross-architecture runs on H100, B200, and the RTX PRO 6000 are new and worth having on record for anyone choosing between CuTile, Triton, and hand-written CUDA on recent NVIDIA hardware. The line-count comparisons also give a practical sense of development effort that most prior work skips. The paper does a clean job of showing that results are workload- and architecture-dependent, with the same attention kernel dropping to 53% of FlashAttention-2 on the consumer Blackwell part. That contrast is useful and not something I had seen before. The main soft spot is exactly the one the stress test flags. Nothing in the abstract or available text confirms that FlashAttention-2, cuBLAS, or WMMA were re-tuned or recompiled with Blackwell-specific paths, TMA usage, or equivalent effort. If the baselines are still Hopper-era builds, the reported speedups and the claim that CuTile is a practical replacement cannot be attributed cleanly to the abstraction itself. Triton’s stronger portability numbers look more believable on the surface, but the same tuning-equivalence question applies there too. No error bars, run counts, or version details appear, which leaves the central performance claims only moderately supported. This is the kind of work that belongs in a reading group when people are actively porting kernels to B200 or H100. Engineers selecting tools will get immediate value from the numbers even if the comparisons need caveats. For peer review I would send it forward, but only after the authors add a methods section that documents baseline versions, compilation flags, and any architecture-specific adjustments. Without that, the claims stay too loose for a journal but the raw data on new hardware still merits checking.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first independent cross-architecture evaluation of NVIDIA's CUDA Tile (CuTile) Python-based tile-centric abstraction for GPU kernel development. It benchmarks CuTile against cuBLAS, Triton, WMMA, and FlashAttention-2 on GEMM, fused multi-head attention, and end-to-end LLM inference workloads in BF16/FP16 across H100 NVL, B200, and RTX PRO 6000 Blackwell GPUs, reporting architecture-dependent results including up to 1007 TFLOP/s for fused attention on B200 (2.5x over FlashAttention-2 in 60 lines of Python code) and 52-79% of cuBLAS GEMM performance in 22 lines (vs. 123 for WMMA), while noting Triton's stronger portability.

Significance. If the central performance and portability claims hold under verified fair-comparison conditions, the work would provide useful empirical data on a new high-level abstraction's practical trade-offs for AI kernels on recent NVIDIA architectures. The direct hardware measurements, specific TFLOP/s figures, and line-of-code counts constitute a strength for an evaluation paper, though the lack of accompanying code or verification artifacts limits immediate reproducibility.

major comments (2)

[Abstract] Abstract and results: The headline claims (1007 TFLOP/s fused attention on B200 at 2.5x FlashAttention-2; 52-79% cuBLAS GEMM in 22 lines) rest on the unverified premise that FlashAttention-2, cuBLAS, WMMA, and Triton baselines received comparable development effort and Blackwell-specific tuning (including TMA paths and compilation flags). No version numbers, re-tuning steps, or architecture-specific optimization details for the baselines are supplied, which directly affects attribution of speedups and the cross-architecture portability conclusions.
[Results] Results and methodology: Performance numbers (e.g., 53% of FlashAttention-2 on RTX PRO 6000 sm_120, 62-101% cuBLAS for Triton) are presented without error bars, run counts, or statistical significance, and the experimental setup does not describe how kernel launch parameters, memory layouts, or precision handling were equalized across CuTile, Triton, and vendor libraries. This gap is load-bearing for the workload- and architecture-dependent effectiveness claims.

minor comments (2)

[Abstract] The abstract states specific TFLOP/s and percentage figures but does not reference the corresponding tables or figures that contain the raw data, making it harder to cross-check the reported values.
The manuscript would benefit from an explicit statement of the CuTile version or commit hash used, as well as the exact cuBLAS and FlashAttention-2 versions against which it was compared.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation of CUDA Tile. The comments highlight important areas for improving methodological transparency, and we address each point below with corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and results: The headline claims (1007 TFLOP/s fused attention on B200 at 2.5x FlashAttention-2; 52-79% cuBLAS GEMM in 22 lines) rest on the unverified premise that FlashAttention-2, cuBLAS, WMMA, and Triton baselines received comparable development effort and Blackwell-specific tuning (including TMA paths and compilation flags). No version numbers, re-tuning steps, or architecture-specific optimization details for the baselines are supplied, which directly affects attribution of speedups and the cross-architecture portability conclusions.

Authors: We agree that explicit documentation of baseline configurations is required to substantiate the performance attributions. In the revised manuscript we have added the precise library versions employed (cuBLAS 12.4, FlashAttention-2 v2.5.0, Triton 2.2.0, and WMMA from CUDA 12.4) together with a statement that no Blackwell-specific re-tuning, custom TMA paths, or non-default compilation flags were applied to any baseline. All vendor and framework kernels were invoked through their standard public APIs using only the target compute-capability flag and -O3 optimization. These clarifications appear in a new paragraph of the Experimental Setup section and support the claim that observed differences reflect the abstractions rather than unequal optimization effort. revision: yes
Referee: [Results] Results and methodology: Performance numbers (e.g., 53% of FlashAttention-2 on RTX PRO 6000 sm_120, 62-101% cuBLAS for Triton) are presented without error bars, run counts, or statistical significance, and the experimental setup does not describe how kernel launch parameters, memory layouts, or precision handling were equalized across CuTile, Triton, and vendor libraries. This gap is load-bearing for the workload- and architecture-dependent effectiveness claims.

Authors: We accept that the original description of the experimental protocol was insufficient. The revised version now states that every reported throughput is the mean of 100 timed runs after 20 warm-up iterations, with standard-deviation error bars added to all figures. Kernel launch parameters were equalized by adopting each framework’s autotuner output (or manually verified equivalent tile and block dimensions) while enforcing identical row-major memory layouts and BF16 precision for all compared kernels. These controls are documented in an expanded Experimental Methodology subsection, including a summary table of launch configurations. Although formal statistical tests were not originally performed, the magnitude of the reported differences renders the architecture-dependent conclusions stable under the added variability measures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark results with no derivations or fitted predictions

full rationale

The paper reports direct hardware measurements of kernel throughput (e.g., TFLOP/s for fused attention and GEMM) on H100, B200, and RTX PRO 6000 GPUs. No equations, first-principles derivations, parameter fits, or predictions appear in the abstract or described content. All performance numbers are obtained by executing the kernels; comparisons to cuBLAS, Triton, WMMA, and FlashAttention-2 are likewise raw runtime results. Because the work contains no derivation chain that could reduce to its own inputs by construction, it is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, theoretical models, or new constructs. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5592 in / 1128 out tokens · 131634 ms · 2026-05-08T08:16:04.834470+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[2]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InInternational Conference on Learning Represen- tations (ICLR), 2024. arXiv:2307.08691

work page internal anchor Pith review arXiv 2024
[3]

Flash- Decoding for Long-Context Inference

Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash- Decoding for Long-Context Inference. Blog post, 2023. https://crfm. stanford.edu/2023/10/12/flashdecoding.html

2023
[4]

CUTLASS: CUDA Templates for Linear Algebra Subroutines

NVIDIA. CUTLASS: CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass, 2024

2024
[5]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2023

2023
[6]

TensorRT-LLM: An Open-Source Library for Optimizing LLM Inference

NVIDIA. TensorRT-LLM: An Open-Source Library for Optimizing LLM Inference. https://github.com/NVIDIA/TensorRT-LLM, 2024

2024
[7]

CUDA Tile Python Programming Guide

NVIDIA. CUDA Tile Python Programming Guide. https://docs.nvidia. com/cuda/cutile-python/, 2025

2025
[8]

Triton: An Interme- diate Language and Compiler for Tiled Neural Network Computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: An Interme- diate Language and Compiler for Tiled Neural Network Computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), pp. 10–19, 2019

2019
[9]

CUDA C++ Programming Guide: Warp Matrix Functions

NVIDIA. CUDA C++ Programming Guide: Warp Matrix Functions. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html# wmma, 2024

2024
[10]

cuBLAS Library

NVIDIA. cuBLAS Library. https://developer.nvidia.com/cublas, 2024

2024
[11]

NVIDIA Hopper GPU Architecture Tuning Guide

NVIDIA. NVIDIA Hopper GPU Architecture Tuning Guide. https: //docs.nvidia.com/cuda/hopper-tuning-guide/, 2024

2024
[12]

NVIDIA Ampere GPU Architecture Tuning Guide

NVIDIA. NVIDIA Ampere GPU Architecture Tuning Guide. https:// docs.nvidia.com/cuda/ampere-tuning-guide/, 2024

2024
[13]

TVM: An Automated End- to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An Automated End- to-End Optimizing Compiler for Deep Learning. InOSDI, 2018

2018
[14]

TensorIR: An Abstraction for Automatic Tensorized Program Optimization

Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. TensorIR: An Abstraction for Automatic Tensorized Program Optimization. InASPLOS, 2023

2023
[15]

ROLLER: Fast and Efficient Tensor Compilation for Deep Learning

Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning. InOSDI, 2022

2022
[16]

AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations

Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Xuefeng Jin. AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations. InPLDI, 2021

2021
[17]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023

work page internal anchor Pith review arXiv 2023
[18]

NVIDIA H100 Tensor Core GPU Architecture

NVIDIA. NVIDIA H100 Tensor Core GPU Architecture. NVIDIA Whitepaper, 2022

2022
[19]

NVIDIA Blackwell Architecture Technical Brief

NVIDIA. NVIDIA Blackwell Architecture Technical Brief. NVIDIA, 2024

2024
[20]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008, 2017

2017
[21]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

2020
[22]

PaLM: Scaling Language Modeling with Pathways.Journal of Machine Learning Research (JMLR), 24(240):1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling Language Modeling with Pathways.Journal of Machine Learning Research (JMLR), 24(240):1–113, 2023

2023
[23]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGres- ley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053, 2020

work page internal anchor Pith review arXiv 1909
[24]

Roofline: An Insightful Visual Performance Model for Multicore Architectures

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4):65–76, 2009

2009
[25]

cuDNN: Efficient primitives for deep learning,

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv:1410.0759, 2014

work page arXiv 2014
[26]

Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fr´edo Durand, and Saman Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. InProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 519–530, 2013

2013
[27]

Vasily V olkov and James W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. InProceedings of the IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2008

2008
[28]

Efficiently Scaling Transformer Inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently Scaling Transformer Inference. InProceedings of Machine Learning and Systems (MLSys), 2023

2023
[29]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150, 2019

work page internal anchor Pith review arXiv 1911
[30]

GQA: Training Generalized Multi- Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yinfei Yang, Sumit Sanghai, and Santiago Ontan ´on. GQA: Training Generalized Multi- Query Transformer Models from Multi-Head Checkpoints. InProceed- ings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023
[31]

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast Inference from Transformers via Speculative Decoding. InProceedings of the International Conference on Machine Learning (ICML), 2023

2023
[32]

Online normalizer calculation for softmax,

Maxim Milakov and Natalia Gimelshein. Online Normalizer Calculation for Softmax. arXiv:1805.02867, 2018

work page arXiv 2018