pith. machine review for the scientific record. sign in

arxiv: 2604.23466 · v1 · submitted 2026-04-25 · 💻 cs.LG · cs.AI· cs.AR

Recognition: unknown

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.AR
keywords CUDA TileCuTileGPU kernel developmentBlackwell GPUHopper GPUfused attentionGEMMperformance evaluation
0
0 comments X

The pith

CuTile reaches 1007 TFLOP/s fused attention on Blackwell B200 using 60 lines of Python code, outperforming FlashAttention-2 by 2.5x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates NVIDIA's CUDA Tile abstraction, a Python-based tile-centric model for GPU kernels, on H100, B200, and RTX PRO 6000 Blackwell GPUs. It benchmarks GEMM, fused multi-head attention, and LLM inference in BF16/FP16, comparing against cuBLAS, Triton, WMMA, and SIMT. On datacenter Blackwell, CuTile hits high throughput with short code for attention and GEMM, but the same kernels drop to 53% of FlashAttention-2 on the consumer Blackwell variant. Triton maintains 62-101% of cuBLAS across all platforms without tuning. The results highlight strong workload and architecture dependence for CuTile's effectiveness.

Core claim

CuTile's Python abstraction enables efficient Tensor Core and TMA usage, delivering up to 1007 TFLOP/s for fused attention on B200 (2.5x FlashAttention-2) in 60 lines and 52-79% of cuBLAS for GEMM in 22 lines, yet the identical attention kernel achieves only 53% of FlashAttention-2 on RTX PRO 6000 Blackwell, while Triton sustains near cuBLAS performance portably across Hopper and Blackwell.

What carries the argument

The CuTile Python-based tile-centric abstraction for GPU kernel development that targets Tensor Cores and Tensor Memory Accelerator efficiency.

If this is right

  • CuTile offers a practical short-code alternative to hand-written CUDA kernels for attention on datacenter Blackwell GPUs.
  • GEMM performance with CuTile reaches over half of cuBLAS levels on tested platforms but falls short of vendor libraries.
  • Triton demonstrates stronger portability than CuTile across Hopper and both Blackwell variants without per-architecture changes.
  • Performance gaps on the RTX PRO 6000 indicate that CuTile kernels require architecture-specific tuning for consumer GPUs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Short code length in CuTile may speed up prototyping of custom AI kernels when full vendor performance is not required.
  • The observed portability difference suggests adding auto-tuning or compiler improvements could broaden CuTile's applicability.
  • End-to-end LLM inference results imply CuTile could integrate into training or serving pipelines on B200-class hardware with minimal code changes.

Load-bearing premise

The CuTile, Triton, WMMA, and cuBLAS implementations were developed and tuned with comparable effort without undisclosed architecture-specific optimizations biasing the comparisons.

What would settle it

A re-benchmark of the fused attention kernel on RTX PRO 6000 Blackwell showing CuTile matching or exceeding FlashAttention-2 throughput would falsify the claim of significant cross-architecture optimization gaps.

Figures

Figures reproduced from arXiv: 2604.23466 by Deepak Kumar, Divakar Kumar Yadav, Tian Zhao.

Figure 1
Figure 1. Figure 1: Performance–productivity frontier for GEMM (left, view at source ↗
Figure 2
Figure 2. Figure 2: GEMM performance (TFLOP/s, BF16) across four square matrix sizes on three GPUs. cuBLAS dominates on all platforms. CuTile (available only view at source ↗
Figure 3
Figure 3. Figure 3: Fused attention throughput (TFLOP/s) vs. sequence length (BF16, causal, batch=8). On the B200 (right), CuTile (red) dramatically outscales all other view at source ↗
Figure 4
Figure 4. Figure 4: The CuTile attention paradox: identical kernel code produces 2.51 view at source ↗
Figure 5
Figure 5. Figure 5: Normalized performance heatmap. Left: GEMM throughput as percentage of cuBLAS. Right: attention throughput as percentage of FlashAttention-2. view at source ↗
Figure 6
Figure 6. Figure 6: GEMM performance as a percentage of cuBLAS at view at source ↗
read the original abstract

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first independent cross-architecture evaluation of NVIDIA's CUDA Tile (CuTile) Python-based tile-centric abstraction for GPU kernel development. It benchmarks CuTile against cuBLAS, Triton, WMMA, and FlashAttention-2 on GEMM, fused multi-head attention, and end-to-end LLM inference workloads in BF16/FP16 across H100 NVL, B200, and RTX PRO 6000 Blackwell GPUs, reporting architecture-dependent results including up to 1007 TFLOP/s for fused attention on B200 (2.5x over FlashAttention-2 in 60 lines of Python code) and 52-79% of cuBLAS GEMM performance in 22 lines (vs. 123 for WMMA), while noting Triton's stronger portability.

Significance. If the central performance and portability claims hold under verified fair-comparison conditions, the work would provide useful empirical data on a new high-level abstraction's practical trade-offs for AI kernels on recent NVIDIA architectures. The direct hardware measurements, specific TFLOP/s figures, and line-of-code counts constitute a strength for an evaluation paper, though the lack of accompanying code or verification artifacts limits immediate reproducibility.

major comments (2)
  1. [Abstract] Abstract and results: The headline claims (1007 TFLOP/s fused attention on B200 at 2.5x FlashAttention-2; 52-79% cuBLAS GEMM in 22 lines) rest on the unverified premise that FlashAttention-2, cuBLAS, WMMA, and Triton baselines received comparable development effort and Blackwell-specific tuning (including TMA paths and compilation flags). No version numbers, re-tuning steps, or architecture-specific optimization details for the baselines are supplied, which directly affects attribution of speedups and the cross-architecture portability conclusions.
  2. [Results] Results and methodology: Performance numbers (e.g., 53% of FlashAttention-2 on RTX PRO 6000 sm_120, 62-101% cuBLAS for Triton) are presented without error bars, run counts, or statistical significance, and the experimental setup does not describe how kernel launch parameters, memory layouts, or precision handling were equalized across CuTile, Triton, and vendor libraries. This gap is load-bearing for the workload- and architecture-dependent effectiveness claims.
minor comments (2)
  1. [Abstract] The abstract states specific TFLOP/s and percentage figures but does not reference the corresponding tables or figures that contain the raw data, making it harder to cross-check the reported values.
  2. The manuscript would benefit from an explicit statement of the CuTile version or commit hash used, as well as the exact cuBLAS and FlashAttention-2 versions against which it was compared.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation of CUDA Tile. The comments highlight important areas for improving methodological transparency, and we address each point below with corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results: The headline claims (1007 TFLOP/s fused attention on B200 at 2.5x FlashAttention-2; 52-79% cuBLAS GEMM in 22 lines) rest on the unverified premise that FlashAttention-2, cuBLAS, WMMA, and Triton baselines received comparable development effort and Blackwell-specific tuning (including TMA paths and compilation flags). No version numbers, re-tuning steps, or architecture-specific optimization details for the baselines are supplied, which directly affects attribution of speedups and the cross-architecture portability conclusions.

    Authors: We agree that explicit documentation of baseline configurations is required to substantiate the performance attributions. In the revised manuscript we have added the precise library versions employed (cuBLAS 12.4, FlashAttention-2 v2.5.0, Triton 2.2.0, and WMMA from CUDA 12.4) together with a statement that no Blackwell-specific re-tuning, custom TMA paths, or non-default compilation flags were applied to any baseline. All vendor and framework kernels were invoked through their standard public APIs using only the target compute-capability flag and -O3 optimization. These clarifications appear in a new paragraph of the Experimental Setup section and support the claim that observed differences reflect the abstractions rather than unequal optimization effort. revision: yes

  2. Referee: [Results] Results and methodology: Performance numbers (e.g., 53% of FlashAttention-2 on RTX PRO 6000 sm_120, 62-101% cuBLAS for Triton) are presented without error bars, run counts, or statistical significance, and the experimental setup does not describe how kernel launch parameters, memory layouts, or precision handling were equalized across CuTile, Triton, and vendor libraries. This gap is load-bearing for the workload- and architecture-dependent effectiveness claims.

    Authors: We accept that the original description of the experimental protocol was insufficient. The revised version now states that every reported throughput is the mean of 100 timed runs after 20 warm-up iterations, with standard-deviation error bars added to all figures. Kernel launch parameters were equalized by adopting each framework’s autotuner output (or manually verified equivalent tile and block dimensions) while enforcing identical row-major memory layouts and BF16 precision for all compared kernels. These controls are documented in an expanded Experimental Methodology subsection, including a summary table of launch configurations. Although formal statistical tests were not originally performed, the magnitude of the reported differences renders the architecture-dependent conclusions stable under the added variability measures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark results with no derivations or fitted predictions

full rationale

The paper reports direct hardware measurements of kernel throughput (e.g., TFLOP/s for fused attention and GEMM) on H100, B200, and RTX PRO 6000 GPUs. No equations, first-principles derivations, parameter fits, or predictions appear in the abstract or described content. All performance numbers are obtained by executing the kernels; comparisons to cuBLAS, Triton, WMMA, and FlashAttention-2 are likewise raw runtime results. Because the work contains no derivation chain that could reduce to its own inputs by construction, it is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, theoretical models, or new constructs. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5592 in / 1128 out tokens · 131634 ms · 2026-05-08T08:16:04.834470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  2. [2]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InInternational Conference on Learning Represen- tations (ICLR), 2024. arXiv:2307.08691

  3. [3]

    Flash- Decoding for Long-Context Inference

    Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash- Decoding for Long-Context Inference. Blog post, 2023. https://crfm. stanford.edu/2023/10/12/flashdecoding.html

  4. [4]

    CUTLASS: CUDA Templates for Linear Algebra Subroutines

    NVIDIA. CUTLASS: CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass, 2024

  5. [5]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2023

  6. [6]

    TensorRT-LLM: An Open-Source Library for Optimizing LLM Inference

    NVIDIA. TensorRT-LLM: An Open-Source Library for Optimizing LLM Inference. https://github.com/NVIDIA/TensorRT-LLM, 2024

  7. [7]

    CUDA Tile Python Programming Guide

    NVIDIA. CUDA Tile Python Programming Guide. https://docs.nvidia. com/cuda/cutile-python/, 2025

  8. [8]

    Triton: An Interme- diate Language and Compiler for Tiled Neural Network Computations

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: An Interme- diate Language and Compiler for Tiled Neural Network Computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), pp. 10–19, 2019

  9. [9]

    CUDA C++ Programming Guide: Warp Matrix Functions

    NVIDIA. CUDA C++ Programming Guide: Warp Matrix Functions. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html# wmma, 2024

  10. [10]

    cuBLAS Library

    NVIDIA. cuBLAS Library. https://developer.nvidia.com/cublas, 2024

  11. [11]

    NVIDIA Hopper GPU Architecture Tuning Guide

    NVIDIA. NVIDIA Hopper GPU Architecture Tuning Guide. https: //docs.nvidia.com/cuda/hopper-tuning-guide/, 2024

  12. [12]

    NVIDIA Ampere GPU Architecture Tuning Guide

    NVIDIA. NVIDIA Ampere GPU Architecture Tuning Guide. https:// docs.nvidia.com/cuda/ampere-tuning-guide/, 2024

  13. [13]

    TVM: An Automated End- to-End Optimizing Compiler for Deep Learning

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An Automated End- to-End Optimizing Compiler for Deep Learning. InOSDI, 2018

  14. [14]

    TensorIR: An Abstraction for Automatic Tensorized Program Optimization

    Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. TensorIR: An Abstraction for Automatic Tensorized Program Optimization. InASPLOS, 2023

  15. [15]

    ROLLER: Fast and Efficient Tensor Compilation for Deep Learning

    Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning. InOSDI, 2022

  16. [16]

    AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations

    Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Xuefeng Jin. AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations. InPLDI, 2021

  17. [17]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023

  18. [18]

    NVIDIA H100 Tensor Core GPU Architecture

    NVIDIA. NVIDIA H100 Tensor Core GPU Architecture. NVIDIA Whitepaper, 2022

  19. [19]

    NVIDIA Blackwell Architecture Technical Brief

    NVIDIA. NVIDIA Blackwell Architecture Technical Brief. NVIDIA, 2024

  20. [20]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008, 2017

  21. [21]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  22. [22]

    PaLM: Scaling Language Modeling with Pathways.Journal of Machine Learning Research (JMLR), 24(240):1–113, 2023

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling Language Modeling with Pathways.Journal of Machine Learning Research (JMLR), 24(240):1–113, 2023

  23. [23]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGres- ley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053, 2020

  24. [24]

    Roofline: An Insightful Visual Performance Model for Multicore Architectures

    Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4):65–76, 2009

  25. [25]

    cuDNN: Efficient primitives for deep learning,

    Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv:1410.0759, 2014

  26. [26]

    Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines

    Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fr´edo Durand, and Saman Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. InProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 519–530, 2013

  27. [27]

    Vasily V olkov and James W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. InProceedings of the IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2008

  28. [28]

    Efficiently Scaling Transformer Inference

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently Scaling Transformer Inference. InProceedings of Machine Learning and Systems (MLSys), 2023

  29. [29]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150, 2019

  30. [30]

    GQA: Training Generalized Multi- Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yinfei Yang, Sumit Sanghai, and Santiago Ontan ´on. GQA: Training Generalized Multi- Query Transformer Models from Multi-Head Checkpoints. InProceed- ings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  31. [31]

    Fast Inference from Transformers via Speculative Decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast Inference from Transformers via Speculative Decoding. InProceedings of the International Conference on Machine Learning (ICML), 2023

  32. [32]

    Online normalizer calculation for softmax,

    Maxim Milakov and Natalia Gimelshein. Online Normalizer Calculation for Softmax. arXiv:1805.02867, 2018