pith. machine review for the scientific record. sign in

arxiv: 2604.10187 · v1 · submitted 2026-04-11 · 💻 cs.PF · cs.AR

Recognition: unknown

WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning

Cheng Huang, Chutong Ding, Guangtao Xue, Guodong Yang, Jian Cao, Kaixuan Zhang, Liping Zhang, Luping Wang, Shiyou Qian

Pith reviewed 2026-05-10 15:28 UTC · model grok-4.3

classification 💻 cs.PF cs.AR
keywords GPU kernel auto-tuningwave-aware modelingbilinear latency predictionruntime configurationLLM inference efficiencyGEMM optimizationsparse sampling
0
0 comments X

The pith

WaveTune's wave-aware bilinear model selects near-optimal GPU kernel configurations at runtime with minimal overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that GPU kernel performance tuning can be made both fast and near-optimal by modeling latency with awareness of wave structures on the hardware. It first maps diverse inputs to a unified space and breaks down the high-dimensional config space, then builds an analytical bilinear predictor that accounts for wave interactions. A sparse sampling method based on these waves and a quick retrieval table then picks the best config without searching everything. If this holds, LLM inference can run kernels close to their fastest possible speed without the usual delays from tuning or suboptimal choices.

Core claim

By decomposing kernel configurations according to wave structures and fitting a bilinear model to predict latency, WaveTune enables sparse sampling and lightweight retrieval that delivers near-optimal kernel performance across varied inputs and hardware while cutting decision time dramatically.

What carries the argument

The wave-aware bilinear model that predicts kernel latency from configuration parameters and hardware wave decompositions.

If this is right

  • Kernel performance reaches within a small margin of the absolute best possible for the given hardware.
  • Runtime decision overhead drops by five orders of magnitude compared to searching all options.
  • End-to-end time-to-first-token for LLMs improves by up to 1.33 times.
  • Results hold for multiple representative kernels and across different GPU designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This modeling choice may allow similar sparse tuning in other domains where resource allocation follows periodic structures, like certain CPU or accelerator workloads.
  • Production systems could adopt it to dynamically adjust kernels for changing batch sizes without pre-computing tables for every scenario.
  • Further work might combine the model with online learning to adapt to new hardware features as they emerge.

Load-bearing premise

The bilinear model built from wave structures accurately predicts latency for unseen inputs and hardware without significant fitting errors or the need for post-adjustments.

What would settle it

Measuring actual runtimes on a held-out GPU or kernel input and observing that WaveTune's chosen configuration performs substantially worse than the true best found by exhaustive search.

Figures

Figures reproduced from arXiv: 2604.10187 by Cheng Huang, Chutong Ding, Guangtao Xue, Guodong Yang, Jian Cao, Kaixuan Zhang, Liping Zhang, Luping Wang, Shiyou Qian.

Figure 1
Figure 1. Figure 1: Illustration of the Tiled GEMM execution model. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the wave quantization effect. A total [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Profiling results on an H100 (132 SMs) for a GEMM kernel with fixed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Discrete-event simulations of 132 SMs with a fixed mean block latency [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the WaveTune framework. For each macro config 𝑐𝑖 , a corresponding dual-table DT(𝑐𝑖) is con￾structed, consisting of a coefficient table indexed by wave count 𝑤 with parameters 𝜽 = ⟨𝛼, 𝛽,𝛾, 𝛿⟩, and a micro-config table indexed by (𝑤, 𝐿) storing optimal micro configs c ∗ 𝑤,𝐿. At runtime, Stage I selects 𝑐 ∗ macro via model-based kernel latency prediction, and Stage II retrieves c ∗ micro from the… view at source ↗
Figure 5
Figure 5. Figure 5: Profiling results on an H100 (132 SMs) for a [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Kernel-level geometric mean speedup over the default heuristic across five GPU architectures. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end TTFT speedup over the default heuristic during the prefill phase across varying input sequence lengths. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Trade-off between runtime decision overhead (log [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablation study on FlashAttention, comparing the [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of profiling range (𝑊 , 𝐼) on performance across different sequence length ranges. However, these methods are primarily designed for offline opti￾mization and still rely on search at deployment time, making them unsuitable for latency-critical online serving scenarios. Heuristic-based Configuration. In production systems, runtime configuration is typically governed by manually designed heuris￾tics … view at source ↗
read the original abstract

The rapid adoption of Large Language Models (LLMs) has made GPU inference efficiency an increasingly critical system concern. The runtime of LLM workloads is largely dominated by tile-based kernels, particularly General Matrix Multiplications (GEMMs). Although these kernels are highly optimized, their performance remains sensitive to a large space of runtime parameters, such as tile sizes and pipeline stages. The interaction between these parameters and hardware resources leads to a non-convex optimization landscape. Existing approaches to parameter configuration -- including search-based auto-tuning, heuristic rules, and learned cost models -- face a fundamental trade-off between performance optimality and runtime efficiency. In this paper, we present WaveTune, a wave-aware framework for runtime kernel auto-tuning. First, we introduce a unified mapping method to handle input diversity and decompose the configuration space to manage high dimensionality. Second, we develop an analytical wave-aware bilinear model that accurately predicts kernel latency. Third, we design a sparse sampling scheme based on wave structures and a lightweight dual-table retrieval mechanism to minimize runtime overhead. As a result, WaveTune enables precise and efficient runtime configuration for GPU kernels. Across three representative kernels and five GPU architectures, WaveTune consistently achieves near-optimal kernel performance, delivering up to 1.83x kernel-level speedup and up to 1.33x end-to-end TTFT reduction, while reducing runtime decision overhead by five orders of magnitude compared to exhaustive search. These results demonstrate that WaveTune effectively eliminates the traditional trade-off between configuration latency and execution optimality, providing a practical and robust solution for high-performance LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents WaveTune, a wave-aware framework for runtime auto-tuning of GPU kernels in LLM inference workloads. It proposes a unified mapping to decompose configuration spaces for diverse inputs, an analytical wave-aware bilinear model to predict kernel latency, and a sparse sampling scheme leveraging wave structures combined with a lightweight dual-table retrieval mechanism. Experiments on three representative kernels across five GPU architectures report consistent near-optimal performance, with up to 1.83x kernel-level speedup, up to 1.33x end-to-end TTFT reduction, and five orders of magnitude lower runtime decision overhead versus exhaustive search.

Significance. If the bilinear model is verifiably analytical (with coefficients independent of the speedup measurement data) and the wave-based sparse sampling reliably identifies near-optimal points without post-hoc adjustments, WaveTune would meaningfully address the optimality-efficiency trade-off in GPU kernel tuning for LLMs. The multi-kernel, multi-architecture evaluation provides a reasonable breadth of evidence. The approach of combining structural wave awareness with lightweight retrieval is a constructive direction for low-overhead runtime systems.

major comments (2)
  1. Abstract: The central claim that the 'analytical wave-aware bilinear model accurately predicts kernel latency' and enables 'near-optimal' configurations with five-order-of-magnitude overhead reduction rests on unshown derivation details, fitting procedures, and prediction-error metrics on held-out data. Without explicit separation of any fitted coefficients from the wave-parameter derivation, the reported speedups risk circularity as noted in the stress-test.
  2. Experimental results (as summarized): No error bars, ablation studies on model components, or cross-validation across the five GPUs and held-out input/configurations are reported. This undermines confidence that the sparse sampling based on wave structures generalizes beyond the three kernels tested or that the 1.83x/1.33x gains are independent of any hidden calibration.
minor comments (1)
  1. Abstract: 'TTFT' is used without expansion on first occurrence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential impact of WaveTune on the optimality-efficiency trade-off in GPU kernel auto-tuning. We address each major comment point by point below, providing clarifications based on the manuscript and committing to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: Abstract: The central claim that the 'analytical wave-aware bilinear model accurately predicts kernel latency' and enables 'near-optimal' configurations with five-order-of-magnitude overhead reduction rests on unshown derivation details, fitting procedures, and prediction-error metrics on held-out data. Without explicit separation of any fitted coefficients from the wave-parameter derivation, the reported speedups risk circularity as noted in the stress-test.

    Authors: The bilinear model is derived analytically from the GPU wave execution model: latency is expressed as a bilinear function of tile sizes and wave parameters (occupancy, scheduling) using closed-form expressions based on hardware specs such as memory bandwidth, compute throughput, and SM occupancy formulas. No coefficients are obtained by fitting to the latency or speedup measurements reported in the evaluation; they follow directly from the wave decomposition and resource bounds. We will add a dedicated subsection to the revised manuscript that presents the full analytical derivation, the exact procedure for computing coefficients from architecture parameters, and quantitative held-out prediction metrics (e.g., mean absolute percentage error across input sizes and configurations excluded from any tuning runs). The stress-test results will be included explicitly to demonstrate that performance gains remain consistent when the model is applied without access to the final measurement data. revision: yes

  2. Referee: Experimental results (as summarized): No error bars, ablation studies on model components, or cross-validation across the five GPUs and held-out input/configurations are reported. This undermines confidence that the sparse sampling based on wave structures generalizes beyond the three kernels tested or that the 1.83x/1.33x gains are independent of any hidden calibration.

    Authors: We agree that the current presentation of results would benefit from additional statistical and validation elements. Although the manuscript already evaluates three kernels across five GPU architectures and multiple input sizes, it does not report error bars, component ablations, or formal cross-validation. In the revision we will incorporate: (i) error bars computed from at least ten independent timing runs per configuration; (ii) ablation experiments that isolate the wave-aware bilinear predictor and the wave-structured sparse sampler; and (iii) cross-validation results showing performance on held-out input dimensions as well as transfer performance across the five GPUs without per-GPU recalibration of the model. These additions will directly address generalization of the sparse sampling scheme and confirm that the reported speedups do not rely on hidden per-experiment calibration. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The abstract and provided excerpts present WaveTune as introducing a unified mapping, an analytical wave-aware bilinear model for latency prediction, and a sparse sampling scheme with dual-table retrieval. No equations, fitting procedures, or self-citations are exhibited that reduce the latency predictions or speedups to quantities defined by construction from the same evaluation data. The reported results (1.83x kernel speedup, 1.33x TTFT reduction, five-order overhead drop) are framed as empirical outcomes across three kernels and five GPUs without any quoted step where a 'prediction' is statistically forced by prior fitting on identical inputs or where a uniqueness theorem is imported from overlapping authors. The derivation chain therefore remains self-contained against the external benchmarks described.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the unproven accuracy of the bilinear latency model and the sufficiency of wave structures for decomposition and sampling; these are domain assumptions rather than derived results.

free parameters (1)
  • Bilinear model coefficients
    The wave-aware bilinear latency predictor requires coefficients that are almost certainly calibrated or fitted to kernel measurements, even if not explicitly quantified in the abstract.
axioms (1)
  • domain assumption GPU kernel latency can be accurately represented by a bilinear function of configuration parameters and wave structures.
    This is the foundational modeling choice that enables the analytical prediction and sparse sampling.

pith-pipeline@v0.9.0 · 5608 in / 1483 out tokens · 74483 ms · 2026-05-10T15:28:35.418818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 23 canonical work pages · 3 internal anchors

  1. [1]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput–Latency Tradeoff in LLM Inference with Sarathi-Serve. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 117–134

  2. [2]

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Trans- former Models from Multi-Head Checkpoints.arXiv preprint arXiv:2305.13245 (2023)

  3. [3]

    Yuan, Wilson W

    Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 163–174. https://doi.org/10.1109/ISPASS.2009.4919648

  4. [4]

    Shiheng Cao, Junmin Wu, Junshi Chen, Hong An, and Zhibin Yu. 2025. AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs. In Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25)(Tokyo, Japan). ACM, 1495–1508. https://doi.org/10.1145/3695053. 3731064

  5. [5]

    Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794

  6. [6]

    Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. InProceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018)

  7. [7]

    Cho, Jeageun Jung, and Mattan Erez

    Benjamin Y. Cho, Jeageun Jung, and Mattan Erez. 2020. Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators. arXiv:2012.00158 [cs.DC]

  8. [8]

    NVIDIA Corporation. 2025. cuBLAS. https://developer.nvidia.com/cublas. Ac- cessed: 2026-01-27

  9. [9]

    Wave Quantization

    NVIDIA Corporation. 2025. Matrix Multiplication Background User’s Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance- matrix-multiplication/index.html. Section “Wave Quantization”

  10. [10]

    Noïc Crouzet, Thomas Carle, and Christine Rochange. 2025. Time-predictable warp scheduling in a GPU.Microprocessors and Microsystems118 (2025), 105203. https://doi.org/10.1016/j.micpro.2025.105203

  11. [11]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 16344–16359

  12. [12]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS)

  13. [13]

    DeepSeek-AI Contributors. 2024. DeepGEMM. https://github.com/deepseek- ai/DeepGEMM. Accessed: 2026-01-27

  14. [14]

    Atiyeh Gheibi-Fetrat, Amirsaeed Ahmadi-Tonekaboni, Farzam Koohi-Ronaghi, Pariya Hajipour, Sana Babayan-Vanestan, Fatemeh Fotouhi, Elahe Mortazavian- Farsani, Pouria Khajehpour-Dezfouli, Sepideh Safari, Shaahin Hessabi, and Hamid Sarbazi-Azad. 2025. RTGPU: Real-Time Computing with Graphics Pro- cessing Units.arXiv preprint arXiv:2507.06069(2025). https://a...

  15. [15]

    Aamodt, and Timothy G

    Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers

  16. [16]

    Lee, Meng Li, Bert Maher, Dheevatsa Mudigere, Maxim Naumov, Martin Schatz, Mikhail Smelyanskiy, Xiaodong Wang, Brandon Reagen, Carole-Jean Wu, Mark Hemp- stead, and Xuan Zhang

    Accel-Sim: An Extensible Simulation Framework for Validated GPU Mod- eling. In47th Annual International Symposium on Computer Architecture (ISCA). IEEE/ACM, 473–486. https://doi.org/10.1109/ISCA45697.2020.00047

  17. [17]

    Gonzalez, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Sheng Zhuang, Ying Sheng, Bowen Yu, Joseph E. Gonzalez, and Ion Stoica. 2024. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention. InProceedings of the 7th Conference on Machine Learning and Systems (MLSys)

  18. [18]

    Jounghoo Lee, Yeonan Ha, Suhyun Lee, Jinyoung Woo, Jinho Lee, Hanhwi Jang, and Youngsok Kim. 2022. GCOM: a detailed GPU core model for accurate analyt- ical modeling of modern GPUs. InProceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22)(New York, New York). Asso- ciation for Computing Machinery, 424–436. https://doi.o...

  19. [19]

    Seonho Lee, Amar Phanishayee, and Divya Mahajan. 2025. Forecasting GPU Performance for Deep Learning Training and Inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’25). Association for Computing Machinery, New York, NY, USA, 493–508. https://doi.org/10.1145/3...

  20. [20]

    Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. InProceed- ings of the 22nd International Conference on Architectural Support for Program- ming Languages and Operating Systems (ASPLOS). ACM, Xi’an, China, 297–311. https://doi.org/10.1145/3037697.3037709

  21. [21]

    Ying Li, Yifan Sun, and Adwait Jog. 2023. Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads. InProceedings of the 56th IEEE/ACM International Symposium on Microarchitecture (MICRO). 380–394

  22. [22]

    Milo Lurati, Stijn Heldens, Alessio Sclocco, and Ben van Werkhoven. 2024. Bring- ing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs. arXiv preprint. https://arxiv.org/abs/2407.11488 arXiv:2407.11488

  23. [23]

    Xinxin Mei and Xiaowen Chu. 2015. Dissecting GPU Memory Hierarchy through Fine-Grained Microbenchmarking.arXiv preprint arXiv:1509.02308(2015)

  24. [24]

    NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU Whitepaper. NVIDIA Hopper Architecture White Paper. https://resources.nvidia.com/en-us- hopper-architecture/nvidia-h100-tensor-c

  25. [25]

    2025.CUDA C Programming Guide

    NVIDIA Corporation. 2025.CUDA C Programming Guide. https://docs.nvidia. com/cuda/cuda-c-programming-guide/ Version 13.0

  26. [26]

    NVIDIA Corporation. 2025. CUTLASS Documentation. https://docs.nvidia.com/ cutlass/index.html. Accessed: 2025-10-18. 12 WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning

  27. [27]

    PyTorch Contributors. [n. d.]. PyTorch Profiler. https://pytorch.org/docs/stable/ profiler.html. Accessed: 2026-02-12

  28. [28]

    Burkhard Ringlein, Thomas Parnell, and Radu Stoica. 2025. GPU Performance Portability needs Autotuning. arXiv preprint. https://arxiv.org/abs/2505.03780 arXiv:2505.03780

  29. [29]

    Burkhard Ringlein, Jan van Lunteren, Radu Stoica, and Thomas Parnell. 2025. The Anatomy of a Triton Attention Kernel. arXiv preprint. https://arxiv.org/ abs/2511.11581 arXiv:2511.11581

  30. [30]

    Rogers, Mike O’Connor, and Tor M

    Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-Aware Warp Scheduling. InProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 99–110

  31. [31]

    Richard Schoonhoven, Ben van Werkhoven, and Kees Joost Batenburg. 2023. Benchmarking Optimization Algorithms for Auto-Tuning GPU Kernels.IEEE Transactions on Evolutionary Computation27, 2 (2023), 312–325. https://doi.org/ 10.1109/TEVC.2022.3210654

  32. [32]

    SGLang Team. 2024. SGLang: Efficient Execution Framework for Large Language Models. https://github.com/sgl-project/sglang

  33. [33]

    Junru Shao, Xiyou Zhou, Siyuan Feng, Bohan Hou, Ruihang Lai, Hongyi Jin, Wuwei Lin, Masahiro Masuda, Cody Hao Yu, and Tianqi Chen. 2022. Tensor program optimization with probabilistic programs. InAdvances in Neural Infor- mation Processing Systems, Vol. 35. 35515–35528

  34. [34]

    Shuaiwen Leon Song, Ang Li, Xu Liu, Akash Kumar, and Henk Corporaal

  35. [35]

    IEEE Transactions on Parallel and Distributed Systems27, 6 (2016), 1738–1751

    Understanding the Impact of CTA Scheduling on GPU Performance. IEEE Transactions on Parallel and Distributed Systems27, 6 (2016), 1738–1751. https://doi.org/10.1109/TPDS.2015.2475741

  36. [36]

    AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers,

    Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, and et al. 2019. MGPU-Sim: Enabling Multi-GPU Performance Modeling and Optimiza- tion. InProceedings of the 46th Annual International Symposium on Computer Architecture (ISCA). ACM, 197–209. https://doi.org/10.11...

  37. [37]

    Ryan Swann, Muhammad Osama, Xiaohu Guo, Bryant Nelson, Lixun Zhang, Alex Brown, Yen Ong, Ali Yazdani, Sean Siddens, Ganesh Dasika, and Alex Underwood. 2025. tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection.arXiv preprint arXiv:2512.04226(2025). https: //arxiv.org/abs/2512.04226

  38. [38]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

  39. [39]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS)

  40. [40]

    Andrey Vladimirov. 2024. CUTLASS Tutorial: Persistent Kernels and Stream- K. https://research.colfax-intl.com/cutlass-tutorial-persistent-kernels-and- stream-k/. Accessed: 2025-10-18

  41. [41]

    vLLM Team. 2023. vLLM: Easy, Fast, and Cheap LLM Serving with Paged Atten- tion. https://github.com/vllm-project/vllm

  42. [42]

    Xizheng Wang, Qingxu Li, Yichi Xu, Gang Lu, Dan Li, Li Chen, Heyang Zhou, Linkang Zheng, Sen Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, Kunling He, Jiaqi Gao, Ennan Zhai, Dennis Cai, and Binzhang Fu. 2025. SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision. In...

  43. [43]

    Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures.Commun. ACM52, 4 (2009), 65–76

  44. [44]

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze

  45. [45]

    InProceedings of the 8th Conference on Machine Learning and Systems (MLSys 2025)

    FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. InProceedings of the 8th Conference on Machine Learning and Systems (MLSys 2025). MLSys, Santa Clara, CA, USA

  46. [46]

    Gyeong-In Yu, Juho Jeong, Geon-Woo Kim, Soojeong Lee, Minsik Jeon, Jaeho Shin, Sung-Eui Chung, and Byung-Gon Lee. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

  47. [47]

    Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko

    Geoffrey X. Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. 2021. Habi- tat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 503–521. https://www.usenix.org/conference/atc21/ presentation/yu

  48. [48]

    Yongseung Yu, Donghyun Son, Younghyun Lee, Sunghyun Park, Giha Ryu, Myeongjin Cho, Jiwon Seo, and Yongjun Park. 2023. Tailoring CUTLASS GEMM using Supervised Learning. InProceedings of the 41st IEEE International Conference on Computer Design (ICCD). IEEE, 465–474. https://doi.org/10.1109/ICCD58817. 2023.00077

  49. [49]

    Hengrui Zhang, August Ning, Rohan Baskar Prabhakar, and David Wentzlaff

  50. [50]

    Splitwise: Efficient generative llm inference using phase splitting

    LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference. InProceedings of the 51st Annual International Symposium on Computer Architecture(Buenos Aires, Argentina)(ISCA ’24). IEEE Press, 1080–1096. https://doi.org/10.1109/ISCA59077.2024.00082

  51. [51]

    Jie Zhang and Adwait Jog. 2017. TLP-aware Cooperative Scheduling for Efficient GPU Memory System Utilization. InProceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17). ACM, 93–104. https://doi.org/ 10.1145/3079856.3080216

  52. [52]

    Kaixuan Zhang, Yunfan Cui, Shuhao Zhang, Chutong Ding, Shiyou Qian, Lup- ing Wang, Jian Cao, Guangtao Xue, Cheng Huang, Guodong Yang, and Liping Zhang. 2026. SynPerf: A Hybrid Analytical-ML Framework for GPU Performance Prediction. arXiv:2601.14910 [cs.PF] https://arxiv.org/abs/2601.14910

  53. [53]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Amr Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. InProceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020)

  54. [54]

    Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, ...