Recognition: unknown
WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
Pith reviewed 2026-05-10 15:28 UTC · model grok-4.3
The pith
WaveTune's wave-aware bilinear model selects near-optimal GPU kernel configurations at runtime with minimal overhead.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decomposing kernel configurations according to wave structures and fitting a bilinear model to predict latency, WaveTune enables sparse sampling and lightweight retrieval that delivers near-optimal kernel performance across varied inputs and hardware while cutting decision time dramatically.
What carries the argument
The wave-aware bilinear model that predicts kernel latency from configuration parameters and hardware wave decompositions.
If this is right
- Kernel performance reaches within a small margin of the absolute best possible for the given hardware.
- Runtime decision overhead drops by five orders of magnitude compared to searching all options.
- End-to-end time-to-first-token for LLMs improves by up to 1.33 times.
- Results hold for multiple representative kernels and across different GPU designs.
Where Pith is reading between the lines
- This modeling choice may allow similar sparse tuning in other domains where resource allocation follows periodic structures, like certain CPU or accelerator workloads.
- Production systems could adopt it to dynamically adjust kernels for changing batch sizes without pre-computing tables for every scenario.
- Further work might combine the model with online learning to adapt to new hardware features as they emerge.
Load-bearing premise
The bilinear model built from wave structures accurately predicts latency for unseen inputs and hardware without significant fitting errors or the need for post-adjustments.
What would settle it
Measuring actual runtimes on a held-out GPU or kernel input and observing that WaveTune's chosen configuration performs substantially worse than the true best found by exhaustive search.
Figures
read the original abstract
The rapid adoption of Large Language Models (LLMs) has made GPU inference efficiency an increasingly critical system concern. The runtime of LLM workloads is largely dominated by tile-based kernels, particularly General Matrix Multiplications (GEMMs). Although these kernels are highly optimized, their performance remains sensitive to a large space of runtime parameters, such as tile sizes and pipeline stages. The interaction between these parameters and hardware resources leads to a non-convex optimization landscape. Existing approaches to parameter configuration -- including search-based auto-tuning, heuristic rules, and learned cost models -- face a fundamental trade-off between performance optimality and runtime efficiency. In this paper, we present WaveTune, a wave-aware framework for runtime kernel auto-tuning. First, we introduce a unified mapping method to handle input diversity and decompose the configuration space to manage high dimensionality. Second, we develop an analytical wave-aware bilinear model that accurately predicts kernel latency. Third, we design a sparse sampling scheme based on wave structures and a lightweight dual-table retrieval mechanism to minimize runtime overhead. As a result, WaveTune enables precise and efficient runtime configuration for GPU kernels. Across three representative kernels and five GPU architectures, WaveTune consistently achieves near-optimal kernel performance, delivering up to 1.83x kernel-level speedup and up to 1.33x end-to-end TTFT reduction, while reducing runtime decision overhead by five orders of magnitude compared to exhaustive search. These results demonstrate that WaveTune effectively eliminates the traditional trade-off between configuration latency and execution optimality, providing a practical and robust solution for high-performance LLM inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents WaveTune, a wave-aware framework for runtime auto-tuning of GPU kernels in LLM inference workloads. It proposes a unified mapping to decompose configuration spaces for diverse inputs, an analytical wave-aware bilinear model to predict kernel latency, and a sparse sampling scheme leveraging wave structures combined with a lightweight dual-table retrieval mechanism. Experiments on three representative kernels across five GPU architectures report consistent near-optimal performance, with up to 1.83x kernel-level speedup, up to 1.33x end-to-end TTFT reduction, and five orders of magnitude lower runtime decision overhead versus exhaustive search.
Significance. If the bilinear model is verifiably analytical (with coefficients independent of the speedup measurement data) and the wave-based sparse sampling reliably identifies near-optimal points without post-hoc adjustments, WaveTune would meaningfully address the optimality-efficiency trade-off in GPU kernel tuning for LLMs. The multi-kernel, multi-architecture evaluation provides a reasonable breadth of evidence. The approach of combining structural wave awareness with lightweight retrieval is a constructive direction for low-overhead runtime systems.
major comments (2)
- Abstract: The central claim that the 'analytical wave-aware bilinear model accurately predicts kernel latency' and enables 'near-optimal' configurations with five-order-of-magnitude overhead reduction rests on unshown derivation details, fitting procedures, and prediction-error metrics on held-out data. Without explicit separation of any fitted coefficients from the wave-parameter derivation, the reported speedups risk circularity as noted in the stress-test.
- Experimental results (as summarized): No error bars, ablation studies on model components, or cross-validation across the five GPUs and held-out input/configurations are reported. This undermines confidence that the sparse sampling based on wave structures generalizes beyond the three kernels tested or that the 1.83x/1.33x gains are independent of any hidden calibration.
minor comments (1)
- Abstract: 'TTFT' is used without expansion on first occurrence.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for acknowledging the potential impact of WaveTune on the optimality-efficiency trade-off in GPU kernel auto-tuning. We address each major comment point by point below, providing clarifications based on the manuscript and committing to revisions that strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: Abstract: The central claim that the 'analytical wave-aware bilinear model accurately predicts kernel latency' and enables 'near-optimal' configurations with five-order-of-magnitude overhead reduction rests on unshown derivation details, fitting procedures, and prediction-error metrics on held-out data. Without explicit separation of any fitted coefficients from the wave-parameter derivation, the reported speedups risk circularity as noted in the stress-test.
Authors: The bilinear model is derived analytically from the GPU wave execution model: latency is expressed as a bilinear function of tile sizes and wave parameters (occupancy, scheduling) using closed-form expressions based on hardware specs such as memory bandwidth, compute throughput, and SM occupancy formulas. No coefficients are obtained by fitting to the latency or speedup measurements reported in the evaluation; they follow directly from the wave decomposition and resource bounds. We will add a dedicated subsection to the revised manuscript that presents the full analytical derivation, the exact procedure for computing coefficients from architecture parameters, and quantitative held-out prediction metrics (e.g., mean absolute percentage error across input sizes and configurations excluded from any tuning runs). The stress-test results will be included explicitly to demonstrate that performance gains remain consistent when the model is applied without access to the final measurement data. revision: yes
-
Referee: Experimental results (as summarized): No error bars, ablation studies on model components, or cross-validation across the five GPUs and held-out input/configurations are reported. This undermines confidence that the sparse sampling based on wave structures generalizes beyond the three kernels tested or that the 1.83x/1.33x gains are independent of any hidden calibration.
Authors: We agree that the current presentation of results would benefit from additional statistical and validation elements. Although the manuscript already evaluates three kernels across five GPU architectures and multiple input sizes, it does not report error bars, component ablations, or formal cross-validation. In the revision we will incorporate: (i) error bars computed from at least ten independent timing runs per configuration; (ii) ablation experiments that isolate the wave-aware bilinear predictor and the wave-structured sparse sampler; and (iii) cross-validation results showing performance on held-out input dimensions as well as transfer performance across the five GPUs without per-GPU recalibration of the model. These additions will directly address generalization of the sparse sampling scheme and confirm that the reported speedups do not rely on hidden per-experiment calibration. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The abstract and provided excerpts present WaveTune as introducing a unified mapping, an analytical wave-aware bilinear model for latency prediction, and a sparse sampling scheme with dual-table retrieval. No equations, fitting procedures, or self-citations are exhibited that reduce the latency predictions or speedups to quantities defined by construction from the same evaluation data. The reported results (1.83x kernel speedup, 1.33x TTFT reduction, five-order overhead drop) are framed as empirical outcomes across three kernels and five GPUs without any quoted step where a 'prediction' is statistically forced by prior fitting on identical inputs or where a uniqueness theorem is imported from overlapping authors. The derivation chain therefore remains self-contained against the external benchmarks described.
Axiom & Free-Parameter Ledger
free parameters (1)
- Bilinear model coefficients
axioms (1)
- domain assumption GPU kernel latency can be accurately represented by a bilinear function of configuration parameters and wave structures.
Reference graph
Works this paper leans on
-
[1]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput–Latency Tradeoff in LLM Inference with Sarathi-Serve. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 117–134
2024
-
[2]
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Trans- former Models from Multi-Head Checkpoints.arXiv preprint arXiv:2305.13245 (2023)
work page internal anchor Pith review arXiv 2023
-
[3]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 163–174. https://doi.org/10.1109/ISPASS.2009.4919648
-
[4]
Shiheng Cao, Junmin Wu, Junshi Chen, Hong An, and Zhibin Yu. 2025. AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs. In Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25)(Tokyo, Japan). ACM, 1495–1508. https://doi.org/10.1145/3695053. 3731064
-
[5]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794
2016
-
[6]
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. InProceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018)
2018
-
[7]
Cho, Jeageun Jung, and Mattan Erez
Benjamin Y. Cho, Jeageun Jung, and Mattan Erez. 2020. Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators. arXiv:2012.00158 [cs.DC]
-
[8]
NVIDIA Corporation. 2025. cuBLAS. https://developer.nvidia.com/cublas. Ac- cessed: 2026-01-27
2025
-
[9]
Wave Quantization
NVIDIA Corporation. 2025. Matrix Multiplication Background User’s Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance- matrix-multiplication/index.html. Section “Wave Quantization”
2025
-
[10]
Noïc Crouzet, Thomas Carle, and Christine Rochange. 2025. Time-predictable warp scheduling in a GPU.Microprocessors and Microsystems118 (2025), 105203. https://doi.org/10.1016/j.micpro.2025.105203
-
[11]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 16344–16359
2022
-
[12]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS)
2023
-
[13]
DeepSeek-AI Contributors. 2024. DeepGEMM. https://github.com/deepseek- ai/DeepGEMM. Accessed: 2026-01-27
2024
-
[14]
Atiyeh Gheibi-Fetrat, Amirsaeed Ahmadi-Tonekaboni, Farzam Koohi-Ronaghi, Pariya Hajipour, Sana Babayan-Vanestan, Fatemeh Fotouhi, Elahe Mortazavian- Farsani, Pouria Khajehpour-Dezfouli, Sepideh Safari, Shaahin Hessabi, and Hamid Sarbazi-Azad. 2025. RTGPU: Real-Time Computing with Graphics Pro- cessing Units.arXiv preprint arXiv:2507.06069(2025). https://a...
-
[15]
Aamodt, and Timothy G
Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers
-
[16]
Accel-Sim: An Extensible Simulation Framework for Validated GPU Mod- eling. In47th Annual International Symposium on Computer Architecture (ISCA). IEEE/ACM, 473–486. https://doi.org/10.1109/ISCA45697.2020.00047
-
[17]
Gonzalez, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Sheng Zhuang, Ying Sheng, Bowen Yu, Joseph E. Gonzalez, and Ion Stoica. 2024. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention. InProceedings of the 7th Conference on Machine Learning and Systems (MLSys)
2024
-
[18]
Jounghoo Lee, Yeonan Ha, Suhyun Lee, Jinyoung Woo, Jinho Lee, Hanhwi Jang, and Youngsok Kim. 2022. GCOM: a detailed GPU core model for accurate analyt- ical modeling of modern GPUs. InProceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22)(New York, New York). Asso- ciation for Computing Machinery, 424–436. https://doi.o...
-
[19]
Seonho Lee, Amar Phanishayee, and Divya Mahajan. 2025. Forecasting GPU Performance for Deep Learning Training and Inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’25). Association for Computing Machinery, New York, NY, USA, 493–508. https://doi.org/10.1145/3...
-
[20]
Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. InProceed- ings of the 22nd International Conference on Architectural Support for Program- ming Languages and Operating Systems (ASPLOS). ACM, Xi’an, China, 297–311. https://doi.org/10.1145/3037697.3037709
-
[21]
Ying Li, Yifan Sun, and Adwait Jog. 2023. Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads. InProceedings of the 56th IEEE/ACM International Symposium on Microarchitecture (MICRO). 380–394
2023
- [22]
- [23]
-
[24]
NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU Whitepaper. NVIDIA Hopper Architecture White Paper. https://resources.nvidia.com/en-us- hopper-architecture/nvidia-h100-tensor-c
2022
-
[25]
2025.CUDA C Programming Guide
NVIDIA Corporation. 2025.CUDA C Programming Guide. https://docs.nvidia. com/cuda/cuda-c-programming-guide/ Version 13.0
2025
-
[26]
NVIDIA Corporation. 2025. CUTLASS Documentation. https://docs.nvidia.com/ cutlass/index.html. Accessed: 2025-10-18. 12 WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
2025
-
[27]
PyTorch Contributors. [n. d.]. PyTorch Profiler. https://pytorch.org/docs/stable/ profiler.html. Accessed: 2026-02-12
2026
- [28]
- [29]
-
[30]
Rogers, Mike O’Connor, and Tor M
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-Aware Warp Scheduling. InProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 99–110
2013
-
[31]
Richard Schoonhoven, Ben van Werkhoven, and Kees Joost Batenburg. 2023. Benchmarking Optimization Algorithms for Auto-Tuning GPU Kernels.IEEE Transactions on Evolutionary Computation27, 2 (2023), 312–325. https://doi.org/ 10.1109/TEVC.2022.3210654
-
[32]
SGLang Team. 2024. SGLang: Efficient Execution Framework for Large Language Models. https://github.com/sgl-project/sglang
2024
-
[33]
Junru Shao, Xiyou Zhou, Siyuan Feng, Bohan Hou, Ruihang Lai, Hongyi Jin, Wuwei Lin, Masahiro Masuda, Cody Hao Yu, and Tianqi Chen. 2022. Tensor program optimization with probabilistic programs. InAdvances in Neural Infor- mation Processing Systems, Vol. 35. 35515–35528
2022
-
[34]
Shuaiwen Leon Song, Ang Li, Xu Liu, Akash Kumar, and Henk Corporaal
-
[35]
IEEE Transactions on Parallel and Distributed Systems27, 6 (2016), 1738–1751
Understanding the Impact of CTA Scheduling on GPU Performance. IEEE Transactions on Parallel and Distributed Systems27, 6 (2016), 1738–1751. https://doi.org/10.1109/TPDS.2015.2475741
-
[36]
AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers,
Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, and et al. 2019. MGPU-Sim: Enabling Multi-GPU Performance Modeling and Optimiza- tion. InProceedings of the 46th Annual International Symposium on Computer Architecture (ISCA). ACM, 197–209. https://doi.org/10.11...
-
[37]
Ryan Swann, Muhammad Osama, Xiaohu Guo, Bryant Nelson, Lixun Zhang, Alex Brown, Yen Ong, Ali Yazdani, Sean Siddens, Ganesh Dasika, and Alex Underwood. 2025. tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection.arXiv preprint arXiv:2512.04226(2025). https: //arxiv.org/abs/2512.04226
-
[38]
Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS)
2017
-
[40]
Andrey Vladimirov. 2024. CUTLASS Tutorial: Persistent Kernels and Stream- K. https://research.colfax-intl.com/cutlass-tutorial-persistent-kernels-and- stream-k/. Accessed: 2025-10-18
2024
-
[41]
vLLM Team. 2023. vLLM: Easy, Fast, and Cheap LLM Serving with Paged Atten- tion. https://github.com/vllm-project/vllm
2023
-
[42]
Xizheng Wang, Qingxu Li, Yichi Xu, Gang Lu, Dan Li, Li Chen, Heyang Zhou, Linkang Zheng, Sen Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, Kunling He, Jiaqi Gao, Ennan Zhai, Dennis Cai, and Binzhang Fu. 2025. SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision. In...
2025
-
[43]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures.Commun. ACM52, 4 (2009), 65–76
2009
-
[44]
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze
-
[45]
InProceedings of the 8th Conference on Machine Learning and Systems (MLSys 2025)
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. InProceedings of the 8th Conference on Machine Learning and Systems (MLSys 2025). MLSys, Santa Clara, CA, USA
2025
-
[46]
Gyeong-In Yu, Juho Jeong, Geon-Woo Kim, Soojeong Lee, Minsik Jeon, Jaeho Shin, Sung-Eui Chung, and Byung-Gon Lee. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)
2022
-
[47]
Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko
Geoffrey X. Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. 2021. Habi- tat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 503–521. https://www.usenix.org/conference/atc21/ presentation/yu
2021
-
[48]
Yongseung Yu, Donghyun Son, Younghyun Lee, Sunghyun Park, Giha Ryu, Myeongjin Cho, Jiwon Seo, and Yongjun Park. 2023. Tailoring CUTLASS GEMM using Supervised Learning. InProceedings of the 41st IEEE International Conference on Computer Design (ICCD). IEEE, 465–474. https://doi.org/10.1109/ICCD58817. 2023.00077
-
[49]
Hengrui Zhang, August Ning, Rohan Baskar Prabhakar, and David Wentzlaff
-
[50]
Splitwise: Efficient generative llm inference using phase splitting
LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference. InProceedings of the 51st Annual International Symposium on Computer Architecture(Buenos Aires, Argentina)(ISCA ’24). IEEE Press, 1080–1096. https://doi.org/10.1109/ISCA59077.2024.00082
-
[51]
Jie Zhang and Adwait Jog. 2017. TLP-aware Cooperative Scheduling for Efficient GPU Memory System Utilization. InProceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17). ACM, 93–104. https://doi.org/ 10.1145/3079856.3080216
-
[52]
Kaixuan Zhang, Yunfan Cui, Shuhao Zhang, Chutong Ding, Shiyou Qian, Lup- ing Wang, Jian Cao, Guangtao Xue, Cheng Huang, Guodong Yang, and Liping Zhang. 2026. SynPerf: A Hybrid Analytical-ML Framework for GPU Performance Prediction. arXiv:2601.14910 [cs.PF] https://arxiv.org/abs/2601.14910
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[53]
Gonzalez, and Ion Stoica
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Amr Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. InProceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020)
2020
-
[54]
Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, ...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.