pith. sign in

arxiv: 2605.31464 · v1 · pith:DKGHEJAXnew · submitted 2026-05-29 · 💻 cs.LG · cs.AI

GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

Pith reviewed 2026-06-28 23:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords GPU kernelslanguage modelssurrogate modelskernel optimizationruntime predictionselective deferralreinforcement learning
0
0 comments X

The pith

Language models forecast GPU kernel performance and selectively defer to hardware to let searches evaluate more candidates under fixed budgets, yielding faster kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can predict relative runtimes of GPU kernels well enough to replace some hardware measurements during optimization searches. The models must also estimate their own uncertainty so they can defer uncertain cases back to the GPU rather than risking bad decisions. Experiments show that this selective surrogate approach allows a search to consider several times as many kernel candidates within the same GPU evaluation budget. Reinforcement learning further improves forecast accuracy and calibration. The result is discovery of faster kernels than an equal-budget search that always measures every candidate on the GPU.

Core claim

LLMs can accurately forecast relative kernel performance; when tuned via reinforcement learning for better calibration they function as selective surrogates that defer to GPU measurements on uncertain cases, so that a kernel search can consider several times as many candidates under the same GPU evaluation budget and recover faster kernels than an equal-budget baseline that always measures on hardware.

What carries the argument

The selective surrogate: an LLM that outputs both a relative performance forecast and a signal for deferral to the GPU when its prediction is likely unreliable.

If this is right

  • Kernel searches can scale candidate volume without raising GPU measurement cost.
  • Faster final kernels are obtained for the same on-device evaluation budget.
  • Reinforcement learning can be used to raise both forecast accuracy and deferral calibration.
  • LLMs can serve as virtual models of GPU behavior in addition to code generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selective-forecast pattern could be tested on other accelerators or non-kernel optimization tasks.
  • Combining the surrogate with coding-agent generators might further enlarge the effective search space.
  • If calibration holds across domains, the approach reduces reliance on repeated hardware runs in early design stages.

Load-bearing premise

LLM forecasts and uncertainty estimates stay sufficiently accurate across the kernels proposed during search so that selective deferral improves final kernel quality rather than introducing bias.

What would settle it

A kernel search using the surrogate that returns a slower best kernel than a pure-GPU-measurement baseline under identical GPU budget would show the surrogate is not useful.

Figures

Figures reproduced from arXiv: 2605.31464 by Elias Stengel-Eskin, Jaemin Cho, Justin Chih-Yao Chen, Mohit Bansal, Zaid Khan.

Figure 1
Figure 1. Figure 1: Left: LLM-driven kernel optimizers use physical GPU measurements as feedback for [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our method casts kernel evaluation as selective forecasting over discrete speedup bins. Top: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The base surrogate overestimates slow kernels and underestimates the fastest kernels, while RL changes where those errors go. Rows are measured speedup bins and columns are predicted bins for GPT-OSS-20B; each cell is the fraction of examples in that measured bin receiving that prediction. The middle and right panels show how training changes those row fractions relative to the base (% change): green cells… view at source ↗
Figure 4
Figure 4. Figure 4: Calibration rewards make confidence track forecast error, while correctness-only RL weakens this relationship. Examples are bucketed by stated confidence, qi = maxb pi,b. The y-axis is mean forecast error within each bucket; lower and more decreasing curves indicate more reliable confidence. Shading marks violations of monotone decrease, summarized by ∆mono with 0 indicating no violation. 0% 25% 50% 75% 10… view at source ↗
Figure 5
Figure 5. Figure 5: Brier calibration improves the speedup found under most measurement budgets; correctness-only and CRPS training do not. Budgets are the fraction of candidates GPU-measured after sorting by predicted expected speedup. Gray dashes show the GPT-OSS-20B base; colored bars show trained-minus-base speedup recovered, with green above and red below the base. Error bars show ±1σ across repeats, and ∆ is the area-av… view at source ↗
Figure 6
Figure 6. Figure 6: The search with the surrogate matches or improves on the baseline’s best speedup on four of the six tasks, often at substantially fewer GPU measurements, with gaps of 5% and 7% on the remaining two. Each panel is one of the six tasks of Section 3.2. The y-axis is the best measured kernel speedup so far. The x-axis is the cumulative number of GPU measurements consumed by the search. Without the surrogate, t… view at source ↗
Figure 7
Figure 7. Figure 7: Surrogates can identify discovery moments, the parent-to-child mutations where the child is much faster than its parent. Each panel fixes one definition of much faster (child at least 1.25×, 1.5×, or 2× faster than its parent). The surrogate ranks all pairs from most to least likely to be a discovery. Each curve shows precision and recall as we accept more of the ranking, starting from just the top most co… view at source ↗
Figure 8
Figure 8. Figure 8: Evaluations per hour on a single A100 for the three evaluation modes used in this paper. The two dashed lines are the kernel benchmark, fixed at one in flight because timing a kernel against the reference requires sole occupancy of the GPU. The solid line is the LLM surrogate (GPT-OSS-20B), which has no such constraint and serves multiple forecasts concurrently from the same GPU, so its throughput grows wi… view at source ↗
read the original abstract

GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground-truth signal necessary for kernel search, they are costly, because each evaluation of a kernel requires compilation and repeated execution on a GPU. As improvements in LLM inference reduce the cost of writing novel kernels and LLM-driven searches scale to large search budgets, on-device evaluation becomes a bottleneck. To address this, we study how LLMs can serve as selective GPU surrogates for kernel evaluation, by forecasting the performance of proposed kernels. A useful surrogate should be accurate, and it should be selective, by knowing when it could be wrong, and deferring to the GPU. To evaluate surrogates, we measure whether their forecasts are accurate, calibrated, and practically useful for recovering fast kernels under limited GPU-measurement budgets. Next, we study whether reinforcement learning can improve forecast accuracy and confidence calibration. Our experiments demonstrate that LLMs can accurately forecast relative kernel performance, that their utility can be improved through reinforcement learning. Used inside a kernel search, the surrogate lets the search consider several times as many candidates under the same GPU evaluation budget, and that leads to finding faster kernels than an equal-budget baseline. These results suggest that LLMs can play a broader role in kernel optimization, by acting as virtual models of a GPU rather than solely as kernel generators for search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes using LLMs as selective surrogates to forecast relative runtime performance of GPU kernels, thereby reducing the number of expensive on-device evaluations needed during evolutionary or agent-driven kernel searches. It reports that LLMs produce accurate relative forecasts, that RL fine-tuning improves both accuracy and confidence calibration, and that deploying the surrogate inside a search loop permits evaluating several times more candidates under a fixed GPU budget, ultimately yielding faster kernels than an equal-budget baseline that relies solely on GPU measurements.

Significance. If the empirical results hold under the reported conditions, the work demonstrates a practical way to scale kernel optimization by treating LLMs as cheap, selective virtual models of GPU behavior rather than only as code generators. The combination of relative-accuracy evaluation, RL-driven calibration, and end-to-end search improvement supplies concrete evidence that selective deferral can increase search throughput without degrading final kernel quality. This approach could meaningfully lower the hardware cost of automated kernel tuning in deep-learning systems.

minor comments (3)
  1. [§4.2] §4.2 and Table 2: the description of the RL reward function does not explicitly state how the calibration term is normalized relative to the accuracy term; adding the exact weighting formula would remove ambiguity when reproducing the reported calibration gains.
  2. [Figure 3] Figure 3 caption: the x-axis label 'search budget (GPU evals)' should clarify whether the plotted points include or exclude the surrogate-only evaluations; this affects interpretation of the 'several times more candidates' claim.
  3. [§5.3] §5.3: the baseline search is described as 'equal-budget' but the text does not state whether the baseline also uses the same evolutionary operators and population size; a one-sentence clarification would strengthen the comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The referee's description of the work is accurate. No specific major comments were provided in the report, so we have no point-by-point responses. We will incorporate any minor revisions requested by the editor.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical workflow: LLMs are fine-tuned via reinforcement learning on kernel performance data, then evaluated for forecast accuracy, calibration, and end-to-end utility inside a search loop whose final kernels are measured on actual GPU hardware. All load-bearing claims (relative ranking accuracy, selective deferral benefit, and search improvement under fixed measurement budget) are validated by direct comparison to hardware baselines rather than by any equation that reduces a forecast to a quantity already fitted inside the same paper. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the surrogate; the derivation chain therefore remains self-contained against external hardware measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5803 in / 1079 out tokens · 25447 ms · 2026-06-28T23:08:19.967540+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 25 canonical work pages · 13 internal anchors

  1. [1]

    Nature , author =

    Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J. Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3.Nature, 630:493–500, 2024. doi: 10.1038/s41586-024-07487-w

  2. [2]

    Cameron, and Gal Oren

    Gregory Bolet, Giorgis Georgakoudis, Konstantinos Parasyris, Harshitha Menon, Niranjan Hasabnis, Kirk W. Cameron, and Gal Oren. Counting without running: Evaluating LLMs’ reasoning about code complexity.arXiv preprint arXiv:2512.04355, 2025

  3. [3]

    Gonzalez, and Ion Stoica

    Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, and Ion Stoica. K-search: Llm kernel generation via co-evolving intrinsic world model.arXiv preprint arXiv:2602.19128, 2026

  4. [4]

    Scaling open-ended reasoning to predict the future.arXiv preprint arXiv:2512.25070, 2025

    Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, and Jonas Geiping. Scaling open-ended reasoning to predict the future.arXiv preprint arXiv:2512.25070, 2025

  5. [5]

    Avo: Agentic variation operators for autonomous evolutionary search, 2026

    Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi. Avo: Agentic variation operators for autonomous evolutionary ...

  6. [6]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    Franc ¸ois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. ARC- AGI-2: A new challenge for frontier AI reasoning systems, 2025. URL https://arxiv.org/ abs/2505.11831

  7. [7]

    An optimum character recognition system using decision functions.IRE Transactions on Electronic Computers, EC-6(4):247–254, 1957

    Chi-Keung Chow. An optimum character recognition system using decision functions.IRE Transactions on Electronic Computers, EC-6(4):247–254, 1957

  8. [8]

    Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation.arXiv preprint, 2026

    Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, and Hao Zhou. Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation.arXiv preprint, 2026

  9. [9]

    Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

    Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, and Jacob Andreas. Beyond binary rewards: Training LMs to reason about their uncertainty.arXiv preprint arXiv:2507.16806, 2025. 12

  10. [10]

    DeepSeek-V4: Towards highly efficient million-token context intelligence,

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence,

  11. [11]

    URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf

  12. [12]

    On the foundations of noise-free selective classification.Journal of Machine Learning Research, 11:1605–1641, 2010

    Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classification.Journal of Machine Learning Research, 11:1605–1641, 2010

  13. [13]

    Cwm: An open-weights llm for research on code generation with world models, 2025

    Meta FAIR CodeGen Team. Cwm: An open-weights llm for research on code generation with world models, 2025. URLhttps://ai.meta.com/research/publications/cwm/

  14. [14]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  15. [15]

    Mathematical exploration and discovery at scale

    Bogdan Georgiev, Javier G´omez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale.arXiv preprint arXiv:2511.02864, 2025

  16. [16]

    Gemini 3 pro model card, 2025

    Google DeepMind. Gemini 3 pro model card, 2025. URL https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf

  17. [17]

    Alex Gu, Baptiste Rozi `ere, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. CRUXEval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065, 2024

  18. [18]

    Recurrent world models facilitate policy evolution

    David Ha and J ¨urgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems (NeurIPS), 2018

  19. [19]

    Mastering diverse control tasks through world models,

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640:647–653, 2025. doi: 10.1038/s41586-025-08744-2

  20. [20]

    Improving efficiency of GPU kernel optimization agents using a domain-specific language and speed-of-light guidance.arXiv preprint arXiv:2603.29010, 2026

    Siva Kumar Sastry Hari, Vignesh Balaji, Sana Damani, Qijing Huang, and Christos Kozyrakis. Improving efficiency of GPU kernel optimization agents using a domain-specific language and speed-of-light guidance.arXiv preprint arXiv:2603.29010, 2026

  21. [21]

    Jones, Matthias Schonlau, and William J

    Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions.Journal of Global Optimization, 13:455–492, 1998. doi: 10.1023/A:1008306431147

  22. [22]

    One life to learn: Inferring symbolic world models for stochastic environments from unguided exploration

    Zaid Khan, Archiki Prasad, Elias Stengel-Eskin, Jaemin Cho, and Mohit Bansal. One life to learn: Inferring symbolic world models for stochastic environments from unguided exploration. InInternational Conference on Learning Representations (ICLR), 2026

  23. [23]

    ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

  24. [24]

    Obtaining well calibrated probabilities using bayesian binning

    Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InProceedings of the AAAI conference on artificial intelligence, volume 29, 2015

  25. [25]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131, 2025

  26. [26]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/ 2508.10925

  27. [27]

    KernelBench: Can LLMs Write Efficient GPU Kernels?

    Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher R´e, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels?, 2025. URL https: //arxiv.org/abs/2502.10517

  28. [28]

    PoE-World: Compositional world modeling with products of programmatic experts

    Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, and Kevin Ellis. PoE-World: Compositional world modeling with products of programmatic experts. In Advances in Neural Information Processing Systems (NeurIPS), 2025

  29. [29]

    Learning to generate unit tests for automated debugging

    Archiki Prasad, Elias Stengel-Eskin, Justin Chen, Zaid Khan, and Mohit Bansal. Learning to generate unit tests for automated debugging. InSecond Conference on Language Modeling, 2025. 13

  30. [30]

    NeuralOS: Towards simulating operating systems via neural generative models.arXiv preprint arXiv:2507.08800, 2025

    Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, and Yuntian Deng. NeuralOS: Towards simulating operating systems via neural generative models.arXiv preprint arXiv:2507.08800, 2025

  31. [31]

    Christopher D. Rosin. Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61:203–230, 2011. doi: 10.1007/s10472-011-9258-6

  32. [32]

    CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

    Tara Saba, Anne Ouyang, Xujie Si, and Fan Long. CuTeGen: An LLM-based agentic framework for generation and optimization of high-performance GPU kernels using CuTe.arXiv preprint arXiv:2604.01489, 2026

  33. [33]

    Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588:604–609, 2020. doi: 10.1038/s41586-020-03051-4

  34. [34]

    Active learning literature survey

    Burr Settles. Active learning literature survey. Technical Report 1648, University of Wisconsin- Madison Department of Computer Sciences, 2009

  35. [35]

    Lacie: Listener-aware finetuning for calibration in large language models.Advances in Neural Information Processing Systems, 37: 43080–43106, 2024

    Elias Stengel-Eskin, Peter Hase, and Mohit Bansal. Lacie: Listener-aware finetuning for calibration in large language models.Advances in Neural Information Processing Systems, 37: 43080–43106, 2024

  36. [36]

    Richard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM SIGART Bulletin, 2(4):160–163, 1991

  37. [37]

    ThetaEvolve: Test-time Learning on Open Problems

    Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Thetaevolve: Test-time learning on open problems. arXiv preprint 2511.23473, 2025

  38. [38]

    Benchmarking World-Model Learning with Environment-Level Queries

    Archana Warrier, Dat Nguyen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cambridge Yang, Joshua B. Tenenbaum, Sebastian V ollmer, Kevin Ellis, and Zenna Tavares. Benchmarking world-model learning, 2025. URLhttps://arxiv.org/abs/2510.19788

  39. [39]

    LLM- as-a-prophet: Understanding predictive intelligence with Prophet Arena.arXiv preprint arXiv:2510.17638, 2025

    Qingchuan Yang, Simon Mahns, Sida Li, Anri Gu, Jibang Wu, and Haifeng Xu. LLM- as-a-prophet: Understanding predictive intelligence with Prophet Arena.arXiv preprint arXiv:2510.17638, 2025

  40. [40]

    MIRAI: Evaluating LLM agents for event forecasting.arXiv preprint arXiv:2407.01231, 2024

    Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, and Wei Wang. MIRAI: Evaluating LLM agents for event forecasting.arXiv preprint arXiv:2407.01231, 2024

  41. [41]

    Learning to discover at test time.arXiv preprint, 2026

    Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to discover at test time.arXiv preprint, 2026

  42. [42]

    Neural Computers

    Mingchen Zhuge, Changsheng Zhao, Haozhe Liu, Zijian Zhou, Shuming Liu, Wenyi Wang, Ernie Chang, Gael Le Lan, Junjie Fei, Wenxuan Zhang, Yasheng Sun, Zhipeng Cai, Zechun Liu, Yunyang Xiong, Yining Yang, Yuandong Tian, Yangyang Shi, Vikas Chandra, and J¨urgen Schmidhuber. Neural computers.arXiv preprint arXiv:2604.06425, 2026. A Success Cases and Failure Ca...

  43. [52]

    %.2f"|format(hardware.clock_rate_ghz) }} GHz | | Memory Clock | {{

    **Data types and special hardware**: Does the candidate use lower precision (FP16, INT8) or tensor cores? These can provide large speedups for eligible operations. Think carefully about the RELATIVE performance. A candidate might be well-written but still slower than a PyTorch reference that dispatches to cuBLAS. Conversely, a simple-looking candidate mig...

  44. [53]

    **Algorithmic complexity**: How many FLOPs does each kernel perform? Does the candidate reduce total work (e.g., fused operations, fewer passes over data)?

  45. [54]

    **Memory access patterns**: Are global memory accesses coalesced? Does the candidate use shared memory or registers to reduce global memory traffic? How many bytes are read/written per thread?

  46. [55]

    **Arithmetic intensity**: What is the ratio of compute to memory operations? Is the kernel compute-bound or memory-bandwidth- bound? This determines which optimizations matter

  47. [56]

    **Thread divergence**: Do conditionals cause warp divergence? Does the candidate reduce branch divergence compared to the reference?

  48. [57]

    **Occupancy and resource pressure**: How many registers per thread? How much shared memory per block? These limit the number of concurrent warps and can bottleneck throughput

  49. [58]

    **Parallelism and grid dimensions**: Does the candidate expose more parallelism? Are there enough threads to saturate the GPU? Is the work evenly distributed across blocks?

  50. [59]

    **Synchronization overhead**: Does the kernel use __syncthreads(), atomics, or other synchronization primitives? These can serialize execution

  51. [60]

    Multiple kernel launches vs

    **Kernel launch overhead**: For very fast kernels, launch overhead can dominate. Multiple kernel launches vs. a single fused kernel matters

  52. [61]

    custom code**: Does the reference use highly optimized vendor libraries (cuBLAS, cuDNN)? Custom kernels rarely beat these for standard operations (GEMM, convolution)

    **Library calls vs. custom code**: Does the reference use highly optimized vendor libraries (cuBLAS, cuDNN)? Custom kernels rarely beat these for standard operations (GEMM, convolution)

  53. [62]

    %.2f"|format(hardware.clock_rate_ghz) }} GHz | | Memory Clock | {{

    **Data types and special hardware**: Does the candidate use lower precision (FP16, INT8) or tensor cores? These can provide large speedups for eligible operations. Think carefully about the RELATIVE performance. A candidate might be well-written but still slower than a PyTorch reference that dispatches to cuBLAS. Conversely, a simple-looking candidate mig...

  54. [63]

    Translates the 2D tensor `(num_tokens, hidden_dim)` into a flat 1D grid of `total_groups`.,→

  55. [64]

    Dynamically adjusts `BLOCK_GROUPS` strictly to balance SM wave occupancy and register limit.,→ - Targeting >= 512 blocks ensures massive wave overlap on A100's 108 SMs to natively hide DRAM latency.,→ - Capping `BLOCK_GROUPS` at 128 guarantees <= 64 elements per thread, completely eliminating,→ register spilling which otherwise drastically hurts performan...

  56. [65]

    Groups are processed entirely within registers natively (intra-warp reduction), bypassing,→ shared memory round-trips

  57. [66]

    Drops PTX masking globally when the dispatch grid perfectly overlaps the tensor boundary.,→ """ 34 import torch import triton import triton.language as tl @triton.jit def fp8_quantize_kernel( x_ptr, x_q_ptr, x_s_ptr, total_groups, GROUP_SIZE: tl.constexpr, BLOCK_GROUPS: tl.constexpr, NEEDS_MASK: tl.constexpr, ): pid = tl.program_id(0) group_start = pid * ...