pith. machine review for the scientific record. sign in

arxiv: 2604.18616 · v1 · submitted 2026-04-16 · 💻 cs.DC · cs.AI· cs.PL

Recognition: unknown

ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:53 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.PL
keywords GPU kernel optimizationLLM agentsdata-flow invariantsagentic code generationGEMMflash attentionMoEKernelBench
0
0 comments X

The pith

Argus uses data-flow invariants to let LLM agents generate GPU kernels at 99-104% of hand-optimized throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Argus is an agentic framework that improves LLM-based generation of GPU kernels by defining data-flow invariants as compile-time rules for how data must move during execution. These rules, expressed in a tile-based Pythonic DSL, supply concrete counterexamples when violated so agents can fix problems in tiling, staging, pipelining, and scheduling instead of guessing from pass/fail results. An in-context reinforcement learning planner draws on a knowledge base to choose optimizations and write the invariants, which are then checked with abstract interpretation and SMT solving at compile time. A reader would care if this works because it automates production of high-performance kernels for matrix multiplication, attention, and MoE layers that today demand expert assembly coding. The paper reports the resulting kernels reach 99-104% of state-of-the-art hand-tuned speed on AMD MI300X while solving nearly all KernelBench tasks.

Core claim

Argus shows that data-flow invariants, verified at compile time via abstract interpretation over a layout algebra and SMT solving with zero runtime cost, enable an in-context RL planner to synthesize kernels for GEMM, flash attention, and MoE that achieve 99-104% of state-of-the-art hand-optimized assembly throughput and outperform prior agentic systems by 2-1543x while solving 100% of Level 1 and 90% of Level 2 KernelBench tasks.

What carries the argument

Data-flow invariants are compile-time specifications that encode required data choreography through kernel execution, realized as tag functions and tag assertions inside the tile-based DSL that propagate symbolic annotations and return concrete counterexamples on violation.

If this is right

  • Kernels for GEMM, flash attention, and MoE reach 99-104% of hand-optimized throughput on AMD MI300X with no runtime verification cost.
  • Performance exceeds existing agentic systems by factors of 2 to 1543x on the same workloads.
  • The approach solves every Level 1 task and 90% of Level 2 tasks across 200 KernelBench problems.
  • The DSL and invariant system generalize across the dominant GPU operations in LLM inference without per-kernel expert tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same invariant machinery could be applied to NVIDIA or other GPU architectures to test whether the performance parity holds beyond AMD MI300X.
  • Embedding the planner and DSL into a larger code-generation pipeline might allow automatic optimization of entire inference graphs rather than isolated kernels.
  • Extending the approach to CPU or TPU back-ends would reveal whether data-flow invariants remain effective outside GPU-specific tiling and memory hierarchies.
  • Measuring the number of planner iterations needed for new kernel types would quantify how quickly the system adapts to previously unseen operations.

Load-bearing premise

The in-context RL planner, aided by the curated knowledge base, can reliably produce invariants and optimization choices whose compile-time verification guarantees the kernels will deliver the claimed performance without hidden runtime violations or hardware-specific failures.

What would settle it

A kernel generated by Argus that passes all invariant checks and compiles cleanly yet runs measurably slower than the hand-optimized reference or produces wrong results on a concrete input pattern not covered by the symbolic checks.

Figures

Figures reproduced from arXiv: 2604.18616 by Binhang Yuan, Chenzhun Guo, Christos Kozyrakis, Cong Wang, Daifeng Li, Haohui Mai, Jiacheng Zhao, Qiuchu Yu, Xiangyun Ding, Xiaoyan Guo.

Figure 1
Figure 1. Figure 1: Overview of Argus. Left: Simplified DSL implementation of flash attention (𝑑=128, 𝐵𝑟=256, 𝐵𝑐=64, 512 threads). Top right: Agentic kernel generation workflow. Bottom right: Tag propagation for 𝑉 across memory levels; each color–shape combination represents a unique tag, and background shading links code regions to memory access patterns. r0 and r8 represent row 0 and row 8. The DSL also exposes hardware ins… view at source ↗
Figure 2
Figure 2. Figure 2: shows the performance ablation results. We ob￾serve that the most significant single contributor is bank conflict mitigation, improving throughput by roughly 30%. Adding pipelining and warp specialization alone causes a slight regression due to increased instruction and branch counts; instruction scheduling recovers and extends the gains, yielding a 2.4–2.8× overall speedup over the naïve baseline. Applyin… view at source ↗
read the original abstract

LLM-based coding agents can generate functionally correct GPU kernels, yet their performance remains far below hand-optimized libraries on critical computations such as matrix multiplication, attention, and Mixture-of-Experts (MoE). Peak GPU performance requires coordinated reasoning over tightly coupled optimizations, including tiling, shared-memory staging, software pipelining, and instruction scheduling, while existing agents rely on sparse pass/fail feedback, leaving them unable to diagnose global constraint violations. We present Argus, an agentic framework that addresses this through data-flow invariants: compile-time specifications encoding how data must be choreographed throughout kernel execution. Argus introduces a tile-based, Pythonic DSL exposing hardware instructions and compiler policies while hiding low-level representations. The DSL provides tag functions to propagate symbolic annotations through data and control flow, and tag assertions to enforce relational constraints at use sites. When violations occur, the compiler returns concrete counterexamples identifying the thread, data element, and program point, enabling dense, structured feedback for targeted fixes. Invariants are verified at compile time via abstract interpretation over a layout algebra and SMT solving, with zero runtime overhead. An in-context reinforcement learning planner learns to select optimizations and synthesize effective invariants, supported by a curated knowledge base of GPU optimization techniques. We evaluate Argus on the AMD MI300X GPU across GEMM, flash attention, and MoE kernels accounting for over 90% of GPU time in LLM inference. Generated kernels achieve 99-104% of state-of-the-art hand-optimized assembly throughput and are 2-1543x faster than existing agentic systems. Argus further generalizes to 200 KernelBench tasks, solving 100% of Level 1 and 90% of Level 2 problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents ARGUS, an agentic framework for GPU kernel optimization that encodes data-flow invariants in a tile-based Pythonic DSL using tag functions for symbolic annotation propagation and tag assertions for relational constraints. Invariants are verified at compile time via abstract interpretation over a layout algebra and SMT solving (zero runtime overhead). An in-context RL planner, aided by a curated knowledge base, selects optimizations. On AMD MI300X, generated GEMM/attention/MoE kernels (covering >90% of LLM inference time) are claimed to reach 99-104% of hand-optimized assembly throughput and 2-1543x speedup over prior agentic systems; the approach also solves 100% of KernelBench Level 1 and 90% of Level 2 tasks.

Significance. If the performance and generalization claims hold under full experimental scrutiny, the work would be significant for automated high-performance computing: it supplies structured, dense feedback (counterexamples identifying thread/data/program point) that existing sparse pass/fail LLM agents lack, while demonstrating that compile-time formal methods can guide near-peak GPU code generation for production workloads.

major comments (3)
  1. [Abstract / Evaluation] Abstract and evaluation: the central performance claim (99-104% of hand-optimized assembly throughput) is reported without any description of measurement methodology, baseline kernel sources, statistical analysis, run counts, or hardware configuration details beyond the MI300X model. This absence prevents assessment of whether the results support the claim that verified invariants suffice for throughput parity.
  2. [Verification / DSL semantics] The soundness of the performance claim rests on the assertion that abstract interpretation over the layout algebra plus SMT captures all behaviors affecting real execution; however, the abstraction necessarily omits micro-architectural timing (shared-memory bank conflicts, warp scheduling latency, instruction-issue constraints) that can alter achieved throughput on MI300X even when data-flow invariants hold.
  3. [KernelBench evaluation] The generalization result (100% Level 1 / 90% Level 2 on 200 KernelBench tasks) is stated without reporting per-task success criteria, failure modes, or whether the same invariant-verification pipeline was used uniformly; this leaves open whether the in-context RL planner reliably synthesizes effective invariants across task distributions.
minor comments (3)
  1. [DSL] Define the precise semantics of tag propagation through control-flow constructs (e.g., conditionals, loops) in the DSL; the current description leaves ambiguity about how assertions are checked at use sites under divergent execution.
  2. [Introduction / §3] Provide a small illustrative example (kernel fragment + tags + counterexample) early in the paper to make the feedback mechanism concrete for readers unfamiliar with layout algebras.
  3. [Planner / Knowledge base] Clarify the exact form of the knowledge base (size, curation process, whether it is public) and how it is injected into the in-context RL planner.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity on methodology, limitations, and evaluation details.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and evaluation: the central performance claim (99-104% of hand-optimized assembly throughput) is reported without any description of measurement methodology, baseline kernel sources, statistical analysis, run counts, or hardware configuration details beyond the MI300X model. This absence prevents assessment of whether the results support the claim that verified invariants suffice for throughput parity.

    Authors: We acknowledge the omission of detailed methodology in the original abstract and evaluation sections. The revised manuscript now includes an expanded 'Experimental Methodology' subsection that specifies: (1) all measurements were performed using the AMD ROCm profiler with 1000 warm-up iterations followed by 1000 timed iterations per kernel; (2) hand-optimized baselines are the vendor-provided assembly kernels from the ROCm library (e.g., rocBLAS for GEMM and custom flash-attention implementations); (3) results report mean throughput with 95% confidence intervals computed over 50 independent runs on the same MI300X device; and (4) full hardware configuration including driver version, HBM3e memory, and clock settings. These additions allow direct assessment that the 99-104% range reflects consistent, reproducible parity under the invariant-guided approach. revision: yes

  2. Referee: [Verification / DSL semantics] The soundness of the performance claim rests on the assertion that abstract interpretation over the layout algebra plus SMT captures all behaviors affecting real execution; however, the abstraction necessarily omits micro-architectural timing (shared-memory bank conflicts, warp scheduling latency, instruction-issue constraints) that can alter achieved throughput on MI300X even when data-flow invariants hold.

    Authors: We agree that the abstract interpretation and SMT solver target data-flow invariants and layout constraints rather than full micro-architectural timing. The verification guarantees absence of data races, incorrect tiling, and certain relational violations, which are necessary but not sufficient for peak throughput. The near-peak performance is achieved empirically through the in-context RL planner's selection of optimizations (tiling, pipelining, etc.) informed by the curated knowledge base. We have added a limitations paragraph in the Discussion section explicitly noting that micro-architectural effects are not modeled statically and that the approach relies on the planner's learned heuristics to mitigate them in practice. This does not alter the core claim that verified invariants enable effective optimization but clarifies the boundary of the formal guarantees. revision: partial

  3. Referee: [KernelBench evaluation] The generalization result (100% Level 1 / 90% Level 2 on 200 KernelBench tasks) is stated without reporting per-task success criteria, failure modes, or whether the same invariant-verification pipeline was used uniformly; this leaves open whether the in-context RL planner reliably synthesizes effective invariants across task distributions.

    Authors: The same invariant-verification pipeline (DSL tagging, abstract interpretation, and SMT) was applied uniformly to all 200 KernelBench tasks. Success criteria are defined as: (a) passing the task's functional test suite and (b) achieving at least 80% of the reference kernel's throughput when a reference is provided. We have added an appendix table listing per-task outcomes, grouped by level, along with the most common failure modes (primarily incomplete invariant synthesis for tasks with complex data-dependent control flow in Level 2). This revision demonstrates that the RL planner generalizes reliably when the verification feedback loop is available. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks

full rationale

The paper presents an agentic system whose central results are empirical performance numbers (99-104% of hand-optimized assembly, 2-1543x speedups, 100%/90% solve rates on KernelBench levels) obtained by running generated kernels on MI300X hardware and comparing against independently published libraries and prior agentic baselines. No equations, fitted parameters, or first-principles derivations appear in the provided text; the verification method (abstract interpretation + SMT) is described as a compile-time tool that supplies feedback to the planner, not as a mathematical reduction that presupposes the throughput numbers. Self-citations are absent from the load-bearing sections, and the knowledge base is curated external material rather than an internal loop. The derivation chain is therefore self-contained against external oracles.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only abstract available; limited visibility into parameters or assumptions. The framework introduces a new DSL and verification method whose correctness depends on unstated details of the layout algebra and SMT encoding.

axioms (1)
  • domain assumption Abstract interpretation over a layout algebra combined with SMT solving can verify data-flow invariants at compile time with zero runtime overhead and produce useful counterexamples.
    Stated as the verification technique enabling dense feedback.
invented entities (1)
  • Tile-based Pythonic DSL with tag functions and tag assertions no independent evidence
    purpose: Expose hardware instructions and compiler policies while propagating symbolic annotations and enforcing relational constraints.
    Core new artifact introduced to address sparse feedback in prior agents.

pith-pipeline@v0.9.0 · 5657 in / 1339 out tokens · 23387 ms · 2026-05-10T09:53:18.323477+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Aiter: AI Tensor Engine for ROCm.https://github.com/ROCm/ aiter, 2025

    AMD. Aiter: AI Tensor Engine for ROCm.https://github.com/ROCm/ aiter, 2025

  2. [2]

    AMD. AMD instinct MI300 instruction set architecture.https:// www.amd.com/content/dam/amd/en/documents/instinct-tech- docs/instruction-set-architectures/amd-instinct-mi300-cdna3- instruction-set-architecture.pdf, August 2025

  3. [3]

    AMD instinct MI300X workload optimization.https://rocm.d ocs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization /workload.html, 2025

    AMD. AMD instinct MI300X workload optimization.https://rocm.d ocs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization /workload.html, 2025

  4. [4]

    hipBLASLt: General matrix-matrix operations library for amd gpus.https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/index .html, 2026

    AMD. hipBLASLt: General matrix-matrix operations library for amd gpus.https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/index .html, 2026

  5. [5]

    Rocm software.https://github.com/ROCm/ROCm, 2026

    AMD. Rocm software.https://github.com/ROCm/ROCm, 2026

  6. [6]

    Kevin: Multi-turn RL for generating CUDA kernels

    Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn RL for generating CUDA kernels. InThe Fourteenth International Conference on Learning Representations, 2026

  7. [7]

    GPUVerify: a verifier for GPU kernels

    Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. GPUVerify: a verifier for GPU kernels. OOPSLA ’12, 2012

  8. [8]

    How to optimize a CUDA matmul kernel for cuBLAS- like performance: a worklog.https://siboehm.com/articles/22/CUDA- MMM, 2022

    Simon Boehm. How to optimize a CUDA matmul kernel for cuBLAS- like performance: a worklog.https://siboehm.com/articles/22/CUDA- MMM, 2022

  9. [9]

    Ramanujam, and P

    Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. PLDI ’08, 2008

  10. [10]

    Gonzalez, and Ion Stoica

    Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, and Ion Stoica. K-Search: LLM kernel generation via co-evolving intrinsic world model, 2026

  11. [11]

    Cate- gorical foundations for CuTe layouts

    Jack Carlisle, Jay Shah, Reuben Stern, and Paul VanKoughnett. Cate- gorical foundations for CuTe layouts. 2026

  12. [12]

    Proofwright: Towards agentic formal verification of cuda.arXiv preprint arXiv:2511.12294, 2025

    Bodhisatwa Chatterjee, Drew Zagieboylo, Sana Damani, Siva Hari, and Christos Kozyrakis. Proofwright: Towards agentic formal verification of CUDA.CoRR, abs/2511.12294, 2025

  13. [13]

    Frans Kaashoek, and Nickolai Zeldovich

    Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. Using crash hoare logic for certify- ing the FSCQ file system. InUSENIX ATC 16, 2016

  14. [14]

    AVO: Agentic variation oper- ators for autonomous evolutionary search, 2026

    Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi. AVO: Agentic variation oper- ators for autonomous evolutionar...

  15. [15]

    TVM: An automated end-to-end optimizing compiler for deep learning

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. OSDI’18, 2018

  16. [16]

    Abstract interpretation: a unified lattice model for static analysis of programs by construction or approx- imation of fixpoints

    Patrick Cousot and Radhia Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approx- imation of fixpoints. InProceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL ’77, 1977

  17. [17]

    CUDA agent: Large-scale agentic RL for high- performance CUDA kernel generation, 2026

    Weinan Dai, Hanlin Wu, Qiying Yu, Huan ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, and Hao Zhou. CUDA agent: Large-scale agentic RL for high- performance CUDA kernel generation, 2026

  18. [18]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  19. [19]

    Z3: An efficient SMT solver

    Leonardo de Moura and Nikolaj Bjørner. Z3: An efficient SMT solver. InTACAS, 2008

  20. [20]

    Deepseek-v3.2: Pushing the frontier of open large lan- guage models, 2025

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large lan- guage models, 2025

  21. [21]

    Tilus: A tile- level GPGPU programming language for low-precision computation

    Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Hao Yu, Yida Wang, and Gennady Pekhimenko. Tilus: A tile- level GPGPU programming language for low-precision computation. ASPLOS ’26, 2025

  22. [22]

    STARK: Strategic team of agents for refining kernels

    Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, and Shuang Yang. STARK: Strategic team of agents for refining kernels. ICLR’26, 2026

  23. [23]

    Ker- nelBlaster: Continual cross-task CUDA optimization via memory- augmented in-context reinforcement learning.CoRR, abs/2602.14293, 2026

    Shengjun Kris Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Ed- ward Lin, Siva Kumar Sastry Hari, and Christos Kozyrakis. Ker- nelBlaster: Continual cross-task CUDA optimization via memory- augmented in-context reinforcement learning.CoRR, abs/2602.14293, 2026

  24. [24]

    GLM-5: from vibe coding to agentic engineering, 2026

    GLM-5-Team. GLM-5: from vibe coding to agentic engineering, 2026

  25. [25]

    Polly - performing polyhedral optimizations on a low-level intermediate representation.Parallel Processing Letters, 2012

    Tobias Grosser, Armin Groesslinger, and Christian Lengauer. Polly - performing polyhedral optimizations on a low-level intermediate representation.Parallel Processing Letters, 2012

  26. [26]

    Mercury: Unlocking multi-GPU operator optimization for LLMs via remote memory scheduling

    Yue Guan, Xinwei Qiang, Zaifeng Pan, Daniels Johnson, Yuanwei Fang, Keren Zhou, Yuke Wang, Wanlu Li, Yufei Ding, and Adnan Aziz. Mercury: Unlocking multi-GPU operator optimization for LLMs via remote memory scheduling. SOSP ’25, 2025

  27. [27]

    Improving efficiency of GPU kernel opti- mization agents using a domain-specific language and speed-of-light guidance, 2026

    Siva Kumar Sastry Hari, Vignesh Balaji, Sana Damani, Qijing Huang, and Christos Kozyrakis. Improving efficiency of GPU kernel opti- mization agents using a domain-specific language and speed-of-light guidance, 2026

  28. [28]

    Lorch, Bryan Parno, Michael L

    Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. IronFleet: proving practical distributed systems correct. SOSP ’15, 2015

  29. [29]

    CUCo: An agentic framework for compute and communication co- design, 2026

    Bodun Hu, Yoga Sri Varshan V, Saurabh Agarwal, and Aditya Akella. CUCo: An agentic framework for compute and communication co- design, 2026

  30. [30]

    Fu, Ryann Swann, Muhammad Osama, Christopher Ré, and Simran Arora

    William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher Ré, and Simran Arora. HipKittens: Fast and furious AMD kernels, 2025

  31. [31]

    Dissecting and modeling the architecture of modern GPU cores

    Rodrigo Huerta, Mojtaba Abaie Shoushtary, José-Lorenzo Cruz, and Antonio Gonzalez. Dissecting and modeling the architecture of modern GPU cores. MICRO ’25, 2025

  32. [32]

    Exo 2: Growing a scheduling language

    Yuka Ikarashi, Kevin Qian, Samir Droubi, Alex Reinking, Gilbert Louis Bernstein, and Jonathan Ragan-Kelley. Exo 2: Growing a scheduling language. ASPLOS ’25, 2025

  33. [33]

    seL4: formal verification of an OS kernel

    Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood. seL4: formal verification of an OS kernel. SOSP ’09, 2009

  34. [34]

    M. Lam. Software pipelining: an effective scheduling technique for vliw machines. PLDI ’88, 1988

  35. [35]

    Towards robust agentic CUDA kernel bench- marking, verification, and optimization, 2025

    Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, and David Ha. Towards robust agentic CUDA kernel bench- marking, verification, and optimization, 2025

  36. [36]

    MLIR: Scaling compiler infrastructure for domain specific computation

    Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasi- lache, and Oleksandr Zinenko. MLIR: Scaling compiler infrastructure for domain specific computation. CGO’21, 2021

  37. [37]

    SIRIUS: Harvesting whole-program optimization opportunities for DNNs

    Yijin Li, Jiacheng Zhao, Sun Qianqi, Haohui Mai, Lei Chen, Wanlu Cao, Yanfan Chen, Li zhicheng, Ying Liu, Xinyuan Zhang, Xiyu Shi, Jie Zhao, Jingling Xue, Huimin Cui, and XiaoBing Feng. SIRIUS: Harvesting whole-program optimization opportunities for DNNs. InProceedings of Machine Learning and Systems, 2023

  38. [38]

    KernelEvolve: Scaling agentic kernel coding for heterogeneous AI accelerators at Meta, 2026

    Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Ro- man Levenstein, Kunming Ho, Haishan Zhu, Alec Hammond, Richard Li, Ajit Mathews, Kaustubh G...

  39. [39]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  40. [40]

    Deja Vu: contextual sparsity for efficient LLMs at inference time

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Ré, and Beidi Chen. Deja Vu: contextual sparsity for efficient LLMs at inference time. ICML’23, 2023

  41. [41]

    Benchmarking and dissecting the nvidia hopper GPU architec- ture

    Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, and Xiaowen Chu. Benchmarking and dissecting the nvidia hopper GPU architec- ture. IPDPS’24, 2024

  42. [42]

    Rammer: Enabling holistic deep learning compiler optimizations with rTasks

    Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. OSDI’20, 2020

  43. [43]

    Verifying security invariants in Expres- sOS

    Haohui Mai, Edgar Pek, Hui Xue, Samuel Talmadge King, and Parthasarathy Madhusudan. Verifying security invariants in Expres- sOS. ASPLOS ’13, 2013

  44. [44]

    Online normalizer calculation for softmax,

    Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018

  45. [45]

    LLMs are in-context bandit reinforcement learners

    Giovanni Monea, Antoine Bosselut, Kianté Brantley, and Yoav Artzi. LLMs are in-context bandit reinforcement learners. InSecond Confer- ence on Language Modeling, 2025

  46. [46]

    Andrew C. Myers. JFlow: practical mostly-static information flow control. POPL’19, pages 228–241, January 1999

  47. [47]

    CUTLASS: CUDA templates for linear algebra subroutines and solvers.https://github.com/NVIDIA/cutlass, 2026

    NVIDIA. CUTLASS: CUDA templates for linear algebra subroutines and solvers.https://github.com/NVIDIA/cutlass, 2026

  48. [48]

    GPT-5.3-Codex system card

    OpenAI. GPT-5.3-Codex system card. Technical report, OpenAI, February 2026

  49. [49]

    KernelBench: Can LLMs write efficient GPU kernels? ICML’25, 2025

    Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Re, and Azalia Mirhoseini. KernelBench: Can LLMs write efficient GPU kernels? ICML’25, 2025

  50. [50]

    Polyhedral optimization of TensorFlow compu- tation graphs

    Benoît Pradelle, Benoît Meister, Muthu Baskaran, Jonathan Springer, and Richard Lethin. Polyhedral optimization of TensorFlow compu- tation graphs. In Abhinav Bhatele, David Boehme, Joshua A. Levine, Allen D. Malony, and Martin Schulz, editors,ProTools, pages 74–89, Cham, 2019. Springer International Publishing

  51. [51]

    Register alloca- tion by puzzle solving

    Fernando Magno Quintão Pereira and Jens Palsberg. Register alloca- tion by puzzle solving. PLDI ’08, 2008

  52. [52]

    Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

    Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. PLDI ’13, 2013

  53. [53]

    Ramalingam

    G. Ramalingam. The undecidability of aliasing.ACM Trans. Program. Lang. Syst., 16(5):1467–1471, September 1994

  54. [54]

    XLA : Compiling machine learning for peak performance, 2020

    Amit Sabne. XLA : Compiling machine learning for peak performance, 2020

  55. [55]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

  56. [56]

    Push-Button verification of file systems via crash refinement

    Helgi Sigurbjarnarson, James Bornholt, Emina Torlak, and Xi Wang. Push-Button verification of file systems via crash refinement. OSDI’16, 2016

  57. [57]

    CUDA-L2: Surpassing cublas performance for matrix multipli- cation through reinforcement learning, 2025

    Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, and Chris Shum. CUDA-L2: Surpassing cublas performance for matrix multipli- cation through reinforcement learning, 2025

  58. [58]

    Shuo Tang, Haohui Mai, and Samuel T. King. Trust and protection in the illinois browser operating system. OSDI’10, 2010

  59. [59]

    Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InPro- ceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019

  60. [60]

    Gluon.https://github.com/triton-lang/triton/tree/main/lib/Di alect/Gluon, 2026

    Triton. Gluon.https://github.com/triton-lang/triton/tree/main/lib/Di alect/Gluon, 2026

  61. [61]

    Hype, sustainability, and the price of the bigger-is-better paradigm in AI

    Gael Varoquaux, Sasha Luccioni, and Meredith Whittaker. Hype, sustainability, and the price of the bigger-is-better paradigm in AI. FAccT ’25, 2025

  62. [62]

    KernelFalcon: Autonomous gpu kernel generation via deep agents.https://pytorch.org/blog/kernelfalcon- autonomous-gpu-kernel-generation-via-deep-agents, 2025

    Laura Wang and PyTorch Team. KernelFalcon: Autonomous gpu kernel generation via deep agents.https://pytorch.org/blog/kernelfalcon- autonomous-gpu-kernel-generation-via-deep-agents, 2025

  63. [63]

    Tilelang: Bridge programmability and performance in modern neural kernels

    Lei Wang, Yu Cheng, Yining Shi, Zhiwen Mo, Zhengju Tang, Wenhao Xie, Tong Wu, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, and Zhi Yang. Tilelang: Bridge programmability and performance in modern neural kernels. ICLR’26, 2026

  64. [64]

    Mirage: a multi-level superoptimizer for tensor programs

    Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: a multi-level superoptimizer for tensor programs. OSDI ’25, 2025

  65. [65]

    Frans Kaashoek

    Alexander Yip, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. Improving application security with data flow assertions. SOSP ’09, 2009

  66. [66]

    differ- entiation

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic "differ- entiation" via text, 2024

  67. [67]

    Making information flow explicit in HiStar

    Nickolai Zeldovich, Silas Boyd-Wickizer, Eddie Kohler, and David Mazières. Making information flow explicit in HiStar. OSDI’06, 2006

  68. [68]

    En- abling tensor language model to assist in generating High-Performance tensor programs for deep learning

    Yi Zhai, Sijia Yang, Keyu Pan, Renwei Zhang, Shuo Liu, Chao Liu, Zichun Ye, Jianmin Ji, Jie Zhao, Yu Zhang, and Yanyong Zhang. En- abling tensor language model to assist in generating High-Performance tensor programs for deep learning. OSDI’24, 2024

  69. [69]

    CudaForge: An agent framework with hardware feed- back for CUDA kernel optimization, 2025

    Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. CudaForge: An agent framework with hardware feed- back for CUDA kernel optimization, 2025

  70. [70]

    Benchmarking the performance of large language models on the cere- bras wafer scale engine, 2024

    Zuoning Zhang, Dhruv Parikh, Youning Zhang, and Viktor Prasanna. Benchmarking the performance of large language models on the cere- bras wafer scale engine, 2024

  71. [71]

    AKG: automatic kernel generation for neural processing units using polyhedral transformations

    Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Xuefeng Jin. AKG: automatic kernel generation for neural processing units using polyhedral transformations. PLDI’21, 2021

  72. [72]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: Generating High- Performance tensor programs for deep learning. OSDI’20, 2020. 14