Recognition: unknown
ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants
Pith reviewed 2026-05-10 09:53 UTC · model grok-4.3
The pith
Argus uses data-flow invariants to let LLM agents generate GPU kernels at 99-104% of hand-optimized throughput.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Argus shows that data-flow invariants, verified at compile time via abstract interpretation over a layout algebra and SMT solving with zero runtime cost, enable an in-context RL planner to synthesize kernels for GEMM, flash attention, and MoE that achieve 99-104% of state-of-the-art hand-optimized assembly throughput and outperform prior agentic systems by 2-1543x while solving 100% of Level 1 and 90% of Level 2 KernelBench tasks.
What carries the argument
Data-flow invariants are compile-time specifications that encode required data choreography through kernel execution, realized as tag functions and tag assertions inside the tile-based DSL that propagate symbolic annotations and return concrete counterexamples on violation.
If this is right
- Kernels for GEMM, flash attention, and MoE reach 99-104% of hand-optimized throughput on AMD MI300X with no runtime verification cost.
- Performance exceeds existing agentic systems by factors of 2 to 1543x on the same workloads.
- The approach solves every Level 1 task and 90% of Level 2 tasks across 200 KernelBench problems.
- The DSL and invariant system generalize across the dominant GPU operations in LLM inference without per-kernel expert tuning.
Where Pith is reading between the lines
- The same invariant machinery could be applied to NVIDIA or other GPU architectures to test whether the performance parity holds beyond AMD MI300X.
- Embedding the planner and DSL into a larger code-generation pipeline might allow automatic optimization of entire inference graphs rather than isolated kernels.
- Extending the approach to CPU or TPU back-ends would reveal whether data-flow invariants remain effective outside GPU-specific tiling and memory hierarchies.
- Measuring the number of planner iterations needed for new kernel types would quantify how quickly the system adapts to previously unseen operations.
Load-bearing premise
The in-context RL planner, aided by the curated knowledge base, can reliably produce invariants and optimization choices whose compile-time verification guarantees the kernels will deliver the claimed performance without hidden runtime violations or hardware-specific failures.
What would settle it
A kernel generated by Argus that passes all invariant checks and compiles cleanly yet runs measurably slower than the hand-optimized reference or produces wrong results on a concrete input pattern not covered by the symbolic checks.
Figures
read the original abstract
LLM-based coding agents can generate functionally correct GPU kernels, yet their performance remains far below hand-optimized libraries on critical computations such as matrix multiplication, attention, and Mixture-of-Experts (MoE). Peak GPU performance requires coordinated reasoning over tightly coupled optimizations, including tiling, shared-memory staging, software pipelining, and instruction scheduling, while existing agents rely on sparse pass/fail feedback, leaving them unable to diagnose global constraint violations. We present Argus, an agentic framework that addresses this through data-flow invariants: compile-time specifications encoding how data must be choreographed throughout kernel execution. Argus introduces a tile-based, Pythonic DSL exposing hardware instructions and compiler policies while hiding low-level representations. The DSL provides tag functions to propagate symbolic annotations through data and control flow, and tag assertions to enforce relational constraints at use sites. When violations occur, the compiler returns concrete counterexamples identifying the thread, data element, and program point, enabling dense, structured feedback for targeted fixes. Invariants are verified at compile time via abstract interpretation over a layout algebra and SMT solving, with zero runtime overhead. An in-context reinforcement learning planner learns to select optimizations and synthesize effective invariants, supported by a curated knowledge base of GPU optimization techniques. We evaluate Argus on the AMD MI300X GPU across GEMM, flash attention, and MoE kernels accounting for over 90% of GPU time in LLM inference. Generated kernels achieve 99-104% of state-of-the-art hand-optimized assembly throughput and are 2-1543x faster than existing agentic systems. Argus further generalizes to 200 KernelBench tasks, solving 100% of Level 1 and 90% of Level 2 problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ARGUS, an agentic framework for GPU kernel optimization that encodes data-flow invariants in a tile-based Pythonic DSL using tag functions for symbolic annotation propagation and tag assertions for relational constraints. Invariants are verified at compile time via abstract interpretation over a layout algebra and SMT solving (zero runtime overhead). An in-context RL planner, aided by a curated knowledge base, selects optimizations. On AMD MI300X, generated GEMM/attention/MoE kernels (covering >90% of LLM inference time) are claimed to reach 99-104% of hand-optimized assembly throughput and 2-1543x speedup over prior agentic systems; the approach also solves 100% of KernelBench Level 1 and 90% of Level 2 tasks.
Significance. If the performance and generalization claims hold under full experimental scrutiny, the work would be significant for automated high-performance computing: it supplies structured, dense feedback (counterexamples identifying thread/data/program point) that existing sparse pass/fail LLM agents lack, while demonstrating that compile-time formal methods can guide near-peak GPU code generation for production workloads.
major comments (3)
- [Abstract / Evaluation] Abstract and evaluation: the central performance claim (99-104% of hand-optimized assembly throughput) is reported without any description of measurement methodology, baseline kernel sources, statistical analysis, run counts, or hardware configuration details beyond the MI300X model. This absence prevents assessment of whether the results support the claim that verified invariants suffice for throughput parity.
- [Verification / DSL semantics] The soundness of the performance claim rests on the assertion that abstract interpretation over the layout algebra plus SMT captures all behaviors affecting real execution; however, the abstraction necessarily omits micro-architectural timing (shared-memory bank conflicts, warp scheduling latency, instruction-issue constraints) that can alter achieved throughput on MI300X even when data-flow invariants hold.
- [KernelBench evaluation] The generalization result (100% Level 1 / 90% Level 2 on 200 KernelBench tasks) is stated without reporting per-task success criteria, failure modes, or whether the same invariant-verification pipeline was used uniformly; this leaves open whether the in-context RL planner reliably synthesizes effective invariants across task distributions.
minor comments (3)
- [DSL] Define the precise semantics of tag propagation through control-flow constructs (e.g., conditionals, loops) in the DSL; the current description leaves ambiguity about how assertions are checked at use sites under divergent execution.
- [Introduction / §3] Provide a small illustrative example (kernel fragment + tags + counterexample) early in the paper to make the feedback mechanism concrete for readers unfamiliar with layout algebras.
- [Planner / Knowledge base] Clarify the exact form of the knowledge base (size, curation process, whether it is public) and how it is injected into the in-context RL planner.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity on methodology, limitations, and evaluation details.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation: the central performance claim (99-104% of hand-optimized assembly throughput) is reported without any description of measurement methodology, baseline kernel sources, statistical analysis, run counts, or hardware configuration details beyond the MI300X model. This absence prevents assessment of whether the results support the claim that verified invariants suffice for throughput parity.
Authors: We acknowledge the omission of detailed methodology in the original abstract and evaluation sections. The revised manuscript now includes an expanded 'Experimental Methodology' subsection that specifies: (1) all measurements were performed using the AMD ROCm profiler with 1000 warm-up iterations followed by 1000 timed iterations per kernel; (2) hand-optimized baselines are the vendor-provided assembly kernels from the ROCm library (e.g., rocBLAS for GEMM and custom flash-attention implementations); (3) results report mean throughput with 95% confidence intervals computed over 50 independent runs on the same MI300X device; and (4) full hardware configuration including driver version, HBM3e memory, and clock settings. These additions allow direct assessment that the 99-104% range reflects consistent, reproducible parity under the invariant-guided approach. revision: yes
-
Referee: [Verification / DSL semantics] The soundness of the performance claim rests on the assertion that abstract interpretation over the layout algebra plus SMT captures all behaviors affecting real execution; however, the abstraction necessarily omits micro-architectural timing (shared-memory bank conflicts, warp scheduling latency, instruction-issue constraints) that can alter achieved throughput on MI300X even when data-flow invariants hold.
Authors: We agree that the abstract interpretation and SMT solver target data-flow invariants and layout constraints rather than full micro-architectural timing. The verification guarantees absence of data races, incorrect tiling, and certain relational violations, which are necessary but not sufficient for peak throughput. The near-peak performance is achieved empirically through the in-context RL planner's selection of optimizations (tiling, pipelining, etc.) informed by the curated knowledge base. We have added a limitations paragraph in the Discussion section explicitly noting that micro-architectural effects are not modeled statically and that the approach relies on the planner's learned heuristics to mitigate them in practice. This does not alter the core claim that verified invariants enable effective optimization but clarifies the boundary of the formal guarantees. revision: partial
-
Referee: [KernelBench evaluation] The generalization result (100% Level 1 / 90% Level 2 on 200 KernelBench tasks) is stated without reporting per-task success criteria, failure modes, or whether the same invariant-verification pipeline was used uniformly; this leaves open whether the in-context RL planner reliably synthesizes effective invariants across task distributions.
Authors: The same invariant-verification pipeline (DSL tagging, abstract interpretation, and SMT) was applied uniformly to all 200 KernelBench tasks. Success criteria are defined as: (a) passing the task's functional test suite and (b) achieving at least 80% of the reference kernel's throughput when a reference is provided. We have added an appendix table listing per-task outcomes, grouped by level, along with the most common failure modes (primarily incomplete invariant synthesis for tasks with complex data-dependent control flow in Level 2). This revision demonstrates that the RL planner generalizes reliably when the verification feedback loop is available. revision: yes
Circularity Check
No circularity: empirical claims rest on external benchmarks
full rationale
The paper presents an agentic system whose central results are empirical performance numbers (99-104% of hand-optimized assembly, 2-1543x speedups, 100%/90% solve rates on KernelBench levels) obtained by running generated kernels on MI300X hardware and comparing against independently published libraries and prior agentic baselines. No equations, fitted parameters, or first-principles derivations appear in the provided text; the verification method (abstract interpretation + SMT) is described as a compile-time tool that supplies feedback to the planner, not as a mathematical reduction that presupposes the throughput numbers. Self-citations are absent from the load-bearing sections, and the knowledge base is curated external material rather than an internal loop. The derivation chain is therefore self-contained against external oracles.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Abstract interpretation over a layout algebra combined with SMT solving can verify data-flow invariants at compile time with zero runtime overhead and produce useful counterexamples.
invented entities (1)
-
Tile-based Pythonic DSL with tag functions and tag assertions
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Aiter: AI Tensor Engine for ROCm.https://github.com/ROCm/ aiter, 2025
AMD. Aiter: AI Tensor Engine for ROCm.https://github.com/ROCm/ aiter, 2025
2025
-
[2]
AMD. AMD instinct MI300 instruction set architecture.https:// www.amd.com/content/dam/amd/en/documents/instinct-tech- docs/instruction-set-architectures/amd-instinct-mi300-cdna3- instruction-set-architecture.pdf, August 2025
2025
-
[3]
AMD instinct MI300X workload optimization.https://rocm.d ocs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization /workload.html, 2025
AMD. AMD instinct MI300X workload optimization.https://rocm.d ocs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization /workload.html, 2025
2025
-
[4]
hipBLASLt: General matrix-matrix operations library for amd gpus.https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/index .html, 2026
AMD. hipBLASLt: General matrix-matrix operations library for amd gpus.https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/index .html, 2026
2026
-
[5]
Rocm software.https://github.com/ROCm/ROCm, 2026
AMD. Rocm software.https://github.com/ROCm/ROCm, 2026
2026
-
[6]
Kevin: Multi-turn RL for generating CUDA kernels
Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn RL for generating CUDA kernels. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[7]
GPUVerify: a verifier for GPU kernels
Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. GPUVerify: a verifier for GPU kernels. OOPSLA ’12, 2012
2012
-
[8]
How to optimize a CUDA matmul kernel for cuBLAS- like performance: a worklog.https://siboehm.com/articles/22/CUDA- MMM, 2022
Simon Boehm. How to optimize a CUDA matmul kernel for cuBLAS- like performance: a worklog.https://siboehm.com/articles/22/CUDA- MMM, 2022
2022
-
[9]
Ramanujam, and P
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. PLDI ’08, 2008
2008
-
[10]
Gonzalez, and Ion Stoica
Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, and Ion Stoica. K-Search: LLM kernel generation via co-evolving intrinsic world model, 2026
2026
-
[11]
Cate- gorical foundations for CuTe layouts
Jack Carlisle, Jay Shah, Reuben Stern, and Paul VanKoughnett. Cate- gorical foundations for CuTe layouts. 2026
2026
-
[12]
Proofwright: Towards agentic formal verification of cuda.arXiv preprint arXiv:2511.12294, 2025
Bodhisatwa Chatterjee, Drew Zagieboylo, Sana Damani, Siva Hari, and Christos Kozyrakis. Proofwright: Towards agentic formal verification of CUDA.CoRR, abs/2511.12294, 2025
-
[13]
Frans Kaashoek, and Nickolai Zeldovich
Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. Using crash hoare logic for certify- ing the FSCQ file system. InUSENIX ATC 16, 2016
2016
-
[14]
AVO: Agentic variation oper- ators for autonomous evolutionary search, 2026
Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi. AVO: Agentic variation oper- ators for autonomous evolutionar...
2026
-
[15]
TVM: An automated end-to-end optimizing compiler for deep learning
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. OSDI’18, 2018
2018
-
[16]
Abstract interpretation: a unified lattice model for static analysis of programs by construction or approx- imation of fixpoints
Patrick Cousot and Radhia Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approx- imation of fixpoints. InProceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL ’77, 1977
1977
-
[17]
CUDA agent: Large-scale agentic RL for high- performance CUDA kernel generation, 2026
Weinan Dai, Hanlin Wu, Qiying Yu, Huan ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, and Hao Zhou. CUDA agent: Large-scale agentic RL for high- performance CUDA kernel generation, 2026
2026
-
[18]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review arXiv 2023
-
[19]
Z3: An efficient SMT solver
Leonardo de Moura and Nikolaj Bjørner. Z3: An efficient SMT solver. InTACAS, 2008
2008
-
[20]
Deepseek-v3.2: Pushing the frontier of open large lan- guage models, 2025
DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large lan- guage models, 2025
2025
-
[21]
Tilus: A tile- level GPGPU programming language for low-precision computation
Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Hao Yu, Yida Wang, and Gennady Pekhimenko. Tilus: A tile- level GPGPU programming language for low-precision computation. ASPLOS ’26, 2025
2025
-
[22]
STARK: Strategic team of agents for refining kernels
Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, and Shuang Yang. STARK: Strategic team of agents for refining kernels. ICLR’26, 2026
2026
-
[23]
Shengjun Kris Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Ed- ward Lin, Siva Kumar Sastry Hari, and Christos Kozyrakis. Ker- nelBlaster: Continual cross-task CUDA optimization via memory- augmented in-context reinforcement learning.CoRR, abs/2602.14293, 2026
-
[24]
GLM-5: from vibe coding to agentic engineering, 2026
GLM-5-Team. GLM-5: from vibe coding to agentic engineering, 2026
2026
-
[25]
Polly - performing polyhedral optimizations on a low-level intermediate representation.Parallel Processing Letters, 2012
Tobias Grosser, Armin Groesslinger, and Christian Lengauer. Polly - performing polyhedral optimizations on a low-level intermediate representation.Parallel Processing Letters, 2012
2012
-
[26]
Mercury: Unlocking multi-GPU operator optimization for LLMs via remote memory scheduling
Yue Guan, Xinwei Qiang, Zaifeng Pan, Daniels Johnson, Yuanwei Fang, Keren Zhou, Yuke Wang, Wanlu Li, Yufei Ding, and Adnan Aziz. Mercury: Unlocking multi-GPU operator optimization for LLMs via remote memory scheduling. SOSP ’25, 2025
2025
-
[27]
Improving efficiency of GPU kernel opti- mization agents using a domain-specific language and speed-of-light guidance, 2026
Siva Kumar Sastry Hari, Vignesh Balaji, Sana Damani, Qijing Huang, and Christos Kozyrakis. Improving efficiency of GPU kernel opti- mization agents using a domain-specific language and speed-of-light guidance, 2026
2026
-
[28]
Lorch, Bryan Parno, Michael L
Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. IronFleet: proving practical distributed systems correct. SOSP ’15, 2015
2015
-
[29]
CUCo: An agentic framework for compute and communication co- design, 2026
Bodun Hu, Yoga Sri Varshan V, Saurabh Agarwal, and Aditya Akella. CUCo: An agentic framework for compute and communication co- design, 2026
2026
-
[30]
Fu, Ryann Swann, Muhammad Osama, Christopher Ré, and Simran Arora
William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher Ré, and Simran Arora. HipKittens: Fast and furious AMD kernels, 2025
2025
-
[31]
Dissecting and modeling the architecture of modern GPU cores
Rodrigo Huerta, Mojtaba Abaie Shoushtary, José-Lorenzo Cruz, and Antonio Gonzalez. Dissecting and modeling the architecture of modern GPU cores. MICRO ’25, 2025
2025
-
[32]
Exo 2: Growing a scheduling language
Yuka Ikarashi, Kevin Qian, Samir Droubi, Alex Reinking, Gilbert Louis Bernstein, and Jonathan Ragan-Kelley. Exo 2: Growing a scheduling language. ASPLOS ’25, 2025
2025
-
[33]
seL4: formal verification of an OS kernel
Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood. seL4: formal verification of an OS kernel. SOSP ’09, 2009
2009
-
[34]
M. Lam. Software pipelining: an effective scheduling technique for vliw machines. PLDI ’88, 1988
1988
-
[35]
Towards robust agentic CUDA kernel bench- marking, verification, and optimization, 2025
Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, and David Ha. Towards robust agentic CUDA kernel bench- marking, verification, and optimization, 2025
2025
-
[36]
MLIR: Scaling compiler infrastructure for domain specific computation
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasi- lache, and Oleksandr Zinenko. MLIR: Scaling compiler infrastructure for domain specific computation. CGO’21, 2021
2021
-
[37]
SIRIUS: Harvesting whole-program optimization opportunities for DNNs
Yijin Li, Jiacheng Zhao, Sun Qianqi, Haohui Mai, Lei Chen, Wanlu Cao, Yanfan Chen, Li zhicheng, Ying Liu, Xinyuan Zhang, Xiyu Shi, Jie Zhao, Jingling Xue, Huimin Cui, and XiaoBing Feng. SIRIUS: Harvesting whole-program optimization opportunities for DNNs. InProceedings of Machine Learning and Systems, 2023
2023
-
[38]
KernelEvolve: Scaling agentic kernel coding for heterogeneous AI accelerators at Meta, 2026
Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Ro- man Levenstein, Kunming Ho, Haishan Zhu, Alec Hammond, Richard Li, Ajit Mathews, Kaustubh G...
2026
-
[39]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024
2024
-
[40]
Deja Vu: contextual sparsity for efficient LLMs at inference time
Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Ré, and Beidi Chen. Deja Vu: contextual sparsity for efficient LLMs at inference time. ICML’23, 2023
2023
-
[41]
Benchmarking and dissecting the nvidia hopper GPU architec- ture
Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, and Xiaowen Chu. Benchmarking and dissecting the nvidia hopper GPU architec- ture. IPDPS’24, 2024
2024
-
[42]
Rammer: Enabling holistic deep learning compiler optimizations with rTasks
Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. OSDI’20, 2020
2020
-
[43]
Verifying security invariants in Expres- sOS
Haohui Mai, Edgar Pek, Hui Xue, Samuel Talmadge King, and Parthasarathy Madhusudan. Verifying security invariants in Expres- sOS. ASPLOS ’13, 2013
2013
-
[44]
Online normalizer calculation for softmax,
Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018
-
[45]
LLMs are in-context bandit reinforcement learners
Giovanni Monea, Antoine Bosselut, Kianté Brantley, and Yoav Artzi. LLMs are in-context bandit reinforcement learners. InSecond Confer- ence on Language Modeling, 2025
2025
-
[46]
Andrew C. Myers. JFlow: practical mostly-static information flow control. POPL’19, pages 228–241, January 1999
1999
-
[47]
CUTLASS: CUDA templates for linear algebra subroutines and solvers.https://github.com/NVIDIA/cutlass, 2026
NVIDIA. CUTLASS: CUDA templates for linear algebra subroutines and solvers.https://github.com/NVIDIA/cutlass, 2026
2026
-
[48]
GPT-5.3-Codex system card
OpenAI. GPT-5.3-Codex system card. Technical report, OpenAI, February 2026
2026
-
[49]
KernelBench: Can LLMs write efficient GPU kernels? ICML’25, 2025
Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Re, and Azalia Mirhoseini. KernelBench: Can LLMs write efficient GPU kernels? ICML’25, 2025
2025
-
[50]
Polyhedral optimization of TensorFlow compu- tation graphs
Benoît Pradelle, Benoît Meister, Muthu Baskaran, Jonathan Springer, and Richard Lethin. Polyhedral optimization of TensorFlow compu- tation graphs. In Abhinav Bhatele, David Boehme, Joshua A. Levine, Allen D. Malony, and Martin Schulz, editors,ProTools, pages 74–89, Cham, 2019. Springer International Publishing
2019
-
[51]
Register alloca- tion by puzzle solving
Fernando Magno Quintão Pereira and Jens Palsberg. Register alloca- tion by puzzle solving. PLDI ’08, 2008
2008
-
[52]
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. PLDI ’13, 2013
2013
-
[53]
Ramalingam
G. Ramalingam. The undecidability of aliasing.ACM Trans. Program. Lang. Syst., 16(5):1467–1471, September 1994
1994
-
[54]
XLA : Compiling machine learning for peak performance, 2020
Amit Sabne. XLA : Compiling machine learning for peak performance, 2020
2020
-
[55]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017
2017
-
[56]
Push-Button verification of file systems via crash refinement
Helgi Sigurbjarnarson, James Bornholt, Emina Torlak, and Xi Wang. Push-Button verification of file systems via crash refinement. OSDI’16, 2016
2016
-
[57]
CUDA-L2: Surpassing cublas performance for matrix multipli- cation through reinforcement learning, 2025
Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, and Chris Shum. CUDA-L2: Surpassing cublas performance for matrix multipli- cation through reinforcement learning, 2025
2025
-
[58]
Shuo Tang, Haohui Mai, and Samuel T. King. Trust and protection in the illinois browser operating system. OSDI’10, 2010
2010
-
[59]
Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InPro- ceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019
2019
-
[60]
Gluon.https://github.com/triton-lang/triton/tree/main/lib/Di alect/Gluon, 2026
Triton. Gluon.https://github.com/triton-lang/triton/tree/main/lib/Di alect/Gluon, 2026
2026
-
[61]
Hype, sustainability, and the price of the bigger-is-better paradigm in AI
Gael Varoquaux, Sasha Luccioni, and Meredith Whittaker. Hype, sustainability, and the price of the bigger-is-better paradigm in AI. FAccT ’25, 2025
2025
-
[62]
KernelFalcon: Autonomous gpu kernel generation via deep agents.https://pytorch.org/blog/kernelfalcon- autonomous-gpu-kernel-generation-via-deep-agents, 2025
Laura Wang and PyTorch Team. KernelFalcon: Autonomous gpu kernel generation via deep agents.https://pytorch.org/blog/kernelfalcon- autonomous-gpu-kernel-generation-via-deep-agents, 2025
2025
-
[63]
Tilelang: Bridge programmability and performance in modern neural kernels
Lei Wang, Yu Cheng, Yining Shi, Zhiwen Mo, Zhengju Tang, Wenhao Xie, Tong Wu, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, and Zhi Yang. Tilelang: Bridge programmability and performance in modern neural kernels. ICLR’26, 2026
2026
-
[64]
Mirage: a multi-level superoptimizer for tensor programs
Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: a multi-level superoptimizer for tensor programs. OSDI ’25, 2025
2025
-
[65]
Frans Kaashoek
Alexander Yip, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. Improving application security with data flow assertions. SOSP ’09, 2009
2009
-
[66]
differ- entiation
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic "differ- entiation" via text, 2024
2024
-
[67]
Making information flow explicit in HiStar
Nickolai Zeldovich, Silas Boyd-Wickizer, Eddie Kohler, and David Mazières. Making information flow explicit in HiStar. OSDI’06, 2006
2006
-
[68]
En- abling tensor language model to assist in generating High-Performance tensor programs for deep learning
Yi Zhai, Sijia Yang, Keyu Pan, Renwei Zhang, Shuo Liu, Chao Liu, Zichun Ye, Jianmin Ji, Jie Zhao, Yu Zhang, and Yanyong Zhang. En- abling tensor language model to assist in generating High-Performance tensor programs for deep learning. OSDI’24, 2024
2024
-
[69]
CudaForge: An agent framework with hardware feed- back for CUDA kernel optimization, 2025
Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. CudaForge: An agent framework with hardware feed- back for CUDA kernel optimization, 2025
2025
-
[70]
Benchmarking the performance of large language models on the cere- bras wafer scale engine, 2024
Zuoning Zhang, Dhruv Parikh, Youning Zhang, and Viktor Prasanna. Benchmarking the performance of large language models on the cere- bras wafer scale engine, 2024
2024
-
[71]
AKG: automatic kernel generation for neural processing units using polyhedral transformations
Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Xuefeng Jin. AKG: automatic kernel generation for neural processing units using polyhedral transformations. PLDI’21, 2021
2021
-
[72]
Gonzalez, and Ion Stoica
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: Generating High- Performance tensor programs for deep learning. OSDI’20, 2020. 14
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.