CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Anne Ouyang; Fan Long; Jikai Jason Li; Tara Saba; Xujie Si; Zhiyang Chen

arxiv: 2604.01489 · v2 · pith:NCKZNFL3new · submitted 2026-04-01 · 💻 cs.LG · cs.AI· cs.DC· cs.PF· cs.SE

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Tara Saba , Zhiyang Chen , Jikai Jason Li , Anne Ouyang , Xujie Si , Fan Long This is my paper

classification 💻 cs.LG cs.AIcs.DCcs.PFcs.SE

keywords cutegenkernelagenticcutegenerationkernelsframeworkhigh-performance

0 comments

read the original abstract

High-performance GPU kernels are critical to modern machine learning systems, yet developing them remains a manual, expert-driven process. Recent work has explored using LLMs to automate kernel generation, but generated kernels still fall short of carefully tuned references on standardized benchmarks. We present CuTeGen, an agentic GPU kernel synthesis framework that treats kernel development as a structured generate-test-refine workflow over the CuTe abstraction layer. Two design choices distinguish CuTeGen from prior work: targeting CuTe rather than raw CUDA, which exposes performance-critical structures such as tiling and data movement while remaining stable enough for iterative refinement, and a delayed profiling schedule that withholds low-level performance feedback until the kernel's high-level structure has stabilized. On the 209 tasks of KernelBench Level-1 and Level-2, CuTeGen achieves an average speedup of 1.71$\times$ over PyTorch and outperforms the prior agentic baseline CudaForge (0.89$\times$) at comparable per-task generation cost. Code available at https://github.com/taratt/cutegen.git

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SpecGen: Accelerating Agentic Kernel Optimization with Speculative Generation
cs.DC 2026-06 unverdicted novelty 6.0

SpecGen introduces speculative generation to fork non-reasoning kernel candidates during LLM reasoning traces, enabling early termination and parallel profiling to reduce end-to-end optimization time on H200 GPUs.
GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization
cs.LG 2026-05 unverdicted novelty 6.0

LLMs can forecast GPU kernel performance accurately enough to serve as selective surrogates, allowing kernel searches to consider more candidates and recover faster kernels under fixed GPU evaluation budgets.
Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages
cs.AI 2026-05 unverdicted novelty 6.0

KLineage derives verified optimization skills from backward lineages of expert GPU kernels to guide LLM agents toward higher-quality and more efficient kernels than memory-based baselines.