CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

· 2026 · cs.LG · arXiv 2604.01489

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

High-performance GPU kernels are critical to modern machine learning systems, yet developing them remains a manual, expert-driven process. Recent work has explored using LLMs to automate kernel generation, but generated kernels still fall short of carefully tuned references on standardized benchmarks. We present CuTeGen, an agentic GPU kernel synthesis framework that treats kernel development as a structured generate-test-refine workflow over the CuTe abstraction layer. Two design choices distinguish CuTeGen from prior work: targeting CuTe rather than raw CUDA, which exposes performance-critical structures such as tiling and data movement while remaining stable enough for iterative refinement, and a delayed profiling schedule that withholds low-level performance feedback until the kernel's high-level structure has stabilized. On the 209 tasks of KernelBench Level-1 and Level-2, CuTeGen achieves an average speedup of 1.71$\times$ over PyTorch and outperforms the prior agentic baseline CudaForge (0.89$\times$) at comparable per-task generation cost. Code available at https://github.com/taratt/cutegen.git

representative citing papers

GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

LLMs can forecast GPU kernel performance accurately enough to serve as selective surrogates, allowing kernel searches to consider more candidates and recover faster kernels under fixed GPU evaluation budgets.

Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

KLineage derives verified optimization skills from backward lineages of expert GPU kernels to guide LLM agents toward higher-quality and more efficient kernels than memory-based baselines.

citing papers explorer

Showing 1 of 1 citing paper after filters.

GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization cs.LG · 2026-05-29 · unverdicted · none · ref 32 · internal anchor
LLMs can forecast GPU kernel performance accurately enough to serve as selective surrogates, allowing kernel searches to consider more candidates and recover faster kernels under fixed GPU evaluation budgets.

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

fields

years

verdicts

representative citing papers

citing papers explorer