pith. sign in

arxiv: 2604.01489 · v2 · pith:NCKZNFL3new · submitted 2026-04-01 · 💻 cs.LG · cs.AI· cs.DC· cs.PF· cs.SE

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

classification 💻 cs.LG cs.AIcs.DCcs.PFcs.SE
keywords cutegenkernelagenticcutegenerationkernelsframeworkhigh-performance
0
0 comments X
read the original abstract

High-performance GPU kernels are critical to modern machine learning systems, yet developing them remains a manual, expert-driven process. Recent work has explored using LLMs to automate kernel generation, but generated kernels still fall short of carefully tuned references on standardized benchmarks. We present CuTeGen, an agentic GPU kernel synthesis framework that treats kernel development as a structured generate-test-refine workflow over the CuTe abstraction layer. Two design choices distinguish CuTeGen from prior work: targeting CuTe rather than raw CUDA, which exposes performance-critical structures such as tiling and data movement while remaining stable enough for iterative refinement, and a delayed profiling schedule that withholds low-level performance feedback until the kernel's high-level structure has stabilized. On the 209 tasks of KernelBench Level-1 and Level-2, CuTeGen achieves an average speedup of 1.71$\times$ over PyTorch and outperforms the prior agentic baseline CudaForge (0.89$\times$) at comparable per-task generation cost. Code available at https://github.com/taratt/cutegen.git

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpecGen: Accelerating Agentic Kernel Optimization with Speculative Generation

    cs.DC 2026-06 unverdicted novelty 6.0

    SpecGen introduces speculative generation to fork non-reasoning kernel candidates during LLM reasoning traces, enabling early termination and parallel profiling to reduce end-to-end optimization time on H200 GPUs.

  2. GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    LLMs can forecast GPU kernel performance accurately enough to serve as selective surrogates, allowing kernel searches to consider more candidates and recover faster kernels under fixed GPU evaluation budgets.

  3. Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

    cs.AI 2026-05 unverdicted novelty 6.0

    KLineage derives verified optimization skills from backward lineages of expert GPU kernels to guide LLM agents toward higher-quality and more efficient kernels than memory-based baselines.