CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation.arXiv preprint arXiv:2602.24286
7 Pith papers cite this work. Polarity classification is still indexing.
years
2026 7verdicts
UNVERDICTED 7representative citing papers
FACT is a three-stage agent-driven system that synthesizes and composes CUTLASS kernels from PyTorch modules, achieving up to 2.03x speedup on transformer blocks over PyTorch and competing optimizers.
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yielding 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3 within 100 steps.
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to MetaX MACA.
KEET uses LLM agents to generate data-grounded natural language explanations of performance issues in GPU kernels from Nsight Compute profiles and shows these improve downstream LLM-based optimization tasks.
citing papers explorer
-
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
-
FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow
FACT is a three-stage agent-driven system that synthesizes and composes CUTLASS kernels from PyTorch modules, achieving up to 2.03x speedup on transformer blocks over PyTorch and competing optimizers.
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
-
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yielding 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3 within 100 steps.
-
InCoder-32B-Thinking: Industrial Code World Model for Thinking
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
-
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to MetaX MACA.
-
KEET: Explaining Performance of GPU Kernels Using LLM Agents
KEET uses LLM agents to generate data-grounded natural language explanations of performance issues in GPU kernels from Nsight Compute profiles and shows these improve downstream LLM-based optimization tasks.