FastKernels is a production-aligned benchmark covering 96.2% of HuggingFace Transformers that reveals state-of-the-art kernel agents deliver at most 0.94x aggregate speedup.
Cudabench: Benchmarking llms for text-to-cuda generation
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
background 1polarities
background 1representative citing papers
CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
CodegenBench shows LLMs generate optimized code well for x86_64 but exhibit significant performance degradation on Sunway and Kunpeng due to limited documentation and training data.
CUDABEAVER benchmark and pass@k(M,C,A) metric show LLM CUDA debugging success drops by up to 40 percentage points under strict performance requirements.
Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to MetaX MACA.
citing papers explorer
-
FastKernels: Benchmarking GPU Kernel Generation in Production
FastKernels is a production-aligned benchmark covering 96.2% of HuggingFace Transformers that reveals state-of-the-art kernel agents deliver at most 0.94x aggregate speedup.
-
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
-
CodegenBench: Can LLMs Write Efficient Code Across Architectures?
CodegenBench shows LLMs generate optimized code well for x86_64 but exhibit significant performance degradation on Sunway and Kunpeng due to limited documentation and training data.
-
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
CUDABEAVER benchmark and pass@k(M,C,A) metric show LLM CUDA debugging success drops by up to 40 percentage points under strict performance requirements.
-
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to MetaX MACA.