CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
C"linkage. •task.yaml — metadata including anti-cheat blocked patterns (e.g., #include
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to MetaX MACA.
citing papers explorer
-
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
-
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
-
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to MetaX MACA.