FastKernels is a production-aligned benchmark covering 96.2% of HuggingFace Transformers that reveals state-of-the-art kernel agents deliver at most 0.94x aggregate speedup.
KernelEvolve: Scaling agentic kernel coding for heterogeneous AI accelerators at meta
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9roles
background 1polarities
background 1representative citing papers
PassNet provides a dataset of 18K graphs and PassBench for LLM-generated compiler passes, with fine-tuned models achieving 2.67x gains on long-tail tasks where TorchInductor underperforms.
A two-stage agent skill system enables autonomous end-to-end deployment of eight decoder-only LLMs on AMD XDNA 2 NPU with numerical correctness in 0.5-4 hours each, generalizing from a human-guided Llama-3.2-1B reference.
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
Trellis treats agent experience graphs as first-class database state so that search patterns become queries, enabling crash recovery, scaling, and closed-loop training as architectural byproducts.
Evolution Fine-Tuning trains LLMs on 156K trajectories spanning 371 tasks to achieve 10.22% average improvement on 22 held-out optimization tasks and match SOTA on select circle-packing problems when combined with test-time RL.
Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.
DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% while saving hundreds of thousands of GPU hours monthly.
AscendOptimizer combines kernel rewinding for reusable experience with evolutionary search on hardware feedback to optimize Ascend NPU operators, delivering 1.21x geometric-mean speedup and faster performance on 53.47% of 101 tested operators versus baseline.
citing papers explorer
-
PassNet: Scaling Large Language Models for Graph Compiler Pass Generation
PassNet provides a dataset of 18K graphs and PassBench for LLM-generated compiler passes, with fine-tuned models achieving 2.67x gains on long-tail tasks where TorchInductor underperforms.
-
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.