Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.
Triton : an intermediate language and compiler for tiled neural network computations
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.
RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.
fast-vollib delivers a high-performance open-source Python library for option pricing and implied volatility with multiple accelerated backends as a drop-in alternative to py_vollib.
citing papers explorer
-
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.
-
FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.
-
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.
-
Fast-Vollib: A Fast Implied Volatility Library for Pythonwith PyTorch, JAX, and CUDA Fused-Kernel Backends
fast-vollib delivers a high-performance open-source Python library for option pricing and implied volatility with multiple accelerated backends as a drop-in alternative to py_vollib.
- RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference