Triton : an intermediate language and compiler for tiled neural network computations

Philippe Tillet, Hsiang-Tsung Kung, David D · 2019

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

representative citing papers

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

cs.SE · 2026-05-04 · conditional · novelty 7.0

Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.

FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

cs.DC · 2026-05-07 · unverdicted · novelty 6.0

FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.

Fast-Vollib: A Fast Implied Volatility Library for Pythonwith PyTorch, JAX, and CUDA Fused-Kernel Backends

q-fin.CP · 2026-04-29 · unverdicted · novelty 5.0

fast-vollib delivers a high-performance open-source Python library for option pricing and implied volatility with multiple accelerated backends as a drop-in alternative to py_vollib.

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

cs.LG · 2025-05-05

citing papers explorer

Showing 5 of 5 citing papers.

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs cs.SE · 2026-05-04 · conditional · none · ref 17
Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.
FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication cs.DC · 2026-05-07 · unverdicted · none · ref 21
FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts cs.LG · 2026-04-28 · unverdicted · none · ref 15
RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.
Fast-Vollib: A Fast Implied Volatility Library for Pythonwith PyTorch, JAX, and CUDA Fused-Kernel Backends q-fin.CP · 2026-04-29 · unverdicted · none · ref 14
fast-vollib delivers a high-performance open-source Python library for option pricing and implied volatility with multiple accelerated backends as a drop-in alternative to py_vollib.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference cs.LG · 2025-05-05 · unreviewed · ref 93

Triton : an intermediate language and compiler for tiled neural network computations

fields

years

verdicts

representative citing papers

citing papers explorer