Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.
Nvidia tensor core programmability, performance & precision
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
The authors derive the first bit-accurate arithmetic models for matrix multiply-accumulate operations on ten GPU architectures spanning NVIDIA Volta to Blackwell and AMD CDNA1 to CDNA3.
KerneLDI accelerates exchange-correlation integration in Kohn-Sham DFT by up to 10x through block-structured matrix multiplication that exploits spatial locality on GPUs while preserving accuracy.
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.
Develops mixed-precision iterative refinement for low-rank Lyapunov equations with rounding error analysis enabling reduced precision for moderately conditioned problems.
Error analysis and cost estimator for recasting floating-point matrix multiplication as accumulated integer products on mixed-precision hardware.
citing papers explorer
-
Prism: Symbolic Superoptimization of Tensor Programs
Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.
-
Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy
The authors derive the first bit-accurate arithmetic models for matrix multiply-accumulate operations on ten GPU architectures spanning NVIDIA Volta to Blackwell and AMD CDNA1 to CDNA3.
-
Accelerating Locality-Driven Integration in Quantum Chemistry with Block-Structured Matrix Multiplication
KerneLDI accelerates exchange-correlation integration in Kohn-Sham DFT by up to 10x through block-structured matrix multiplication that exploits spatial locality on GPUs while preserving accuracy.
-
Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.
-
Mixed-precision iterative refinement for low-rank Lyapunov equations
Develops mixed-precision iterative refinement for low-rank Lyapunov equations with rounding error analysis enabling reduced precision for moderately conditioned problems.
-
Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic
Error analysis and cost estimator for recasting floating-point matrix multiplication as accumulated integer products on mixed-precision hardware.