Nvidia tensor core programmability, performance & precision

· 2018 · arXiv 2018.00091

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Prism: Symbolic Superoptimization of Tensor Programs

cs.PL · 2026-04-16 · unverdicted · novelty 8.0

Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.

Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy

cs.AR · 2025-11-14 · accept · novelty 8.0

The authors derive the first bit-accurate arithmetic models for matrix multiply-accumulate operations on ten GPU architectures spanning NVIDIA Volta to Blackwell and AMD CDNA1 to CDNA3.

Accelerating Locality-Driven Integration in Quantum Chemistry with Block-Structured Matrix Multiplication

physics.comp-ph · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

KerneLDI accelerates exchange-correlation integration in Kohn-Sham DFT by up to 10x through block-structured matrix multiplication that exploits spatial locality on GPUs while preserving accuracy.

Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels

cs.CE · 2026-04-20 · unverdicted · novelty 6.0

A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.

Mixed-precision iterative refinement for low-rank Lyapunov equations

math.NA · 2025-10-02 · unverdicted · novelty 6.0

Develops mixed-precision iterative refinement for low-rank Lyapunov equations with rounding error analysis enabling reduced precision for moderately conditioned problems.

Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic

math.NA · 2025-06-12 · unverdicted · novelty 5.0

Error analysis and cost estimator for recasting floating-point matrix multiplication as accumulated integer products on mixed-precision hardware.

citing papers explorer

Showing 6 of 6 citing papers.

Prism: Symbolic Superoptimization of Tensor Programs cs.PL · 2026-04-16 · unverdicted · none · ref 21
Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.
Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy cs.AR · 2025-11-14 · accept · none · ref 1
The authors derive the first bit-accurate arithmetic models for matrix multiply-accumulate operations on ten GPU architectures spanning NVIDIA Volta to Blackwell and AMD CDNA1 to CDNA3.
Accelerating Locality-Driven Integration in Quantum Chemistry with Block-Structured Matrix Multiplication physics.comp-ph · 2026-05-11 · unverdicted · none · ref 35 · 2 links
KerneLDI accelerates exchange-correlation integration in Kohn-Sham DFT by up to 10x through block-structured matrix multiplication that exploits spatial locality on GPUs while preserving accuracy.
Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels cs.CE · 2026-04-20 · unverdicted · none · ref 18
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.
Mixed-precision iterative refinement for low-rank Lyapunov equations math.NA · 2025-10-02 · unverdicted · none · ref 30
Develops mixed-precision iterative refinement for low-rank Lyapunov equations with rounding error analysis enabling reduced precision for moderately conditioned problems.
Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic math.NA · 2025-06-12 · unverdicted · none · ref 30
Error analysis and cost estimator for recasting floating-point matrix multiplication as accumulated integer products on mixed-precision hardware.

Nvidia tensor core programmability, performance & precision

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer