NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7roles
background 1polarities
background 1representative citing papers
PackSELL packs delta-encoded indices and values into single words with tunable bit allocation, delivering up to 1.63x faster FP16 SpMV and FP32-accurate performance exceeding FP16 cuSPARSE while reducing memory traffic.
FP64 tensor cores accelerate high-order finite-element kernels in MFEM by up to 2x with 83% energy gains and near-perfect weak scaling on exascale hardware.
KerneLDI accelerates exchange-correlation integration in Kohn-Sham DFT by up to 10x through block-structured matrix multiplication that exploits spatial locality on GPUs while preserving accuracy.
UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.
Error analysis and cost estimator for recasting floating-point matrix multiplication as accumulated integer products on mixed-precision hardware.
RAMSES integrates commercial encryption technologies and OS hardening into an HPC platform to enable secure sensitive-data processing with limited performance overhead, as shown by biomedical-sector benchmarks.
citing papers explorer
-
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
-
PackSELL: A Sparse Matrix Format for Precision-Agnostic High-Performance SpMV
PackSELL packs delta-encoded indices and values into single words with tunable bit allocation, delivering up to 1.63x faster FP16 SpMV and FP32-accurate performance exceeding FP16 cuSPARSE while reducing memory traffic.
-
Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores
FP64 tensor cores accelerate high-order finite-element kernels in MFEM by up to 2x with 83% energy gains and near-perfect weak scaling on exascale hardware.
-
Accelerating Locality-Driven Integration in Quantum Chemistry with Block-Structured Matrix Multiplication
KerneLDI accelerates exchange-correlation integration in Kohn-Sham DFT by up to 10x through block-structured matrix multiplication that exploits spatial locality on GPUs while preserving accuracy.
-
UCCL-Zip: Lossless Compression Supercharged GPU Communication
UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.
-
Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic
Error analysis and cost estimator for recasting floating-point matrix multiplication as accumulated integer products on mixed-precision hardware.
-
RAMSES: Secure high-performance computing for sensitive data
RAMSES integrates commercial encryption technologies and OS hardening into an HPC platform to enable secure sensitive-data processing with limited performance overhead, as shown by biomedical-sector benchmarks.