NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
In: International Conference for High Performance Com- puting, Networking, Storage and Analysis (SC)
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
GTaP delivers a GPU-resident fork-join task-parallel runtime with pragma support and EPAQ that outperforms CPU OpenMP on several irregular applications.
ANNS-AMP adapts distance-computation precision to vector-space regions via a lightweight cluster-level predictor and a bit-serial accelerator, delivering 163.76x/10.57x/2.06x average speedups and 1100x/39.41x/6.66x energy reductions versus CPU/GPU/custom baselines with <2.7% accuracy loss.
Neighbor-only work stealing for 2D-mesh satellite constellations yields growing per-attempt latency advantages and performs within 2.2% of global stealing on emulated workloads.
citing papers explorer
-
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
-
GTaP: A GPU-Resident Fork-Join Task-Parallel Runtime with a Pragma-Based Interface
GTaP delivers a GPU-resident fork-join task-parallel runtime with pragma support and EPAQ that outperforms CPU OpenMP on several irregular applications.
-
ANNS-AMP: Accelerating Approximate Nearest Neighbor Search via Adaptive Mixed-Precision Computing
ANNS-AMP adapts distance-computation precision to vector-space regions via a lightweight cluster-level predictor and a bit-serial accelerator, delivering 163.76x/10.57x/2.06x average speedups and 1100x/39.41x/6.66x energy reductions versus CPU/GPU/custom baselines with <2.7% accuracy loss.
-
Work Stealing for the 2D-Mesh Topology of Satellite Constellations in Low Earth Orbit
Neighbor-only work stealing for 2D-mesh satellite constellations yields growing per-attempt latency advantages and performs within 2.2% of global stealing on emulated workloads.