hub

InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis(Atlanta, GA, USA)(SC ’24)

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects · 2024 · arXiv 1406.2024

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.

Unfolding an Atomistic World: Atomistic Simulation of Reactor Pressure Vessel Steel Across Year-and-Meter Scales

cs.DC · 2026-04-27 · unverdicted · novelty 7.0 · 2 refs

AtomWorld enables the first direct atomistic simulation of RPV steel at year-and-meter scales, handling ten-quintillion-atom systems and simulating one service year in 1.71 days with 92-97% scaling efficiency on leadership supercomputers.

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

cs.PF · 2026-04-11 · unverdicted · novelty 7.0 · 5 refs

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.

Multi-Agent Orchestration for High-Throughput Materials Screening on a Leadership-Class System

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

A planner-executor multi-agent system using gpt-oss-120b and Parsl orchestrates scalable high-throughput MOF screening on the Aurora supercomputer with low overhead.

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

cs.DC · 2026-04-08 · unverdicted · novelty 7.0 · 2 refs

NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

cs.DC · 2026-04-02 · unverdicted · novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

Fast MoE Inference via Predictive Prefetching and Expert Replication

cs.LG · 2026-05-12 · conditional · novelty 6.0

Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.

Stencil Computations on Cerebras Wafer-Scale Engine

cs.DC · 2026-05-08 · unverdicted · novelty 6.0

CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

cs.DC · 2026-05-06 · unverdicted · novelty 6.0

HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.

Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels

cs.CE · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.

Scalable and Adaptive Parallel Training of Graph Transformer on Large Graphs

cs.DC · 2026-04-17 · unverdicted · novelty 6.0

A new distributed framework for graph transformer training auto-selects parallel strategies and optimizes sparse operations to deliver up to 6x speedup on 8 GPUs and 78% memory reduction.

Preserving Clusters in Error-Bounded Lossy Compression of Particle Data

cs.LG · 2026-04-20 · unverdicted · novelty 5.0

A clustering-aware correction algorithm using spatial partitioning and projected gradient descent preserves single-linkage clusters in lossy-compressed particle data while keeping competitive compression ratios.

Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning

eess.SY · 2026-04-08 · unverdicted · novelty 5.0

High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.

Making Room for AI: Multi-GPU Molecular Dynamics with Deep Potentials in GROMACS

cs.DC · 2026-04-08 · unverdicted · novelty 5.0

GROMACS now runs multi-GPU DeePMD inference for molecular dynamics, reaching 40-66% strong scaling efficiency up to 32 devices on a 15k-atom protein system with over 90% time in inference.

citing papers explorer

Showing 15 of 15 citing papers.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding cs.DC · 2026-05-12 · unverdicted · none · ref 7
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
Unfolding an Atomistic World: Atomistic Simulation of Reactor Pressure Vessel Steel Across Year-and-Meter Scales cs.DC · 2026-04-27 · unverdicted · none · ref 47 · 2 links
AtomWorld enables the first direct atomistic simulation of RPV steel at year-and-meter scales, handling ten-quintillion-atom systems and simulating one service year in 1.71 days with 92-97% scaling efficiency on leadership supercomputers.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding cs.PF · 2026-04-11 · unverdicted · none · ref 40 · 5 links
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
Multi-Agent Orchestration for High-Throughput Materials Screening on a Leadership-Class System cs.AI · 2026-04-09 · unverdicted · none · ref 20
A planner-executor multi-agent system using gpt-oss-120b and Parsl orchestrates scalable high-throughput MOF screening on the Aurora supercomputer with low overhead.
NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining cs.DC · 2026-04-08 · unverdicted · none · ref 18 · 2 links
NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods cs.DC · 2026-04-02 · unverdicted · none · ref 31
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 142
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Fast MoE Inference via Predictive Prefetching and Expert Replication cs.LG · 2026-05-12 · conditional · none · ref 23
Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.
Stencil Computations on Cerebras Wafer-Scale Engine cs.DC · 2026-05-08 · unverdicted · none · ref 18
CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.
One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving cs.DC · 2026-05-06 · unverdicted · none · ref 12
HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels cs.CE · 2026-04-20 · unverdicted · none · ref 46 · 2 links
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.
Scalable and Adaptive Parallel Training of Graph Transformer on Large Graphs cs.DC · 2026-04-17 · unverdicted · none · ref 30
A new distributed framework for graph transformer training auto-selects parallel strategies and optimizes sparse operations to deliver up to 6x speedup on 8 GPUs and 78% memory reduction.
Preserving Clusters in Error-Bounded Lossy Compression of Particle Data cs.LG · 2026-04-20 · unverdicted · none · ref 27
A clustering-aware correction algorithm using spatial partitioning and projected gradient descent preserves single-linkage clusters in lossy-compressed particle data while keeping competitive compression ratios.
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning eess.SY · 2026-04-08 · unverdicted · none · ref 30
High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.
Making Room for AI: Multi-GPU Molecular Dynamics with Deep Potentials in GROMACS cs.DC · 2026-04-08 · unverdicted · none · ref 45
GROMACS now runs multi-GPU DeePMD inference for molecular dynamics, reaching 40-66% strong scaling efficiency up to 32 devices on a 15k-atom protein system with over 90% time in inference.

InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis(Atlanta, GA, USA)(SC ’24)

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer