Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al · 2022

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

cs.LG · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.

MultiPath Memory Access: Breaking Host-GPU Bandwidth Bottlenecks in LLM Services

cs.DC · 2025-12-18 · unverdicted · novelty 7.0

MMA routes host-GPU transfers over multiple available paths to deliver 4.62x higher peak bandwidth and lower latencies in LLM serving without hardware or driver changes.

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

cs.DC · 2026-04-28 · unverdicted · novelty 6.0

DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.

citing papers explorer

Showing 3 of 3 citing papers.

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving cs.LG · 2026-05-03 · unverdicted · none · ref 3 · 2 links
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
MultiPath Memory Access: Breaking Host-GPU Bandwidth Bottlenecks in LLM Services cs.DC · 2025-12-18 · unverdicted · none · ref 4
MMA routes host-GPU transfers over multiple available paths to deliver 4.62x higher peak bandwidth and lower latencies in LLM serving without hardware or driver changes.
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference cs.DC · 2026-04-28 · unverdicted · none · ref 4
DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.

Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale

fields

years

verdicts

representative citing papers

citing papers explorer