Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale

URLhttps://arxiv · 2022 · arXiv 2207.00032

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

Efficient Memory Management for Large Language Model Serving with PagedAttention

cs.LG · 2023-09-12 · conditional · novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.

AoiZora: Topology-Aware Auto-Parallel Optimization for Inference of Diffusion Transformers

cs.DC · 2026-06-16 · unverdicted · novelty 6.0

AoiZora adds topology-aware physical placement planning to auto-parallel compilation for diffusion transformer inference, cutting one-step denoising latency by up to 1.42x on TPU v5e sub-slices.

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eliminating observed flips.

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

cs.DC · 2026-05-18 · unverdicted · novelty 6.0

RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal workloads.

ShardTensor: Domain Parallelism for Scientific Machine Learning

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.

SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference

cs.AI · 2026-02-05 · unverdicted · novelty 6.0

SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across multiple LLMs.

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

cs.LG · 2023-06-24 · unverdicted · novelty 6.0

H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.

GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache

cs.LG · 2026-07-01 · unverdicted · novelty 5.0

GSRQ applies a gain-shape variant of K-means inside residual quantization to improve directional fidelity, raising LongBench accuracy from 11.34 to 33.54 at 1-bit on LLaMA-3-8B.

Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs

cs.LG · 2026-05-05 · unverdicted · novelty 5.0 · 2 refs

Predict-then-Diffuse predicts response length for diffusion LLMs before inference, cutting FLOPs with a data-driven safety buffer while preserving output quality.

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

cs.AR · 2025-09-11 · unverdicted · novelty 5.0

PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.

Optimizing Teacher-Student Partitioning for Scalable Knowledge Distillation on HPC Systems

cs.DC · 2026-06-26 · unverdicted · novelty 3.0

The paper introduces an HPC-aware teacher-student partitioning strategy for knowledge distillation that combines vertical and horizontal splits and reports up to 67% higher throughput than the symmetric TRL baseline.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Efficient Memory Management for Large Language Model Serving with PagedAttention cs.LG · 2023-09-12 · conditional · none · ref 1
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving cs.LG · 2026-05-26 · unverdicted · none · ref 3
The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eliminating observed flips.
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cs.LG · 2023-06-24 · unverdicted · none · ref 17
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache cs.LG · 2026-07-01 · unverdicted · none · ref 9
GSRQ applies a gain-shape variant of K-means inside residual quantization to improve directional fidelity, raising LongBench accuracy from 11.34 to 33.54 at 1-bit on LLaMA-3-8B.
Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs cs.LG · 2026-05-05 · unverdicted · none · ref 14 · 2 links
Predict-then-Diffuse predicts response length for diffusion LLMs before inference, cutting FLOPs with a data-driven safety buffer while preserving output quality.

Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer