Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale

· 2022 · arXiv 2207.00032

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

Efficient Memory Management for Large Language Model Serving with PagedAttention

cs.LG · 2023-09-12 · conditional · novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.

ShardTensor: Domain Parallelism for Scientific Machine Learning

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.

Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

Predict-then-Diffuse predicts response lengths for diffusion LLMs via an auxiliary model and safety buffer to reduce FLOP waste while preserving output quality.

citing papers explorer

Showing 4 of 4 citing papers.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 2
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Efficient Memory Management for Large Language Model Serving with PagedAttention cs.LG · 2023-09-12 · conditional · none · ref 1
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
ShardTensor: Domain Parallelism for Scientific Machine Learning cs.DC · 2026-05-11 · unverdicted · none · ref 57
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs cs.LG · 2026-05-05 · unverdicted · none · ref 14
Predict-then-Diffuse predicts response lengths for diffusion LLMs via an auxiliary model and safety buffer to reduce FLOP waste while preserving output quality.

Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale

fields

years

verdicts

representative citing papers

citing papers explorer