Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
AoiZora adds topology-aware physical placement planning to auto-parallel compilation for diffusion transformer inference, cutting one-step denoising latency by up to 1.42x on TPU v5e sub-slices.
The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eliminating observed flips.
RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal workloads.
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across multiple LLMs.
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
GSRQ applies a gain-shape variant of K-means inside residual quantization to improve directional fidelity, raising LongBench accuracy from 11.34 to 33.54 at 1-bit on LLaMA-3-8B.
Predict-then-Diffuse predicts response length for diffusion LLMs before inference, cutting FLOPs with a data-driven safety buffer while preserving output quality.
PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
The paper introduces an HPC-aware teacher-student partitioning strategy for knowledge distillation that combines vertical and horizontal splits and reports up to 67% higher throughput than the symmetric TRL baseline.
citing papers explorer
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
-
A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving
The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eliminating observed flips.
-
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
-
GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache
GSRQ applies a gain-shape variant of K-means inside residual quantization to improve directional fidelity, raising LongBench accuracy from 11.34 to 33.54 at 1-bit on LLaMA-3-8B.
-
Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs
Predict-then-Diffuse predicts response length for diffusion LLMs before inference, cutting FLOPs with a data-driven safety buffer while preserving output quality.