Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
Predict-then-Diffuse predicts response lengths for diffusion LLMs via an auxiliary model and safety buffer to reduce FLOP waste while preserving output quality.
citing papers explorer
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs
Predict-then-Diffuse predicts response lengths for diffusion LLMs via an auxiliary model and safety buffer to reduce FLOP waste while preserving output quality.