hub

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction

Xing, L · 2024 · arXiv 2410.17247

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.

VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results across VLMs and benchmarks.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote sensing interpretation.

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.

Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.

VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Geometry-Guided 3D Visual Token Pruning for Video-Language Models

cs.CV · 2026-04-20 · conditional · novelty 6.0

Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.

Towards Joint Quantization and Token Pruning of Vision-Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

QUOTA jointly optimizes low-bit quantization and visual token pruning for VLMs by deriving pruning decisions from quantized operators, achieving 95.65% average performance retention with only 30% of visual tokens versus 94.3% for stage-wise baselines.

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.

HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.

CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

cs.DC · 2026-04-07 · unverdicted · novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.

Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

cs.LG · 2026-03-30 · unverdicted · novelty 2.0

The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 74
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer