citation dossier

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al · 2024 · arXiv 2410.17247

17Pith papers citing it

18reference links

cs.CVtop field · 14 papers

UNVERDICTEDtop verdict bucket · 15 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 17 reviewed papers. Its strongest current cluster is cs.CV (14 papers). The largest review-status bucket among citing papers is UNVERDICTED (15 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.

VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results across VLMs and benchmarks.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote sensing interpretation.

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.

Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.

VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Geometry-Guided 3D Visual Token Pruning for Video-Language Models

cs.CV · 2026-04-20 · conditional · novelty 6.0

Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.

Towards Joint Quantization and Token Pruning of Vision-Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

QUOTA jointly optimizes low-bit quantization and visual token pruning for VLMs by deriving pruning decisions from quantized operators, achieving 95.65% average performance retention with only 30% of visual tokens versus 94.3% for stage-wise baselines.

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.

HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.

CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

cs.DC · 2026-04-07 · unverdicted · novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.

citing papers explorer

Showing 17 of 17 citing papers.

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models cs.CV · 2026-04-27 · unverdicted · none · ref 19
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models cs.CV · 2026-04-16 · unverdicted · none · ref 25
VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results across VLMs and benchmarks.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 44
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding cs.CV · 2026-04-13 · unverdicted · none · ref 46
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote sensing interpretation.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 42
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models cs.CV · 2026-05-10 · unverdicted · none · ref 40
COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs? cs.CV · 2026-05-09 · unverdicted · none · ref 44
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding cs.CV · 2026-05-07 · unverdicted · none · ref 19 · 2 links
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 74
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Geometry-Guided 3D Visual Token Pruning for Video-Language Models cs.CV · 2026-04-20 · conditional · none · ref 34
Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.
Towards Joint Quantization and Token Pruning of Vision-Language Models cs.CV · 2026-04-19 · unverdicted · none · ref 42
QUOTA jointly optimizes low-bit quantization and visual token pruning for VLMs by deriving pruning decisions from quantized operators, achieving 95.65% average performance retention with only 30% of visual tokens versus 94.3% for stage-wise baselines.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 94
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 43
DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models cs.CV · 2026-04-09 · unverdicted · none · ref 38
HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference cs.DC · 2026-04-07 · unverdicted · none · ref 80
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 51
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling cs.CV · 2026-04-18 · unverdicted · none · ref 53
EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer