Reroute turns irreversible visual-token pruning into recoverable routing that reuses existing attention scores, improving grounding performance under aggressive reduction on LLaVA-1.5 and Qwen while preserving TFLOPs and KV-cache budgets.
hub
See what you are told: Visual attention sink in large multimodal models
24 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
AVLLMs route audio-visual information sequentially in video tasks and via parallel streams for interleaved items, allowing early token discard with little performance loss across models and scales.
Rea2Seg turns image segmentation into candidate mask discovery from MLLM attention followed by MLLM-based comparative scoring and selection, plus a new multi-dimensional reasoning benchmark ReasonSeg-SGDR.
Multimodal LLMs exhibit functional sparsity where a small number of CoRe attention heads handle cross-modal retrieval, with ablation of the top 5% degrading performance while others have little effect.
Mechanistic analysis of GLMs shows graph sink tokens have high activation but low importance for predictions, indicating decoupling between saliency and graph-semantic utility.
ADAPT reduces MLLM hallucinations 40-60% by aligning cross-attention dynamics via visual anchors, supervised inference, and preference tuning while preserving general capabilities.
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.
MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.
Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.
EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.
VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.
BACON calibrates observation-window attention using last-query evidence and coherence filters to raise average multimodal KV compression performance by 7.5% (up to 30.9%) under aggressive budgets.
Proposes dynamic token re-weighting during target-domain fine-tuning to mitigate exacerbated attention sink in source-free CDFSL, achieving SOTA on four benchmarks.
RAVE is a lightweight pair-gating addition to self-attention that improves visual token allocation in LMMs and delivers an average 3-point gain on multimodal benchmarks, largest on perception-heavy tasks.
SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.
IRA is a stochastic attention mechanism that regulates visual information injection in VLMs to yield smoother embedding trajectories and reduced attention sinks.
citing papers explorer
-
Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
Reroute turns irreversible visual-token pruning into recoverable routing that reuses existing attention scores, improving grounding performance under aggressive reduction on LLaVA-1.5 and Qwen while preserving TFLOPs and KV-cache budgets.
-
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
AVLLMs route audio-visual information sequentially in video tasks and via parallel streams for interleaved items, allowing early token discard with little performance loss across models and scales.
-
Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning
Rea2Seg turns image segmentation into candidate mask discovery from MLLM attention followed by MLLM-based comparative scoring and selection, plus a new multi-dimensional reasoning benchmark ReasonSeg-SGDR.
-
Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads
Multimodal LLMs exhibit functional sparsity where a small number of CoRe attention heads handle cross-modal retrieval, with ablation of the top 5% degrading performance while others have little effect.
-
When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models
Mechanistic analysis of GLMs shows graph sink tokens have high activation but low importance for predictions, indicating decoupling between saliency and graph-semantic utility.
-
ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs
ADAPT reduces MLLM hallucinations 40-60% by aligning cross-attention dynamics via visual anchors, supervised inference, and preference tuning while preserving general capabilities.
-
VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
-
Inference Time Optimization with Confidence Dynamics
Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.
-
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues
MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.
-
Large Vision-Language Models Get Lost in Attention
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
-
Latent Denoising Improves Visual Alignment in Large Multimodal Models
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
-
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
-
Counting to Four is still a Chore for VLMs
VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.
-
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models
Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.
-
EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models
EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.
-
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions
VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.
-
Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression
BACON calibrates observation-window attention using last-query evidence and coherence filters to raise average multimodal KV compression performance by 7.5% (up to 30.9%) under aggressive budgets.
-
Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning
Proposes dynamic token re-weighting during target-domain fine-tuning to mitigate exacerbated attention sink in source-free CDFSL, achieving SOTA on four benchmarks.
-
RAVE: Re-Allocating Visual Attention in Large Multimodal Models
RAVE is a lightweight pair-gating addition to self-attention that improves visual token allocation in LMMs and delivers an average 3-point gain on multimodal benchmarks, largest on perception-heavy tasks.
-
Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval
SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.
-
Information-Regularized Attention for Visual-Centric Reasoning
IRA is a stochastic attention mechanism that regulates visual information injection in VLMs to yield smoother embedding trajectories and reduced attention sinks.
- Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
- MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
- HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling