hub

See what you are told: Visual attention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang · 2025 · arXiv 2503.03321

24 Pith papers cite this work. Polarity classification is still indexing.

24 Pith papers citing it

read on arXiv browse 24 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 2 background 1

citation-polarity summary

use method 2 background 1

representative citing papers

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

cs.CV · 2026-06-10 · conditional · novelty 7.0

Reroute turns irreversible visual-token pruning into recoverable routing that reuses existing attention scores, improving grounding performance under aggressive reduction on LLaVA-1.5 and Qwen while preserving TFLOPs and KV-cache budgets.

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

AVLLMs route audio-visual information sequentially in video tasks and via parallel streams for interleaved items, allowing early token discard with little performance loss across models and scales.

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

cs.CV · 2026-06-08 · unverdicted · novelty 7.0

Rea2Seg turns image segmentation into candidate mask discovery from MLLM attention followed by MLLM-based comparative scoring and selection, plus a new multi-dimensional reasoning benchmark ReasonSeg-SGDR.

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

Multimodal LLMs exhibit functional sparsity where a small number of CoRe attention heads handle cross-modal retrieval, with ablation of the top 5% degrading performance while others have little effect.

When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Mechanistic analysis of GLMs shows graph sink tokens have high activation but low importance for predictions, indicating decoupling between saliency and graph-semantic utility.

ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

ADAPT reduces MLLM hallucinations 40-60% by aligning cross-attention dynamics via visual anchors, supervised inference, and preference tuning while preserving general capabilities.

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.

Inference Time Optimization with Confidence Dynamics

cs.CL · 2026-05-24 · unverdicted · novelty 6.0

Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.

Large Vision-Language Models Get Lost in Attention

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

Latent Denoising Improves Visual Alignment in Large Multimodal Models

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.

The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.

Counting to Four is still a Chore for VLMs

cs.CV · 2026-04-11 · unverdicted · novelty 6.0

VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

cs.CV · 2026-03-15 · unverdicted · novelty 6.0

Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.

EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

cs.CV · 2026-02-19 · unverdicted · novelty 6.0

EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

cs.CV · 2025-11-21 · conditional · novelty 6.0

VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.

Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

cs.CV · 2026-06-10 · unverdicted · novelty 5.0

BACON calibrates observation-window attention using last-query evidence and coherence filters to raise average multimodal KV compression performance by 7.5% (up to 30.9%) under aggressive budgets.

Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning

cs.CV · 2026-05-25 · unverdicted · novelty 5.0

Proposes dynamic token re-weighting during target-domain fine-tuning to mitigate exacerbated attention sink in source-free CDFSL, achieving SOTA on four benchmarks.

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

cs.CV · 2026-05-18 · unverdicted · novelty 5.0 · 2 refs

RAVE is a lightweight pair-gating addition to self-attention that improves visual token allocation in LMMs and delivers an average 3-point gain on multimodal benchmarks, largest on perception-heavy tasks.

Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

cs.CV · 2026-04-28 · unverdicted · novelty 5.0

SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.

Information-Regularized Attention for Visual-Centric Reasoning

cs.CV · 2026-07-01 · unverdicted · novelty 4.0

IRA is a stochastic attention mechanism that regulates visual information injection in VLMs to yield smoother embedding trajectories and reduced attention sinks.

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

cs.LG · 2026-04-22

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

cs.CV · 2025-11-18

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

cs.CV · 2025-09-28

citing papers explorer

Showing 24 of 24 citing papers.

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models cs.CV · 2026-06-10 · conditional · none · ref 30
Reroute turns irreversible visual-token pruning into recoverable routing that reuses existing attention scores, improving grounding performance under aggressive reduction on LLaVA-1.5 and Qwen while preserving TFLOPs and KV-cache budgets.
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs cs.AI · 2026-06-08 · unverdicted · none · ref 20
AVLLMs route audio-visual information sequentially in video tasks and via parallel streams for interleaved items, allowing early token discard with little performance loss across models and scales.
Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning cs.CV · 2026-06-08 · unverdicted · none · ref 26
Rea2Seg turns image segmentation into candidate mask discovery from MLLM attention followed by MLLM-based comparative scoring and selection, plus a new multi-dimensional reasoning benchmark ReasonSeg-SGDR.
Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads cs.CL · 2026-06-04 · unverdicted · none · ref 7
Multimodal LLMs exhibit functional sparsity where a small number of CoRe attention heads handle cross-modal retrieval, with ablation of the top 5% degrading performance while others have little effect.
When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models cs.LG · 2026-06-02 · unverdicted · none · ref 14
Mechanistic analysis of GLMs shows graph sink tokens have high activation but low importance for predictions, indicating decoupling between saliency and graph-semantic utility.
ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs cs.CV · 2026-06-30 · unverdicted · none · ref 10
ADAPT reduces MLLM hallucinations 40-60% by aligning cross-attention dynamics via visual anchors, supervised inference, and preference tuning while preserving general capabilities.
VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context cs.CV · 2026-06-29 · unverdicted · none · ref 14
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
Inference Time Optimization with Confidence Dynamics cs.CL · 2026-05-24 · unverdicted · none · ref 7
Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues cs.CV · 2026-05-21 · unverdicted · none · ref 12
MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.
Large Vision-Language Models Get Lost in Attention cs.AI · 2026-05-07 · unverdicted · none · ref 27
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
Latent Denoising Improves Visual Alignment in Large Multimodal Models cs.CV · 2026-04-23 · unverdicted · none · ref 40
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 8
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
Counting to Four is still a Chore for VLMs cs.CV · 2026-04-11 · unverdicted · none · ref 17
VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models cs.CV · 2026-03-15 · unverdicted · none · ref 8
Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.
EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models cs.CV · 2026-02-19 · unverdicted · none · ref 18
EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions cs.CV · 2025-11-21 · conditional · none · ref 12
VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.
Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression cs.CV · 2026-06-10 · unverdicted · none · ref 38
BACON calibrates observation-window attention using last-query evidence and coherence filters to raise average multimodal KV compression performance by 7.5% (up to 30.9%) under aggressive budgets.
Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning cs.CV · 2026-05-25 · unverdicted · none · ref 15
Proposes dynamic token re-weighting during target-domain fine-tuning to mitigate exacerbated attention sink in source-free CDFSL, achieving SOTA on four benchmarks.
RAVE: Re-Allocating Visual Attention in Large Multimodal Models cs.CV · 2026-05-18 · unverdicted · none · ref 3 · 2 links
RAVE is a lightweight pair-gating addition to self-attention that improves visual token allocation in LMMs and delivers an average 3-point gain on multimodal benchmarks, largest on perception-heavy tasks.
Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval cs.CV · 2026-04-28 · unverdicted · none · ref 17
SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.
Information-Regularized Attention for Visual-Centric Reasoning cs.CV · 2026-07-01 · unverdicted · none · ref 7
IRA is a stochastic attention mechanism that regulates visual information injection in VLMs to yield smoother embedding trajectories and reduced attention sinks.
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs cs.LG · 2026-04-22 · unreviewed · ref 17
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs cs.CV · 2025-11-18 · unreviewed · ref 27
HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling cs.CV · 2025-09-28 · unreviewed · ref 4

See what you are told: Visual attention sink in large multimodal models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer