hub

See what you are told: Visual attention sink in large multimodal models

Kang, S · 2025 · arXiv 2503.03321

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 2 background 1

citation-polarity summary

use method 2 background 1

representative citing papers

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.

Inference Time Optimization with Confidence Dynamics

cs.CL · 2026-05-24 · unverdicted · novelty 6.0

Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.

Large Vision-Language Models Get Lost in Attention

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

Latent Denoising Improves Visual Alignment in Large Multimodal Models

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.

The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.

Counting to Four is still a Chore for VLMs

cs.CV · 2026-04-11 · unverdicted · novelty 6.0

VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

cs.CV · 2026-03-15 · unverdicted · novelty 6.0

Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.

EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

cs.CV · 2026-02-19 · unverdicted · novelty 6.0

EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

cs.CV · 2025-11-21 · conditional · novelty 6.0

VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.

Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning

cs.CV · 2026-05-25 · unverdicted · novelty 5.0

Proposes dynamic token re-weighting during target-domain fine-tuning to mitigate exacerbated attention sink in source-free CDFSL, achieving SOTA on four benchmarks.

Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

cs.CV · 2026-04-28 · unverdicted · novelty 5.0

SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

cs.CV · 2026-05-18

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

cs.LG · 2026-04-22

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

cs.CV · 2025-11-18

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

cs.CV · 2025-09-28

citing papers explorer

Showing 16 of 16 citing papers.

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context cs.CV · 2026-06-29 · unverdicted · none · ref 14
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
Inference Time Optimization with Confidence Dynamics cs.CL · 2026-05-24 · unverdicted · none · ref 7
Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues cs.CV · 2026-05-21 · unverdicted · none · ref 12
MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.
Large Vision-Language Models Get Lost in Attention cs.AI · 2026-05-07 · unverdicted · none · ref 27
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
Latent Denoising Improves Visual Alignment in Large Multimodal Models cs.CV · 2026-04-23 · unverdicted · none · ref 40
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 8
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
Counting to Four is still a Chore for VLMs cs.CV · 2026-04-11 · unverdicted · none · ref 17
VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models cs.CV · 2026-03-15 · unverdicted · none · ref 8
Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.
EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models cs.CV · 2026-02-19 · unverdicted · none · ref 18
EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions cs.CV · 2025-11-21 · conditional · none · ref 12
VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.
Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning cs.CV · 2026-05-25 · unverdicted · none · ref 15
Proposes dynamic token re-weighting during target-domain fine-tuning to mitigate exacerbated attention sink in source-free CDFSL, achieving SOTA on four benchmarks.
Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval cs.CV · 2026-04-28 · unverdicted · none · ref 17
SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.
RAVE: Re-Allocating Visual Attention in Large Multimodal Models cs.CV · 2026-05-18 · unreviewed · ref 3
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs cs.LG · 2026-04-22 · unreviewed · ref 17
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs cs.CV · 2025-11-18 · unreviewed · ref 27
HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling cs.CV · 2025-09-28 · unreviewed · ref 4

See what you are told: Visual attention sink in large multimodal models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer