{"total":16,"items":[{"citing_arxiv_id":"2606.30288","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context","primary_cat":"cs.CV","submitted_at":"2026-06-29T13:30:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25799","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning","primary_cat":"cs.CV","submitted_at":"2026-05-25T12:49:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes dynamic token re-weighting during target-domain fine-tuning to mitigate exacerbated attention sink in source-free CDFSL, achieving SOTA on four benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25244","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Inference Time Optimization with Confidence Dynamics","primary_cat":"cs.CL","submitted_at":"2026-05-24T20:04:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21954","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues","primary_cat":"cs.CV","submitted_at":"2026-05-21T03:40:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18359","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RAVE: Re-Allocating Visual Attention in Large Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T13:12:50+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05668","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Vision-Language Models Get Lost in Attention","primary_cat":"cs.AI","submitted_at":"2026-05-07T04:45:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Benchmark results under different SAP modes. Weboldthe best results and underline the runner-upswithin each model. Model / Variant Affected LayersPOPE RWQA 3dSRBenchMMMU VMCBenchHallusionBenchMathVista Qwen-2.5-VL-3B / 86.13 59.35 53.46 47.78 72.31 66.97 61.5 + Vis. Attn. 87.58 61.38 53.94 48.29 72.67 68.66 61.6 + Patch Comp. 87.47 61.62 54.14 47.88 72.59 69.19 61.7 + Noise [1, 27] 87.40 60.52 53.85 48.29 72.66 69.09 61.6 Qwen-2.5-VL-7B / 86.54 65.75 55.6351.7774.34 69.1963.3 + Vis. Attn. 87.62 66.14 56.60 51.18 74.77 70.98 63.1 + Patch Comp. 87.73 66.54 56.74 51.32 74.80 71.40 63.1 + Noise [1, 27] 87.51 66.54 56.56 51.76 74.76 70.35 62.9 LLaVA-1.5-7B / 74.38 47.71 47.53 34.12 48.71 41.63 21.9 + Vis. Attn. 75.79 50.20 48.65 34."},{"citing_arxiv_id":"2604.25273","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval","primary_cat":"cs.CV","submitted_at":"2026-04-28T06:29:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21343","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Latent Denoising Improves Visual Alignment in Large Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-04-23T06:58:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"produces measurably better internal visual representations within the LLM. We extract mean-pooled visual hidden states from every layer of the 32-layer Vicuna backbone on a 5,000-image subset of ImageNet-1K validation, using the LLaVA+CLIP baseline and latent denoising models. CKA with reference models.We compute linear centered kernel alignment (CKA) [ 40] between each LLM layer and two frozen reference models: the same CLIP ViT-L/14 encoder used as the LMM's vision backbone, and DINOv2 ViT-L/14, a self-supervised vision model. Figure 5(a) shows both: latent denoising maintains higher CKA with CLIP at early-to-mid layers (layers 0-15, +0.02 to +0.04) and dramatically improves final-layer alignment (+0."},{"citing_arxiv_id":"2604.20937","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-22T13:28:53+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"distinct approaches:Hard Pruning: these topK% tokens are deliberately dis- carded, after which we apply VisionZip, where tokens with high attention are selected.Attention Redistribution: Since mitigation of sink tokens was explored by prior work, we attempt to adopt previous work for visual token pruning to see if it is effective. Therefore, following [17], we redistribute the attention weights from theseKtokens to the remaining tokens before selection. Fig. 7 illustrates theperformanceacrossvariousK(5%to20%),fromwhichwedrawthefollowing observations:1)Hard Pruningconsistently outperforms the original VisionZip. This confirms our hypothesis that the most attended tokens are often semanti- cally sparse sink tokens."},{"citing_arxiv_id":"2604.14363","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-15T19:26:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10039","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Counting to Four is still a Chore for VLMs","primary_cat":"cs.CV","submitted_at":"2026-04-11T05:23:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.14184","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-03-15T02:21:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.17419","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-02-19T14:50:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.17722","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions","primary_cat":"cs.CV","submitted_at":"2025-11-21T19:18:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.14159","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs","primary_cat":"cs.CV","submitted_at":"2025-11-18T05:48:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.00054","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling","primary_cat":"cs.CV","submitted_at":"2025-09-28T08:31:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}