pith. machine review for the scientific record. sign in

arxiv: 2605.10676 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.LG

Recognition: unknown

Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:43 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords multimodal large language modelshallucinationsvision-language imbalanceadversarial perturbationtraining-free decodingattention mechanismsequilibrium restoration
0
0 comments X

The pith

Hallucinations in multimodal models stem from linguistic priors overpowering visual signals, which a training-free perturbation method can rebalance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that in multimodal large language model decoding, attention abnormally focuses on irrelevant image tokens because linguistic priors dominate over visual information. It adopts a decoding-as-game view to show this as an equilibrium imbalance rather than mere noise. The proposed Adversarial Counter-Commonsense Equilibrium (ACE) perturbs the visual context using counter-commonsense patches. Because authentic visual features stay stable under such changes while hallucinated responses fluctuate, the method can suppress the sensitive priors and reinforce the stable visual signals. This plug-and-play strategy improves trustworthiness during inference with almost no added cost.

Core claim

Hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. The Adversarial Counter-Commonsense Equilibrium (ACE) is a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy that precisely suppresses perturbation-sensitive priors while compensating for stable visual signals to restore balance.

What carries the argument

Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context with counter-commonsense patches and applies dynamic game decoding to suppress unstable priors.

If this is right

  • - Enhances trustworthiness of MLLMs as a plug-and-play strategy
  • - Requires negligible inference overhead
  • - Suppresses perturbation-sensitive linguistic priors
  • - Compensates for stable visual signals to restore vision-language balance

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • - The equilibrium framing could guide inference-time fixes for other types of multimodal over-reliance on priors.
  • - Patch selection might be automated or adapted to specific image domains for broader use.
  • - The approach suggests attention redirection can occur without retraining by exploiting response stability differences.

Load-bearing premise

Authentic visual features remain stable under perturbation while hallucinations fluctuate.

What would settle it

If counter-commonsense patches either fail to reduce hallucinations or cause authentic visual features to change substantially, the claim that the method restores balance through differential stability would be falsified.

Figures

Figures reproduced from arXiv: 2605.10676 by Lingwei Dang, Peilin Zhao, Qingxin Xiao, Qingyao Wu, Yangyang Zhao.

Figure 1
Figure 1. Figure 1: We use a solid-color control to isolate the nature of Atten￾tion Sinks. Their persistence without visual content identifies them as structural artifacts independent of semantics, termed the VSB. Logit Lens analysis reveals a distinct contrast: the VSB remains stationary in control settings but fluctuates markedly in normal im￾agery. This implies visual signals are structural suppressed rather than discarde… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the ACE framework. Starting with the Adaptive Adversarial Selection, ACE retrieves and pastes counter￾commonsense patches (e.g., pasting a “car” into a “soccer field”) to construct the CIS. By contrasting the CIS with the OIS via cosine similarity in the deep feature space, ACE employs a soft gating mechanism to decouple environment-invariant features, forming the DVS. Finally, by implementing … view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison on the MME hallucination subset. We evaluate ACE against baseline decoding strategies on (a) LLaVA-1.5 and (b) InstructBLIP across four categories: Existence, Count, Position, and Color. precision via two metrics: CHAIRS (sentence-level) and CHAIRI (instance-level). As shown in the table. 2, ACE consistently outperforms existing decoding strategies across all mainstream model variant… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of intervention depth and injection ratio α on LLaVA-1.5. Mid-layer rectification serves as the optimal intervention window, whereas early and late injections suffer from feature mismatch and semantic conflict, respectively. via grid search, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Partial samples from each domain. Category Distribution and Composition. The library is meticulously curated to span 15 diverse sub-categories, grouped into three primary orthogonal domains. As shown in [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualizing the breakdown of False Equilibrium via ACE streams. We visualize the attention heatmaps and generated captions for an original image (Left) and its counter-commonsense perturbed counterpart (Right). (1) Heatmaps: The high-attention sinks, typically aggregated on non-salient background regions, exhibit a drastic topological reorganization triggered by the introduction of the perturbation (the sk… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of hallucination mitigation strategies. The left column displays the original input Iraw and the counter-commonsense perturbed image Icf (with overlaid objects) used for inference. The right column contrasts the generated descriptions: [Regular] Succumbing to narrative inertia, the baseline model fabricates context-consistent but non-existent objects (highlighted in red), such as “a … view at source ↗
Figure 9
Figure 9. Figure 9: Visualizing the Game-Theoretic Dynamics on LLaVA-NeXT. The plots track the Logit Rank evolution (a proxy for Agent Utility) of the Factual Token (Agent V) vs. Hallucinated Token (Agent L). Regular Decoding: The gray Entanglement Zone highlights the Game of Attrition, where Agent L eventually exploits narrative inertia to overpower Agent V, leading to a False Equilibrium. ACE Decoding: (1) Control Validatio… view at source ↗
read the original abstract

During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual and narrative logic, and such coercive corrections exacerbate visual-language imbalance. Adopting a "decoding-as-game" perspective, we reveal that hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. We propose Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy. This approach precisely suppresses perturbation-sensitive priors while compensating for stable visual signals to restore balance. Extensive experiments demonstrate that ACE, as a plug-and-play strategy, enhances model trustworthiness with negligible inference overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that hallucinations in multimodal large language models (MLLMs) arise during decoding from an equilibrium imbalance between linguistic priors and visual information, manifested as abnormal attention concentration on irrelevant image tokens. It proposes Adversarial Counter-Commonsense Equilibrium (ACE), a training-free plug-and-play framework that perturbs visual context with counter-commonsense patches. The method exploits the asserted differential that authentic visual features remain stable under perturbation while hallucinations fluctuate, implementing a dynamic game decoding strategy to suppress perturbation-sensitive priors and compensate stable visual signals, thereby restoring balance. Extensive experiments are claimed to show improved trustworthiness with negligible inference overhead.

Significance. If the differential stability assumption holds and is empirically isolated, ACE would offer a practical, training-free advance for mitigating hallucinations in MLLMs by reframing decoding as a game that selectively rebalances vision and language. The absence of retraining or extra data requirements is a notable strength for deployment. However, the current presentation provides insufficient quantitative grounding (baselines, metrics, error bars) to assess whether the claimed restoration of equilibrium delivers meaningful gains over existing attention-redirection or decoding-correction approaches.

major comments (1)
  1. [ACE framework description (method section)] The core operating principle (authentic visual features remain stable under perturbation while hallucinations fluctuate) is stated as a fact to be leveraged in the ACE framework description, yet no derivation, controlled measurement (e.g., cosine similarity or attention-map deltas on verified vs. hallucinated tokens), or ablation isolating the selectivity of counter-commonsense patches is provided. This assumption is load-bearing for the rebalancing claim: if patches induce global shifts in visual embeddings or attention, the dynamic game decoding cannot selectively suppress priors while compensating stable signals.
minor comments (2)
  1. [Abstract] The abstract asserts 'extensive experiments' demonstrating effectiveness but supplies no quantitative results, specific metrics, baselines, datasets, or error bars, which weakens the ability to evaluate the equilibrium-restoration claim from the summary alone.
  2. [Method] The term 'counter-commonsense patches' and the precise mechanics of the 'dynamic game decoding strategy' would benefit from an earlier formal definition or pseudocode to clarify how suppression and compensation are implemented.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding the justification of the core operating principle in the ACE framework below.

read point-by-point responses
  1. Referee: [ACE framework description (method section)] The core operating principle (authentic visual features remain stable under perturbation while hallucinations fluctuate) is stated as a fact to be leveraged in the ACE framework description, yet no derivation, controlled measurement (e.g., cosine similarity or attention-map deltas on verified vs. hallucinated tokens), or ablation isolating the selectivity of counter-commonsense patches is provided. This assumption is load-bearing for the rebalancing claim: if patches induce global shifts in visual embeddings or attention, the dynamic game decoding cannot selectively suppress priors while compensating stable signals.

    Authors: We acknowledge that the manuscript introduces the differential stability between authentic visual features and hallucinated linguistic priors as an observed property without providing explicit derivations or isolated measurements in the method section. While the full paper includes extensive experiments showing ACE's effectiveness in reducing hallucinations, we agree that this load-bearing assumption requires stronger empirical grounding to rule out global embedding shifts. In the revision, we will add a new subsection with controlled measurements, including cosine similarity and attention-map delta analyses comparing verified versus hallucinated tokens under counter-commonsense perturbations, plus targeted ablations on patch selectivity. These will quantify the differential stability and confirm that the dynamic game decoding selectively targets perturbation-sensitive priors. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected.

full rationale

The paper frames hallucinations via a decoding-as-game perspective and an equilibrium imbalance between linguistic priors and visual signals, then introduces ACE as a training-free perturbation method that leverages the stated differential stability of authentic visual features versus fluctuating hallucinated content. This stability is presented as an empirical fact to be exploited rather than a quantity derived from or defined by the method itself. No equations reduce the framework's output to its inputs by construction, no parameters are fitted and then relabeled as predictions, and no self-citations or uniqueness theorems carry the central load. The derivation remains self-contained, with success claims resting on experimental validation rather than tautological redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one unverified domain assumption about differential stability under perturbation; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Authentic visual features remain stable under perturbation while hallucinations fluctuate
    This premise is directly leveraged to implement the suppression of perturbation-sensitive priors.

pith-pipeline@v0.9.0 · 5463 in / 1238 out tokens · 40549 ms · 2026-05-12T04:43:08.539341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 7 internal anchors

  1. [1]

    Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

    SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding , author =. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

  2. [2]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Don’t miss the forest for the trees: Attentional vision calibration for large vision language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  3. [3]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Hallucinatory Image Tokens: A Training-free EAZY Approach to Detecting and Mitigating Object Hallucinations in LVLMs , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  4. [4]

    arXiv preprint arXiv:2501.01926 , year=

    Mitigating hallucination for large vision language model by inter-modality correlation calibration decoding , author=. arXiv preprint arXiv:2501.01926 , year=

  5. [5]

    Object Hallucination in Image Captioning

    Rohrbach, Anna and Hendricks, Lisa Anne and Burns, Kaylee and Darrell, Trevor and Saenko, Kate. Object Hallucination in Image Captioning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1437

  6. [6]

    In Bouamor, H., Pino, J

    Li, Yifan and Du, Yifan and Zhou, Kun and Wang, Jinpeng and Zhao, Xin and Wen, Ji-Rong. Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.20

  7. [7]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Mme: A comprehensive evaluation benchmark for multimodal large language models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  8. [8]

    Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

    Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

  9. [9]

    2025 , journal=

    Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding , author=. 2025 , journal=

  10. [10]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  11. [11]

    The Twelfth International Conference on Learning Representations , year=

    Instructive Decoding: Instruction-Tuned Large Language Models are Self-Refiner from Noisy Instructions , author=. The Twelfth International Conference on Learning Representations , year=

  12. [12]

    Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

    Wang, Xintong and Pan, Jingheng and Ding, Liang and Biemann, Chris. Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.937

  13. [13]

    Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models , volume =

    Huo, Fushuo and Xu, Wenchao and Zhang, Zhong and Wang, Haozhao and Chen, Zhicheng and Zhao, Peilin , booktitle =. Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models , volume =

  14. [14]

    Instructive Decoding: Instruction-Tuned Large Language Models are Self-Refiner from Noisy Instructions , volume =

    Kim, Taehyeon and KIM, JOONKEE and Lee, Gihun and Yun, Se-Young , booktitle =. Instructive Decoding: Instruction-Tuned Large Language Models are Self-Refiner from Noisy Instructions , volume =

  15. [15]

    Park, Yeji and Lee, Deokyeong and Choe, Junsuk and Chang, Buru , title =. Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence , articleno =. 2025 , isbn =. doi:10.1609/aaai.v3...

  16. [16]

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models , volume =

    Chuang, Yung-Sung and Xie, Yujia and Luo, Hongyin and Kim, Yoon and Glass, James R and He, Pengcheng , booktitle =. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models , volume =

  17. [17]

    arXiv preprint arXiv:2402.08680 , year=

    Mitigating object hallucination in large vision-language models via image-grounded guidance , author=. arXiv preprint arXiv:2402.08680 , year=

  18. [18]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    Mitigating hallucinations in large vision-language models via summary-guided decoding , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

  19. [19]

    The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering.arXiv preprint arXiv:2502.03628,

    The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering , author=. arXiv preprint arXiv:2502.03628 , year=

  20. [20]

    Forty-second International Conference on Machine Learning (ICML) , year=

    Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models , author=. Forty-second International Conference on Machine Learning (ICML) , year=

  21. [21]

    and Stepputtis, Simon and Ramanan, Deva and Salakhutdinov, Russ and Morency, Louis-Philippe and Sycara, Katia and Xie, Yaqi , booktitle =

    Zhang, Ce and Wan, Zifu and Kan, Zhehan and Ma, Martin Q. and Stepputtis, Simon and Ramanan, Deva and Salakhutdinov, Russ and Morency, Louis-Philippe and Sycara, Katia and Xie, Yaqi , booktitle =. Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models , volume =

  22. [22]

    Probing Visual Language Priors in

    Luo, Tiange and Cao, Ang and Lee, Gunhee and Johnson, Justin and Lee, Honglak , booktitle =. Probing Visual Language Priors in. 2025 , editor =

  23. [23]

    arXiv preprint arXiv:2410.11779 , year=

    Mllm can see? dynamic correction decoding for hallucination mitigation , author=. arXiv preprint arXiv:2410.11779 , year=

  24. [24]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Deepseek-vl: towards real-world vision-language understanding , author=. arXiv preprint arXiv:2403.05525 , year=

  25. [25]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Lisa: Reasoning segmentation via large language model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  26. [26]

    Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , pages=

    Instructdiffusion: A generalist modeling interface for vision tasks , author=. Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , pages=

  27. [27]

    Hallucination of Multimodal Large Language Models: A Survey

    Hallucination of multimodal large language models: A survey , author=. arXiv preprint arXiv:2404.18930 , year=

  28. [28]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  29. [29]

    Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

    Towards injecting medical visual knowledge into multimodal llms at scale , author=. Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

  30. [30]

    A Survey on Hallucination in Large Vision-Language Models

    A survey on hallucination in large vision-language models , author=. arXiv preprint arXiv:2402.00253 , year=

  31. [31]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  32. [32]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Shikra: Unleashing multimodal llm's referential dialogue magic , author=. arXiv preprint arXiv:2306.15195 , year=

  33. [33]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites , author=. arXiv preprint arXiv:2404.16821 , year=

  34. [34]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  35. [35]

    arXiv preprint arXiv:2403.00425 , year=

    Halc: Object hallucination reduction via adaptive focal-contrast decoding , author=. arXiv preprint arXiv:2403.00425 , year=

  36. [36]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Multi-modal hallucination control by visual information grounding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  37. [37]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Detecting and preventing hallucinations in large vision language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  38. [38]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Exposing and mitigating spurious correlations for cross-modal retrieval , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  39. [39]

    Analyzing and Mitigating Object Hallucination in Large Vision-Language Models , volume =

    Zhou, Yiyang and Cui, Chenhang and Yoon, Jaehong and Zhang, Linjun and Deng, Zhun and Finn, Chelsea and Bansal, Mohit and Yao, Huaxiu , booktitle =. Analyzing and Mitigating Object Hallucination in Large Vision-Language Models , volume =

  40. [40]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  41. [41]

    Vision Transformers Need Registers , volume =

    Darcet, Timoth\'. Vision Transformers Need Registers , volume =. International Conference on Representation Learning , editor =

  42. [42]

    Advances in Neural Information Processing Systems , volume=

    Hyper-sd: Trajectory segmented consistency model for efficient image synthesis , author=. Advances in Neural Information Processing Systems , volume=

  43. [43]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models , author=. arXiv preprint arXiv:2407.07895 , year=

  44. [44]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  45. [45]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen-vl: A frontier large vision-language model with versatile abilities , author=. arXiv preprint arXiv:2308.12966 , volume=

  46. [46]

    Advances in neural information processing systems , volume=

    Instructblip: Towards general-purpose vision-language models with instruction tuning , author=. Advances in neural information processing systems , volume=

  47. [47]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  48. [48]

    European Conference on Computer Vision , pages=

    Paying more attention to image: A training-free method for alleviating hallucination in lvlms , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  49. [49]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Vision Transformers Don't Need Trained Registers , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=