pith. sign in

arxiv: 2606.27596 · v1 · pith:4XLEE6F3new · submitted 2026-06-25 · 💻 cs.CV · cs.AI

Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding

Pith reviewed 2026-06-29 01:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords object hallucinationlarge vision-language modelsattention headscausal interventioninference-time decodingvisual attention entropypathological shortcuts
0
0 comments X

The pith

Hallucinations in large vision-language models are triggered when specific attention heads decouple from visual evidence and follow language priors instead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that object hallucination arises from a structural misalignment at decision-critical steps, where particular attention heads act as risky mediators that stop grounding in the image and lock onto language model priors, creating a shortcut that ignores visual input. This departs from simpler views that blame low attention intensity overall. Fox is presented as a training-free method that first uses visual attention entropy to find these heads without labels, then saturates logits to break the shortcut path, and finally applies conflict-gated decoding to keep outputs fluent. A sympathetic reader would care because the account identifies a precise, intervenable cause rather than a diffuse symptom, opening a route to more reliable multimodal reasoning at inference time. If the account holds, existing models can be made more faithful by editing the flow through identified heads rather than retraining from scratch.

Core claim

Hallucination is triggered at decision-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors. This establishes a pathological shortcut that bypasses visual grounding. Fox diagnoses the misalignment with a visual attention entropy probe that localizes the risky mediators in an unsupervised manner, performs causal intervention by numerical logit saturation to sever the shortcut, and reconciles the result with a conflict-gated cooperative decoding strategy that preserves observational fluency.

What carries the argument

risky mediators: the specific attention heads that decouple from visual evidence at decision-critical steps to form pathological shortcuts

If this is right

  • Targeted logit saturation on the identified heads severs the shortcut and lowers hallucination rates.
  • The resulting outputs maintain linguistic richness while increasing faithfulness to the image.
  • The entire procedure runs at inference time with no model retraining required.
  • The method reports a 29.1 percent improvement over the prior SID baseline on standard hallucination benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If entropy reliably flags causal heads, the same probe could be adapted to diagnose other systematic failures such as inconsistent reasoning across modalities.
  • The logit-saturation step implies that attention patterns in frozen transformers can be edited post hoc to enforce grounding without changing weights.
  • Extending the localization step to video or audio inputs would require only redefining the entropy calculation over the new evidence stream.

Load-bearing premise

Visual attention entropy can reliably and unsupervisedly identify the exact attention heads whose decoupling is the direct cause of the pathological shortcut.

What would settle it

If intervening on the entropy-localized heads reduces hallucinations no more than intervening on randomly chosen heads, the claim that those heads are the load-bearing cause would be falsified.

Figures

Figures reproduced from arXiv: 2606.27596 by Can Chen, Fan Zhou, Gillian Dobbie, Liu Yu, Ping Kuang, Zhikun Feng.

Figure 1
Figure 1. Figure 1: Motivation of our work. (a) Global visual atten￾tion magnitude mV,tail and distribution lack discriminative power to identify hallucination. (b) While global magnitude boosting (Green) fails to suppress the pathological peak on system instruc￾tions at decision-critical steps, our structural intervention (Blue) on risky mediators eliminates this shortcut, restoring visual ground￾ing. (c) Unlike coarse-grain… view at source ↗
Figure 2
Figure 2. Figure 2: Structural Causal Model (SCM) of the LVLM Decoding Path. (a) Observational SCM: The latent mediators H are localized at decision-critical steps. While stable mediators HS maintain visual grounding, risky mediators HR trigger a pathological shortcut (red arrow) from language priors Xsys to output Yt. (b) Interventional SCM: By applying do(HR), we sever the shortcut. The final output is dynamically reconcile… view at source ↗
Figure 3
Figure 3. Figure 3: Empirical validation of the joint risk score (S). (a) Di￾agnostic fidelity: The ROC curve demonstrates that the aggregated joint risk score reliably distinguishes hallucinated trajectories from faithful ones (AUC=0.818). (b) Structural decoupling: The dis￾tribution shift confirms that hallucination (orange) is characterized by higher joint risk, signifying the concurrent collapse of visual reliability and … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of Fox. (1) Causal Diagnosis: Identifying risky mediators HR by intersecting decision-critical queries with the conjunctive measurement of prior-path activation msys and visual uncertainty Hvis. (2) Causal Intervention: Executing the do-operator via numerical logit saturation to physically sever pathological shortcuts. (3) Adaptive Decoding: Reconciling observational and interventional distributio… view at source ↗
Figure 5
Figure 5. Figure 5: Performance on the MME benchmark. Higher scores indicate better effectiveness. Fox achieves the highest total scores across all evaluated backbones, particularly excelling in evidence￾driven subsets Position and Color. Results on MME Benchmark [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effectiveness of diagnosis-driven head selection. We report the performance gains of Fox over a random-intervention baseline across different intervention ratios K on the POPE bench￾mark. Improvements in Mean Accuracy (∆ Mean Acc) and Mean F1 (∆ Mean F1) consistently validate that targeting specific risky mediators is superior to stochastic head suppression [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of k, τJS, and β on hallucination and informa￾tiveness in LLaVA-1.5, evaluated on 500 COCO samples. tural integrity of the generative manifold, our method effec￾tively suppresses prior-driven behaviors without incurring a loss in descriptive richness. Consequently, Fox rectifies the dynamic structural misalignment of baseline decoding, ensuring responses are both faithful and expressive. 5.2 Ablatio… view at source ↗
Figure 10
Figure 10. Figure 10: Open-ended captioning comparison. Hallucinations are marked in red. Fox effectively mitigates hallucination while preserving descriptive richness. Is there a tv....? Our Yes No Is there a backpack....? Is there a dog....? Is there a car....? LLaVA-1.5 Our Yes No LLaVA-1.5 Our Yes No LLaVA-1.5 Our Yes No InstrcutBLIP Our Yes No InstrcutBLIP Our Yes No InstrcutBLIP Is there a bottle....? Is there a person..… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative examples of VLMs on POPE for object exis￾tence prediction. Red: Hallucination; Green: Correct predictions. SID (details cf. Appendix C.7). These results reinforce that our intervention effectively addresses the dynamic structural misalignment by surgically severing pathological shortcuts. Hyperparameter Sensitivity [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Hallucination is not explained by a global reduction in visual attention mass. Comparing the distribution of the sample-level global visual attention mass mV,tail (averaged over all layers and heads) between 1,000 Faithful (blue) and 1,000 Hallucinated (orange) samples on LLaVA-1.5-7B. (a) Smoothed window (tail = 32) aggregating statistics over recent post-image text queries. (b) Instantaneous window (tai… view at source ↗
Figure 13
Figure 13. Figure 13: All-head aggregated system reliance separates hallucinated from faithful samples. We compute an analysis-only sample￾level score by aggregating head-wise system attention mass m (l,h) sys,tail over all heads with ∆AUC(l,h) weighting (Appendix A.3); a higher score indicates stronger reliance on system/prefix tokens. (a) Smoothed window (tail = 32), AUC=0.7626. (b) Instantaneous window (tail = 1), AUC=0.822… view at source ↗
Figure 14
Figure 14. Figure 14: Sample-level diagnosis and sparsity analysis of the joint risk score. All statistics are computed on the Text-Tail query set with an instantaneous window (tail = 1). We define the head-level joint risk as S (l,h) tail = m (l,h) sys,tail · H (l,h) vis,tail and form a sample-level diagnostic score sn(Stail) by projecting standardized head-wise scores onto the Top-K heads ranked by |∆AUC(l,h) |. (a) Precisio… view at source ↗
Figure 15
Figure 15. Figure 15: Head-level structural transformation after logit-level intervention (four representative examples). For each risky head (selected by high joint risk S (l,h) tail ), we visualize the attention snapshot at a fixed decision-critical step: system reliance m (l,h) sys,tail (left bar), visual attention map over Ivis with entropy H (l,h) vis,tail (center heatmap), and text reliance (right bar). Across examples, … view at source ↗
Figure 16
Figure 16. Figure 16: Parameter sensitivity analysis on InstructBLIP. Impact of k, τJS, and β on captioning performance (CHAIRS, CHAIRI , and F1), evaluated on 500 COCO samples. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Parameter sensitivity analysis on Shikra. Impact of k, τJS, and β on captioning performance (CHAIRS, CHAIRI , and F1), evaluated on 500 COCO samples. As shown in [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Performance on the MME benchmark. Higher scores indicate better effectiveness. range to layers 3–10 to better fit its internal dynamics where alignment is established early and generation converges later. Method LLAVA-1.5 InstructBLIP Shikra Ran↑ Pop↑ Adv↑ Ran↑ Pop↑ Adv↑ Ran↑ Pop↑ Adv↑ Sampling 85.59 83.40 79.06 86.14 81.55 78.80 82.17 81.06 77.44 VCD 86.83 82.05 78.00 85.70 81.12 79.87 79.13 81.12 75.90 … view at source ↗
Figure 19
Figure 19. Figure 19: The role of JSD-based conflict gating. Without JSD gating, always applying the intervention leads to severe generation degradation due to excessive context suppression. Conversely, an overly large JSD threshold biases the decoding toward an overly conservative regime, reducing semantic coverage. A moderate JSD threshold enables adaptive cooperation between the interventional and observational branches, ac… view at source ↗
Figure 20
Figure 20. Figure 20: Fox’s performance on reducing hallucinations of LLaVA-1.5. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Fox’s performance on reducing hallucinations of InstructBLIP. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Fox’s performance on reducing hallucinations of Shikra. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗
read the original abstract

Large Vision-Language Models (LVLMs) exhibit sophisticated reasoning but remain susceptible to object hallucination. Deviating from the prevailing attention intensity assumption, we reveal a deeper dynamic structural misalignment: hallucination is triggered at decision-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors. This establishes a pathological shortcut that bypasses visual grounding. To dismantle this, we propose Fox (Faithfulness and Observational-flow via eXpression-rectification), a training-free inference-time framework. Fox diagnoses structural misalignment using a visual attention entropy probe to localize risky mediators unsupervisedly. We then execute a targeted causal intervention via numerical logit saturation to physically sever the shortcut path. Finally, a conflict-gated cooperative decoding strategy reconciles interventional faithfulness with observational fluency. Extensive experiments demonstrate that Fox achieves SOTA performance, outperforming SID by 29.1% while preserving linguistic richness. Code is available at https://github.com/Cc2021start/Fox.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that object hallucination in LVLMs stems from a structural misalignment at decision-critical steps, where specific attention heads ('risky mediators') decouple from visual evidence and lock onto language priors, creating a pathological shortcut. It introduces the training-free Fox framework, which diagnoses this via an unsupervised visual attention entropy probe to localize the mediators, applies numerical logit saturation as a causal intervention to sever the shortcut, and uses conflict-gated cooperative decoding to balance faithfulness and fluency. Experiments are said to show SOTA results, including a 29.1% improvement over SID while preserving linguistic richness, with code released.

Significance. If the causal mechanism and intervention hold, the work provides a mechanistic account of hallucination beyond attention intensity assumptions and a practical inference-time method for mitigation. The release of code supports reproducibility and allows verification of the claimed gains.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (diagnosis step): The central claim requires that the visual attention entropy probe unsupervisedly isolates the precise attention heads whose decoupling constitutes the load-bearing causal shortcut. No controlled ablation is described showing that intervening specifically on entropy-localized heads (vs. random heads or high-entropy heads) produces the claimed reduction in hallucination while preserving other behaviors; without this, the subsequent logit saturation step may address a correlate rather than the mediator.
  2. [§4, experiments] §4 (intervention) and experiments: The numerical logit saturation is presented as physically severing the shortcut path, but the manuscript provides no derivation or measurement (e.g., via do-calculus style intervention or path-specific effect) confirming that saturation on the localized heads alters the decision distribution in the manner predicted by the risky-mediator hypothesis rather than via a generic regularization effect.
  3. [Results (Table 1)] Table 1 or equivalent results section: The reported 29.1% improvement over SID lacks accompanying details on the exact evaluation protocol, number of runs, statistical significance, or controls for prompt sensitivity; this makes it impossible to assess whether the gain is attributable to the causal intervention or to other implementation choices.
minor comments (2)
  1. [Abstract] The abstract states 'extensive experiments' but the provided text contains no quantitative baselines, dataset sizes, or metric definitions; these should be summarized even at high level for clarity.
  2. [§2-3] Notation for 'risky mediators' and 'visual attention entropy probe' is introduced without an explicit equation or pseudocode in the early sections; adding a compact definition would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, providing clarifications and committing to revisions to enhance the causal validation and experimental details.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (diagnosis step): The central claim requires that the visual attention entropy probe unsupervisedly isolates the precise attention heads whose decoupling constitutes the load-bearing causal shortcut. No controlled ablation is described showing that intervening specifically on entropy-localized heads (vs. random heads or high-entropy heads) produces the claimed reduction in hallucination while preserving other behaviors; without this, the subsequent logit saturation step may address a correlate rather than the mediator.

    Authors: We agree that an explicit ablation comparing interventions on entropy-localized heads versus random or high-entropy heads would provide stronger evidence for the specificity of the risky mediators. In the revised manuscript, we will add this controlled ablation study, demonstrating that only the entropy-based localization leads to significant hallucination reduction while maintaining fluency, thereby confirming the probe's effectiveness in isolating the causal shortcut. revision: yes

  2. Referee: [§4, experiments] §4 (intervention) and experiments: The numerical logit saturation is presented as physically severing the shortcut path, but the manuscript provides no derivation or measurement (e.g., via do-calculus style intervention or path-specific effect) confirming that saturation on the localized heads alters the decision distribution in the manner predicted by the risky-mediator hypothesis rather than via a generic regularization effect.

    Authors: The logit saturation is intended as a targeted intervention to cap the output of the decoupled heads, thereby blocking the language-prior shortcut. While the original manuscript relies on empirical outcomes to support the mechanism, we acknowledge the value of a more formal analysis. We will include additional measurements of the decision distribution shifts and a discussion of the intervention's effect in terms of blocking the identified path, though a full do-calculus derivation may require further theoretical development beyond the scope of this work. revision: partial

  3. Referee: [Results (Table 1)] Table 1 or equivalent results section: The reported 29.1% improvement over SID lacks accompanying details on the exact evaluation protocol, number of runs, statistical significance, or controls for prompt sensitivity; this makes it impossible to assess whether the gain is attributable to the causal intervention or to other implementation choices.

    Authors: We appreciate this point and will revise the results section to include full details on the evaluation protocol (using POPE and CHAIR benchmarks with standard settings), the number of runs (5 independent runs with reported means and standard deviations), statistical significance tests (p-values), and controls for prompt sensitivity (using fixed prompts from prior literature). This will allow readers to better evaluate the robustness of the 29.1% improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents an empirical, training-free framework (Fox) that diagnoses misalignment via an attention entropy probe and applies logit saturation intervention, with performance validated experimentally against baselines like SID. No equations, self-citations, or steps in the abstract or described chain reduce by construction to fitted inputs, self-definitions, or prior author results; the localization and intervention are framed as novel observational methods rather than tautological renamings or forced predictions. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that attention entropy serves as a faithful unsupervised probe for structural misalignment and that logit saturation constitutes a valid causal intervention that severs the shortcut without introducing new artifacts. No free parameters or invented physical entities are mentioned.

axioms (2)
  • domain assumption Visual attention entropy can unsupervisedly localize decision-critical attention heads that have decoupled from visual evidence.
    Invoked in the diagnosis component of Fox as the basis for identifying risky mediators.
  • domain assumption Numerical logit saturation physically severs the pathological shortcut path without harming observational fluency when combined with conflict-gated decoding.
    Central to the intervention and reconciliation steps described in the abstract.
invented entities (1)
  • risky mediators (specific attention heads) no independent evidence
    purpose: To name the attention heads that decouple from visual evidence and create the shortcut.
    Conceptual label introduced to describe the mechanism; no independent evidence or falsifiable prediction provided in abstract.

pith-pipeline@v0.9.1-grok · 5714 in / 1557 out tokens · 19933 ms · 2026-06-29T01:15:30.954424+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

109 extracted references · 1 canonical work pages

  1. [1]

    Mitigating object hallucinations in large vision-language models with assembly of global and local attention

    An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., and Lu, S. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 29915--29926, 2025

  2. [2]

    Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., and Shou, M. Z. Hallucination of multimodal large language models: A survey, 2025. URL https://arxiv.org/abs/2404.18930

  3. [3]

    Q., Jia, J., Qin, W., Tang, R., and Pavlovic, V

    Che, L., Liu, T. Q., Jia, J., Qin, W., Tang, R., and Pavlovic, V. Hallucinatory image tokens: A training-free eazy approach on detecting and mitigating object hallucinations in lvlms, 2025. URL https://arxiv.org/abs/2503.07772

  4. [5]

    Mixture of decoding: An attention-inspired adaptive decoding strategy to mitigate hallucinations in large vision-language models, 2025

    Chen, X., Zhang, Y., Liu, Q., Wu, J., Zhang, F., and Tan, T. Mixture of decoding: An attention-inspired adaptive decoding strategy to mitigate hallucinations in large vision-language models, 2025. URL https://arxiv.org/abs/2505.17061

  5. [6]

    Halc: Object hallucination reduction via adaptive focal-contrast decoding, 2024

    Chen, Z., Zhao, Z., Luo, H., Yao, H., Li, B., and Zhou, J. Halc: Object hallucination reduction via adaptive focal-contrast decoding, 2024. URL https://arxiv.org/abs/2403.00425

  6. [7]

    N., and Hoi, S

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P. N., and Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems, 36: 0 49250--49267, 2023

  7. [8]

    Mitigating hallucination in large vision-language models via adaptive attention calibration, 2025

    Fazli, M., Wei, B., Sari, A., and Zhu, Z. Mitigating hallucination in large vision-language models via adaptive attention calibration, 2025. URL https://arxiv.org/abs/2505.21472

  8. [9]

    Mme: A comprehensive evaluation benchmark for multimodal large language models

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  9. [11]

    Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

    Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., and Yu, N. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13418--13427, 2024 b

  10. [12]

    Self-introspective decoding: Alleviating hallucinations for large vision-language models, 2025

    Huo, F., Xu, W., Zhang, Z., Wang, H., Chen, Z., and Zhao, P. Self-introspective decoding: Alleviating hallucinations for large vision-language models, 2025. URL https://arxiv.org/abs/2408.02032

  11. [13]

    The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio, 2024 a

    Leng, S., Xing, Y., Cheng, Z., Zhou, Y., Zhang, H., Li, X., Zhao, D., Lu, S., Miao, C., and Bing, L. The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio, 2024 a . URL https://arxiv.org/abs/2410.12787

  12. [14]

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding

    Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., and Bing, L. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13872--13882, 2024 b

  13. [15]

    Mitigating hallucination for large vision language model by inter-modality correlation calibration decoding, 2025

    Li, J., Zhang, J., Jie, Z., Ma, L., and Li, G. Mitigating hallucination for large vision language model by inter-modality correlation calibration decoding, 2025. URL https://arxiv.org/abs/2501.01926

  14. [17]

    Mitigating hallucination in large multi-modal models via robust instruction tuning, 2024 a

    Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., and Wang, L. Mitigating hallucination in large multi-modal models via robust instruction tuning, 2024 a . URL https://arxiv.org/abs/2306.14565

  15. [18]

    Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. Advances in neural information processing systems, 36: 0 34892--34916, 2023

  16. [19]

    Paying more attention to image: A training-free method for alleviating hallucination in lvlms

    Liu, S., Zheng, K., and Chen, W. Paying more attention to image: A training-free method for alleviating hallucination in lvlms. In European Conference on Computer Vision, pp.\ 125--140. Springer, 2024 b

  17. [20]

    Debiasing intrinsic bias and application bias jointly via invariant risk minimization (student abstract)

    Mao, Y., Yu, L., Yang, Y., Zhou, F., and Zhong, T. Debiasing intrinsic bias and application bias jointly via invariant risk minimization (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.\ 16280--16281, 2023

  18. [21]

    C., and Lu, S

    Nie, J., Zhang, G., An, W., Xing, Y., Tan, Y.-P., Kot, A. C., and Lu, S. Mmrel: Benchmarking relation understanding in multi-modal large language models, 2025. URL https://arxiv.org/abs/2406.09121

  19. [22]

    Causality

    Pearl, J. Causality. Cambridge university press, 2009

  20. [24]

    Aligning large multimodal models with factually augmented rlhf, 2023

    Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L.-Y., Wang, Y.-X., Yang, Y., Keutzer, K., and Darrell, T. Aligning large multimodal models with factually augmented rlhf, 2023. URL https://arxiv.org/abs/2309.14525

  21. [25]

    Drivevlm: The convergence of autonomous driving and large vision-language models, 2024

    Tian, X., Gu, J., Li, B., Liu, Y., Wang, Y., Zhao, Z., Zhan, K., Jia, P., Lang, X., and Zhao, H. Drivevlm: The convergence of autonomous driving and large vision-language models, 2024. URL https://arxiv.org/abs/2402.12289

  22. [27]

    Instructpart: Task-oriented part segmentation with instruction reasoning, 2025

    Wan, Z., Xie, Y., Zhang, C., Lin, Z., Wang, Z., Stepputtis, S., Ramanan, D., and Sycara, K. Instructpart: Task-oriented part segmentation with instruction reasoning, 2025. URL https://arxiv.org/abs/2505.18291

  23. [28]

    Chatcad: Interactive computer-aided diagnosis on medical image using large language models, 2023

    Wang, S., Zhao, Z., Ouyang, X., Wang, Q., and Shen, D. Chatcad: Interactive computer-aided diagnosis on medical image using large language models, 2023. URL https://arxiv.org/abs/2302.07257

  24. [30]

    The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023

    Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.-C., Liu, Z., and Wang, L. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023. URL https://arxiv.org/abs/2309.17421

  25. [31]

    Mixup-based unified framework to overcome gender bias resurgence

    Yu, L., Mao, Y., Wu, J., and Zhou, F. Mixup-based unified framework to overcome gender bias resurgence. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.\ 1755--1759, 2023

  26. [32]

    Biases mitigation and expressiveness preservation in language models: A comprehensive pipeline (student abstract)

    Yu, L., Guo, L., Kuang, P., and Zhou, F. Biases mitigation and expressiveness preservation in language models: A comprehensive pipeline (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 23701--23702, 2024

  27. [33]

    Bridging the fairness gap: Enhancing pre-trained models with llm-generated sentences

    Yu, L., Guo, L., Kuang, P., and Zhou, F. Bridging the fairness gap: Enhancing pre-trained models with llm-generated sentences. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025 a

  28. [34]

    Bimodal debiasing for text-to-image diffusion: Adaptive guidance in textual and visual spaces

    Yu, L., Sun, J., Kuang, P., Zhou, R., Zhou, F., and Feng, Z. Bimodal debiasing for text-to-image diffusion: Adaptive guidance in textual and visual spaces. In Proceedings of the 33rd ACM International Conference on Multimedia, pp.\ 11249--11258, 2025 b

  29. [35]

    Amplifying commonsense knowledge via bi-directional relation integrated graph-based contrastive pre-training from large language models

    Yu, L., Tian, F., Kuang, P., and Zhou, F. Amplifying commonsense knowledge via bi-directional relation integrated graph-based contrastive pre-training from large language models. Information Processing & Management, 62 0 (3): 0 104068, 2025 c

  30. [36]

    Causally-grounded dual-path attention intervention for object hallucination mitigation in lvlms

    Yu, L., Chen, Z., Kuang, P., Feng, Z., Zhou, F., Wang, L., and Dobbie, G. Causally-grounded dual-path attention intervention for object hallucination mitigation in lvlms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pp.\ 36021--36029, 2026

  31. [37]

    Q., Stepputtis, S., Ramanan, D., Salakhutdinov, R., Morency, L.-P., Sycara, K., and Xie, Y

    Zhang, C., Wan, Z., Kan, Z., Ma, M. Q., Stepputtis, S., Ramanan, D., Salakhutdinov, R., Morency, L.-P., Sycara, K., and Xie, Y. Self-correcting decoding with generative feedback for mitigating hallucinations in large vision-language models, 2025. URL https://arxiv.org/abs/2502.06130

  32. [38]

    Mitigating hallucination in large vision-language models through aligning attention distribution to information flow, 2025

    Zhao, J., Zhang, F., Sun, X., and Feng, C. Mitigating hallucination in large vision-language models through aligning attention distribution to information flow, 2025. URL https://arxiv.org/abs/2505.14257

  33. [39]

    Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning

    Zhou, F., Mao, Y., Yu, L., Yang, Y., and Zhong, T. Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 4227--4241, 2023

  34. [40]

    Mitigating modality prior-induced hallucinations in multimodal large language models via deciphering attention causality

    Zhou, G., Yan, Y., Zou, X., Wang, K., Liu, A., and Hu, X. Mitigating modality prior-induced hallucinations in multimodal large language models via deciphering attention causality. In ICLR, 2025

  35. [41]

    Analyzing and mitigating object hallucination in large vision-language models, 2024

    Zhou, Y., Cui, C., Yoon, J., Zhang, L., Deng, Z., Finn, C., Bansal, M., and Yao, H. Analyzing and mitigating object hallucination in large vision-language models, 2024. URL https://arxiv.org/abs/2310.00754

  36. [42]

    Alleviating hallucinations in large language models through multi-model contrastive decoding and dynamic hallucination detection

    Zhu, C., Liu, Y., Zhang, H., Wang, A., Chen, G., Wang, L., Luo, W., and Zhang, K. Alleviating hallucinations in large language models through multi-model contrastive decoding and dynamic hallucination detection. Advances in Neural Information Processing Systems, 38: 0 165364--165388, 2026

  37. [44]

    arXiv preprint arXiv:2304.10592 , year=

    Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

  38. [45]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  39. [46]

    Proceedings of the 31st ACM international conference on multimedia , pages=

    Towards deconfounded image-text matching with causal inference , author=. Proceedings of the 31st ACM international conference on multimedia , pages=

  40. [47]

    arXiv preprint arXiv:2210.15097 , year=

    Contrastive decoding: Open-ended text generation as optimization , author=. arXiv preprint arXiv:2210.15097 , year=

  41. [48]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  42. [49]

    ICLR , year=

    Mitigating modality prior-induced hallucinations in multimodal large language models via deciphering attention causality , author=. ICLR , year=

  43. [50]

    Advances in neural information processing systems , volume=

    Instructblip: Towards general-purpose vision-language models with instruction tuning , author=. Advances in neural information processing systems , volume=

  44. [51]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  45. [52]

    arXiv preprint arXiv:2506.17462 , year=

    General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting , author=. arXiv preprint arXiv:2506.17462 , year=

  46. [53]

    arXiv preprint arXiv:2306.15195 , year=

    Shikra: Unleashing multimodal llm's referential dialogue magic , author=. arXiv preprint arXiv:2306.15195 , year=

  47. [54]

    2023 , eprint=

    MultiModal-GPT: A Vision and Language Model for Dialogue with Humans , author=. 2023 , eprint=

  48. [55]

    2023 , publisher=

    Stanford alpaca: An instruction-following llama model , author=. 2023 , publisher=

  49. [56]

    ChatGPT outperforms crowd workers for text-annotation tasks , volume=

    Gilardi, Fabrizio and Alizadeh, Meysam and Kubli, Maël , year=. ChatGPT outperforms crowd workers for text-annotation tasks , volume=. Proceedings of the National Academy of Sciences , publisher=. doi:10.1073/pnas.2305016120 , number=

  50. [57]

    arXiv preprint arXiv:2308.12966 , volume=

    Qwen-vl: A frontier large vision-language model with versatile abilities , author=. arXiv preprint arXiv:2308.12966 , volume=

  51. [58]

    2025 , eprint=

    Otter: A Multi-Modal Model with In-Context Instruction Tuning , author=. 2025 , eprint=

  52. [59]

    Transactions on machine learning research , year=

    International conference on machine learning , author=. Transactions on machine learning research , year=

  53. [60]

    2019 , eprint=

    VisualBERT: A Simple and Performant Baseline for Vision and Language , author=. 2019 , eprint=

  54. [61]

    2023 , eprint=

    Aligning Large Multimodal Models with Factually Augmented RLHF , author=. 2023 , eprint=

  55. [62]

    2024 , eprint=

    Analyzing and Mitigating Object Hallucination in Large Vision-Language Models , author=. 2024 , eprint=

  56. [63]

    2024 , eprint=

    Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models , author=. 2024 , eprint=

  57. [64]

    2023 , eprint=

    Evaluation and Analysis of Hallucination in Large Vision-Language Models , author=. 2023 , eprint=

  58. [65]

    ACM computing surveys , volume=

    Survey of hallucination in natural language generation , author=. ACM computing surveys , volume=. 2023 , publisher=

  59. [66]

    2023 , eprint=

    REPLUG: Retrieval-Augmented Black-Box Language Models , author=. 2023 , eprint=

  60. [67]

    2025 , eprint=

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models , author=. 2025 , eprint=

  61. [68]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  62. [69]

    2016 , eprint=

    Analyzing the Behavior of Visual Question Answering Models , author=. 2016 , eprint=

  63. [70]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  64. [71]

    2022 , eprint=

    Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem , author=. 2022 , eprint=

  65. [72]

    2024 , eprint=

    Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision , author=. 2024 , eprint=

  66. [73]

    2024 , eprint=

    IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding , author=. 2024 , eprint=

  67. [74]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  68. [75]

    2024 , eprint=

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models , author=. 2024 , eprint=

  69. [76]

    2009 , publisher=

    Causality , author=. 2009 , publisher=

  70. [77]

    arXiv preprint arXiv:1809.02156 , year=

    Object hallucination in image captioning , author=. arXiv preprint arXiv:1809.02156 , year=

  71. [78]

    arXiv preprint arXiv:2509.25177 , year=

    Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding , author=. arXiv preprint arXiv:2509.25177 , year=

  72. [79]

    Advances in Neural Information Processing Systems , volume=

    Alleviating hallucinations in large language models through multi-model contrastive decoding and dynamic hallucination detection , author=. Advances in Neural Information Processing Systems , volume=

  73. [80]

    arXiv preprint arXiv:2412.02946 , year=

    Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis , author=. arXiv preprint arXiv:2412.02946 , year=

  74. [81]

    IEEE Transactions on Information theory , volume=

    Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information theory , volume=. 2002 , publisher=

  75. [82]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Mme: A comprehensive evaluation benchmark for multimodal large language models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  76. [83]

    arXiv preprint arXiv:2305.10355 , year=

    Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=

  77. [84]

    arXiv preprint arXiv:2403.18715 , year=

    Mitigating hallucinations in large vision-language models with instruction contrastive decoding , author=. arXiv preprint arXiv:2403.18715 , year=

  78. [85]

    2025 , eprint=

    Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models , author=. 2025 , eprint=

  79. [86]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  80. [87]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Mitigating object hallucinations in large vision-language models with assembly of global and local attention , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Showing first 80 references.