See Only When Needed: Context-Aware Attention Intervention for Mitigating Hallucinations in LVLMs

Cees G.M. Snoek; Ling Shao; Wenbo Lyu; Xiantong Zhen; Yingjun Du; Yuqing Lei

arxiv: 2606.29847 · v1 · pith:AVV73EN2new · submitted 2026-06-29 · 💻 cs.CV

See Only When Needed: Context-Aware Attention Intervention for Mitigating Hallucinations in LVLMs

Yuqing Lei , Wenbo Lyu , Yingjun Du , Xiantong Zhen , Cees G.M. Snoek , Ling Shao This is my paper

Pith reviewed 2026-06-30 06:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords hallucination mitigationlarge vision-language modelsattention interventioncontext-aware selectivitytraining-free inferenceobject hallucinationsvisual grounding

0 comments

The pith

Context-aware attention intervention reduces object hallucinations in vision-language models by intervening selectively on uncertain tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Context-aware Attention Intervention (CAI) as a training-free method to mitigate hallucinations in large vision-language models. It enforces a see only when needed principle through two-axis selectivity that determines both where to focus visual attention and when to apply changes during decoding. Token-specific relevance drawn from early layers identifies aligned image regions, and an entropy- and depth-gated tilt adjusts attention only for tokens showing uncertainty spikes in deeper layers. This leaves confident tokens and irrelevant regions untouched to avoid spurious evidence or fluency loss. Experiments across backbones and benchmarks report state-of-the-art gains even without optional contrastive decoding.

Core claim

CAI derives token-specific visual relevance from early-layer representations to localize semantically aligned regions and applies a conservative entropy- and depth-gated attention tilt only to uncertainty-spiking tokens in deeper layers, enforcing see-only-when-needed selectivity that strengthens grounding while preserving fluency and achieving state-of-the-art hallucination mitigation as a KL-minimal reweighting with bounded interference under inactive gates or small tilts.

What carries the argument

Context-aware Attention Intervention (CAI), the two-axis selectivity mechanism that uses early-layer visual relevance for localization and applies gated attention reweighting only on uncertain deeper-layer tokens.

If this is right

The method yields consistent hallucination reductions across multiple LVLM backbones without any training.
Linguistic fluency is maintained because confident tokens and irrelevant regions receive no change.
Contrastive decoding remains optional rather than required for the gains.
The intervention is characterized as KL-minimal reweighting with interference bounded by inactive gates or small tilts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach implies that visual grounding errors concentrate in deeper layers specifically around high-entropy tokens rather than uniformly.
It could extend to other generation flaws in multimodal models where selective correction might avoid side effects.
The early-to-late layer distinction suggests a natural division of labor that future architectures might exploit directly.

Load-bearing premise

Early-layer representations accurately identify which image regions matter for each specific token, and selectively tilting attention on uncertain tokens strengthens correct grounding without adding new errors or reducing output quality.

What would settle it

A controlled test on standard benchmarks where the attention tilt is applied only to tokens flagged as uncertain shows no reduction in hallucination rates relative to the unmodified model.

Figures

Figures reproduced from arXiv: 2606.29847 by Cees G.M. Snoek, Ling Shao, Wenbo Lyu, Xiantong Zhen, Yingjun Du, Yuqing Lei.

**Figure 2.** Figure 2: Given visual input (a) and the query “Please describe the image in detail”, (b) shows the evolution of token entropy across decoding layers. In deeper layers, hallucination-prone tokens (e.g., “traffic light”) exhibit markedly higher entropy than grounded tokens (e.g., “man”), whereas tokens dominated by language priors (e.g., “The”, “a”, “at”) remain low-entropy. failure mode critically undermines reliabi… view at source ↗

**Figure 3.** Figure 3: Overview of CAI. At each decoding step, the lowest layer evaluates the similarity between the generated token and each patch token (Eq. 4), yielding a visual grounding signal. Attention interventions are applied to deep decoding layers with high entropy (Eq. 6), redirecting attention toward patch tokens in proportion to their similarity (Eq. 5). Integrated with contrastive decoding (Eq. 7), CAI mitigates r… view at source ↗

**Figure 5.** Figure 5: Case study: failure mode of confident hallucination. Given the prompt “Please describe this image in detail”, the model generates the hallucinated token “plane”, which is not present in (a). Note that its predictive entropy across multiple layers remains below threshold, illustrating a failure mode where confident hallucinations evade entropybased intervention [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study on the starting intervention layer ls and the entropy threshold γ under the random setting of AOKVQA in POPE [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 8.** Figure 8: Visualization of token-image similarity across decoding layers. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Case study of visual input (a) with the query “Is there only one zipper in the picture?”. The intervention layers are shown in (b), and token-image similarity in (c) establishes grounding between the response token and the visual input. The attention maps from the 0-th head in layer 31, before and after the similarity-guided intervention, are visualized in (d) and (e). ("dog"), successfully preserving fine… view at source ↗

**Figure 10.** Figure 10: Number of tokens requiring intervention at each layer [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Case study of CHAIR in long text generation. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) excel at multimodal tasks but remain prone to object hallucinations. Prior training-free remedies often uniformly strengthen visual signals, which may also amplify irrelevant regions and introduce spurious evidence, harming fluency. We propose Context-aware Attention Intervention (CAI), a training-free inference-time mechanism that enforces a see only when needed principle via two-axis selectivity: where to look and when to intervene. At each decoding step, CAI derives token-specific visual relevance from early-layer representations to localize semantically aligned regions, and applies a conservative, entropy- and depth-gated attention tilt only for uncertainty-spiking tokens in deeper layers where visual grounding degrades, leaving confident tokens and irrelevant regions largely unchanged. This targeted intervention strengthens visual grounding while preserving linguistic fluency, and it yields consistent improvements even without contrastive decoding, which remains optional as an auxiliary bias-suppression module. Extensive experiments across multiple LVLM backbones and benchmarks show that CAI achieves state-of-the-art hallucination mitigation, and our analysis characterizes CAI as a KL-minimal attention reweighting with bounded interference under inactive gates or small tilts. Code is available at https://github.com/Iris1946/CAI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAI adds a selective entropy-and-depth gated attention tilt that targets uncertain tokens, but the early-layer localization step that justifies the bounded-interference claim lacks direct validation.

read the letter

The paper's main new element is the two-axis selectivity: early-layer features pick the regions to attend, while entropy and depth gates decide whether to apply a small attention tilt only on high-uncertainty tokens in later layers. This is a tighter combination than the uniform visual boosts in earlier training-free work.

It does a few things cleanly. The method stays training-free, treats contrastive decoding as optional, and reports gains across multiple LVLMs and benchmarks. The KL-minimal reweighting framing with the inactive-gate bound is a useful way to describe why fluency should hold.

The soft spot is exactly the one the stress-test flags. Token-specific visual relevance comes from early layers to localize aligned regions, yet the abstract and available description give no quantitative overlap with ground-truth objects or ablation that isolates localization quality. If those early cues are noisy, the selectivity guarantee weakens and the SOTA claim rests on an untested premise. That is the load-bearing assumption.

The work is aimed at people who need practical inference-time fixes for LVLM reliability. It shows enough structure and cross-model results to deserve referee time, even if the localization step needs tighter evidence in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Context-aware Attention Intervention (CAI), a training-free inference-time mechanism for mitigating object hallucinations in LVLMs. It enforces a 'see only when needed' principle through two-axis selectivity: deriving token-specific visual relevance from early-layer representations to localize semantically aligned regions, then applying entropy- and depth-gated attention tilts only to uncertainty-spiking tokens in deeper layers. The method claims consistent SOTA improvements across multiple LVLM backbones and benchmarks, even without contrastive decoding (which is optional), and characterizes the intervention as KL-minimal reweighting with bounded interference under inactive gates or small tilts. Code is released publicly.

Significance. If the empirical results and the KL-minimal characterization hold under the stated assumptions, this would represent a meaningful advance by providing a targeted, training-free intervention that avoids the fluency degradation and spurious evidence risks of uniform visual strengthening approaches. The public code release supports reproducibility and enables direct follow-up work on LVLM reliability.

major comments (2)

[Experiments] The central claim that two-axis selectivity strengthens grounding without amplifying irrelevant regions or harming fluency rests on the accuracy of token-specific visual relevance derived from early-layer representations. No quantitative validation of this localization (e.g., overlap metrics with ground-truth object regions or ablation isolating localization quality from the gating) appears in the experiments, leaving the bounded-interference guarantee unsupported.
[Analysis] The analysis section characterizes CAI as KL-minimal attention reweighting, but the abstract provides no derivation, closed-form expression, or controlled measurement showing that the reweighting is minimal or that interference remains bounded when gates are inactive; this property is load-bearing for the 'conservative' and 'parameter-free' framing of the intervention.

minor comments (1)

[Abstract] The abstract states 'extensive experiments across multiple LVLM backbones and benchmarks' but does not name the specific models or datasets in the opening paragraph; moving this detail forward would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications and indicate planned revisions to strengthen the empirical and analytical support.

read point-by-point responses

Referee: [Experiments] The central claim that two-axis selectivity strengthens grounding without amplifying irrelevant regions or harming fluency rests on the accuracy of token-specific visual relevance derived from early-layer representations. No quantitative validation of this localization (e.g., overlap metrics with ground-truth object regions or ablation isolating localization quality from the gating) appears in the experiments, leaving the bounded-interference guarantee unsupported.

Authors: We acknowledge the value of quantitative validation for the early-layer localization. The current manuscript provides qualitative attention visualizations demonstrating semantic alignment, but we agree that overlap metrics and isolating ablations would offer stronger evidence. In the revised version, we will add mean IoU overlap with ground-truth object regions on a benchmark subset and an ablation separating localization quality from the entropy/depth gates to better support the bounded-interference property. revision: yes
Referee: [Analysis] The analysis section characterizes CAI as KL-minimal attention reweighting, but the abstract provides no derivation, closed-form expression, or controlled measurement showing that the reweighting is minimal or that interference remains bounded when gates are inactive; this property is load-bearing for the 'conservative' and 'parameter-free' framing of the intervention.

Authors: The analysis section derives the KL-minimal characterization and states the bounded-interference property under inactive gates or small tilts. To make this more explicit as requested, the revision will include a dedicated subsection with the closed-form KL divergence expression between original and intervened distributions, plus controlled empirical measurements of KL values across gate states and tilt sizes. This will reinforce the conservative framing without altering the core method. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical intervention

full rationale

The paper introduces CAI as a training-free inference-time mechanism that derives token-specific visual relevance from early-layer representations and applies entropy- and depth-gated attention tilt. The central result (SOTA hallucination mitigation with KL-minimal reweighting characterization) is supported by experiments across multiple LVLM backbones and benchmarks, with the characterization presented as an analysis of the proposed method rather than a reduction to fitted inputs or self-citations. No load-bearing steps reduce by construction to the paper's own definitions, predictions, or prior self-citations. The method remains independent of its inputs and externally falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer attention behavior plus the unverified effectiveness of the proposed early-layer relevance extraction and gated tilt; no free parameters, new entities, or ad-hoc axioms are explicitly introduced in the abstract.

axioms (1)

standard math Standard multi-head attention and layer-wise representation properties in transformer-based LVLMs
The method assumes early layers encode usable visual relevance and deeper layers suffer grounding degradation, which are background properties of the architecture.

pith-pipeline@v0.9.1-grok · 5764 in / 1270 out tokens · 37240 ms · 2026-06-30T06:44:33.260201+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 19 canonical work pages · 14 internal anchors

[1]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., Lu, S.: Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29915–29926 (2025)

2025
[2]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Hallucination of Multimodal Large Language Models: A Survey

Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Advances in Neural Information Processing Systems37, 7400–7426 (2024)

Basu, S., Grayson, M., Morrison, C., Nushi, B., Feizi, S., Massiceti, D.: Under- standing information storage and transfer in multi-modal large language models. Advances in Neural Information Processing Systems37, 7400–7426 (2024)

2024
[5]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901
[6]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, J., Zhang, T., Huang, S., Niu, Y., Zhang, L., Wen, L., Hu, X.: Ict: Image- object cross-level trusted intervention for mitigating object hallucination in large vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4209–4221 (2025)

2025
[7]

arXiv preprint arXiv:2311.16479 (2023) 16 Lei et al

Chen, Z., Zhu, Y., Zhan, Y., Li, Z., Zhao, C., Wang, J., Tang, M.: Mitigating hallucination in visual language models with visual supervision. arXiv preprint arXiv:2311.16479 (2023) 16 Lei et al

work page arXiv 2023
[8]

Journal of Machine Learning Research24(240), 1–113 (2023)

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling lan- guage modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)

2023
[9]

In: International Conference on Learning Representations

Chuang, Y.S., Xie, Y., Luo, H., Kim, Y., Glass, J.R., He, P.: Dola: Decoding by contrasting layers improves factuality in large language models. In: International Conference on Learning Representations. vol. 2024, pp. 54158–54183 (2024)

2024
[10]

Advances in neural information processing systems36, 49250–49267 (2023)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

2023
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Favero, A., Zancato, L., Trager, M., Choudhary, S., Perera, P., Achille, A., Swami- nathan, A., Soatto, S.: Multi-modal hallucination control by visual information grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14303–14312 (2024)

2024
[12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entan- gled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14375–14385 (2024)

2024
[13]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019)

2019
[15]

arXiv preprint arXiv:2408.02032 (2024)

Huo, F., Xu, W., Zhang, Z., Wang, H., Chen, Z., Zhao, P.: Self-introspective de- coding: Alleviating hallucinations for large vision-language models. arXiv preprint arXiv:2408.02032 (2024)

work page arXiv 2024
[16]

Jiang, C., Xu, H., Dong, M., Chen, J., Ye, W., Yan, M., Ye, Q., Zhang, J., Huang, F., Zhang, S.: Hallucination augmented contrastive learning for multimodal large languagemodel.In:ProceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition. pp. 27036–27046 (2024)

2024
[17]

Proceedings of the 2023 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (2023)

Lee, S., Park, S.H., Jo, Y., Seo, M.: Volcano: mitigating multimodal hallucination through self-feedback guided revision. Proceedings of the 2023 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (2023)

2023
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)

2024
[19]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

2023
[20]

VisualBERT: A Simple and Performant Baseline for Vision and Language

Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908
[21]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023) Context-Aware Attention Intervention for Mitigating Hallucinations 17

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023) Context-Aware Attention Intervention for Mitigating Hallucinations 17

2023
[22]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

2014
[23]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucina- tion in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

A Survey on Hallucination in Large Vision-Language Models

Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

2024
[26]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023
[27]

In: The Thirteenth International Conference on Learning Representations (2025)

Liu, S., Ye, H., Zou, J.: Reducing hallucinations in large vision-language models via latent space steering. In: The Thirteenth International Conference on Learning Representations (2025)

2025
[28]

In: European Conference on Com- puter Vision

Liu, S., Zheng, K., Chen, W.: Paying more attention to image: A training-free method for alleviating hallucination in lvlms. In: European Conference on Com- puter Vision. pp. 125–140. Springer (2024)

2024
[29]

Visual-RFT: Visual Reinforcement Fine-Tuning

Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual- rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

In: The Thirteenth International Conference on Learning Representations (2025)

Neo, C., Ong, L., Torr, P., Geva, M., Krueger, D., Barez, F.: Towards interpret- ing visual information processing in vision-language models. In: The Thirteenth International Conference on Learning Representations (2025)

2025
[31]

Advances in neural information processing sys- tems35, 27730–27744 (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)

2022
[32]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing p

Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallu- cination in image captioning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing p. 4035–4045 (2018)

2018
[33]

In: European conference on computer vision

Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-okvqa: A benchmark for visual question answering using world knowledge. In: European conference on computer vision. pp. 146–162. Springer (2022)

2022
[34]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

In: Proceedings of the IEEE/CVF international conference on computer vision

Sun,C.,Myers,A.,Vondrick,C.,Murphy,K.,Schmid,C.:Videobert:Ajointmodel for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7464–7473 (2019)

2019
[36]

Aligning Large Multimodal Models with Factually Augmented RLHF

Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L.Y., Wang, Y.X., Yang, Y., et al.: Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

arXiv preprint arXiv:2503.20752 (2025)

Tan, H., Ji, Y., Hao, X., Lin, M., Wang, P., Wang, Z., Zhang, S.: Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv preprint arXiv:2503.20752 (2025)

work page arXiv 2025
[38]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Tang, F., Liu, C., Xu, Z., Hu, M., Huang, Z., Xue, H., Chen, Z., Peng, Z., Yang, Z., Zhou, S., et al.: Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26147–26159 (2025) 18 Lei et al

2025
[39]

Team, Q.: Qwen3.5: Accelerating productivity with native multimodal agents (February 2026),https://qwen.ai/blog?id=qwen3.5

2026
[40]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Proceedings of the International Confer- ence on Computer Vision (2025)

Wan, Z., Zhang, C., Yong, S., Ma, M.Q., Stepputtis, S., Morency, L.P., Ramanan, D., Sycara, K., Xie, Y.: Only: One-layer intervention sufficiently mitigates halluci- nations in large vision-language models. Proceedings of the International Confer- ence on Computer Vision (2025)

2025
[42]

Proceedings of the 29th International Conference on Computational Linguistics p

Wu, Y., Zhao, Y., Zhao, S., Zhang, Y., Yuan, X., Zhao, G., Jiang, N.: Overcoming language priors in visual question answering via distinguishing superficially simi- lar instances. Proceedings of the 29th International Conference on Computational Linguistics p. 5721–5729 (2022)

2022
[43]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

National Science Review11(12), nwae403 (2024)

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

2024
[45]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

Yu, Q., Li, J., Wei, L., Pang, L., Ye, W., Qin, B., Tang, S., Tian, Q., Zhuang, Y.: Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 12944–12953 (2024)

2024
[46]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yu, T., Yao, Y., Zhang, H., He, T., Han, Y., Cui, G., Hu, J., Liu, Z., Zheng, H.T., Sun, M., et al.: Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13807–13816 (2024)

2024
[47]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics pp

Yue, Z., Zhang, L., Jin, Q.: Less is more: Mitigating multimodal hallucination from an eos decision perspective. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics pp. 11766–11781 (2024)

2024
[48]

arXiv preprint arXiv:2601.20279 (2026)

Zhang, X., Zhu, Y., Gu, C., Yuan, X., Zhao, Q., Cao, J., Tang, F., Fan, S., Shen, Y., Shen, C., et al.: Hallucination begins where saliency drops. arXiv preprint arXiv:2601.20279 (2026)

work page arXiv 2026
[49]

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Zhao, Z., Wang, B., Ouyang, L., Dong, X., Wang, J., He, C.: Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Zhou, H., Li, X., Wang, R., Cheng, M., Zhou, T., Hsieh, C.J.: R1-zero’s" aha mo- ment" in visual reasoning on a 2b non-sft model. arXiv preprint arXiv:2503.05132 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

The Eleventh International Conference on Learning Representations (2023)

Zhou, Y., Cui, C., Yoon, J., Zhang, L., Deng, Z., Finn, C., Bansal, M., Yao, H.: Analyzing and mitigating object hallucination in large vision-language models. The Eleventh International Conference on Learning Representations (2023)

2023
[52]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Please describe this image in detail

Zou, X., Wang, Y., Yan, Y., Lyu, Y., Zheng, K., Huang, S., Chen, J., Jiang, P., Liu, J., Tang, C., Hu, X.: Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models. The Forty-second International Conference on Machine Learning (ICML) (2025) Context-Aware Attention Intervention for Mitigat...

work page arXiv 2025

[1] [1]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., Lu, S.: Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29915–29926 (2025)

2025

[2] [2]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Hallucination of Multimodal Large Language Models: A Survey

Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Advances in Neural Information Processing Systems37, 7400–7426 (2024)

Basu, S., Grayson, M., Morrison, C., Nushi, B., Feizi, S., Massiceti, D.: Under- standing information storage and transfer in multi-modal large language models. Advances in Neural Information Processing Systems37, 7400–7426 (2024)

2024

[5] [5]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901

[6] [6]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, J., Zhang, T., Huang, S., Niu, Y., Zhang, L., Wen, L., Hu, X.: Ict: Image- object cross-level trusted intervention for mitigating object hallucination in large vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4209–4221 (2025)

2025

[7] [7]

arXiv preprint arXiv:2311.16479 (2023) 16 Lei et al

Chen, Z., Zhu, Y., Zhan, Y., Li, Z., Zhao, C., Wang, J., Tang, M.: Mitigating hallucination in visual language models with visual supervision. arXiv preprint arXiv:2311.16479 (2023) 16 Lei et al

work page arXiv 2023

[8] [8]

Journal of Machine Learning Research24(240), 1–113 (2023)

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling lan- guage modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)

2023

[9] [9]

In: International Conference on Learning Representations

Chuang, Y.S., Xie, Y., Luo, H., Kim, Y., Glass, J.R., He, P.: Dola: Decoding by contrasting layers improves factuality in large language models. In: International Conference on Learning Representations. vol. 2024, pp. 54158–54183 (2024)

2024

[10] [10]

Advances in neural information processing systems36, 49250–49267 (2023)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

2023

[11] [11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Favero, A., Zancato, L., Trager, M., Choudhary, S., Perera, P., Achille, A., Swami- nathan, A., Soatto, S.: Multi-modal hallucination control by visual information grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14303–14312 (2024)

2024

[12] [12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entan- gled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14375–14385 (2024)

2024

[13] [13]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019)

2019

[15] [15]

arXiv preprint arXiv:2408.02032 (2024)

Huo, F., Xu, W., Zhang, Z., Wang, H., Chen, Z., Zhao, P.: Self-introspective de- coding: Alleviating hallucinations for large vision-language models. arXiv preprint arXiv:2408.02032 (2024)

work page arXiv 2024

[16] [16]

Jiang, C., Xu, H., Dong, M., Chen, J., Ye, W., Yan, M., Ye, Q., Zhang, J., Huang, F., Zhang, S.: Hallucination augmented contrastive learning for multimodal large languagemodel.In:ProceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition. pp. 27036–27046 (2024)

2024

[17] [17]

Proceedings of the 2023 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (2023)

Lee, S., Park, S.H., Jo, Y., Seo, M.: Volcano: mitigating multimodal hallucination through self-feedback guided revision. Proceedings of the 2023 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (2023)

2023

[18] [18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)

2024

[19] [19]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

2023

[20] [20]

VisualBERT: A Simple and Performant Baseline for Vision and Language

Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908

[21] [21]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023) Context-Aware Attention Intervention for Mitigating Hallucinations 17

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023) Context-Aware Attention Intervention for Mitigating Hallucinations 17

2023

[22] [22]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

2014

[23] [23]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucina- tion in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

A Survey on Hallucination in Large Vision-Language Models

Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

2024

[26] [26]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023

[27] [27]

In: The Thirteenth International Conference on Learning Representations (2025)

Liu, S., Ye, H., Zou, J.: Reducing hallucinations in large vision-language models via latent space steering. In: The Thirteenth International Conference on Learning Representations (2025)

2025

[28] [28]

In: European Conference on Com- puter Vision

Liu, S., Zheng, K., Chen, W.: Paying more attention to image: A training-free method for alleviating hallucination in lvlms. In: European Conference on Com- puter Vision. pp. 125–140. Springer (2024)

2024

[29] [29]

Visual-RFT: Visual Reinforcement Fine-Tuning

Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual- rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

In: The Thirteenth International Conference on Learning Representations (2025)

Neo, C., Ong, L., Torr, P., Geva, M., Krueger, D., Barez, F.: Towards interpret- ing visual information processing in vision-language models. In: The Thirteenth International Conference on Learning Representations (2025)

2025

[31] [31]

Advances in neural information processing sys- tems35, 27730–27744 (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)

2022

[32] [32]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing p

Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallu- cination in image captioning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing p. 4035–4045 (2018)

2018

[33] [33]

In: European conference on computer vision

Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-okvqa: A benchmark for visual question answering using world knowledge. In: European conference on computer vision. pp. 146–162. Springer (2022)

2022

[34] [34]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

In: Proceedings of the IEEE/CVF international conference on computer vision

Sun,C.,Myers,A.,Vondrick,C.,Murphy,K.,Schmid,C.:Videobert:Ajointmodel for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7464–7473 (2019)

2019

[36] [36]

Aligning Large Multimodal Models with Factually Augmented RLHF

Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L.Y., Wang, Y.X., Yang, Y., et al.: Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

arXiv preprint arXiv:2503.20752 (2025)

Tan, H., Ji, Y., Hao, X., Lin, M., Wang, P., Wang, Z., Zhang, S.: Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv preprint arXiv:2503.20752 (2025)

work page arXiv 2025

[38] [38]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Tang, F., Liu, C., Xu, Z., Hu, M., Huang, Z., Xue, H., Chen, Z., Peng, Z., Yang, Z., Zhou, S., et al.: Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26147–26159 (2025) 18 Lei et al

2025

[39] [39]

Team, Q.: Qwen3.5: Accelerating productivity with native multimodal agents (February 2026),https://qwen.ai/blog?id=qwen3.5

2026

[40] [40]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Proceedings of the International Confer- ence on Computer Vision (2025)

Wan, Z., Zhang, C., Yong, S., Ma, M.Q., Stepputtis, S., Morency, L.P., Ramanan, D., Sycara, K., Xie, Y.: Only: One-layer intervention sufficiently mitigates halluci- nations in large vision-language models. Proceedings of the International Confer- ence on Computer Vision (2025)

2025

[42] [42]

Proceedings of the 29th International Conference on Computational Linguistics p

Wu, Y., Zhao, Y., Zhao, S., Zhang, Y., Yuan, X., Zhao, G., Jiang, N.: Overcoming language priors in visual question answering via distinguishing superficially simi- lar instances. Proceedings of the 29th International Conference on Computational Linguistics p. 5721–5729 (2022)

2022

[43] [43]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

National Science Review11(12), nwae403 (2024)

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

2024

[45] [45]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

Yu, Q., Li, J., Wei, L., Pang, L., Ye, W., Qin, B., Tang, S., Tian, Q., Zhuang, Y.: Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 12944–12953 (2024)

2024

[46] [46]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yu, T., Yao, Y., Zhang, H., He, T., Han, Y., Cui, G., Hu, J., Liu, Z., Zheng, H.T., Sun, M., et al.: Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13807–13816 (2024)

2024

[47] [47]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics pp

Yue, Z., Zhang, L., Jin, Q.: Less is more: Mitigating multimodal hallucination from an eos decision perspective. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics pp. 11766–11781 (2024)

2024

[48] [48]

arXiv preprint arXiv:2601.20279 (2026)

Zhang, X., Zhu, Y., Gu, C., Yuan, X., Zhao, Q., Cao, J., Tang, F., Fan, S., Shen, Y., Shen, C., et al.: Hallucination begins where saliency drops. arXiv preprint arXiv:2601.20279 (2026)

work page arXiv 2026

[49] [49]

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Zhao, Z., Wang, B., Ouyang, L., Dong, X., Wang, J., He, C.: Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Zhou, H., Li, X., Wang, R., Cheng, M., Zhou, T., Hsieh, C.J.: R1-zero’s" aha mo- ment" in visual reasoning on a 2b non-sft model. arXiv preprint arXiv:2503.05132 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

The Eleventh International Conference on Learning Representations (2023)

Zhou, Y., Cui, C., Yoon, J., Zhang, L., Deng, Z., Finn, C., Bansal, M., Yao, H.: Analyzing and mitigating object hallucination in large vision-language models. The Eleventh International Conference on Learning Representations (2023)

2023

[52] [52]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Please describe this image in detail

Zou, X., Wang, Y., Yan, Y., Lyu, Y., Zheng, K., Huang, S., Chen, J., Jiang, P., Liu, J., Tang, C., Hu, X.: Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models. The Forty-second International Conference on Machine Learning (ICML) (2025) Context-Aware Attention Intervention for Mitigat...

work page arXiv 2025