What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

Jiancheng Lv; Jizhe Zhou; Jun Luo; Xia Du; Yusheng He; Zheng Lin

arxiv: 2605.30911 · v1 · pith:HYP4EW57new · submitted 2026-05-29 · 💻 cs.CV · cs.AI

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

Yusheng He , Jizhe Zhou , Xia Du , Zheng Lin , Jun Luo , Jiancheng Lv This is my paper

Pith reviewed 2026-06-28 23:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords large vision-language modelshallucinationsarchitectural factorslinguistic foundationvisual representationsemantic alignmentCoSimUE benchmark

0 comments

The pith

Architectural design in LVLMs affects three hallucination types differently, with parameter scaling showing only limited benefits overall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that hallucinations in large vision-language models arise from how the architecture is structured rather than from scale alone. It divides model design into linguistic foundation, visual representation, and semantic alignment, then splits hallucinations into co-occurrence, similarity, and uncertainty categories. Using a new benchmark with controlled perturbations, experiments across seven design choices map specific factors to specific error types. If correct, this means developers can target improvements more precisely instead of relying primarily on bigger models.

Core claim

Factorizing LVLM architecture into Linguistic Foundation, Visual Representation, and Semantic Alignment dimensions, and hallucinations into Co-occurrence, Similarity, and Uncertainty types, reveals that parameter scaling has limited effect on all three hallucination kinds; larger and better-trained language foundations reduce co-occurrence errors; stronger visual encoders and higher resolutions reduce similarity errors; effective alignment strategies reduce uncertainty errors; and jointly improving visual fidelity with alignment produces the broadest gains.

What carries the argument

The three-dimensional factorization of architecture (Linguistic Foundation, Visual Representation, Semantic Alignment) together with the CoSimUE benchmark that uses controlled textual and random perturbations to isolate hallucination behaviors.

If this is right

Developers should prioritize stronger language foundations when targeting co-occurrence hallucinations rather than uniform scaling.
Investments in visual encoder strength and input resolution will primarily address similarity-based errors.
Alignment techniques become the main lever for reducing uncertainty hallucinations.
Simultaneous upgrades to visual representation and alignment deliver improvements across more hallucination categories than single-dimension changes.
Future LVLM design can move away from blanket parameter increases toward dimension-specific optimizations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factorization approach could be applied to other reliability issues such as bias or safety failures in multimodal models.
Training curricula might be redesigned to emphasize the dimension that most affects the dominant error type in a given application domain.
Hardware or data efficiency trade-offs could be re-evaluated once the relative value of each dimension is quantified.

Load-bearing premise

The benchmark's perturbations successfully isolate the effects of each architectural dimension on each hallucination type without major interference from training data or unmodeled interactions.

What would settle it

A controlled test in which a new LVLM varies only one dimension (for example, alignment strategy) while holding the others fixed and shows no corresponding change in the mapped hallucination type on the benchmark.

Figures

Figures reproduced from arXiv: 2605.30911 by Jiancheng Lv, Jizhe Zhou, Jun Luo, Xia Du, Yusheng He, Zheng Lin.

**Figure 2.** Figure 2: Overview of the CoSimUE pipeline. errors from weak grounding or visual similarity confusion. However, these taxonomies largely overlook uncertainty hallucinations. Existing works that relate hallucinations to model uncertainty primarily focus on predicting hallucinations based on model uncertainty and do not directly address the uncertainty inherent in the question itself [37], [38]. We therefore introdu… view at source ↗

**Figure 3.** Figure 3: Examples of different question categories evaluated across various baseline models. The text segments highlighted in red indicate specific instances [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Double-column width hallucination analysis. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM hallucinate less? Many existing efforts focus on improving internal components of the model. We argue that hallucination fundamentally stems from how the model architecture is designed. To investigate this, we factor the architecture design into three dimensions: Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA), and categorize hallucinations into Co-occurrence, Similarity, and previously overlooked Uncertainty types. Building on this formulation, we propose CoSimUE, a benchmark that creates fine-grained hallucination scenarios through controlled textual perturbations and random perturbations, enabling mapping between design choices and hallucination behaviors. Experiments across 7 design aspects show that: 1) the widely emphasized scaling of model parameters has only limited impact on reducing all three types of hallucinations; 2) larger and better-trained language foundations can reduce co-occurrence hallucinations; 3) stronger visual encoders and higher resolutions mitigate similarity errors; 4) effective alignment strategies alleviate uncertainty hallucinations. 5) Furthermore, cross-dimensional analysis reveals that jointly enhancing visual fidelity and alignment quality yields the most comprehensive improvements. This study provides the first systematic exploration linking architecture-level design to hallucination robustness, offering practical guidance for developing reliable and efficient LVLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's three-way architecture split and new benchmark are reasonable ideas, but comparing off-the-shelf models without tight controls leaves the claimed mappings to hallucination types under-supported.

read the letter

The main thing to know is that this work factors LVLM design into Linguistic Foundation, Visual Representation, and Semantic Alignment, then ties those to co-occurrence, similarity, and uncertainty hallucinations through a perturbation-based benchmark called CoSimUE. The headline results say parameter scaling helps little overall, while stronger language foundations cut co-occurrence errors, better visual encoders and resolution cut similarity errors, and alignment cuts uncertainty errors, with the best gains from improving visuals and alignment together.

What is new is the explicit three-way factorization and the addition of uncertainty as a distinct hallucination category. The benchmark itself, built on controlled textual and random perturbations, is a concrete attempt to create test cases that link design choices to error types. That framing moves the discussion past blanket scaling and toward more targeted fixes, which is a step forward from many prior hallucination papers.

The soft spots sit in the experimental setup. The claims rest on differences observed across existing models, yet those models vary simultaneously in data, training procedures, and multiple architectural pieces. Without ablations that hold everything else fixed or clear documentation of model selection and statistical controls, it is difficult to attribute reductions cleanly to one dimension. The stress-test concern about residual correlations and perturbation artifacts is fair on the evidence shown; the abstract gives no details on how the perturbations were validated or whether they avoid introducing new failure modes. That makes the specific mappings suggestive rather than conclusive.

This paper is aimed at researchers working on LVLM reliability who want practical architecture guidance. A reader already running similar benchmarks could extract useful hypotheses, but would treat the current results as preliminary. It deserves peer review because the topic is relevant and the proposed framework is coherent, even though the current experiments need tighter controls and more transparent reporting to carry the weight of the conclusions.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that hallucination in LVLMs fundamentally stems from architectural design, which it factors into three dimensions—Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA)—while categorizing hallucinations into Co-occurrence, Similarity, and Uncertainty types. It introduces the CoSimUE benchmark, which generates fine-grained scenarios via controlled textual perturbations and random perturbations to map design choices to hallucination behaviors. Experiments across 7 design aspects conclude that parameter scaling has limited impact on all three hallucination types, larger/better-trained LFs reduce co-occurrence hallucinations, stronger visual encoders and higher resolutions mitigate similarity errors, effective alignment strategies alleviate uncertainty hallucinations, and jointly enhancing VR and SA yields the most comprehensive improvements. The work positions itself as the first systematic exploration linking architecture-level design to hallucination robustness.

Significance. If the isolation of effects holds, the result would be significant for the field: it supplies the first explicit mapping from the three architectural dimensions to distinct hallucination types and offers concrete, actionable guidance on design choices (e.g., prioritizing visual fidelity plus alignment over raw parameter scaling). The CoSimUE benchmark and the three-way hallucination taxonomy are reusable contributions that could shape subsequent LVLM robustness studies.

major comments (2)

[Abstract / experimental results section] Abstract and experimental results section: the headline attributions (larger LF reduces co-occurrence hallucinations; stronger VR mitigates similarity errors; effective SA alleviates uncertainty hallucinations) rest on the premise that the 7 design aspects and CoSimUE perturbations cleanly vary only the targeted factor. The manuscript provides no details on model selection criteria, statistical tests, controls for training-data or optimization confounds, or exact perturbation implementations, leaving open the possibility that observed differences arise from correlated model differences rather than the claimed architectural dimensions.
[CoSimUE benchmark description (likely §3)] CoSimUE benchmark description (likely §3): the controlled textual and random perturbations are asserted to isolate LF/VR/SA effects on the three hallucination types. Because publicly available LVLMs differ simultaneously in multiple components, training corpora, and optimization procedures, residual correlations may remain; without explicit ablation or sensitivity analysis showing that perturbation artifacts do not introduce new failure modes absent from natural inputs, the mapping from design dimension to hallucination type is not yet load-bearing.

minor comments (1)

[Introduction / taxonomy section] The three hallucination categories are introduced without a formal definition or illustrative examples in the abstract; a short table or figure in the main text would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments below by clarifying our experimental controls and committing to additions that strengthen the isolation of effects.

read point-by-point responses

Referee: [Abstract / experimental results section] Abstract and experimental results section: the headline attributions (larger LF reduces co-occurrence hallucinations; stronger VR mitigates similarity errors; effective SA alleviates uncertainty hallucinations) rest on the premise that the 7 design aspects and CoSimUE perturbations cleanly vary only the targeted factor. The manuscript provides no details on model selection criteria, statistical tests, controls for training-data or optimization confounds, or exact perturbation implementations, leaving open the possibility that observed differences arise from correlated model differences rather than the claimed architectural dimensions.

Authors: We agree that the manuscript would benefit from greater transparency on these points. In the revision we will add a new subsection detailing model selection criteria, including the specific checkpoints chosen to hold VR and SA fixed while varying LF (and analogously for the other dimensions). We will also report statistical tests (paired t-tests and effect sizes) for all headline comparisons. On training-data and optimization confounds, we will explicitly discuss that the selected public models share comparable pre-training corpora and that we controlled for parameter count where possible; any residual correlations will be acknowledged as a limitation. Exact perturbation code and hyper-parameters will be moved to the appendix with pseudocode. revision: yes
Referee: [CoSimUE benchmark description (likely §3)] CoSimUE benchmark description (likely §3): the controlled textual and random perturbations are asserted to isolate LF/VR/SA effects on the three hallucination types. Because publicly available LVLMs differ simultaneously in multiple components, training corpora, and optimization procedures, residual correlations may remain; without explicit ablation or sensitivity analysis showing that perturbation artifacts do not introduce new failure modes absent from natural inputs, the mapping from design dimension to hallucination type is not yet load-bearing.

Authors: We acknowledge the difficulty of perfect isolation with public checkpoints. In the revision we will insert an ablation subsection that (i) compares CoSimUE outputs against unmodified natural-image captions to quantify any introduced failure modes and (ii) performs sensitivity sweeps over perturbation strength. These analyses will be used to argue that the observed mappings are not artifacts of the benchmark construction. While we cannot retroactively retrain models to eliminate all correlations, the added controls and ablations will make the current evidence load-bearing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking of existing models

full rationale

This is a purely empirical study that selects publicly available LVLMs, varies their architectural dimensions (LF, VR, SA) via existing checkpoints, applies the CoSimUE benchmark, and reports observed differences in hallucination rates. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. The mapping from design choices to hallucination types rests on experimental variation rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the validity of the introduced categorization of hallucinations and architecture dimensions plus the assumption that the new benchmark isolates causal effects of those dimensions.

axioms (2)

domain assumption LVLMs can be factored into Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA) dimensions.
This factorization structures the entire analysis of design choices.
domain assumption Hallucinations in LVLMs fall into Co-occurrence, Similarity, and Uncertainty types.
This categorization enables mapping design choices to specific behaviors.

invented entities (1)

CoSimUE benchmark no independent evidence
purpose: Creates fine-grained hallucination scenarios through controlled textual perturbations and random perturbations to enable mapping between design choices and hallucination behaviors.
Newly proposed evaluation tool central to the experiments.

pith-pipeline@v0.9.1-grok · 5784 in / 1399 out tokens · 33426 ms · 2026-06-28T23:20:21.951857+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 15 canonical work pages · 9 internal anchors

[1]

Unveiling hallucination in text, image, video, and audio foundation models: A comprehensive survey,

P. Sahoo, P. Meharia, A. Ghosh, S. Saha, V . Jain, and A. Chadha, “Unveiling hallucination in text, image, video, and audio foundation models: A comprehensive survey,”CoRR, vol. abs/2405.09589, 2024

work page arXiv 2024
[2]

Palm: Scal- ing language modeling with pathways,

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmannet al., “Palm: Scal- ing language modeling with pathways,”Journal of Machine Learning Research, pp. 1–113, 2023

2023
[3]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901
[5]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023

2023
[6]

Llava-next: Improved reasoning, ocr, and world knowledge,

H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024

2024
[7]

BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models,” inICML, 2023

2023
[8]

Instructblip: Towards general-purpose vision- language models with instruction tuning,

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023

2023
[9]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond,”arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Minigpt-5: Interleaved vision-and- language generation via generative vokens,

K. Zheng, X. He, and X. E. Wang, “Minigpt-5: Interleaved vision-and- language generation via generative vokens,” 2023

2023
[12]

mplug-owl: Modularization empowers large language models with multimodality,

Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, Y . Shi, C. Jiang, C. Li, Y . Xu, H. Chen, J. Tian, Q. Qi, J. Zhang, and F. Huang, “mplug-owl: Modularization empowers large language models with multimodality,” 2023

2023
[13]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration,

Q. Ye, H. Xu, J. Ye, M. Yan, A. Hu, H. Liu, Q. Qian, J. Zhang, F. Huang, and J. Zhou, “mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration,” 2023

2023
[14]

mplug-owl3: Towards long image-sequence understanding in multi-modal large language models,

J. Ye, H. Xu, H. Liu, A. Hu, M. Yan, Q. Qian, J. Zhang, F. Huang, and J. Zhou, “mplug-owl3: Towards long image-sequence understanding in multi-modal large language models,” 2024

2024
[15]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Introducing gpt-5,

OpenAI, “Introducing gpt-5,” 2025

2025
[17]

Introducing claude sonnet 4.5,

anthropic Team, “Introducing claude sonnet 4.5,” 2025

2025
[18]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

G. Gemini Team, “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” 2025

2025
[19]

Glm-4.5: Agentic, reasoning, and coding (arc) foundation models,

G.-. Team, “Glm-4.5: Agentic, reasoning, and coding (arc) foundation models,” 2025

2025
[20]

Nullu: Mitigating object hallucinations in large vision-language models via halluspace projection,

L. Yang, Z. Zheng, B. Chen, Z. Zhao, C. Lin, and C. Shen, “Nullu: Mitigating object hallucinations in large vision-language models via halluspace projection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 14 635–14 645

2025
[21]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,

Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu, “Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 13 418–13 427

2024
[22]

Agla: Mitigating object hallucinations in large vision-language models with assembly of global and local attention,

W. An, F. Tian, S. Leng, J. Nie, H. Lin, Q. Wang, G. Dai, P. Chen, and S. Lu, “Agla: Mitigating object hallucinations in large vision-language models with assembly of global and local attention,”arXiv preprint arXiv:2406.12718, 2024

work page arXiv 2024
[23]

R. S. Pressman,Software engineering: a practitioner’s approach. Pal- grave macmillan, 2005

2005
[24]

A review on large language models: Architectures, applications, taxonomies, open issues and challenges,

M. A. K. Raiaan, M. S. H. Mukta, K. Fatema, N. M. Fahad, S. Sakib, M. M. J. Mim, J. Ahmad, M. E. Ali, and S. Azam, “A review on large language models: Architectures, applications, taxonomies, open issues and challenges,”IEEE access, vol. 12, pp. 26 839–26 874, 2024

2024
[25]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[26]

Vision-language models for vision tasks: A survey,

J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,”IEEE transactions on pattern analysis and machine intelligence, pp. 5625–5644, 2024

2024
[27]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 296–26 306

2024
[28]

Mit- igating object hallucinations in large vision-language models through visual contrastive decoding,

S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing, “Mit- igating object hallucinations in large vision-language models through visual contrastive decoding,”arXiv preprint arXiv:2311.16922, 2023

work page arXiv 2023
[29]

Why language models hallucinate,

A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang, “Why language models hallucinate,” 2025

2025
[30]

Analyzing and mitigating object hallucination in large vision- language models,

Y . Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao, “Analyzing and mitigating object hallucination in large vision- language models,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[31]

Evaluating object hallucination in large vision-language models,

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” 2023

2023
[32]

Detecting and preventing hallucinations in large vision language models,

A. Gunjal, J. Yin, and E. Bas, “Detecting and preventing hallucinations in large vision language models,” inProceedings of the AAAI Conference on Artificial Intelligence, 2024, pp. 18 135–18 143

2024
[33]

Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data,

Q. Yu, J. Li, L. Wei, L. Pang, W. Ye, B. Qin, S. Tang, Q. Tian, and Y . Zhuang, “Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12 944–12 953

2024
[34]

Mitigating object hallucination in large vision-language models via image-grounded guidance,

L. Zhao, Y . Deng, W. Zhang, and Q. Gu, “Mitigating object hallucination in large vision-language models via image-grounded guidance,”arXiv preprint arXiv:2402.08680, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

work page arXiv 2024
[35]

Eyes wide shut? exploring the visual shortcomings of multimodal llms,

S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie, “Eyes wide shut? exploring the visual shortcomings of multimodal llms,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9568–9578

2024
[36]

Why Language Models Hallucinate

A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang, “Why language models hallucinate,”arXiv preprint arXiv:2509.04664, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Estimating the hallucination rate of generative ai,

A. Jesson, N. Beltran-Velez, Q. Chu, S. Karlekar, J. Kossen, Y . Gal, J. P. Cunningham, and D. Blei, “Estimating the hallucination rate of generative ai,”Advances in Neural Information Processing Systems, vol. 37, pp. 31 154–31 201, 2024

2024
[38]

Deep compositional captioning: Describing novel object categories without paired training data,

L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell, “Deep compositional captioning: Describing novel object categories without paired training data,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1–10

2016
[39]

Object Hallucination in Image Captioning

A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko, “Object hallucination in image captioning,”arXiv preprint arXiv:1809.02156, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[40]

Mme: A comprehensive evaluation benchmark for multimodal large language models,

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, and R. Ji, “Mme: A comprehensive evaluation benchmark for multimodal large language models,” 2024

2024
[41]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

F. Liu, K. Lin, L. Li, J. Wang, Y . Yacoob, and L. Wang, “Mitigating hallucination in large multi-modal models via robust instruction tuning,” arXiv preprint arXiv:2306.14565, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Visual hallucinations of multi-modal large language models,

W. Huang, H. Liu, M. Guo, and N. Z. Gong, “Visual hallucinations of multi-modal large language models,”arXiv preprint arXiv:2402.14683, 2024

work page arXiv 2024
[43]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,

T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoobet al., “Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 375–14 385

2024
[44]

Autohallusion: Automatic generation of hallucination benchmarks for vision-language models,

X. Wu, T. Guan, D. Li, S. Huang, X. Liu, X. Wang, R. Xian, A. Shrivastava, F. Huang, J. L. Boyd-Graber, T. Zhou, and D. Manocha, “Autohallusion: Automatic generation of hallucination benchmarks for vision-language models,” 2024

2024
[45]

Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges,

C. Cui, Y . Zhou, X. Yang, S. Wu, L. Zhang, J. Zou, and H. Yao, “Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges,”arXiv preprint arXiv:2311.03287, 2023

work page arXiv 2023
[46]

Microsoft coco: Common objects in con- text,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in con- text,” inComputer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 740–755

2014
[47]

B. F. Labs, “Flux,” 2024

2024
[48]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in neural information processing systems, vol. 30, 2017

2017
[49]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

2023
[50]

Mistral 7B

A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnieret al., “Mistral 7b. corr, abs/2310.06825, 2023. doi: 10.48550,”arXiv preprint ARXIV .2310.06825, vol. 10, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” 2021

2021

[1] [1]

Unveiling hallucination in text, image, video, and audio foundation models: A comprehensive survey,

P. Sahoo, P. Meharia, A. Ghosh, S. Saha, V . Jain, and A. Chadha, “Unveiling hallucination in text, image, video, and audio foundation models: A comprehensive survey,”CoRR, vol. abs/2405.09589, 2024

work page arXiv 2024

[2] [2]

Palm: Scal- ing language modeling with pathways,

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmannet al., “Palm: Scal- ing language modeling with pathways,”Journal of Machine Learning Research, pp. 1–113, 2023

2023

[3] [3]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901

[5] [5]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023

2023

[6] [6]

Llava-next: Improved reasoning, ocr, and world knowledge,

H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024

2024

[7] [7]

BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models,” inICML, 2023

2023

[8] [8]

Instructblip: Towards general-purpose vision- language models with instruction tuning,

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023

2023

[9] [9]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond,”arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Minigpt-5: Interleaved vision-and- language generation via generative vokens,

K. Zheng, X. He, and X. E. Wang, “Minigpt-5: Interleaved vision-and- language generation via generative vokens,” 2023

2023

[12] [12]

mplug-owl: Modularization empowers large language models with multimodality,

Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, Y . Shi, C. Jiang, C. Li, Y . Xu, H. Chen, J. Tian, Q. Qi, J. Zhang, and F. Huang, “mplug-owl: Modularization empowers large language models with multimodality,” 2023

2023

[13] [13]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration,

Q. Ye, H. Xu, J. Ye, M. Yan, A. Hu, H. Liu, Q. Qian, J. Zhang, F. Huang, and J. Zhou, “mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration,” 2023

2023

[14] [14]

mplug-owl3: Towards long image-sequence understanding in multi-modal large language models,

J. Ye, H. Xu, H. Liu, A. Hu, M. Yan, Q. Qian, J. Zhang, F. Huang, and J. Zhou, “mplug-owl3: Towards long image-sequence understanding in multi-modal large language models,” 2024

2024

[15] [15]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Introducing gpt-5,

OpenAI, “Introducing gpt-5,” 2025

2025

[17] [17]

Introducing claude sonnet 4.5,

anthropic Team, “Introducing claude sonnet 4.5,” 2025

2025

[18] [18]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

G. Gemini Team, “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” 2025

2025

[19] [19]

Glm-4.5: Agentic, reasoning, and coding (arc) foundation models,

G.-. Team, “Glm-4.5: Agentic, reasoning, and coding (arc) foundation models,” 2025

2025

[20] [20]

Nullu: Mitigating object hallucinations in large vision-language models via halluspace projection,

L. Yang, Z. Zheng, B. Chen, Z. Zhao, C. Lin, and C. Shen, “Nullu: Mitigating object hallucinations in large vision-language models via halluspace projection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 14 635–14 645

2025

[21] [21]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,

Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu, “Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 13 418–13 427

2024

[22] [22]

Agla: Mitigating object hallucinations in large vision-language models with assembly of global and local attention,

W. An, F. Tian, S. Leng, J. Nie, H. Lin, Q. Wang, G. Dai, P. Chen, and S. Lu, “Agla: Mitigating object hallucinations in large vision-language models with assembly of global and local attention,”arXiv preprint arXiv:2406.12718, 2024

work page arXiv 2024

[23] [23]

R. S. Pressman,Software engineering: a practitioner’s approach. Pal- grave macmillan, 2005

2005

[24] [24]

A review on large language models: Architectures, applications, taxonomies, open issues and challenges,

M. A. K. Raiaan, M. S. H. Mukta, K. Fatema, N. M. Fahad, S. Sakib, M. M. J. Mim, J. Ahmad, M. E. Ali, and S. Azam, “A review on large language models: Architectures, applications, taxonomies, open issues and challenges,”IEEE access, vol. 12, pp. 26 839–26 874, 2024

2024

[25] [25]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[26] [26]

Vision-language models for vision tasks: A survey,

J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,”IEEE transactions on pattern analysis and machine intelligence, pp. 5625–5644, 2024

2024

[27] [27]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 296–26 306

2024

[28] [28]

Mit- igating object hallucinations in large vision-language models through visual contrastive decoding,

S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing, “Mit- igating object hallucinations in large vision-language models through visual contrastive decoding,”arXiv preprint arXiv:2311.16922, 2023

work page arXiv 2023

[29] [29]

Why language models hallucinate,

A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang, “Why language models hallucinate,” 2025

2025

[30] [30]

Analyzing and mitigating object hallucination in large vision- language models,

Y . Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao, “Analyzing and mitigating object hallucination in large vision- language models,” inThe Twelfth International Conference on Learning Representations, 2024

2024

[31] [31]

Evaluating object hallucination in large vision-language models,

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” 2023

2023

[32] [32]

Detecting and preventing hallucinations in large vision language models,

A. Gunjal, J. Yin, and E. Bas, “Detecting and preventing hallucinations in large vision language models,” inProceedings of the AAAI Conference on Artificial Intelligence, 2024, pp. 18 135–18 143

2024

[33] [33]

Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data,

Q. Yu, J. Li, L. Wei, L. Pang, W. Ye, B. Qin, S. Tang, Q. Tian, and Y . Zhuang, “Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12 944–12 953

2024

[34] [34]

Mitigating object hallucination in large vision-language models via image-grounded guidance,

L. Zhao, Y . Deng, W. Zhang, and Q. Gu, “Mitigating object hallucination in large vision-language models via image-grounded guidance,”arXiv preprint arXiv:2402.08680, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

work page arXiv 2024

[35] [35]

Eyes wide shut? exploring the visual shortcomings of multimodal llms,

S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie, “Eyes wide shut? exploring the visual shortcomings of multimodal llms,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9568–9578

2024

[36] [36]

Why Language Models Hallucinate

A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang, “Why language models hallucinate,”arXiv preprint arXiv:2509.04664, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Estimating the hallucination rate of generative ai,

A. Jesson, N. Beltran-Velez, Q. Chu, S. Karlekar, J. Kossen, Y . Gal, J. P. Cunningham, and D. Blei, “Estimating the hallucination rate of generative ai,”Advances in Neural Information Processing Systems, vol. 37, pp. 31 154–31 201, 2024

2024

[38] [38]

Deep compositional captioning: Describing novel object categories without paired training data,

L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell, “Deep compositional captioning: Describing novel object categories without paired training data,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1–10

2016

[39] [39]

Object Hallucination in Image Captioning

A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko, “Object hallucination in image captioning,”arXiv preprint arXiv:1809.02156, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[40] [40]

Mme: A comprehensive evaluation benchmark for multimodal large language models,

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, and R. Ji, “Mme: A comprehensive evaluation benchmark for multimodal large language models,” 2024

2024

[41] [41]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

F. Liu, K. Lin, L. Li, J. Wang, Y . Yacoob, and L. Wang, “Mitigating hallucination in large multi-modal models via robust instruction tuning,” arXiv preprint arXiv:2306.14565, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Visual hallucinations of multi-modal large language models,

W. Huang, H. Liu, M. Guo, and N. Z. Gong, “Visual hallucinations of multi-modal large language models,”arXiv preprint arXiv:2402.14683, 2024

work page arXiv 2024

[43] [43]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,

T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoobet al., “Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 375–14 385

2024

[44] [44]

Autohallusion: Automatic generation of hallucination benchmarks for vision-language models,

X. Wu, T. Guan, D. Li, S. Huang, X. Liu, X. Wang, R. Xian, A. Shrivastava, F. Huang, J. L. Boyd-Graber, T. Zhou, and D. Manocha, “Autohallusion: Automatic generation of hallucination benchmarks for vision-language models,” 2024

2024

[45] [45]

Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges,

C. Cui, Y . Zhou, X. Yang, S. Wu, L. Zhang, J. Zou, and H. Yao, “Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges,”arXiv preprint arXiv:2311.03287, 2023

work page arXiv 2023

[46] [46]

Microsoft coco: Common objects in con- text,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in con- text,” inComputer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 740–755

2014

[47] [47]

B. F. Labs, “Flux,” 2024

2024

[48] [48]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in neural information processing systems, vol. 30, 2017

2017

[49] [49]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

2023

[50] [50]

Mistral 7B

A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnieret al., “Mistral 7b. corr, abs/2310.06825, 2023. doi: 10.48550,”arXiv preprint ARXIV .2310.06825, vol. 10, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” 2021

2021