pith. machine review for the scientific record. sign in

arxiv: 2604.17982 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.CL

Recognition: unknown

Mitigating Multimodal Hallucination via Phase-wise Self-reward

Chuyang Sun, Kehai Chen, Min Zhang, Xuefeng Bai, Yang Xiang, Yu Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:06 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords multimodal hallucinationself-reward decodingvision-language modelsinference-time mitigationphase-wise patternsLLaVA
0
0 comments X

The pith

Phase-wise self-reward decoding halves hallucination rates in large vision-language models at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that hallucinations in vision-language models follow distinct patterns that spike at the beginning of each new semantic phase in the generated response. By using the model itself to generate reward signals that detect these peaks, it intervenes during decoding to correct them on the fly. This avoids the need for extra training data or external models. The approach leads to substantial reductions in errors across multiple benchmarks and models.

Core claim

The authors establish that visual hallucinations exhibit phase-wise dynamic patterns peaking at the onset of each semantic phase, and introduce PSRD which uses self-reward signals distilled into a lightweight model to guide precise interventions during the decoding process, achieving online hallucination correction without external supervision.

What carries the argument

Phase-wise self-reward decoding, which distills hallucination guidance from the LVLM into a lightweight reward model for on-the-fly targeted intervention at semantic phase onsets.

If this is right

  • Reduces hallucination rate by 50% in LLaVA-1.5-7B.
  • Outperforms existing post-hoc methods on five benchmarks for four different LVLMs.
  • Effectively mitigates hallucination propagation during generation.
  • Achieves controllable trade-off between performance and inference efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar phase detection could apply to text-only large language models to reduce factual errors.
  • Integrating this with fine-tuning might further improve base model reliability.
  • Testing on more diverse visual inputs could reveal if the phase patterns hold across domains.

Load-bearing premise

Hallucinations in the model's outputs follow identifiable patterns linked to the start of new semantic phases, and the model can generate reliable self-reward signals to detect and fix them without outside help.

What would settle it

Measuring the actual hallucination rates on the same benchmarks after applying PSRD and finding no significant reduction compared to baselines would disprove the effectiveness of the phase-wise intervention.

Figures

Figures reproduced from arXiv: 2604.17982 by Chuyang Sun, Kehai Chen, Min Zhang, Xuefeng Bai, Yang Xiang, Yu Zhang.

Figure 1
Figure 1. Figure 1: Illustration of the proposed PSRD framework. PSRD first activates the intrinsic hallucination discrimination ca￾pacity of LVLMs through the uncertainty signals to train a lightweight phase-wise reward model. Then the reward model monitors the response online to provide on-the-fly reward signals, enabling dynamic, targeted intervention during the decoding process. that perform a single correction pass after… view at source ↗
Figure 2
Figure 2. Figure 2: Characterization of dynamic hallucination patterns [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Quantitative results of phase-specific hallucination [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of reward model output scores under [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Trade-off between hallucination mitigation effec [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Relative fluency / perplexity change under different [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Justification of the default loss weights by magnitude balancing at the early training stage. The three curves correspond [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of pseudo-label confidence scores pro [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Large Vision-Language Models (LVLMs) still struggle with vision hallucination, where generated responses are inconsistent with the visual input. Existing methods either rely on large-scale annotated data for fine-tuning, which incurs massive computational overhead, or employ static post-hoc strategies that overlook the dynamic nature of hallucination emergence. To address these, we introduce a new self-rewarding framework, enabling dynamic hallucination mitigation at inference time without external supervision. On the empirical side, we reveal that visual hallucination exhibits phase-wise dynamic patterns, peaking at the onset of each semantic phase. Drawing on these insights, we propose \textbf{PSRD} (\textbf{Phase-wise \textbf{S}elf-\textbf{R}eward \textbf{D}ecoding) for online hallucination correction guided by phase-wise self-reward signals. To reduce the cost of repeated self-evaluation during decoding, we distill the hallucination guidance signal from LVLMs into a lightweight reward model. The reward model subsequently provides on-the-fly guidance for targeted intervention during the decoding process, enabling precise hallucination suppression. The proposed PSRD significantly reduces the hallucination rate of LLaVA-1.5-7B by 50.0% and consistently outperforms existing post-hoc methods across five hallucination evaluation benchmarks for four LVLMs. Further analysis confirms that PSRD effectively mitigates hallucination propagation and achieves a highly controllable trade-off between strong performance and inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces PSRD, a phase-wise self-reward decoding framework for mitigating visual hallucinations in Large Vision-Language Models (LVLMs) during inference without external supervision. It observes that hallucinations exhibit phase-wise dynamic patterns peaking at semantic phase onsets, distills self-reward signals into a lightweight reward model for efficient online correction, and reports a 50% reduction in hallucination rate for LLaVA-1.5-7B along with consistent outperformance over post-hoc methods on five benchmarks for four LVLMs.

Significance. If the empirical results hold under rigorous validation, this would represent a meaningful advance by offering an unsupervised inference-time approach that exploits the dynamic emergence of hallucinations rather than relying on static post-hoc fixes or expensive fine-tuning. The distillation to a lightweight reward model is a practical strength for deployment efficiency, and the phase-wise analysis could stimulate further work on temporal patterns in multimodal generation errors.

major comments (3)
  1. Abstract: The central claim of a 50.0% reduction in hallucination rate for LLaVA-1.5-7B (and outperformance across five benchmarks) is reported without the baseline hallucination rate, exact evaluation protocol, statistical significance tests, or controls for potential data leakage in phase boundary detection, leaving the magnitude and reliability of the gains under-supported.
  2. Method section (self-reward signal derivation): The self-reward is generated by the same LVLM whose outputs are being corrected; the manuscript provides no independent verification (e.g., correlation analysis against ground-truth visual inconsistencies or external image features) that the signal reliably flags hallucinations rather than model-internal biases or overconfidence.
  3. Experiments (phase-wise patterns and distillation): The phase boundary detection threshold and reward model distillation hyperparameters are free parameters; without sensitivity analysis or ablations showing that the reported 50% reduction and benchmark wins are robust to their settings, the central claim risks being an artifact of the specific evaluation setup.
minor comments (2)
  1. Abstract: The phrase 'highly controllable trade-off between strong performance and inference efficiency' is stated without specifying the control parameters or quantitative efficiency metrics used to demonstrate controllability.
  2. Overall: The manuscript would benefit from explicit pseudocode or a diagram for the online intervention step during decoding to clarify how the distilled reward model intervenes at phase onsets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We have carefully reviewed each major point and provide point-by-point responses below, indicating where revisions have been made to the manuscript to improve clarity, support for claims, and robustness.

read point-by-point responses
  1. Referee: Abstract: The central claim of a 50.0% reduction in hallucination rate for LLaVA-1.5-7B (and outperformance across five benchmarks) is reported without the baseline hallucination rate, exact evaluation protocol, statistical significance tests, or controls for potential data leakage in phase boundary detection, leaving the magnitude and reliability of the gains under-supported.

    Authors: We agree that the abstract would benefit from greater specificity to better substantiate the central claims. In the revised manuscript, we have updated the abstract to include the baseline hallucination rate for LLaVA-1.5-7B, a concise description of the evaluation protocol (including the five benchmarks and metrics), reference to the statistical significance tests performed, and an explicit clarification that phase boundaries are identified using only the model's internal generation dynamics (e.g., semantic shift indicators from token probabilities) with no access to or leakage from any test data. revision: yes

  2. Referee: Method section (self-reward signal derivation): The self-reward is generated by the same LVLM whose outputs are being corrected; the manuscript provides no independent verification (e.g., correlation analysis against ground-truth visual inconsistencies or external image features) that the signal reliably flags hallucinations rather than model-internal biases or overconfidence.

    Authors: This is a fair methodological concern. While the framework is intentionally unsupervised and relies on the LVLM's own signals, we have added a new correlation analysis in the revised manuscript between the self-reward scores and ground-truth hallucination annotations from the evaluation benchmarks. The results show a positive correlation, providing empirical evidence that the signal aligns with actual visual inconsistencies rather than solely internal biases. We acknowledge that this is not a fully external verification (e.g., via separate image encoders) and discuss it as a limitation and avenue for future work. revision: partial

  3. Referee: Experiments (phase-wise patterns and distillation): The phase boundary detection threshold and reward model distillation hyperparameters are free parameters; without sensitivity analysis or ablations showing that the reported 50% reduction and benchmark wins are robust to their settings, the central claim risks being an artifact of the specific evaluation setup.

    Authors: We appreciate the emphasis on robustness. In the revised manuscript, we have incorporated sensitivity analyses that vary the phase boundary detection threshold across multiple values and test alternative settings for the distillation hyperparameters (including reward model capacity and training details). These ablations confirm that the reported hallucination reductions and benchmark improvements remain consistent and are not artifacts of particular hyperparameter choices. The results are presented in an expanded experiments section with additional figures and tables in the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces PSRD as an empirical method: it observes phase-wise hallucination patterns, distills a self-reward signal from the LVLM into a lightweight model, and applies it for online correction during decoding. All central claims (50% hallucination reduction on LLaVA-1.5-7B, outperformance on five benchmarks) are presented as experimental outcomes rather than derived predictions. No equations, uniqueness theorems, or self-citations are invoked to force the result by construction. The self-reward mechanism is a design choice whose effectiveness is tested externally on benchmarks; it does not reduce to tautological re-labeling of inputs. The framework remains self-contained against independent evaluation and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on an empirical observation of phase-wise hallucination patterns and the assumption that self-generated rewards can serve as reliable guidance without external labels; no explicit free parameters are named but phase boundaries and reward thresholds are implicit.

free parameters (2)
  • phase boundary detection threshold
    Implicit in the phase-wise pattern detection; value not stated in abstract.
  • reward model distillation hyperparameters
    Required to train the lightweight model but unspecified.
axioms (2)
  • domain assumption Visual hallucination exhibits phase-wise dynamic patterns peaking at semantic phase onsets
    Stated as an empirical revelation that underpins the entire PSRD design.
  • domain assumption Self-reward signals from the LVLM can detect and correct hallucinations without external supervision
    Core premise enabling inference-time intervention.

pith-pipeline@v0.9.0 · 5567 in / 1443 out tokens · 35753 ms · 2026-05-10T05:06:47.031073+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 36 canonical work pages · 10 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Shun-ichi Amari. 1993. Backpropagation and stochastic gradient descent method. Neurocomputing5, 4-5 (1993), 185–196

  3. [3]

    Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, and Shijian Lu. 2025. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In Proceedings of the Computer Vision and Pattern Recognition Conference. 29915– 29926

  4. [4]

    Kushal Arora, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. [n. d.]. Director: Generator-classifiers for supervised language modeling, 2022.URL https://arxiv. org/abs/22067694 ([n. d.])

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

  6. [6]

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930(2024)

  7. [7]

    Fu Chaoyou, Chen Peixian, Shen Yunhang, Qin Yulei, Zhang Mengdan, Lin Xu, Yang Jinrui, Zheng Xiawu, Li Ke, Sun Xing, et al. 2023. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.133943 (2023)

  8. [8]

    Liwei Che, Tony Qingze Liu, Jing Jia, Weiyi Qin, Ruixiang Tang, and Vladimir Pavlovic. 2025. Hallucinatory Image Tokens: A Training-free EAZY Approach on Detecting and Mitigating Object Hallucinations in LVLMs.arXiv preprint arXiv:2503.07772(2025)

  9. [9]

    Xinlong Chen, Yuanxing Zhang, Qiang Liu, Junfei Wu, Fuzheng Zhang, and Tieniu Tan. 2025. Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models.arXiv preprint arXiv:2505.17061(2025)

  10. [10]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 [cs.CV]

  11. [11]

    Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Y Zou, Kai-Wei Chang, and Wei Wang. 2024. Enhancing large vision language models with self-training on image comprehension.Advances in Neural Information Processing Systems37 (2024), 131369–131397

  12. [12]

    Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pra- muditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto

  13. [13]

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Multi-modal hallucination control by visual information grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14303–14312

  14. [14]

    Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, and See-Kiong Ng. 2025. Chip: Cross-modal hierarchical direct preference optimization for multimodal llms.arXiv preprint arXiv:2501.16629(2025)

  15. [15]

    Yuhan Fu, Ruobing Xie, Xingwu Sun, Zhanhui Kang, and Xirong Li. 2025. Mitigat- ing hallucination in multimodal large language model via hallucination-targeted direct preference optimization. InFindings of the Association for Computational Linguistics: ACL 2025. 16563–16577

  16. [16]

    Anisha Gunjal, Jihan Yin, and Erhan Bas. 2024. Detecting and preventing hallucinations in large vision language models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18135–18143

  17. [17]

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. 2024. Opera: Alleviating hal- lucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13418–13427

  18. [18]

    Weiyang Huang, Xuefeng Bai, Kehai Chen, Xinyang Chen, Yibin Chen, Weili Guan, and Min Zhang. 2026. SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking.arXiv preprint arXiv:2604.07922(2026)

  19. [19]

    Wen Huang, Hongbin Liu, Minxin Guo, and Neil Zhenqiang Gong. 2024. Vi- sual hallucinations of multi-modal large language models.arXiv preprint arXiv:2402.14683(2024)

  20. [20]

    Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qing- hao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. 2024. Hallucination augmented contrastive learning for multimodal large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 27036–27046

  21. [21]

    Seongyun Lee, Sue Hyun Park, Yongrae Jo, and Minjoon Seo. 2023. Volcano: mitigating multimodal hallucination through self-feedback guided revision.arXiv preprint arXiv:2311.07362(2023)

  22. [22]

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2024. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13872–13882

  23. [23]

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895(2024)

  24. [24]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen

  25. [25]

    Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355(2023)

  26. [26]

    Zirui Li, Xuefeng Bai, Kehai Chen, Yizhi Li, Jian Yang, Chenghua Lin, and Min Zhang. 2026. Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure.arXiv preprint arXiv:2602.08783(2026)

  27. [27]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InComputer Vision–ECCV 2014: 13th European Conference, Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al. Zurich, Switzerland, September 6-12, 2014, Proceedings, Pa...

  28. [28]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26296–26306

  29. [29]

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multi- modal reasoning via thought chains for science question answering.Advances in neural information processing systems35 (2022), 2507–2521

  30. [30]

    Jinlong Ma, Yu Zhang, Xuefeng Bai, Kehai Chen, Yuwei Wang, Zeming Liu, Jun Yu, and Min Zhang. 2026. Beyond Unimodal Shortcuts: MLLMs as Cross- Modal Reasoners for Grounded Named Entity Recognition.arXiv preprint arXiv:2602.04486(2026)

  31. [31]

    Yuexin Ma, Tai Wang, Xuyang Bai, Huitong Yang, Yuenan Hou, Yaming Wang, Yu Qiao, Ruigang Yang, and Xinge Zhu. 2024. Vision-centric bev perception: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence46, 12 (2024), 10978–10997

  32. [32]

    Oscar Mañas, Pierluca D’Oro, Koustuv Sinha, Adriana Romero-Soriano, Michal Drozdzal, and Aishwarya Agrawal. 2025. Controlling Multimodal LLMs via Reward-guided Decoding. InProceedings of the IEEE/CVF International Conference on Computer Vision. 1391–1401

  33. [33]

    Yassine Ouali, Adrian Bulat, Brais Martinez, and Georgios Tzimiropoulos. 2024. Clip-dpo: Vision-language models as a source of preference for fixing hallucina- tions in lvlms. InEuropean Conference on Computer Vision. Springer, 395–413

  34. [34]

    Yeji Park, Deokyeong Lee, Junsuk Choe, and Buru Chang. 2025. Convis: Con- trastive decoding with hallucination visualization for mitigating hallucinations in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39

  35. [35]

    Shangpin Peng, Senqiao Yang, Li Jiang, and Zhuotao Tian. 2025. Mitigating Object Hallucinations via Sentence-Level Early Intervention.arXiv preprint arXiv:2507.12455(2025)

  36. [36]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  37. [37]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

  38. [38]

    Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. 2024. Warm: On the benefits of weight averaged reward models.arXiv preprint arXiv:2401.12187(2024)

  39. [39]

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object hallucination in image captioning.arXiv preprint arXiv:1809.02156(2018)

  40. [40]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

  41. [41]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

  42. [42]

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. 2023. Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525(2023)

  43. [43]

    Wei Suo, Lijun Zhang, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, and Yanning Zhang. 2025. Octopus: Alleviating hallucination via dynamic contrastive decoding. InProceedings of the Computer Vision and Pattern Recognition Conference. 29904–29914

  44. [44]

    Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, and Changxing Ding. 2025. Beyond human data: Aligning multimodal large language models by iterative self-evolution. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 7202–7210

  45. [45]

    Zifu Wan, Ce Zhang, Silong Yong, Martin Q Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, and Yaqi Xie. 2025. ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models.arXiv preprint arXiv:2507.00898(2025)

  46. [46]

    Chenxi Wang, Xiang Chen, Ningyu Zhang, Bozhong Tian, Haoming Xu, Shumin Deng, and Huajun Chen. 2024. Mllm can see? dynamic correction decoding for hallucination mitigation.arXiv preprint arXiv:2410.11779(2024)

  47. [47]

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. 2023. An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397(2023)

  48. [48]

    Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. 2024. Mitigating hallucinations in large vision-language models with instruction contrastive decoding.arXiv preprint arXiv:2403.18715(2024)

  49. [49]

    Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, et al. 2026. Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception. arXiv preprint arXiv:2602.11858(2026)

  50. [50]

    Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, and Lichao Sun. 2025. First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training.arXiv preprint arXiv:2505.22453(2025)

  51. [51]

    Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, and Changick Kim. 2024. Don’t Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models.arXiv preprint arXiv:2405.17820(2024)

  52. [52]

    Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Linchao Zhu. 2025. Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 25543–25551

  53. [53]

    Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. 2025. FG-CLIP: Fine-Grained Visual and Textual Alignment.arXiv preprint arXiv:2505.05071(2025)

  54. [54]

    Mufan Xu, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Muyun Yang, Tiejun Zhao, and Min Zhang. 2026. Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models.arXiv preprint arXiv:2602.14386(2026)

  55. [55]

    Kevin Yang and Dan Klein. 2021. FUDGE: Controlled text generation with future discriminators. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3511–3535

  56. [56]

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. 2024. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences67, 12 (2024), 220105

  57. [57]

    Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2025. RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CV...

  58. [58]

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9556–9567

  59. [59]

    Zihao Yue, Liang Zhang, and Qin Jin. 2024. Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective.arXiv preprint arXiv:2402.14545 (2024)

  60. [60]

    Pingrui Zhang, Xianqiang Gao, Yuhan Wu, Kehui Liu, Dong Wang, Zhigang Wang, Bin Zhao, Yan Ding, and Xuelong Li. 2025. Moma-kitchen: A 100k+ benchmark for affordance-grounded last-mile navigation in mobile manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 6315–6326

  61. [61]

    Pingrui Zhang, Yifei Su, Pengyuan Wu, Dong An, Li Zhang, Zhigang Wang, Dong Wang, Yan Ding, Bin Zhao, and Xuelong Li. 2025. Cross from left to right brain: Adaptive text dreamer for vision-and-language navigation.arXiv preprint arXiv:2505.20897(2025)

  62. [62]

    Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, and Min Zhang. 2025. Evaluating and steering modality preferences in multimodal large language model.arXiv preprint arXiv:2505.20977(2025)

  63. [63]

    Yu Zhang, Mufan Xu, Xuefeng Bai, Pengfei Zhang, Yang Xiang, Min Zhang, et al. 2026. Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration.arXiv preprint arXiv:2602.03677(2026)

  64. [64]

    Xiangqing Zheng, Chengyue Wu, Kehai Chen, and Min Zhang. 2025. LoCoT2V- Bench: Benchmarking Long-Form and Complex Text-to-Video Generation.arXiv preprint arXiv:2510.26412(2025)

  65. [65]

    Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, and Min Zhang. 2026. Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance. arXiv preprint arXiv:2602.03491(2026)

  66. [66]

    CLIP-raw

    Fei Zuo, Kehai Chen, Yu Zhang, Zhengshan Xue, and Min Zhang. 2025. InImage- Trans: Multimodal LLM-based text image machine translation. InFindings of the Association for Computational Linguistics: ACL 2025. 20256–20277. Mitigating Multimodal Hallucination via Phase-wise Self-reward Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Our supplementary ...