arxiv: 2605.06679 · v1 · submitted 2026-04-22 · 💻 cs.LG

Recognition: no theorem link

Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding

Abudukelimu Wuerkaixi, Cao Liu, Fengying Xie, HaoPeng Zhang, Ke Zeng, Xin Yang, Xuxin Cheng, Yitong An, Yubo Jiang, Zhiguo Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords vision-language modelsobject hallucinationdecoding methodstraining-free inferencemultimodal generationvisual groundingattention imbalance

0 comments

The pith

Positive-and-Negative Decoding steers vision-language models toward visual evidence by contrasting amplified and counterfactual paths during inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often generate objects that are not in the image because they favor language patterns over visual input. The paper identifies an imbalance in how these models attend to visual versus linguistic information. Positive-and-Negative Decoding (PND) addresses this by running two paths in parallel: one that strengthens visual features and another that builds opposing scenarios to discourage language-only outputs. By comparing these paths at each decoding step, the method guides the model to produce more accurate, grounded descriptions. This approach works on existing models without any retraining and shows strong results on standard tests for hallucination.

Core claim

The central discovery is that object hallucination in VLMs arises from under-weighting of visual features in attention mechanisms, and that a dual-path decoding strategy can correct this. Specifically, the positive path amplifies visual evidence while the negative path constructs counterfactuals to penalize generations driven by linguistic priors. Contrasting the logits or probabilities from these two paths during autoregressive decoding enforces visual fidelity, leading to improved performance on hallucination benchmarks without requiring model fine-tuning or additional data.

What carries the argument

Positive-and-Negative Decoding (PND), a training-free dual-path contrast that amplifies visual evidence in one path and penalizes prior-dominant generation in the other.

Load-bearing premise

The attention imbalance favoring linguistic priors is the primary driver of hallucination, and the dual-path contrast will reliably enforce visual fidelity without introducing new errors or biases.

What would settle it

Applying PND to a VLM and measuring equal or higher hallucination rates on POPE or CHAIR compared to standard decoding would disprove the method's effectiveness.

Figures

Figures reproduced from arXiv: 2605.06679 by Abudukelimu Wuerkaixi, Cao Liu, Fengying Xie, HaoPeng Zhang, Ke Zeng, Xin Yang, Xuxin Cheng, Yitong An, Yubo Jiang, Zhiguo Jiang.

**Figure 1.** Figure 1: PND suppresses object hallucination via dual-path contrast. (a) A standard VLM fails to identify the frisbee due to weak linguistic priors. (b) Existing negative-only methods are insufficient. (c) Our PND’s dual pathways overcome this: the positive path reinforces the object’s presence, while the negative path creates a counterfactual, leading to correct identification. resolves this imbalance. To truly… view at source ↗

**Figure 2.** Figure 2: Overview of the PND framework for belief-adjusted decoding. Given an input image, we first extract multi-layer cross-modal attention maps to estimate query-aligned visual evidence. These maps guide the construction of two perturbed visual representations: a positive view Vpos that amplifies evidence, and a negative view Vneg that suppresses it. Passing each view through the VLM yields three logits (origina… view at source ↗

**Figure 3.** Figure 3: Empirical evidence for Bayesian imbalance in multimodal decoding. We plot the layer-wise allocation of cross-modal attention in a representative VLM. Early layers attend to visual evidence, but deeper layers shift attentional most entirely toward textual context, reflecting the accumulation of a strong language prior. This progressive decline in visual contribution indicates that p(xv | y) is underweighte… view at source ↗

**Figure 4.** Figure 4: Performance comparison of different hallucination miti [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison for ”describe the scene captured in the image”. The baseline model ([regular]) produces a description [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) are frequently undermined by object hallucination, generating content that contradicts visual reality, due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our finding of an attention imbalance in VLMs, where visual features are under-weighted. Our framework introduces a dual-path contrast: a positive path that amplifies visual evidence and a negative path that constructs counterfactuals to penalize prior-dominant generation. By contrasting outputs from both paths during decoding, PND steers generation toward visually grounded results. Experiments on POPE, MME, and CHAIR demonstrate state-of-the-art performance without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PND gives a training-free dual-path contrast for cutting object hallucinations in VLMs, but the abstract leaves the causal claims and actual gains unverified.

read the letter

The new piece is the Positive-and-Negative Decoding setup. It starts from an observed attention imbalance where visual tokens get under-weighted, then runs a positive path that boosts visual evidence and a negative path that builds counterfactuals to push back against language-prior tokens. The decoder contrasts the two at each step. That construction is more explicit than earlier contrastive decoding tricks and stays inference-only, so it could be dropped onto existing models without retraining costs. The motivation from attention patterns is clear and the target problem matters for any downstream VLM use. The abstract claims SOTA numbers on POPE, MME, and CHAIR, which would be useful if the gains are real and the baselines are standard. The stress-test concern is fair: nothing in the visible material isolates whether the attention imbalance is the main driver or just a side effect, and there is no sign of an ablation that swaps the negative path for a neutral or random one to check if the contrast itself is doing the work. Without those controls or the actual metric tables, it is impossible to tell whether the method suppresses hallucinations cleanly or simply trades one bias for another. The paper is aimed at researchers who need quick reliability fixes for vision-language generation. It is coherent on its own terms and engages the hallucination literature directly, so it clears the bar for a serious referee even though the current write-up would need the full experiments and ablations before anyone could rely on the claims.

Referee Report

2 major / 0 minor

Summary. The paper proposes Positive-and-Negative Decoding (PND), a training-free inference framework for Vision-Language Models to reduce object hallucination caused by over-reliance on linguistic priors and under-weighting of visual features in attention. PND uses a dual-path approach during decoding: a positive path that amplifies visual evidence and a negative path that constructs counterfactuals to penalize prior-dominant generation, steering outputs toward visually grounded results. It reports state-of-the-art performance on the POPE, MME, and CHAIR benchmarks without any retraining.

Significance. If the empirical claims hold, PND would represent a significant advance as a simple, training-free method to mitigate a common failure mode in VLMs. The inference-time intervention could be widely applicable and the identification of attention imbalance provides a new perspective on hallucination causes.

major comments (2)

[Abstract] Abstract: The assertion of 'state-of-the-art performance' on POPE, MME, and CHAIR is presented without any quantitative metrics, baseline comparisons, ablation studies, or statistical details. This absence prevents verification or replication of the central claim.
[Abstract] Abstract: The framework assumes that the identified attention imbalance is the primary driver of hallucination and that contrasting the negative-path counterfactuals will selectively enforce visual fidelity without introducing new errors or biases, but no derivation, isolation experiment, or control (e.g., neutral baseline replacing the negative path) is supplied to support this.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, clarifying the content of the full paper and committing to revisions that strengthen the abstract without altering our core claims or results.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of 'state-of-the-art performance' on POPE, MME, and CHAIR is presented without any quantitative metrics, baseline comparisons, ablation studies, or statistical details. This absence prevents verification or replication of the central claim.

Authors: We agree that the abstract would be more informative with explicit metrics. The full manuscript reports detailed quantitative results in Section 4 and Tables 1-3, including direct comparisons to prior methods on POPE (accuracy and F1), MME (perception and cognition scores), and CHAIR (hallucination rate), along with ablation studies on the dual-path components. We will revise the abstract to include the key performance deltas and main baseline references while keeping it concise. revision: yes
Referee: [Abstract] Abstract: The framework assumes that the identified attention imbalance is the primary driver of hallucination and that contrasting the negative-path counterfactuals will selectively enforce visual fidelity without introducing new errors or biases, but no derivation, isolation experiment, or control (e.g., neutral baseline replacing the negative path) is supplied to support this.

Authors: Section 3.1 of the manuscript presents attention visualizations and quantitative analysis confirming the visual-linguistic imbalance as a key factor in hallucination. The experimental section (4.3) includes ablations that isolate the negative path's contribution by comparing full PND against positive-path-only and neutral (no-contrast) variants; these controls demonstrate that the dual-path contrast improves visual grounding without adding new biases or errors, as measured by consistent gains across all benchmarks. We will add a brief reference to these controls in the abstract and expand the description of the neutral baseline if needed for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper frames PND as a training-free inference-time intervention motivated by an observed attention imbalance, using dual-path contrast (positive amplification of visual evidence and negative counterfactual penalization) to steer decoding. No equations, fitted parameters, or results are presented that reduce by construction to inputs, self-definitions, or self-citation chains. The central mechanism is described as an empirical contrast during decoding, with performance claims resting on external benchmark experiments (POPE, MME, CHAIR) rather than any internal loop or renamed known result. The derivation remains self-contained as a practical intervention without load-bearing reductions to prior assumptions or author-specific uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption of attention imbalance as the cause of hallucination and on the introduction of the PND dual-path mechanism as an effective corrective intervention.

axioms (1)

domain assumption VLMs exhibit an attention imbalance in which visual features are under-weighted relative to linguistic priors.
This observation is stated as the motivation for developing PND.

invented entities (2)

Positive path no independent evidence
purpose: Amplifies visual evidence during decoding
Constructed as one half of the dual-path contrast to boost image-grounded generation.
Negative path no independent evidence
purpose: Constructs counterfactuals to penalize prior-dominant generation
Invented to create contrast that discourages language-only outputs.

pith-pipeline@v0.9.0 · 5458 in / 1322 out tokens · 55843 ms · 2026-05-11T01:22:56.919798+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

[1]

Mitigating object hallucinations in large vision- language models with assembly of global and local attention

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, and Shi- jian Lu. Mitigating object hallucinations in large vision- language models with assembly of global and local attention. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 29915–29926, 2025. 2, 6

work page 2025
[2]

Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023. 2, 6

work page 2023
[3]

Hallucination of multimodal large language models: A survey.arXiv e-prints, pages arXiv–2404, 2024

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey.arXiv e-prints, pages arXiv–2404, 2024. 2

work page 2024
[4]

Comment: Microarrays, empirical bayes and the two-group model.Statist

T Tony Cai. Comment: Microarrays, empirical bayes and the two-group model.Statist. Sci., 23(1):29–33, 2008. 1, 3

work page 2008
[5]

Multi-object hallucination in vision language mod- els.Advances in Neural Information Processing Systems, 37:44393–44418, 2024

Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, and Joyce Chai. Multi-object hallucination in vision language mod- els.Advances in Neural Information Processing Systems, 37:44393–44418, 2024. 5

work page 2024
[6]

On some recent advances on high dimensional bayesian statistics.ESAIM: Proceedings and Surveys, 51:293–319, 2015

Nicolas Chopin, S ´ebastien Gadat, Benjamin Guedj, Arnaud Guyader, and Elodie Vernet. On some recent advances on high dimensional bayesian statistics.ESAIM: Proceedings and Surveys, 51:293–319, 2015. 2

work page 2015
[7]

Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 1, 6

work page 2023
[8]

Plug and play language models: A simple approach to controlled text generation

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. InInternational Conference on Learning Representations, 2020. 2

work page 2020
[9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, G Heigold, S Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2020. 2

work page 2020
[10]

Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,...

work page 2024
[11]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 18135–18143, 2024. 1, 2

work page 2024
[12]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 5

work page 2020
[13]

Bliva: A simple multimodal llm for better handling of text-rich visual questions

Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. Bliva: A simple multimodal llm for better handling of text-rich visual questions. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2256– 2264, 2024. 2

work page 2024
[14]

A survey on hal- lucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Infor- mation Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hal- lucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Infor- mation Systems, 43(2):1–55, 2025. 1

work page 2025
[15]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024. 2

work page 2024
[16]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 5

work page 2019
[17]

Mitigating object hallucinations in large vision-language models through visual contrastive decod- ing

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shi- jian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through vi- sual contrastive decoding.arXiv preprint arXiv:2311.16922,

work page arXiv
[18]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 3

work page 2022
[19]

Streamlining prediction in bayesian deep learning

Rui Li, Marcus Klasson, Arno Solin, and Martin Trapp. Streamlining prediction in bayesian deep learning. InThe Thirteenth International Conference on Learning Represen- tations. 3

work page
[20]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 292–305, 2023. 1, 5

work page 2023
[21]

A multi-label detection deep learning model with attention-guided image enhancement for retinal images

Zhenwei Li, Mengying Xu, Xiaoli Yang, Yanqi Han, and Ji- awen Wang. A multi-label detection deep learning model with attention-guided image enhancement for retinal images. Micromachines, 14(3):705, 2023. 4

work page 2023
[22]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Sys- tems, 35:17612–17625, 2022

Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Ser- ena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Sys- tems, 35:17612–17625, 2022. 1

work page 2022
[23]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 5

work page 2014
[24]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

work page 2023
[25]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 2, 6

work page 2024
[26]

Reducing hallucina- tions in large vision-language models via latent space steer- ing

Sheng Liu, Haotian Ye, and James Zou. Reducing hallucina- tions in large vision-language models via latent space steer- ing. InThe Thirteenth International Conference on Learning Representations, 2025. 2

work page 2025
[27]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review arXiv
[28]

Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 2

work page 2022
[29]

Event co-occurrences for prompt- based generative event argument extraction.Scientific Re- ports, 14(1):31377, 2024

Jiaren Peng, Wenzhong Yang, Fuyuan Wei, Liang He, Long Yao, and Hongzhen Lv. Event co-occurrences for prompt- based generative event argument extraction.Scientific Re- ports, 14(1):31377, 2024. 2

work page 2024
[30]

Deep multi-scale at- tentional features for medical image segmentation.Applied Soft Computing, 109:107445, 2021

Sahadev Poudel and Sang-Woong Lee. Deep multi-scale at- tentional features for medical image segmentation.Applied Soft Computing, 109:107445, 2021. 4

work page 2021
[31]

Object hallucination in image cap- tioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

work page 2018
[32]

A-okvqa: A benchmark for visual question answering using world knowl- edge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean conference on computer vision, pages 146–162. Springer, 2022. 5

work page 2022
[33]

Towards understanding the modality gap in clip

Peiyang Shi, Michael C Welle, M ˚arten Bj¨orkman, and Dan- ica Kragic. Towards understanding the modality gap in clip. InICLR 2023 workshop on multimodal representation learn- ing: perks and pitfalls, 2023. 1, 5

work page 2023
[34]

Large vision-language model alignment and misalignment: A survey through the lens of explainability

Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, and Mengnan Du. Large vision-language model alignment and misalignment: A survey through the lens of explainability.arXiv preprint arXiv:2501.01346, 2025. 2

work page arXiv 2025
[35]

Attention-guided sample-based feature enhance- ment network for crowded pedestrian detection using vision sensors.Sensors, 24(19):6350, 2024

Shuyuan Tang, Yiqing Zhou, Jintao Li, Chang Liu, and Jinglin Shi. Attention-guided sample-based feature enhance- ment network for crowded pedestrian detection using vision sensors.Sensors, 24(19):6350, 2024. 4, 5

work page 2024
[36]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

arXiv preprint arXiv:2411.10442 , year=

Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442, 2024. 6

work page arXiv 2024
[38]

Recent advances in algebraic geometry and bayesian statistics.Information Geometry, 7(Suppl 1): 187–209, 2024

Sumio Watanabe. Recent advances in algebraic geometry and bayesian statistics.Information Geometry, 7(Suppl 1): 187–209, 2024. 1

work page 2024
[39]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jia- long Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang...

work page 2025
[40]

Mitigat- ing hallucination in large vision-language models via mod- ular attribution and intervention

Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu. Mitigat- ing hallucination in large vision-language models via mod- ular attribution and intervention. InAdaptive Foundation Models: Evolving AI for Personalized and Efficient Learn- ing, 2025. 1

work page 2025
[41]

Clearsight: Vi- sual signal enhancement for object hallucination mitigation in multimodal large language models

Hao Yin, Guangzong Si, and Zilei Wang. Clearsight: Vi- sual signal enhancement for object hallucination mitigation in multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14625–14634, 2025. 2, 6

work page 2025
[42]

A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 5

work page 2024
[43]

Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625–5644, 2024

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625–5644, 2024. 1

work page 2024
[44]

Investigating and mitigating the multimodal halluci- nation snowballing in large vision-language models

Weihong Zhong, Xiaocheng Feng, Liang Zhao, Qiming Li, Lei Huang, Yuxuan Gu, Weitao Ma, Yuan Xu, and Bing Qin. Investigating and mitigating the multimodal halluci- nation snowballing in large vision-language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11991–12011, 2024. 2

work page 2024
[45]

Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

work page
[46]

Ibd: Alleviating hallucinations in large vision- language models via image-biased decoding

Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision- language models via image-biased decoding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1624–1633, 2025. 1, 2

work page 2025