pith. sign in

arxiv: 2606.28401 · v1 · pith:JYIC3N5Fnew · submitted 2026-06-24 · 💻 cs.CV · cs.LG

Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs

Pith reviewed 2026-06-30 01:20 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords vision-language modelshallucinationspreference alignmentpreference synthesisvisual groundingVLMsimage variants
0
0 comments X

The pith

ViPSy builds preference pairs from visual cues in image variants to align VLMs against hallucinations while staying close to the model's own outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViPSy, a two-stage framework that first extracts a visual cue from objects recurring across related image versions and then uses that cue to shape the model's own response candidates for preference pairs. Existing preference methods either drift far from the model's natural distribution or fail to use the image enough, so this approach aims to fix both issues at once. If correct, it produces a VLM that generates fewer ungrounded statements on visual tasks. The aligned model sets new records on hallucination benchmarks and also scores higher on general visual understanding tests plus downstream tasks like segmentation.

Core claim

ViPSy derives a visual cue from recurring object-level content across semantically aligned image variants so that preference construction can rely on visual information rather than language priors; in the second stage it conditions the policy's own rollouts on this cue to produce candidates that remain close to the policy distribution while better leveraging visual information from the image. Preference alignment with these pairs yields a VLM that reduces hallucination rates on AMBER and Object HalBench by 35.7 percent and 24.5 percent relative to the prior state-of-the-art method, while also improving on MMStar, MMVP, CV-Bench, semantic segmentation, and ImageNet linear probing.

What carries the argument

ViPSy, the two-stage vision-driven preference synthesis framework that extracts visual cues from recurring objects across image variants to guide policy rollouts for pair construction.

If this is right

  • The aligned VLM reduces hallucination rates by 35.7% on AMBER and 24.5% on Object HalBench versus the prior best method.
  • The same model records higher scores on the visual grounding benchmarks MMStar, MMVP, and CV-Bench.
  • Downstream tasks improve as well, with measurable gains in semantic segmentation accuracy and ImageNet linear probing performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cue-extraction step could be applied to non-object visual elements such as spatial relations or text within images to test whether the benefit generalizes.
  • If the two-stage process proves stable, it offers a template for constructing preference data in other multimodal settings where visual grounding is the main failure mode.

Load-bearing premise

Recurring object-level content across semantically aligned image variants supplies a reliable visual cue that lets preference construction rely on visual information rather than language priors.

What would settle it

Re-running the full pipeline on the same VLMs and benchmarks and finding hallucination rates on AMBER and Object HalBench that are no lower than the previous state-of-the-art method would show the claimed gains do not hold.

Figures

Figures reproduced from arXiv: 2606.28401 by Jongheon Jeong, Yunhun Nam.

Figure 1
Figure 1. Figure 1: Comparison of preference construction strategies for hallucination mitigation in VLMs. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ViPSy. ViPSy constructs visually grounded and policy-aligned preference pairs through two stages. (1) Self-captioned semantic synthesis for consistent cue extraction uses the policy’s own self-captions as T2I prompts to generate semantically aligned synthetic variants. Each variant is compared with the original image, and recurring object-level evidence across comparisons is aggregated into a c… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples of cue-conditioned preference pairs. Given the same image, prompt, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Agreement of preference selections for alternative judges compared upon preferences based on Qwen2.5-VL-72B. 2 4 8 16 3.0 3.5 4.0 4.5 5.0 Hal. Avg. 4.87 4.44 3.69 3.46 → [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples of cue-conditioned preference pairs. Given the same image, prompt, [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison between the vanilla model and the model preference-aligned using [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of self-captioned semantic synthesis. The source image is first captioned by the [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of self-captioned semantic synthesis. The source image is first captioned by the [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pairwise comparison prompt. You are given pairwise comparison results between one reference image and multiple comparison images. Each pairwise result contains shared facts. Merge similar phrases, remove near-duplicates, and produce one consolidated result. Rules: - chosen should contain the merged shared content from the pairwise chosen results. - Keep only content that is supported as shared visual cont… view at source ↗
Figure 11
Figure 11. Figure 11: Aggregation prompt. Describe the image in detail. Please use the following object information as extra context: {visual cues} [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visual grounding prompt. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) have shown strong performance in visual understanding, yet they still suffer from hallucinations, generating content that is not grounded in the image. Preference alignment is a promising approach to improve visual faithfulness, but its success depends heavily on how preference pairs are constructed. Existing methods exhibit two key limitations; (a) intervention-based methods often introduce significant deviation from the policy distribution, and (b) sampling-based methods often underuse visual information during the construction. In this paper, we propose ViPSy (Vision-driven Preference Synthesis), a framework for constructing preference data that are both policy-aligned and visually grounded. Our framework consists of two stages; in the first stage, ViPSy derives a visual cue from recurring object-level content across semantically aligned image variants, so preference construction can rely on visual information rather than language priors. In the second stage, ViPSy conditions the policy's own rollouts on this cue, allowing candidates to be guided by visually grounded content while staying close to the policy's response distribution. The resulting candidates remain close to the policy's response distribution while better leveraging visual information from the image. Experiments show that the resulting VLM, preference-aligned with ViPSy-constructed preference pairs, achieves a new state-of-the-art in hallucination mitigation. Compared with the previous state-of-the-art method, it reduces hallucination rates on AMBER and Object HalBench by 35.7% and 24.5%, respectively. The resulting model further improves on general visual grounding benchmarks, e.g., MMStar, MMVP, and CV-Bench, while also yielding gains in semantic segmentation and ImageNet linear probing, underscoring the effectiveness of our framework in enhancing the model's visual capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes ViPSy, a two-stage framework for synthesizing preference pairs to align VLMs and mitigate hallucinations. Stage 1 derives a visual cue from recurring object-level content across semantically aligned image variants so that construction can rely on visual information rather than language priors. Stage 2 conditions the policy's own rollouts on this cue to produce candidates that remain close to the policy distribution while better leveraging visual information. The resulting aligned VLM is reported to achieve new state-of-the-art hallucination mitigation, with 35.7% and 24.5% reductions on AMBER and Object HalBench relative to the prior SOTA, plus gains on MMStar, MMVP, CV-Bench, semantic segmentation, and ImageNet linear probing.

Significance. If the central claims hold after verification, the work would offer a concrete advance in preference alignment for VLMs by addressing both policy deviation and under-use of visual signals. The reported cross-benchmark improvements, including on non-hallucination tasks, would indicate that the synthesized pairs enhance visual grounding more broadly than prior sampling- or intervention-based methods.

major comments (2)
  1. [Abstract] Abstract: the central performance claim (new SOTA with 35.7% and 24.5% reductions) is presented without any experimental protocol, baseline details, statistical tests, ablation results, or dataset splits, rendering the quantitative results unverifiable from the supplied text.
  2. [Abstract] Abstract (first stage description): the assertion that the visual cue 'derives ... so preference construction can rely on visual information rather than language priors' is load-bearing for the claimed advantage over sampling-based methods, yet no mechanism, metric, or validation is supplied to confirm that the cue extraction is independent of language priors; if the cue remains correlated with priors, the second-stage rollouts cannot be guaranteed to improve visual faithfulness.
minor comments (1)
  1. [Abstract] Abstract: the specific base VLM, preference-alignment algorithm (e.g., DPO), and exact construction of 'semantically aligned image variants' are not named.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that greater clarity is needed to support the central claims and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claim (new SOTA with 35.7% and 24.5% reductions) is presented without any experimental protocol, baseline details, statistical tests, ablation results, or dataset splits, rendering the quantitative results unverifiable from the supplied text.

    Authors: We agree that the abstract would benefit from additional context to improve verifiability. In the revised manuscript we will expand the abstract to include a brief statement of the evaluation protocol (standard benchmarks AMBER and Object HalBench, comparison against prior SOTA preference-alignment methods), while directing readers to Sections 4 and 5 for full details on baselines, statistical tests, ablations, and dataset splits. revision: yes

  2. Referee: [Abstract] Abstract (first stage description): the assertion that the visual cue 'derives ... so preference construction can rely on visual information rather than language priors' is load-bearing for the claimed advantage over sampling-based methods, yet no mechanism, metric, or validation is supplied to confirm that the cue extraction is independent of language priors; if the cue remains correlated with priors, the second-stage rollouts cannot be guaranteed to improve visual faithfulness.

    Authors: The mechanism for cue extraction (recurring object-level content identified via visual similarity across semantically aligned image variants) is described in Section 3.1 and validated for independence from language priors via ablation in Section 4.3. We will revise the abstract to briefly reference the visual-similarity metric. Full mechanism, metrics, and validation results remain in the main text due to abstract length constraints. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method construction with no self-referential derivations

full rationale

The paper describes a two-stage framework (ViPSy) for synthesizing preference pairs from image variants and policy rollouts. No equations, fitted parameters, or mathematical derivations are present in the provided text. The central claims rest on empirical outcomes after alignment rather than any quantity that reduces to its own inputs by construction. Stage 1 cue extraction and Stage 2 conditioning are procedural steps for data generation, not self-definitional or fitted-input predictions. No self-citations are invoked as load-bearing uniqueness theorems. The derivation chain is therefore self-contained as a constructive method whose validity is tested externally via benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities with measurable details; the 'visual cue' is introduced at a conceptual level without independent validation evidence.

pith-pipeline@v0.9.1-grok · 5844 in / 1242 out tokens · 28884 ms · 2026-06-30T01:20:06.616900+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

102 extracted references · 41 canonical work pages · 21 internal anchors

  1. [1]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  2. [2]

    Flamingo: a Visual Language Model for Few-Shot Learning.Advances in neural information processing systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning.Advances in neural information processing systems, 35:23716–23736, 2022

  3. [3]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning.Advances in neural information processing systems, 36:49250–49267, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning.Advances in neural information processing systems, 36:49250–49267, 2023

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    GPT-4V(ision) System Card, 2023

    OpenAI. GPT-4V(ision) System Card, 2023. URL https://api.semanticscholar.org/CorpusID: 263218031

  6. [6]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models From Natural Language Supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  7. [7]

    Sigmoid Loss for Language Image Pre-Training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre-Training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  8. [8]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, March 2023. URL https://lmsys.org/ blog/2023-03-30-vicuna/

  9. [9]

    Qwen2.5-Coder Technical Report

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2.5-Coder technical report.arXiv preprint arXiv:2409.12186, 2024

  10. [10]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts.arXiv preprint arXiv:2310.02255, 2023

  11. [11]

    LLaV A-Docent: Instruction Tuning with Multimodal Large Language Model to Support Art Appreciation Education.Computers and Education: Artificial Intelligence, 7:100297, 2024

    Unggi Lee, Minji Jeon, Yunseo Lee, Gyuri Byun, Yoorim Son, Jaeyoon Shin, Hongkyu Ko, and Hyeon- cheol Kim. LLaV A-Docent: Instruction Tuning with Multimodal Large Language Model to Support Art Appreciation Education.Computers and Education: Artificial Intelligence, 7:100297, 2024

  12. [12]

    LLaV A-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. LLaV A-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

  13. [13]

    BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine.arXiv preprint arXiv:2308.09442, 2023

    Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine.arXiv preprint arXiv:2308.09442, 2023

  14. [14]

    Med-Flamingo: a Multimodal Medical Few-shot Learner

    Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-Flamingo: a Multimodal Medical Few-shot Learner. InMachine learning for health (ML4H), pages 353–367. PMLR, 2023

  15. [15]

    WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation

    Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19606–19616, 2023

  16. [16]

    AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models

    Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 1932–1940, 2024

  17. [17]

    MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection.arXiv preprint arXiv:2410.09453, 2024

    Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection.arXiv preprint arXiv:2410.09453, 2024. 10

  18. [18]

    AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation.arXiv preprint arXiv:2406.11548, 2024

    Chuyan Xiong, Chengyu Shen, Xiaoqi Li, Kaichen Zhou, Jeremy Liu, Ruiping Wang, and Hao Dong. AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation.arXiv preprint arXiv:2406.11548, 2024

  19. [19]

    Enhancing Robotic Ma- nipulation with AI Feedback from Multimodal Large Language Models.arXiv preprint arXiv:2402.14245, 2024

    Jinyi Liu, Yifu Yuan, Jianye Hao, Fei Ni, Lingzhi Fu, Yibin Chen, and Yan Zheng. Enhancing Robotic Ma- nipulation with AI Feedback from Multimodal Large Language Models.arXiv preprint arXiv:2402.14245, 2024

  20. [20]

    ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

    Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18061–18070, 2024

  21. [21]

    Detecting and Preventing Hallucinations in Large Vision Language Models

    Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and Preventing Hallucinations in Large Vision Language Models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18135–18143, 2024

  22. [22]

    HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pag...

  23. [23]

    Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances

    Yike Wu, Yu Zhao, Shiwan Zhao, Ying Zhang, Xiaojie Yuan, Guoqing Zhao, and Ning Jiang. Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5721–5729, 2022

  24. [24]

    Visual Perturbation-aware Col- laborative Learning for Overcoming the Language Prior Problem.arXiv preprint arXiv:2207.11850, 2022

    Yudong Han, Liqiang Nie, Jianhua Yin, Jianlong Wu, and Yan Yan. Visual Perturbation-aware Col- laborative Learning for Overcoming the Language Prior Problem.arXiv preprint arXiv:2207.11850, 2022

  25. [25]

    Overcoming Language Priors with Counterfactual Inference for Visual Question Answering

    Ren Zhibo, Wang Huizhen, Zhu Muhua, Wang Yichao, Xiao Tong, and Zhu Jingbo. Overcoming Language Priors with Counterfactual Inference for Visual Question Answering. InProceedings of the 22nd Chinese National Conference on Computational Linguistics, pages 600–610, 2023

  26. [26]

    V olcano: Mitigating Multimodal Hallu- cination through Self-Feedback Guided Revision

    Seongyun Lee, Sue Hyun Park, Yongrae Jo, and Minjoon Seo. V olcano: Mitigating Multimodal Hallu- cination through Self-Feedback Guided Revision. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 391–404, 2024

  27. [27]

    Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872– 13882, 2024

  28. [28]

    Analyzing the Behavior of Visual Question Answering Models

    Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the Behavior of Visual Question Answering Models. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1955–1960, 2016

  29. [29]

    Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  30. [30]

    Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing

    Vedika Agarwal, Rakshith Shetty, and Mario Fritz. Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9690–9698, 2020

  31. [31]

    SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense.arXiv preprint arXiv:2510.16596, 2025

    Yiyang Huang, Liang Shi, Yitian Zhang, Yi Xu, and Yun Fu. SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense.arXiv preprint arXiv:2510.16596, 2025

  32. [32]

    Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

    Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond Hallucina- tions: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization.arXiv preprint arXiv:2311.16839, 2023

  33. [33]

    Mitigating Object Hallucinations via Sentence- Level Early Intervention

    Shangpin Peng, Senqiao Yang, Li Jiang, and Zhuotao Tian. Mitigating Object Hallucinations via Sentence- Level Early Intervention. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 635–646, 2025. 11

  34. [34]

    Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key.arXiv preprint arXiv:2501.09695, 2025

    Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, and Dongsheng Li. Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key.arXiv preprint arXiv:2501.09695, 2025

  35. [35]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model.Advances in neural information processing systems, 36:53728–53741, 2023

  36. [36]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-Tuning Language Models from Human Preferences.arXiv preprint arXiv:1909.08593, 2019

  37. [37]

    Learning to summarize from human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize from human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

  38. [38]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  39. [39]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.arXiv preprint arXiv:2204.05862, 2022

  40. [40]

    Aligning Large Multimodal Models with Factually Augmented RLHF

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning Large Multimodal Models with Factually Augmented RLHF. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024

  41. [41]

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.arXiv preprint arXiv:2307.15217, 2023

  42. [42]

    Parameter Efficient Reinforcement Learning from Human Feedback.arXiv preprint arXiv:2403.10704, 2024

    Hakim Sidahmed, Samrat Phatale, Alex Hutcheson, Zhuonan Lin, Zhang Chen, Zac Yu, Jarvis Jin, Simral Chaudhary, Roman Komarytsia, Christiane Ahlheim, et al. Parameter Efficient Reinforcement Learning from Human Feedback.arXiv preprint arXiv:2403.10704, 2024

  43. [43]

    Provably Efficient Online RLHF with One-Pass Reward Modeling.arXiv preprint arXiv:2502.07193, 2025

    Long-Fei Li, Yu-Yang Qian, Peng Zhao, and Zhi-Hua Zhou. Provably Efficient Online RLHF with One-Pass Reward Modeling.arXiv preprint arXiv:2502.07193, 2025

  44. [44]

    Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback.Advances in neural information processing systems, 37:36602– 36633, 2024

    Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A Smith, Yejin Choi, and Hannaneh Hajishirzi. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback.Advances in neural information processing systems, 37:36602– 36633, 2024

  45. [45]

    What Matters in Data for DPO?arXiv preprint arXiv:2508.18312, 2025

    Yu Pan, Zhongze Cai, Guanting Chen, Huaiyang Zhong, and Chonghuan Wang. What Matters in Data for DPO?arXiv preprint arXiv:2508.18312, 2025

  46. [46]

    Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

    Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning Modalities in Vision Large Language Models via Preference Fine-tuning.arXiv preprint arXiv:2402.11411, 2024

  47. [47]

    Arık, and Tomas Pfister

    Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan Ö. Arık, and Tomas Pfister. Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment, 2024

  48. [48]

    Detecting and Mitigating Hallucination in Large Vision Language Models via Fine- Grained AI Feedback

    Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Linchao Zhu. Detecting and Mitigating Hallucination in Large Vision Language Models via Fine- Grained AI Feedback. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25543–25551, 2025

  49. [49]

    RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness.arXiv preprint arXiv:2405.17220, 2(3):8, 2024

    Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, et al. RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness.arXiv preprint arXiv:2405.17220, 2(3):8, 2024

  50. [50]

    On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. InThe twelfth international conference on learning representations, 2024. 12

  51. [51]

    Self- Distillation Bridges Distribution Gap in Language Model Fine-Tuning

    Zhaorui Yang, Tianyu Pang, Haozhe Feng, Han Wang, Wei Chen, Minfeng Zhu, and Qian Liu. Self- Distillation Bridges Distribution Gap in Language Model Fine-Tuning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1028–1043, 2024

  52. [52]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models.arXiv preprint arXiv:2601.18734, 2026

  53. [53]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017

  54. [54]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI Feedback.arXiv preprint arXiv:2212.08073, 2022

  55. [55]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300, 2024

  56. [56]

    Slic-hf: Sequence likelihood calibration with human feedback

    Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. SLiC-HF: Sequence Likelihood Calibration with Human Feedback.arXiv preprint arXiv:2305.10425, 2023

  57. [57]

    A General Theoretical Paradigm to Understand Learning from Human Preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A General Theoretical Paradigm to Understand Learning from Human Preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

  58. [58]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. KTO: Model Alignment as Prospect Theoretic Optimization.arXiv preprint arXiv:2402.01306, 2024

  59. [59]

    ORPO: Monolithic Preference Optimization without Reference Model

    Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic Preference Optimization without Reference Model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170–11189, 2024

  60. [60]

    SimPO: Simple Preference Optimization with a Reference- Free Reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO: Simple Preference Optimization with a Reference- Free Reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

  61. [61]

    V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

    Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13258–13273, 2024

  62. [62]

    Modality-Fair Preference Optimization for Trustworthy MLLM Alignment.arXiv preprint arXiv:2410.15334, 2024

    Songtao Jiang, Yan Zhang, Ruizhe Chen, Tianxiang Hu, Yeying Jin, Qinglin He, Yang Feng, Jian Wu, and Zuozhu Liu. Modality-Fair Preference Optimization for Trustworthy MLLM Alignment.arXiv preprint arXiv:2410.15334, 2024

  63. [63]

    mDPO: Conditional Preference Optimization for Multimodal Large Language Models

    Fei Wang, Wenxuan Zhou, James Y Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen. mDPO: Conditional Preference Optimization for Multimodal Large Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8078–8088, 2024

  64. [64]

    Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization.arXiv preprint arXiv:2506.11712, 2025

    Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, and Liqiang Nie. Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization.arXiv preprint arXiv:2506.11712, 2025

  65. [65]

    RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807–13816, 2024

  66. [66]

    Fu, X., Hu, Y ., Li, B., Feng, Y ., Wang, H., Lin, X., Roth, D., Smith, N

    Lehan He, Zeren Chen, Zhelun Shi, Tianyu Yu, Jing Shao, and Lu Sheng. Systematic Reward Gap Optimization for Mitigating VLM Hallucinations.arXiv preprint arXiv:2411.17265, 2024

  67. [67]

    Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

    Renjie Pi, Tianyang Han, Wei Xiong, Jipeng Zhang, Runtao Liu, Rui Pan, and Tong Zhang. Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization. InEuropean Conference on Computer Vision, pages 382–398. Springer, 2024

  68. [68]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024. 13

  69. [69]

    YOLO-World: Real-Time Open-V ocabulary Object Detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. YOLO-World: Real-Time Open-V ocabulary Object Detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024

  70. [70]

    Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data.arXiv preprint arXiv:2404.14367, 2024

    Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data.arXiv preprint arXiv:2404.14367, 2024

  71. [71]

    Self-instruct: Aligning language models with self-generated instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

  72. [72]

    Self-Rewarding Language Models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-Rewarding Language Models.arXiv preprint arXiv:2401.10020, 2024

  73. [73]

    Self-Play Preference Optimization for Language Model Alignment.arXiv preprint arXiv:2405.00675, 2024

    Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-Play Preference Optimization for Language Model Alignment.arXiv preprint arXiv:2405.00675, 2024

  74. [74]

    Self-Boosting Large Language Models with Synthetic Preference Data.arXiv preprint arXiv:2410.06961, 2024

    Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, and Furu Wei. Self-Boosting Large Language Models with Synthetic Preference Data.arXiv preprint arXiv:2410.06961, 2024

  75. [75]

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models.arXiv preprint arXiv:2309.03883, 2023

  76. [76]

    OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024

  77. [77]

    EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models

    Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, and Xinyu Dai. EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1167–1181, 2024

  78. [78]

    Qwen2.5-VL, January 2025

    Qwen Team. Qwen2.5-VL, January 2025. URLhttps://qwenlm.github.io/blog/qwen2.5-vl/

  79. [79]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. InForty-first international conference on machine learning, 2024

  80. [80]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025

Showing first 80 references.