pith. sign in

arxiv: 2507.12455 · v3 · pith:27KSWOPAnew · submitted 2025-07-16 · 💻 cs.CV

Mitigating Object Hallucinations via Sentence-Level Early Intervention

Pith reviewed 2026-05-25 08:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords object hallucinationmultimodal large language modelspreference learningsentence-level interventionvision-language modelshallucination mitigationdirect preference optimization
0
0 comments X

The pith

Sentence-level preference training from detector-validated pairs cuts object hallucinations in multimodal models by over 90 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that object hallucinations in multimodal large language models arise mostly in the opening sentences of a response and then spread. It shows that these early errors can be targeted by automatically building preference data: the model generates candidate sentences, two open-vocabulary detectors check which objects are actually present, and sentences are labeled hallucinated or accurate. These pairs are then used to train the model with a context-aware preference objective (C-DPO) that rewards correct early outputs and discourages fabricated objects. The resulting models improve on both hallucination benchmarks and general capability tests without any human labels or distribution shifts.

Core claim

Hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. By iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated or non-hallucinated categories, high-quality in-domain preference pairs can be bootstrapped without human annotations. Training with a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest then suppresses the fabrication of objects from the first sentence onward.

What carries the argument

SENTINEL framework that creates context-coherent positive and hallucinated negative sentence pairs via detector cross-validation and optimizes them with context-aware direct preference optimization (C-DPO) focused on early generation steps.

If this is right

  • Models trained with SENTINEL reduce hallucinations by more than 90 percent relative to the base model while outperforming prior state-of-the-art methods on both hallucination and general capability benchmarks.
  • No human-annotated preference data is required because the training pairs are generated automatically through detector validation.
  • Focusing the preference loss at the sentence level prevents errors from propagating through later parts of the response.
  • The iterative bootstrapping process produces progressively higher-quality preference data as the model improves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-sentence labeling and preference approach could be applied to other hallucination types such as incorrect attributes or spatial relations.
  • If more accurate or specialized detectors become available, the quality of the automatically generated preference data would increase and further reduce residual hallucinations.
  • The method's independence from human labels makes it practical to retrain models periodically as new vision backbones improve detector reliability.

Load-bearing premise

Cross-checking model outputs against two open-vocabulary detectors accurately classifies sentences as hallucinated or non-hallucinated without introducing new systematic errors or distribution shifts.

What would settle it

A collection of generated sentences where the two detectors label an object as present but a careful human review finds it absent (or the reverse) at a high rate would show the classification step is unreliable.

Figures

Figures reproduced from arXiv: 2507.12455 by Li Jiang, Senqiao Yang, Shangpin Peng, Zhuotao Tian.

Figure 1
Figure 1. Figure 1: Comparative analysis of data construction strategies for hallucination mitigation in MLLMs. Our proposed approach demonstrates superior efficiency and effectiveness in generating high-quality, domain-specific preference learning datasets, offer￾ing a robust solution for reducing hallucination in MLLMs. the development of general-purpose AI systems [2, 9, 33– 35, 44, 65, 87]. However, a critical challenge p… view at source ↗
Figure 2
Figure 2. Figure 2: Object position distribution in MLLM hallucination analysis. (a) illustrates the progressive deterioration of halluci￾nation effects in Multimodal Large Language Models (MLLMs) with increasing description length in the image captioning task, while (b) demonstrates the effectiveness of early-stage interven￾tion in mitigating the propagation of hallucination. model under training, πref represents the unchang… view at source ↗
Figure 3
Figure 3. Figure 3: The overview of SENTINEL. The proposed SENTINEL takes six essential steps: (1) Generate multiple in-domain responses conditioned on the input image, prompt, and context c. (2) Identify and extract all mentioned objects from each generated sentence. (3) Utilizing two object detectors to validate the existence of extracted objects through cross-referencing. (4) Categorize generated sentences into hallucinate… view at source ↗
Figure 4
Figure 4. Figure 4: Categories of in-domain candidates. The in-domain candidates fall into three types. Employing non-hallucinated, context-coherent descriptions (y + w) as positive samples, paired with hallucinated descriptions (yl ), enhances the model’s gener￾alization performance and robustness. (4)-(5), this process extracts contextually relevant data, en￾suring the training data better represents the model’s output dist… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of SENTINEL. Our method can effectively eliminate hallucinations in MLLMs while enhancing the model’s general capabilities. Method Object HalBench AMBER MM-Vet Resp. ↓ Ment. ↓ Acc ↑ F1 ↑ Overall ↑ LLaVA-v1.5-7B 52.7 27.9 71.5 74.1 31.1 Ours (8.6K (y+ w, yl)) 4.3 2.6 76.1 79.3 32.6 Ours (8.6K Rewrited (y+ w, yl)) 4.8↑0.5 2.9↑0.3 75.0↓1.1 78.0↓1.3 31.3↓1.3 [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 8
Figure 8. Figure 8: Impact of training data quantity on hallucination rate in Object Halbench [55]. The results show that SENTINEL demonstrates better efficiency, effectiveness, and scalability, while effectively reducing hallucination rates across varying data scales. Method Object HalBench [55] AMBER [63] HallusionBench [12] TextVQA [59] MM-Vet [78] Resp. ↓ Ment. ↓ Acc↑ F1↑ Question Acc↑ Acc↑ Overall ↑ LLaVA-v1.5-7B 52.7 28… view at source ↗
Figure 9
Figure 9. Figure 9: Effect of intermediate hallucination mitigation on subsequent generations. Showing the effectiveness of early-stage intervention in mitigating the propagation of hallucinations. Model Method Object HalBench AMBER Resp. ↓ Ment. ↓ CHAIR ↓ Hal ↓ Cog ↓ LLaVA-v1.5-7B [34] baseline 52.7 27.9 8.4 35.5 4.0 Woodpecker [75] 39.6 26.4 - - - VCD [27] 52.7 27.3 9.1 39.8 4.2 OPERA [19] 40.0 21.9 6.5 28.5 3.1 EOS [79] 40… view at source ↗
Figure 10
Figure 10. Figure 10: Time cost analysis of decode-based methods. Decode-based early intervention increases inference time, primar￾ily due to the additional generation steps required by MLLM sam￾pling, whereas the object detector remains highly efficient. When this early intervention strategy is applied through￾out the entire caption generation process, as shown in Tab. 6, it effectively mitigates object hallucinations when ev… view at source ↗
Figure 11
Figure 11. Figure 11: Comparison between C-DPO and standard DPO during model training. The proposed C-DPO promotes more sta￾ble gradient updates, enhancing training stability. the counterparts used for comparison. In Appendix D.3, we present the detailed evaluation setup. In Appendix D.4, we provide detailed results from some of the experiments. Additionally, in Appendix D.5, we present specific details of the ablation studies… view at source ↗
Figure 12
Figure 12. Figure 12: Impact of rewriting on the training process. Training with rewritten data fails to achieve the same level of convergence, resulting in higher final loss and weaker differentiation between positive and negative samples, demonstrating the necessity of in￾domain training data. Complement with existing preference learning methods. HA-DPO [82] employs a GPT-4 [1]-based rewriting ap￾proach to modify both positi… view at source ↗
Figure 13
Figure 13. Figure 13: Comparing general image description results between SENTINAL and its base model LLaVA-v1.5-7B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, providing a more detailed description. 10 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparing detailed image description results between SENTINAL and its base model LLaVA-v1.5-7B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, providing a more detailed description. 11 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Comparing visual question answering results between SENTINAL and LLaVA-v1.5-7B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, leading to more accurate and detailed answers. 12 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparing general image descriptions between SENTINAL and its base model LLaVA-v1.5-13B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, providing a more detailed description. 13 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparing detailed image descriptions between SENTINAL and its base model LLaVA-v1.5-13B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, providing a more detailed description. 14 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Comparing visual question answering between SENTINAL and its base model LLaVA-v1.5-13B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, leading to more accurate answers. 15 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
read the original abstract

Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at https://github.com/pspdada/SENTINEL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SENTINEL, a framework to mitigate object hallucinations in MLLMs by intervening at the sentence level during early text generation. It bootstraps in-domain preference pairs without human annotations by iteratively sampling model outputs and classifying sentences as hallucinated or non-hallucinated via cross-checking against two open-vocabulary detectors; these pairs then train the model with a context-aware preference loss (C-DPO). The abstract claims this yields over 90% hallucination reduction versus the base model and outperforms prior SOTA on both hallucination and general capability benchmarks.

Significance. If the detector-based labeling accurately reflects true hallucinations, the method offers a scalable, annotation-free route to preference data that targets the early emergence of errors, with potential to improve MLLM reliability in grounded applications. The public release of models, datasets, and code at the cited GitHub repository is a clear strength for reproducibility and follow-on work.

major comments (2)
  1. [§3] §3 (bootstrapping procedure): The central pipeline classifies sentences using cross-checks against two open-vocabulary detectors, yet reports no inter-detector agreement statistics, human validation of labels, or ablation on detector choice. This is load-bearing for the >90% reduction claim, because systematic mislabeling (e.g., missing context-dependent fabrications or over-flagging valid descriptions) would cause C-DPO to optimize for detector agreement rather than visual grounding.
  2. [§4] Abstract and §4 (results): The reported 'over 90% reduction' and benchmark outperformance are presented without measurement-protocol details, statistical significance tests, variance across runs, or controls isolating detector-induced bias. Given that preference-pair quality directly determines the training signal, these omissions prevent assessment of whether the gains are robust or artifactual.
minor comments (2)
  1. The abstract refers to 'general capabilities benchmarks' without naming them; adding the specific datasets (e.g., VQAv2, GQA) would improve clarity.
  2. Notation for the C-DPO loss could be expanded with an explicit equation showing how sentence-level context is incorporated, to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to improve transparency and rigor.

read point-by-point responses
  1. Referee: [§3] §3 (bootstrapping procedure): The central pipeline classifies sentences using cross-checks against two open-vocabulary detectors, yet reports no inter-detector agreement statistics, human validation of labels, or ablation on detector choice. This is load-bearing for the >90% reduction claim, because systematic mislabeling (e.g., missing context-dependent fabrications or over-flagging valid descriptions) would cause C-DPO to optimize for detector agreement rather than visual grounding.

    Authors: We agree that inter-detector agreement statistics, human validation of labels, and ablations on detector choice are important for validating the bootstrapping pipeline. These elements were omitted from the original submission. In the revised manuscript we will add: (i) agreement metrics (e.g., percentage agreement and Cohen’s kappa) between the two detectors, (ii) human validation results on a random sample of 200 labeled sentences, and (iii) an ablation comparing SENTINEL performance when using each detector individually versus the cross-check. These additions will directly address concerns about potential systematic mislabeling. revision: yes

  2. Referee: [§4] Abstract and §4 (results): The reported 'over 90% reduction' and benchmark outperformance are presented without measurement-protocol details, statistical significance tests, variance across runs, or controls isolating detector-induced bias. Given that preference-pair quality directly determines the training signal, these omissions prevent assessment of whether the gains are robust or artifactual.

    Authors: We acknowledge the need for more complete reporting. The revised version will expand §4 to include: detailed measurement protocols for all benchmarks, statistical significance tests (paired t-tests with p-values) on the main results, standard deviations across at least three independent training runs, and a control experiment that trains on a human-labeled subset to quantify any detector-induced bias. These changes will allow readers to better evaluate the robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical pipeline: iterative sampling of model outputs, labeling via two external open-vocabulary detectors to create preference pairs, and training with a context-aware C-DPO loss. No equations, self-citations, or uniqueness theorems are invoked that reduce the claimed >90% hallucination reduction or benchmark gains to fitted parameters or inputs by construction. Evaluation occurs on independent hallucination and capability benchmarks, keeping the central result externally falsifiable rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that hallucinations appear early and that detector validation is sufficiently accurate; no explicit free parameters or new invented entities are described in the abstract.

axioms (1)
  • domain assumption Hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs
    This insight, stated in the abstract, directly motivates the sentence-level intervention design.

pith-pipeline@v0.9.0 · 5775 in / 1242 out tokens · 26253 ms · 2026-05-25T08:29:16.822737+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  2. Mitigating Multimodal Hallucination via Phase-wise Self-reward

    cs.CV 2026-04 unverdicted novelty 6.0

    PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 2 Pith papers · 23 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

  2. [2]

    Qwen-vl: A versatile vision-language model for un- derstanding, localization

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization. Text Reading, and Beyond, 2023. 1, 9

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 8, 9

  4. [4]

    Hallucination of Multimodal Large Language Models: A Survey

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024. 1, 2, 9

  5. [5]

    Driving with llms: Fusing object-level vec- tor modality for explainable autonomous driving

    Long Chen, Oleg Sinavski, Jan H ¨unermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vec- tor modality for explainable autonomous driving. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024. 1

  6. [6]

    Halc: Object hallucination re- duction via adaptive focal-contrast decoding

    Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination re- duction via adaptive focal-contrast decoding. arXiv preprint arXiv:2403.00425, 2024. 2, 9

  7. [7]

    Yolo-world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3, 4

  8. [8]

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by con- trasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023. 1, 2, 6, 7, 9

  9. [9]

    Instructblip: Towards general- purpose vision-language models with instruction tuning,

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

  10. [10]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. 2, 6, 5, 9

  11. [11]

    Mask-dpo: Generalizable fine-grained factuality alignment of llms

    Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, and Kai Chen. Mask-dpo: Generalizable fine-grained factuality alignment of llms. arXiv preprint arXiv:2503.02846, 2025. 4

  12. [12]

    Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition...

  13. [13]

    Detecting and preventing hallucinations in large vision language models

    Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, 2024. 2, 7

  14. [14]

    Visual perturbation-aware collaborative learning for overcoming the language prior problem

    Yudong Han, Liqiang Nie, Jianhua Yin, Jianlong Wu, and Yan Yan. Visual perturbation-aware collaborative learning for overcoming the language prior problem. arXiv preprint arXiv:2207.11850, 2022. 9

  15. [15]

    Skip \n: A sim- ple method to reduce hallucination in large vision-language models

    Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, and Mike Zheng Shou. Skip \n: A sim- ple method to reduce hallucination in large vision-language models. arXiv preprint arXiv:2402.01345, 2024. 5

  16. [16]

    A topic-level self-correctional ap- proach to mitigate hallucinations in mllms

    Lehan He, Zeren Chen, Zhelun Shi, Tianyu Yu, Jing Shao, and Lu Sheng. A topic-level self-correctional ap- proach to mitigate hallucinations in mllms. arXiv preprint arXiv:2411.17265, 2024. 3, 6, 7

  17. [17]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR,

  18. [18]

    Advancing medical imaging with language mod- els: A journey from n-grams to chatgpt

    Mingzhe Hu, Shaoyan Pan, Yuheng Li, and Xiaofeng Yang. Advancing medical imaging with language mod- els: A journey from n-grams to chatgpt. arXiv preprint arXiv:2304.04920, 2023. 1

  19. [19]

    Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ,

  20. [20]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7, 8

  21. [21]

    Fgaif: Aligning large vision- language models with fine-grained ai feedback

    Liqiang Jing and Xinya Du. Fgaif: Aligning large vision- language models with fine-grained ai feedback. arXiv preprint arXiv:2404.05046, 2024. 2, 7

  22. [22]

    Faith- score: Fine-grained evaluations of hallucinations in large vision-language models

    Liqiang Jing, Ruosen Li, Yunmo Chen, and Xinya Du. Faith- score: Fine-grained evaluations of hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477,

  23. [23]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision,

  24. [24]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9579–9589, 2024. 9

  25. [25]

    Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

    Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xian- gru Peng, and Jiaya Jia. Step-dpo: Step-wise preference op- timization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024. 2

  26. [26]

    V olcano: mitigating multimodal hallucina- tion through self-feedback guided revision

    Seongyun Lee, Sue Hyun Park, Yongrae Jo, and Min- joon Seo. V olcano: mitigating multimodal hallucina- tion through self-feedback guided revision. arXiv preprint arXiv:2311.07362, 2023. 6

  27. [27]

    Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 1, 2, 6, 7, 9

  28. [28]

    Silkie: Preference distillation for large visual lan- guage models

    Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual lan- guage models. arXiv preprint arXiv:2312.10665, 2023. 2

  29. [29]

    Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv:2403.18814, 2023. 9

  30. [30]

    Factual: A benchmark for faithful and consistent tex- tual scene graph parsing

    Zhuang Li, Yuyang Chai, Terry Yue Zhuo, Lizhen Qu, Gholamreza Haffari, Fei Li, Donghong Ji, and Quan Hung Tran. Factual: A benchmark for faithful and consistent tex- tual scene graph parsing. arXiv preprint arXiv:2305.17497,

  31. [31]

    Flame: Factuality- aware alignment for large language models

    Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, and Xilun Chen. Flame: Factuality- aware alignment for large language models. Advances in Neural Information Processing Systems, 2024. 2

  32. [32]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 2

  33. [33]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 2023. 1, 9

  34. [34]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 6, 1, 7

  35. [35]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 1, 8, 9

  36. [36]

    A Survey on Hallucination in Large Vision-Language Models

    Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiu- tian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024. 1, 9

  37. [37]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Eu- ropean Conference on Computer Vision, 2024. 3, 4, 1

  38. [38]

    Typicalness- aware learning for failure detection

    Yijun Liu, Jiequan Cui, Zhuotao Tian, Senqiao Yang, Qing- dong He, Xiaoling Wang, and Jingyong Su. Typicalness- aware learning for failure detection. arXiv preprint arXiv:2411.01981, 2024. 9

  39. [39]

    NLTK: The Natural Language Toolkit

    Edward Loper and Steven Bird. Nltk: The natural language toolkit. arXiv preprint cs/0205028, 2002. 2

  40. [40]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6, 3

  41. [41]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022. 2, 6, 5, 9

  42. [42]

    Simpo: Simple preference optimization with a reference-free reward

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Ad- vances in Neural Information Processing Systems, 2024. 9

  43. [43]

    Counterfactual vqa: A cause- effect look at language bias

    Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian- Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause- effect look at language bias. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition ,

  44. [44]

    GPT-4V(ision) system card, 2023

    OpenAI. GPT-4V(ision) system card, 2023. 1, 6, 9

  45. [45]

    Clip-dpo: Vision-language models as a source of preference for fixing hallucinations in lvlms

    Yassine Ouali, Adrian Bulat, Brais Martinez, and Georgios Tzimiropoulos. Clip-dpo: Vision-language models as a source of preference for fixing hallucinations in lvlms. In European Conference on Computer Vision, 2024. 6

  46. [46]

    Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

    Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024. 9

  47. [47]

    Omni-dpo: A dual- perspective paradigm for dynamic preference learning of llms

    Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, and Min Zhang. Omni-dpo: A dual- perspective paradigm for dynamic preference learning of llms. arXiv preprint arXiv:2506.10054, 2025. 9

  48. [48]

    Does your vision-language model get lost in the long video sampling dilemma? arXiv preprint arXiv:2503.12496, 2025

    Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, and Jiaya Jia. Does your vision-language model get lost in the long video sampling dilemma? arXiv preprint arXiv:2503.12496, 2025. 9

  49. [49]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, 2021. 3

  50. [50]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PmLR, 2021. 9

  51. [51]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 2023. 2, 4, 9 10

  52. [52]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Confer- ence for High Performance Computing, Networking, Storage and Analysis, 2020. 3

  53. [53]

    A Survey of Hallucination in Large Foundation Models

    Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023. 1, 2, 9

  54. [55]

    Object Hallucination in Image Captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. arXiv preprint arXiv:1809.02156, 2018. 2, 6, 8, 4, 5, 7, 9

  55. [56]

    Data-augmented phrase-level alignment for mitigating object hallucination

    Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan ¨O Arık, and Tomas Pfister. Data-augmented phrase-level alignment for mitigating object hallucination. arXiv preprint arXiv:2405.18654, 2024. 6, 7

  56. [57]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 2

  57. [58]

    Ex- plore the potential of clip for training-free open vocabulary semantic segmentation

    Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. In European Conference on Com- puter Vision, pages 139–156. Springer, 2024. 9

  58. [59]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019. 2, 6, 8, 4, 5, 9

  59. [60]

    Fine-tuning language models for factuality

    Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Fine-tuning language models for factuality. In The Twelfth International Conference on Learning Representations, 2023. 2

  60. [61]

    Learning shape-aware embedding for scene text detection

    Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, Chao Zhou, Xiaoyong Shen, and Jiaya Jia. Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4234–4243, 2019. 9

  61. [62]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2

  62. [63]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023. 2, 6, 7, 8, 5

  63. [64]

    Declip: Decoupled learning for open- vocabulary dense perception

    Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, and Zhuotao Tian. Declip: Decoupled learning for open- vocabulary dense perception. In Proceedings of the Com- puter Vision and Pattern Recognition Conference , pages 14824–14834, 2025. 9

  64. [65]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 9

  65. [66]

    Noiseboost: Alleviating hallucination with noise perturbation for multimodal large language models

    Kai Wu, Boyuan Jiang, Zhengkai Jiang, Qingdong He, Donghao Luo, Shengzhi Wang, Qingwen Liu, and Chengjie Wang. Noiseboost: Alleviating hallucination with noise perturbation for multimodal large language models. arXiv preprint arXiv:2405.20081, 2024. 6

  66. [67]

    Overcoming language priors in visual question answering via distinguishing superficially similar instances

    Yike Wu, Yu Zhao, Shiwan Zhao, Ying Zhang, Xiaojie Yuan, Guoqing Zhao, and Ning Jiang. Overcoming language priors in visual question answering via distinguishing superficially similar instances. In Proceedings of the 29th International Conference on Computational Linguistics, 2022. 9

  67. [68]

    Embodied task planning with large language models

    Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, and Haibin Yan. Embodied task planning with large language models. arXiv preprint arXiv:2307.01848, 2023. 1

  68. [69]

    Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback

    Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Lin- chao Zhu. Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. arXiv preprint arXiv:2404.14233, 2024. 2, 3, 6, 7, 9

  69. [70]

    Efuf: Efficient fine-grained unlearning framework for mitigating hallucina- tions in multimodal large language models

    Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, and Xinyu Dai. Efuf: Efficient fine-grained unlearning framework for mitigating hallucina- tions in multimodal large language models. arXiv preprint arXiv:2402.09801, 2024. 6, 7, 9

  70. [71]

    Lidar-llm: Exploring the potential of large language models for 3d lidar understanding

    Senqiao Yang, Jiaming Liu, Ray Zhang, Mingjie Pan, Zoey Guo, Xiaoqi Li, Zehui Chen, Peng Gao, Yandong Guo, and Shanghang Zhang. Lidar-llm: Exploring the potential of large language models for 3d lidar understanding. arXiv preprint arXiv:2312.14074, 2023. 9

  71. [72]

    An improved baseline for reasoning segmentation with large language model

    Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240, 2023

  72. [73]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. arXiv preprint arXiv:2412.04467, 2024. 9

  73. [74]

    Unified language-driven zero-shot domain adaptation

    Senqiao Yang, Zhuotao Tian, Li Jiang, and Jiaya Jia. Unified language-driven zero-shot domain adaptation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23407–23415, 2024. 9

  74. [75]

    Woodpecker: Hallucination correction for multimodal large language models

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences,

  75. [76]

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 6, 7

  76. [77]

    Rlaif-v: Aligning mllms through open-source 11 ai feedback for super gpt-4v trustworthiness

    Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, et al. Rlaif-v: Aligning mllms through open-source 11 ai feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024. 3, 6, 5, 7, 9

  77. [78]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 2, 6, 8, 4, 9

  78. [79]

    Less is more: Mitigat- ing multimodal hallucination from an eos decision perspec- tive

    Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigat- ing multimodal hallucination from an eos decision perspec- tive. arXiv preprint arXiv:2402.14545, 2024. 6, 1

  79. [80]

    Automated multi-level prefer- ence for mllms

    Mengxi Zhang, Wenhao Wu, Yu Lu, Yuxin Song, Kang Rong, Huanjin Yao, Jianbo Zhao, Fanglong Liu, Haocheng Feng, Jingdong Wang, et al. Automated multi-level prefer- ence for mllms. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 2, 7

  80. [81]

    Omdet: Large-scale vision-language multi-dataset pre-training with multimodal detection network

    Tiancheng Zhao, Peng Liu, and Kyusong Lee. Omdet: Large-scale vision-language multi-dataset pre-training with multimodal detection network. IET Computer Vision, 2024. 3

Showing first 80 references.