Mitigating Object Hallucinations via Sentence-Level Early Intervention

Li Jiang; Senqiao Yang; Shangpin Peng; Zhuotao Tian

arxiv: 2507.12455 · v3 · pith:27KSWOPAnew · submitted 2025-07-16 · 💻 cs.CV

Mitigating Object Hallucinations via Sentence-Level Early Intervention

Shangpin Peng , Senqiao Yang , Li Jiang , Zhuotao Tian This is my paper

Pith reviewed 2026-05-25 08:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords object hallucinationmultimodal large language modelspreference learningsentence-level interventionvision-language modelshallucination mitigationdirect preference optimization

0 comments

The pith

Sentence-level preference training from detector-validated pairs cuts object hallucinations in multimodal models by over 90 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that object hallucinations in multimodal large language models arise mostly in the opening sentences of a response and then spread. It shows that these early errors can be targeted by automatically building preference data: the model generates candidate sentences, two open-vocabulary detectors check which objects are actually present, and sentences are labeled hallucinated or accurate. These pairs are then used to train the model with a context-aware preference objective (C-DPO) that rewards correct early outputs and discourages fabricated objects. The resulting models improve on both hallucination benchmarks and general capability tests without any human labels or distribution shifts.

Core claim

Hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. By iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated or non-hallucinated categories, high-quality in-domain preference pairs can be bootstrapped without human annotations. Training with a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest then suppresses the fabrication of objects from the first sentence onward.

What carries the argument

SENTINEL framework that creates context-coherent positive and hallucinated negative sentence pairs via detector cross-validation and optimizes them with context-aware direct preference optimization (C-DPO) focused on early generation steps.

If this is right

Models trained with SENTINEL reduce hallucinations by more than 90 percent relative to the base model while outperforming prior state-of-the-art methods on both hallucination and general capability benchmarks.
No human-annotated preference data is required because the training pairs are generated automatically through detector validation.
Focusing the preference loss at the sentence level prevents errors from propagating through later parts of the response.
The iterative bootstrapping process produces progressively higher-quality preference data as the model improves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-sentence labeling and preference approach could be applied to other hallucination types such as incorrect attributes or spatial relations.
If more accurate or specialized detectors become available, the quality of the automatically generated preference data would increase and further reduce residual hallucinations.
The method's independence from human labels makes it practical to retrain models periodically as new vision backbones improve detector reliability.

Load-bearing premise

Cross-checking model outputs against two open-vocabulary detectors accurately classifies sentences as hallucinated or non-hallucinated without introducing new systematic errors or distribution shifts.

What would settle it

A collection of generated sentences where the two detectors label an object as present but a careful human review finds it absent (or the reverse) at a high rate would show the classification step is unreliable.

Figures

Figures reproduced from arXiv: 2507.12455 by Li Jiang, Senqiao Yang, Shangpin Peng, Zhuotao Tian.

**Figure 1.** Figure 1: Comparative analysis of data construction strategies for hallucination mitigation in MLLMs. Our proposed approach demonstrates superior efficiency and effectiveness in generating high-quality, domain-specific preference learning datasets, offering a robust solution for reducing hallucination in MLLMs. the development of general-purpose AI systems [2, 9, 33– 35, 44, 65, 87]. However, a critical challenge p… view at source ↗

**Figure 2.** Figure 2: Object position distribution in MLLM hallucination analysis. (a) illustrates the progressive deterioration of hallucination effects in Multimodal Large Language Models (MLLMs) with increasing description length in the image captioning task, while (b) demonstrates the effectiveness of early-stage intervention in mitigating the propagation of hallucination. model under training, πref represents the unchang… view at source ↗

**Figure 3.** Figure 3: The overview of SENTINEL. The proposed SENTINEL takes six essential steps: (1) Generate multiple in-domain responses conditioned on the input image, prompt, and context c. (2) Identify and extract all mentioned objects from each generated sentence. (3) Utilizing two object detectors to validate the existence of extracted objects through cross-referencing. (4) Categorize generated sentences into hallucinate… view at source ↗

**Figure 4.** Figure 4: Categories of in-domain candidates. The in-domain candidates fall into three types. Employing non-hallucinated, context-coherent descriptions (y + w) as positive samples, paired with hallucinated descriptions (yl ), enhances the model’s generalization performance and robustness. (4)-(5), this process extracts contextually relevant data, ensuring the training data better represents the model’s output dist… view at source ↗

**Figure 7.** Figure 7: Qualitative results of SENTINEL. Our method can effectively eliminate hallucinations in MLLMs while enhancing the model’s general capabilities. Method Object HalBench AMBER MM-Vet Resp. ↓ Ment. ↓ Acc ↑ F1 ↑ Overall ↑ LLaVA-v1.5-7B 52.7 27.9 71.5 74.1 31.1 Ours (8.6K (y+ w, yl)) 4.3 2.6 76.1 79.3 32.6 Ours (8.6K Rewrited (y+ w, yl)) 4.8↑0.5 2.9↑0.3 75.0↓1.1 78.0↓1.3 31.3↓1.3 [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 8.** Figure 8: Impact of training data quantity on hallucination rate in Object Halbench [55]. The results show that SENTINEL demonstrates better efficiency, effectiveness, and scalability, while effectively reducing hallucination rates across varying data scales. Method Object HalBench [55] AMBER [63] HallusionBench [12] TextVQA [59] MM-Vet [78] Resp. ↓ Ment. ↓ Acc↑ F1↑ Question Acc↑ Acc↑ Overall ↑ LLaVA-v1.5-7B 52.7 28… view at source ↗

**Figure 9.** Figure 9: Effect of intermediate hallucination mitigation on subsequent generations. Showing the effectiveness of early-stage intervention in mitigating the propagation of hallucinations. Model Method Object HalBench AMBER Resp. ↓ Ment. ↓ CHAIR ↓ Hal ↓ Cog ↓ LLaVA-v1.5-7B [34] baseline 52.7 27.9 8.4 35.5 4.0 Woodpecker [75] 39.6 26.4 - - - VCD [27] 52.7 27.3 9.1 39.8 4.2 OPERA [19] 40.0 21.9 6.5 28.5 3.1 EOS [79] 40… view at source ↗

**Figure 10.** Figure 10: Time cost analysis of decode-based methods. Decode-based early intervention increases inference time, primarily due to the additional generation steps required by MLLM sampling, whereas the object detector remains highly efficient. When this early intervention strategy is applied throughout the entire caption generation process, as shown in Tab. 6, it effectively mitigates object hallucinations when ev… view at source ↗

**Figure 11.** Figure 11: Comparison between C-DPO and standard DPO during model training. The proposed C-DPO promotes more stable gradient updates, enhancing training stability. the counterparts used for comparison. In Appendix D.3, we present the detailed evaluation setup. In Appendix D.4, we provide detailed results from some of the experiments. Additionally, in Appendix D.5, we present specific details of the ablation studies… view at source ↗

**Figure 12.** Figure 12: Impact of rewriting on the training process. Training with rewritten data fails to achieve the same level of convergence, resulting in higher final loss and weaker differentiation between positive and negative samples, demonstrating the necessity of indomain training data. Complement with existing preference learning methods. HA-DPO [82] employs a GPT-4 [1]-based rewriting approach to modify both positi… view at source ↗

**Figure 13.** Figure 13: Comparing general image description results between SENTINAL and its base model LLaVA-v1.5-7B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, providing a more detailed description. 10 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Comparing detailed image description results between SENTINAL and its base model LLaVA-v1.5-7B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, providing a more detailed description. 11 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Comparing visual question answering results between SENTINAL and LLaVA-v1.5-7B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, leading to more accurate and detailed answers. 12 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Comparing general image descriptions between SENTINAL and its base model LLaVA-v1.5-13B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, providing a more detailed description. 13 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Comparing detailed image descriptions between SENTINAL and its base model LLaVA-v1.5-13B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, providing a more detailed description. 14 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Comparing visual question answering between SENTINAL and its base model LLaVA-v1.5-13B. Our method effectively mitigates hallucinations while enhancing the general performance of the base model, leading to more accurate answers. 15 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at https://github.com/pspdada/SENTINEL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SENTINEL, a framework to mitigate object hallucinations in MLLMs by intervening at the sentence level during early text generation. It bootstraps in-domain preference pairs without human annotations by iteratively sampling model outputs and classifying sentences as hallucinated or non-hallucinated via cross-checking against two open-vocabulary detectors; these pairs then train the model with a context-aware preference loss (C-DPO). The abstract claims this yields over 90% hallucination reduction versus the base model and outperforms prior SOTA on both hallucination and general capability benchmarks.

Significance. If the detector-based labeling accurately reflects true hallucinations, the method offers a scalable, annotation-free route to preference data that targets the early emergence of errors, with potential to improve MLLM reliability in grounded applications. The public release of models, datasets, and code at the cited GitHub repository is a clear strength for reproducibility and follow-on work.

major comments (2)

[§3] §3 (bootstrapping procedure): The central pipeline classifies sentences using cross-checks against two open-vocabulary detectors, yet reports no inter-detector agreement statistics, human validation of labels, or ablation on detector choice. This is load-bearing for the >90% reduction claim, because systematic mislabeling (e.g., missing context-dependent fabrications or over-flagging valid descriptions) would cause C-DPO to optimize for detector agreement rather than visual grounding.
[§4] Abstract and §4 (results): The reported 'over 90% reduction' and benchmark outperformance are presented without measurement-protocol details, statistical significance tests, variance across runs, or controls isolating detector-induced bias. Given that preference-pair quality directly determines the training signal, these omissions prevent assessment of whether the gains are robust or artifactual.

minor comments (2)

The abstract refers to 'general capabilities benchmarks' without naming them; adding the specific datasets (e.g., VQAv2, GQA) would improve clarity.
Notation for the C-DPO loss could be expanded with an explicit equation showing how sentence-level context is incorporated, to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to improve transparency and rigor.

read point-by-point responses

Referee: [§3] §3 (bootstrapping procedure): The central pipeline classifies sentences using cross-checks against two open-vocabulary detectors, yet reports no inter-detector agreement statistics, human validation of labels, or ablation on detector choice. This is load-bearing for the >90% reduction claim, because systematic mislabeling (e.g., missing context-dependent fabrications or over-flagging valid descriptions) would cause C-DPO to optimize for detector agreement rather than visual grounding.

Authors: We agree that inter-detector agreement statistics, human validation of labels, and ablations on detector choice are important for validating the bootstrapping pipeline. These elements were omitted from the original submission. In the revised manuscript we will add: (i) agreement metrics (e.g., percentage agreement and Cohen’s kappa) between the two detectors, (ii) human validation results on a random sample of 200 labeled sentences, and (iii) an ablation comparing SENTINEL performance when using each detector individually versus the cross-check. These additions will directly address concerns about potential systematic mislabeling. revision: yes
Referee: [§4] Abstract and §4 (results): The reported 'over 90% reduction' and benchmark outperformance are presented without measurement-protocol details, statistical significance tests, variance across runs, or controls isolating detector-induced bias. Given that preference-pair quality directly determines the training signal, these omissions prevent assessment of whether the gains are robust or artifactual.

Authors: We acknowledge the need for more complete reporting. The revised version will expand §4 to include: detailed measurement protocols for all benchmarks, statistical significance tests (paired t-tests with p-values) on the main results, standard deviations across at least three independent training runs, and a control experiment that trains on a human-labeled subset to quantify any detector-induced bias. These changes will allow readers to better evaluate the robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical pipeline: iterative sampling of model outputs, labeling via two external open-vocabulary detectors to create preference pairs, and training with a context-aware C-DPO loss. No equations, self-citations, or uniqueness theorems are invoked that reduce the claimed >90% hallucination reduction or benchmark gains to fitted parameters or inputs by construction. Evaluation occurs on independent hallucination and capability benchmarks, keeping the central result externally falsifiable rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that hallucinations appear early and that detector validation is sufficiently accurate; no explicit free parameters or new invented entities are described in the abstract.

axioms (1)

domain assumption Hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs
This insight, stated in the abstract, directly motivates the sentence-level intervention design.

pith-pipeline@v0.9.0 · 5775 in / 1242 out tokens · 26253 ms · 2026-05-25T08:29:16.822737+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Mitigating Multimodal Hallucination via Phase-wise Self-reward
cs.CV 2026-04 unverdicted novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 2 Pith papers · 23 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen-vl: A versatile vision-language model for un- derstanding, localization

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization. Text Reading, and Beyond, 2023. 1, 9

work page 2023
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024. 1, 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Driving with llms: Fusing object-level vec- tor modality for explainable autonomous driving

Long Chen, Oleg Sinavski, Jan H ¨unermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vec- tor modality for explainable autonomous driving. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024. 1

work page 2024
[6]

Halc: Object hallucination re- duction via adaptive focal-contrast decoding

Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination re- duction via adaptive focal-contrast decoding. arXiv preprint arXiv:2403.00425, 2024. 2, 9

work page arXiv 2024
[7]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3, 4

work page 2024
[8]

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by con- trasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023. 1, 2, 6, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Instructblip: Towards general- purpose vision-language models with instruction tuning,

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

work page
[10]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. 2, 6, 5, 9

work page 2017
[11]

Mask-dpo: Generalizable fine-grained factuality alignment of llms

Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, and Kai Chen. Mask-dpo: Generalizable fine-grained factuality alignment of llms. arXiv preprint arXiv:2503.02846, 2025. 4

work page arXiv 2025
[12]

Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition...

work page 2024
[13]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, 2024. 2, 7

work page 2024
[14]

Visual perturbation-aware collaborative learning for overcoming the language prior problem

Yudong Han, Liqiang Nie, Jianhua Yin, Jianlong Wu, and Yan Yan. Visual perturbation-aware collaborative learning for overcoming the language prior problem. arXiv preprint arXiv:2207.11850, 2022. 9

work page arXiv 2022
[15]

Skip \n: A sim- ple method to reduce hallucination in large vision-language models

Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, and Mike Zheng Shou. Skip \n: A sim- ple method to reduce hallucination in large vision-language models. arXiv preprint arXiv:2402.01345, 2024. 5

work page arXiv 2024
[16]

A topic-level self-correctional ap- proach to mitigate hallucinations in mllms

Lehan He, Zeren Chen, Zhelun Shi, Tianyu Yu, Jing Shao, and Lu Sheng. A topic-level self-correctional ap- proach to mitigate hallucinations in mllms. arXiv preprint arXiv:2411.17265, 2024. 3, 6, 7

work page arXiv 2024
[17]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR,

work page
[18]

Advancing medical imaging with language mod- els: A journey from n-grams to chatgpt

Mingzhe Hu, Shaoyan Pan, Yuheng Li, and Xiaofeng Yang. Advancing medical imaging with language mod- els: A journey from n-grams to chatgpt. arXiv preprint arXiv:2304.04920, 2023. 1

work page arXiv 2023
[19]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ,

work page
[20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Fgaif: Aligning large vision- language models with fine-grained ai feedback

Liqiang Jing and Xinya Du. Fgaif: Aligning large vision- language models with fine-grained ai feedback. arXiv preprint arXiv:2404.05046, 2024. 2, 7

work page arXiv 2024
[22]

Faith- score: Fine-grained evaluations of hallucinations in large vision-language models

Liqiang Jing, Ruosen Li, Yunmo Chen, and Xinya Du. Faith- score: Fine-grained evaluations of hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477,

work page arXiv
[23]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision,

work page
[24]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9579–9589, 2024. 9

work page 2024
[25]

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xian- gru Peng, and Jiaya Jia. Step-dpo: Step-wise preference op- timization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

V olcano: mitigating multimodal hallucina- tion through self-feedback guided revision

Seongyun Lee, Sue Hyun Park, Yongrae Jo, and Min- joon Seo. V olcano: mitigating multimodal hallucina- tion through self-feedback guided revision. arXiv preprint arXiv:2311.07362, 2023. 6

work page arXiv 2023
[27]

Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 1, 2, 6, 7, 9

work page 2024
[28]

Silkie: Preference distillation for large visual lan- guage models

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual lan- guage models. arXiv preprint arXiv:2312.10665, 2023. 2

work page arXiv 2023
[29]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv:2403.18814, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Factual: A benchmark for faithful and consistent tex- tual scene graph parsing

Zhuang Li, Yuyang Chai, Terry Yue Zhuo, Lizhen Qu, Gholamreza Haffari, Fei Li, Donghong Ji, and Quan Hung Tran. Factual: A benchmark for faithful and consistent tex- tual scene graph parsing. arXiv preprint arXiv:2305.17497,

work page arXiv
[31]

Flame: Factuality- aware alignment for large language models

Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, and Xilun Chen. Flame: Factuality- aware alignment for large language models. Advances in Neural Information Processing Systems, 2024. 2

work page 2024
[32]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 2023. 1, 9

work page 2023
[34]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 6, 1, 7

work page 2024
[35]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 1, 8, 9

work page 2024
[36]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiu- tian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024. 1, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Eu- ropean Conference on Computer Vision, 2024. 3, 4, 1

work page 2024
[38]

Typicalness- aware learning for failure detection

Yijun Liu, Jiequan Cui, Zhuotao Tian, Senqiao Yang, Qing- dong He, Xiaoling Wang, and Jingyong Su. Typicalness- aware learning for failure detection. arXiv preprint arXiv:2411.01981, 2024. 9

work page arXiv 2024
[39]

NLTK: The Natural Language Toolkit

Edward Loper and Steven Bird. Nltk: The natural language toolkit. arXiv preprint cs/0205028, 2002. 2

work page internal anchor Pith review Pith/arXiv arXiv 2002
[40]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022. 2, 6, 5, 9

work page 2022
[42]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Ad- vances in Neural Information Processing Systems, 2024. 9

work page 2024
[43]

Counterfactual vqa: A cause- effect look at language bias

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian- Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause- effect look at language bias. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition ,

work page
[44]

GPT-4V(ision) system card, 2023

OpenAI. GPT-4V(ision) system card, 2023. 1, 6, 9

work page 2023
[45]

Clip-dpo: Vision-language models as a source of preference for fixing hallucinations in lvlms

Yassine Ouali, Adrian Bulat, Brais Martinez, and Georgios Tzimiropoulos. Clip-dpo: Vision-language models as a source of preference for fixing hallucinations in lvlms. In European Conference on Computer Vision, 2024. 6

work page 2024
[46]

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Omni-dpo: A dual- perspective paradigm for dynamic preference learning of llms

Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, and Min Zhang. Omni-dpo: A dual- perspective paradigm for dynamic preference learning of llms. arXiv preprint arXiv:2506.10054, 2025. 9

work page arXiv 2025
[48]

Does your vision-language model get lost in the long video sampling dilemma? arXiv preprint arXiv:2503.12496, 2025

Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, and Jiaya Jia. Does your vision-language model get lost in the long video sampling dilemma? arXiv preprint arXiv:2503.12496, 2025. 9

work page arXiv 2025
[49]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, 2021. 3

work page 2021
[50]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PmLR, 2021. 9

work page 2021
[51]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 2023. 2, 4, 9 10

work page 2023
[52]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Confer- ence for High Performance Computing, Networking, Storage and Analysis, 2020. 3

work page 2020
[53]

A Survey of Hallucination in Large Foundation Models

Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023. 1, 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. arXiv preprint arXiv:1809.02156, 2018. 2, 6, 8, 4, 5, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2018
[56]

Data-augmented phrase-level alignment for mitigating object hallucination

Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan ¨O Arık, and Tomas Pfister. Data-augmented phrase-level alignment for mitigating object hallucination. arXiv preprint arXiv:2405.18654, 2024. 6, 7

work page arXiv 2024
[57]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[58]

Ex- plore the potential of clip for training-free open vocabulary semantic segmentation

Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. In European Conference on Com- puter Vision, pages 139–156. Springer, 2024. 9

work page 2024
[59]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019. 2, 6, 8, 4, 5, 9

work page 2019
[60]

Fine-tuning language models for factuality

Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Fine-tuning language models for factuality. In The Twelfth International Conference on Learning Representations, 2023. 2

work page 2023
[61]

Learning shape-aware embedding for scene text detection

Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, Chao Zhou, Xiaoyong Shen, and Jiaya Jia. Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4234–4243, 2019. 9

work page 2019
[62]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023. 2, 6, 7, 8, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Declip: Decoupled learning for open- vocabulary dense perception

Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, and Zhuotao Tian. Declip: Decoupled learning for open- vocabulary dense perception. In Proceedings of the Com- puter Vision and Pattern Recognition Conference , pages 14824–14834, 2025. 9

work page 2025
[65]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Noiseboost: Alleviating hallucination with noise perturbation for multimodal large language models

Kai Wu, Boyuan Jiang, Zhengkai Jiang, Qingdong He, Donghao Luo, Shengzhi Wang, Qingwen Liu, and Chengjie Wang. Noiseboost: Alleviating hallucination with noise perturbation for multimodal large language models. arXiv preprint arXiv:2405.20081, 2024. 6

work page arXiv 2024
[67]

Overcoming language priors in visual question answering via distinguishing superficially similar instances

Yike Wu, Yu Zhao, Shiwan Zhao, Ying Zhang, Xiaojie Yuan, Guoqing Zhao, and Ning Jiang. Overcoming language priors in visual question answering via distinguishing superficially similar instances. In Proceedings of the 29th International Conference on Computational Linguistics, 2022. 9

work page 2022
[68]

Embodied task planning with large language models

Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, and Haibin Yan. Embodied task planning with large language models. arXiv preprint arXiv:2307.01848, 2023. 1

work page arXiv 2023
[69]

Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Lin- chao Zhu. Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. arXiv preprint arXiv:2404.14233, 2024. 2, 3, 6, 7, 9

work page arXiv 2024
[70]

Efuf: Efficient fine-grained unlearning framework for mitigating hallucina- tions in multimodal large language models

Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, and Xinyu Dai. Efuf: Efficient fine-grained unlearning framework for mitigating hallucina- tions in multimodal large language models. arXiv preprint arXiv:2402.09801, 2024. 6, 7, 9

work page arXiv 2024
[71]

Lidar-llm: Exploring the potential of large language models for 3d lidar understanding

Senqiao Yang, Jiaming Liu, Ray Zhang, Mingjie Pan, Zoey Guo, Xiaoqi Li, Zehui Chen, Peng Gao, Yandong Guo, and Shanghang Zhang. Lidar-llm: Exploring the potential of large language models for 3d lidar understanding. arXiv preprint arXiv:2312.14074, 2023. 9

work page arXiv 2023
[72]

An improved baseline for reasoning segmentation with large language model

Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240, 2023

work page arXiv 2023
[73]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. arXiv preprint arXiv:2412.04467, 2024. 9

work page arXiv 2024
[74]

Unified language-driven zero-shot domain adaptation

Senqiao Yang, Zhuotao Tian, Li Jiang, and Jiaya Jia. Unified language-driven zero-shot domain adaptation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23407–23415, 2024. 9

work page 2024
[75]

Woodpecker: Hallucination correction for multimodal large language models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences,

work page
[76]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 6, 7

work page 2024
[77]

Rlaif-v: Aligning mllms through open-source 11 ai feedback for super gpt-4v trustworthiness

Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, et al. Rlaif-v: Aligning mllms through open-source 11 ai feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024. 3, 6, 5, 7, 9

work page arXiv 2024
[78]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 2, 6, 8, 4, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Less is more: Mitigat- ing multimodal hallucination from an eos decision perspec- tive

Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigat- ing multimodal hallucination from an eos decision perspec- tive. arXiv preprint arXiv:2402.14545, 2024. 6, 1

work page arXiv 2024
[80]

Automated multi-level prefer- ence for mllms

Mengxi Zhang, Wenhao Wu, Yu Lu, Yuxin Song, Kang Rong, Huanjin Yao, Jianbo Zhao, Fanglong Liu, Haocheng Feng, Jingdong Wang, et al. Automated multi-level prefer- ence for mllms. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 2, 7

work page 2024
[81]

Omdet: Large-scale vision-language multi-dataset pre-training with multimodal detection network

Tiancheng Zhao, Peng Liu, and Kyusong Lee. Omdet: Large-scale vision-language multi-dataset pre-training with multimodal detection network. IET Computer Vision, 2024. 3

work page 2024

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen-vl: A versatile vision-language model for un- derstanding, localization

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization. Text Reading, and Beyond, 2023. 1, 9

work page 2023

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024. 1, 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Driving with llms: Fusing object-level vec- tor modality for explainable autonomous driving

Long Chen, Oleg Sinavski, Jan H ¨unermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vec- tor modality for explainable autonomous driving. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024. 1

work page 2024

[6] [6]

Halc: Object hallucination re- duction via adaptive focal-contrast decoding

Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination re- duction via adaptive focal-contrast decoding. arXiv preprint arXiv:2403.00425, 2024. 2, 9

work page arXiv 2024

[7] [7]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3, 4

work page 2024

[8] [8]

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by con- trasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023. 1, 2, 6, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Instructblip: Towards general- purpose vision-language models with instruction tuning,

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

work page

[10] [10]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. 2, 6, 5, 9

work page 2017

[11] [11]

Mask-dpo: Generalizable fine-grained factuality alignment of llms

Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, and Kai Chen. Mask-dpo: Generalizable fine-grained factuality alignment of llms. arXiv preprint arXiv:2503.02846, 2025. 4

work page arXiv 2025

[12] [12]

Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition...

work page 2024

[13] [13]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, 2024. 2, 7

work page 2024

[14] [14]

Visual perturbation-aware collaborative learning for overcoming the language prior problem

Yudong Han, Liqiang Nie, Jianhua Yin, Jianlong Wu, and Yan Yan. Visual perturbation-aware collaborative learning for overcoming the language prior problem. arXiv preprint arXiv:2207.11850, 2022. 9

work page arXiv 2022

[15] [15]

Skip \n: A sim- ple method to reduce hallucination in large vision-language models

Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, and Mike Zheng Shou. Skip \n: A sim- ple method to reduce hallucination in large vision-language models. arXiv preprint arXiv:2402.01345, 2024. 5

work page arXiv 2024

[16] [16]

A topic-level self-correctional ap- proach to mitigate hallucinations in mllms

Lehan He, Zeren Chen, Zhelun Shi, Tianyu Yu, Jing Shao, and Lu Sheng. A topic-level self-correctional ap- proach to mitigate hallucinations in mllms. arXiv preprint arXiv:2411.17265, 2024. 3, 6, 7

work page arXiv 2024

[17] [17]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR,

work page

[18] [18]

Advancing medical imaging with language mod- els: A journey from n-grams to chatgpt

Mingzhe Hu, Shaoyan Pan, Yuheng Li, and Xiaofeng Yang. Advancing medical imaging with language mod- els: A journey from n-grams to chatgpt. arXiv preprint arXiv:2304.04920, 2023. 1

work page arXiv 2023

[19] [19]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ,

work page

[20] [20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Fgaif: Aligning large vision- language models with fine-grained ai feedback

Liqiang Jing and Xinya Du. Fgaif: Aligning large vision- language models with fine-grained ai feedback. arXiv preprint arXiv:2404.05046, 2024. 2, 7

work page arXiv 2024

[22] [22]

Faith- score: Fine-grained evaluations of hallucinations in large vision-language models

Liqiang Jing, Ruosen Li, Yunmo Chen, and Xinya Du. Faith- score: Fine-grained evaluations of hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477,

work page arXiv

[23] [23]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision,

work page

[24] [24]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9579–9589, 2024. 9

work page 2024

[25] [25]

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xian- gru Peng, and Jiaya Jia. Step-dpo: Step-wise preference op- timization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

V olcano: mitigating multimodal hallucina- tion through self-feedback guided revision

Seongyun Lee, Sue Hyun Park, Yongrae Jo, and Min- joon Seo. V olcano: mitigating multimodal hallucina- tion through self-feedback guided revision. arXiv preprint arXiv:2311.07362, 2023. 6

work page arXiv 2023

[27] [27]

Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 1, 2, 6, 7, 9

work page 2024

[28] [28]

Silkie: Preference distillation for large visual lan- guage models

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual lan- guage models. arXiv preprint arXiv:2312.10665, 2023. 2

work page arXiv 2023

[29] [29]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv:2403.18814, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Factual: A benchmark for faithful and consistent tex- tual scene graph parsing

Zhuang Li, Yuyang Chai, Terry Yue Zhuo, Lizhen Qu, Gholamreza Haffari, Fei Li, Donghong Ji, and Quan Hung Tran. Factual: A benchmark for faithful and consistent tex- tual scene graph parsing. arXiv preprint arXiv:2305.17497,

work page arXiv

[31] [31]

Flame: Factuality- aware alignment for large language models

Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, and Xilun Chen. Flame: Factuality- aware alignment for large language models. Advances in Neural Information Processing Systems, 2024. 2

work page 2024

[32] [32]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 2023. 1, 9

work page 2023

[34] [34]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 6, 1, 7

work page 2024

[35] [35]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 1, 8, 9

work page 2024

[36] [36]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiu- tian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024. 1, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Eu- ropean Conference on Computer Vision, 2024. 3, 4, 1

work page 2024

[38] [38]

Typicalness- aware learning for failure detection

Yijun Liu, Jiequan Cui, Zhuotao Tian, Senqiao Yang, Qing- dong He, Xiaoling Wang, and Jingyong Su. Typicalness- aware learning for failure detection. arXiv preprint arXiv:2411.01981, 2024. 9

work page arXiv 2024

[39] [39]

NLTK: The Natural Language Toolkit

Edward Loper and Steven Bird. Nltk: The natural language toolkit. arXiv preprint cs/0205028, 2002. 2

work page internal anchor Pith review Pith/arXiv arXiv 2002

[40] [40]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022. 2, 6, 5, 9

work page 2022

[42] [42]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Ad- vances in Neural Information Processing Systems, 2024. 9

work page 2024

[43] [43]

Counterfactual vqa: A cause- effect look at language bias

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian- Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause- effect look at language bias. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition ,

work page

[44] [44]

GPT-4V(ision) system card, 2023

OpenAI. GPT-4V(ision) system card, 2023. 1, 6, 9

work page 2023

[45] [45]

Clip-dpo: Vision-language models as a source of preference for fixing hallucinations in lvlms

Yassine Ouali, Adrian Bulat, Brais Martinez, and Georgios Tzimiropoulos. Clip-dpo: Vision-language models as a source of preference for fixing hallucinations in lvlms. In European Conference on Computer Vision, 2024. 6

work page 2024

[46] [46]

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Omni-dpo: A dual- perspective paradigm for dynamic preference learning of llms

Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, and Min Zhang. Omni-dpo: A dual- perspective paradigm for dynamic preference learning of llms. arXiv preprint arXiv:2506.10054, 2025. 9

work page arXiv 2025

[48] [48]

Does your vision-language model get lost in the long video sampling dilemma? arXiv preprint arXiv:2503.12496, 2025

Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, and Jiaya Jia. Does your vision-language model get lost in the long video sampling dilemma? arXiv preprint arXiv:2503.12496, 2025. 9

work page arXiv 2025

[49] [49]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, 2021. 3

work page 2021

[50] [50]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PmLR, 2021. 9

work page 2021

[51] [51]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 2023. 2, 4, 9 10

work page 2023

[52] [52]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Confer- ence for High Performance Computing, Networking, Storage and Analysis, 2020. 3

work page 2020

[53] [53]

A Survey of Hallucination in Large Foundation Models

Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023. 1, 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [55]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. arXiv preprint arXiv:1809.02156, 2018. 2, 6, 8, 4, 5, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2018

[55] [56]

Data-augmented phrase-level alignment for mitigating object hallucination

Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan ¨O Arık, and Tomas Pfister. Data-augmented phrase-level alignment for mitigating object hallucination. arXiv preprint arXiv:2405.18654, 2024. 6, 7

work page arXiv 2024

[56] [57]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[57] [58]

Ex- plore the potential of clip for training-free open vocabulary semantic segmentation

Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. In European Conference on Com- puter Vision, pages 139–156. Springer, 2024. 9

work page 2024

[58] [59]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019. 2, 6, 8, 4, 5, 9

work page 2019

[59] [60]

Fine-tuning language models for factuality

Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Fine-tuning language models for factuality. In The Twelfth International Conference on Learning Representations, 2023. 2

work page 2023

[60] [61]

Learning shape-aware embedding for scene text detection

Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, Chao Zhou, Xiaoyong Shen, and Jiaya Jia. Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4234–4243, 2019. 9

work page 2019

[61] [62]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [63]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023. 2, 6, 7, 8, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [64]

Declip: Decoupled learning for open- vocabulary dense perception

Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, and Zhuotao Tian. Declip: Decoupled learning for open- vocabulary dense perception. In Proceedings of the Com- puter Vision and Pattern Recognition Conference , pages 14824–14834, 2025. 9

work page 2025

[64] [65]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [66]

Noiseboost: Alleviating hallucination with noise perturbation for multimodal large language models

Kai Wu, Boyuan Jiang, Zhengkai Jiang, Qingdong He, Donghao Luo, Shengzhi Wang, Qingwen Liu, and Chengjie Wang. Noiseboost: Alleviating hallucination with noise perturbation for multimodal large language models. arXiv preprint arXiv:2405.20081, 2024. 6

work page arXiv 2024

[66] [67]

Overcoming language priors in visual question answering via distinguishing superficially similar instances

Yike Wu, Yu Zhao, Shiwan Zhao, Ying Zhang, Xiaojie Yuan, Guoqing Zhao, and Ning Jiang. Overcoming language priors in visual question answering via distinguishing superficially similar instances. In Proceedings of the 29th International Conference on Computational Linguistics, 2022. 9

work page 2022

[67] [68]

Embodied task planning with large language models

Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, and Haibin Yan. Embodied task planning with large language models. arXiv preprint arXiv:2307.01848, 2023. 1

work page arXiv 2023

[68] [69]

Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Lin- chao Zhu. Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. arXiv preprint arXiv:2404.14233, 2024. 2, 3, 6, 7, 9

work page arXiv 2024

[69] [70]

Efuf: Efficient fine-grained unlearning framework for mitigating hallucina- tions in multimodal large language models

Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, and Xinyu Dai. Efuf: Efficient fine-grained unlearning framework for mitigating hallucina- tions in multimodal large language models. arXiv preprint arXiv:2402.09801, 2024. 6, 7, 9

work page arXiv 2024

[70] [71]

Lidar-llm: Exploring the potential of large language models for 3d lidar understanding

Senqiao Yang, Jiaming Liu, Ray Zhang, Mingjie Pan, Zoey Guo, Xiaoqi Li, Zehui Chen, Peng Gao, Yandong Guo, and Shanghang Zhang. Lidar-llm: Exploring the potential of large language models for 3d lidar understanding. arXiv preprint arXiv:2312.14074, 2023. 9

work page arXiv 2023

[71] [72]

An improved baseline for reasoning segmentation with large language model

Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240, 2023

work page arXiv 2023

[72] [73]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. arXiv preprint arXiv:2412.04467, 2024. 9

work page arXiv 2024

[73] [74]

Unified language-driven zero-shot domain adaptation

Senqiao Yang, Zhuotao Tian, Li Jiang, and Jiaya Jia. Unified language-driven zero-shot domain adaptation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23407–23415, 2024. 9

work page 2024

[74] [75]

Woodpecker: Hallucination correction for multimodal large language models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences,

work page

[75] [76]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 6, 7

work page 2024

[76] [77]

Rlaif-v: Aligning mllms through open-source 11 ai feedback for super gpt-4v trustworthiness

Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, et al. Rlaif-v: Aligning mllms through open-source 11 ai feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024. 3, 6, 5, 7, 9

work page arXiv 2024

[77] [78]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 2, 6, 8, 4, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[78] [79]

Less is more: Mitigat- ing multimodal hallucination from an eos decision perspec- tive

Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigat- ing multimodal hallucination from an eos decision perspec- tive. arXiv preprint arXiv:2402.14545, 2024. 6, 1

work page arXiv 2024

[79] [80]

Automated multi-level prefer- ence for mllms

Mengxi Zhang, Wenhao Wu, Yu Lu, Yuxin Song, Kang Rong, Huanjin Yao, Jianbo Zhao, Fanglong Liu, Haocheng Feng, Jingdong Wang, et al. Automated multi-level prefer- ence for mllms. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 2, 7

work page 2024

[80] [81]

Omdet: Large-scale vision-language multi-dataset pre-training with multimodal detection network

Tiancheng Zhao, Peng Liu, and Kyusong Lee. Omdet: Large-scale vision-language multi-dataset pre-training with multimodal detection network. IET Computer Vision, 2024. 3

work page 2024