pith. machine review for the scientific record. sign in

arxiv: 2306.14565 · v4 · submitted 2023-06-26 · 💻 cs.CV · cs.AI· cs.CE· cs.CL· cs.MM

Recognition: 2 theorem links

· Lean Theorem

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Jianfeng Wang, Kevin Lin, Lijuan Wang, Linjie Li, Yaser Yacoob

Pith reviewed 2026-05-14 17:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CEcs.CLcs.MM
keywords hallucination mitigationvisual instruction tuninglarge multi-modal modelsnegative instructionsrobust finetuningGPT-4 evaluationLRV-Instruction
0
0 comments X

The pith

Finetuning on a dataset with both positive and negative visual instructions reduces hallucinations in large multi-modal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LRV-Instruction, a dataset of 400,000 GPT-4-generated visual instructions that pairs each positive example with negative ones at three semantic levels. These negative instructions cover nonexistent object manipulation, existent object manipulation, and knowledge manipulation to train models against producing inconsistent image descriptions. Finetuning MiniGPT-4 and mPLUG-Owl on this balanced mix cuts hallucinations on negative prompts and raises accuracy on standard public benchmarks relative to prior state-of-the-art tuning methods. The work also introduces GAVIE, a GPT-4-based evaluator that scores model responses without needing human-written ground-truth answers.

Core claim

Existing large multi-modal models produce significant hallucinations when given negative instructions, especially those involving manipulation of existent objects or external knowledge. By constructing LRV-Instruction with matched positive and negative samples across 16 vision-language tasks and finetuning on it, the models become more robust: they generate fewer inconsistent descriptions while achieving higher performance on multiple public datasets than current leading approaches. A balanced ratio of positive to negative training instances further strengthens this robustness.

What carries the argument

LRV-Instruction dataset, which supplies 400k visual instructions containing both positive answers and three levels of negative instructions (nonexistent object, existent object, and knowledge manipulation) to drive robust instruction tuning.

If this is right

  • Finetuned models exhibit fewer hallucinations specifically on existent-object and knowledge-manipulation prompts.
  • Performance improves on several public vision-language benchmarks relative to prior state-of-the-art instruction-tuned models.
  • A balanced mix of positive and negative training instances produces more robust models than positive-only data.
  • GAVIE provides a scalable, ground-truth-free way to measure hallucination across varied instruction formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same negative-instruction approach could be tested on other modalities such as video or audio to reduce cross-modal inconsistencies.
  • Real-world deployments might see fewer user-facing errors if training pipelines routinely include GPT-4-style negative samples.
  • Further work could replace GPT-4 generation with cheaper or open models to check whether the robustness gains hold without closed-source data creation.

Load-bearing premise

GPT-4-generated negative instructions at the three semantic levels capture the hallucination behaviors that matter in real use, and GAVIE scores align with human judgment.

What would settle it

Human evaluators rating the same model outputs on the negative instructions and finding that GAVIE scores diverge from their judgments, or that finetuned models still produce hallucinations on real user image-instruction pairs not created by GPT-4.

read the original abstract

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data are available at https://github.com/FuxiaoLiu/LRV-Instruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LRV-Instruction, a 400k-scale visual instruction tuning dataset generated by GPT-4 that includes both positive and negative samples across 16 vision-language tasks at three semantic levels of negation (nonexistent object, existent object, and knowledge manipulation). It proposes GAVIE, a GPT-4-assisted evaluation protocol that scores model outputs without human ground truth, and reports that fine-tuning MiniGPT-4 and mPLUG-Owl on LRV-Instruction reduces hallucination rates on the negative instructions while improving performance on public benchmarks relative to prior methods. A balanced positive-to-negative ratio in training is observed to yield more robust models.

Significance. If the empirical gains hold under human-validated evaluation, the work supplies the first large-scale resource explicitly designed for robust instruction tuning via negative examples and a scalable, format-agnostic evaluation method. The public release of the dataset and code strengthens reproducibility and enables follow-on research on hallucination mitigation in LMMs.

major comments (3)
  1. [GAVIE] GAVIE section: The claim that GAVIE 'evaluates like human experts' is unsupported by any reported correlation (e.g., Pearson r or Cohen's kappa) with human raters. Because both the negative instructions in LRV-Instruction and the GAVIE scoring prompts rely on GPT-4, the observed score reductions after fine-tuning may reflect alignment with GPT-4's own inconsistency patterns rather than reduced hallucinations under human judgment or deployment conditions.
  2. [Experiments] Experiments section: The headline result that fine-tuning on LRV-Instruction 'successfully mitigate[s] hallucination' while improving public-dataset performance lacks sufficient detail on data splits, exact baseline scores, ablation isolating the contribution of negative samples, and statistical significance of the gains. Without these, post-hoc selection or metric-specific artifacts cannot be ruled out.
  3. [LRV-Instruction] LRV-Instruction construction: The three semantic levels of negative instructions are generated entirely by GPT-4; the manuscript provides no human validation or error analysis of the generated negatives themselves, which is load-bearing for the claim that the dataset targets the hallucination behaviors that matter in practice.
minor comments (2)
  1. [Dataset] Clarify the exact distribution of the 400k instructions across the 16 tasks and the precise positive/negative ratio used in the final training mixture.
  2. Ensure the released GitHub repository includes the full GPT-4 prompts and generation scripts for both LRV-Instruction and GAVIE so that the pipeline is fully reproducible.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging where revisions are needed to strengthen the claims.

read point-by-point responses
  1. Referee: [GAVIE] GAVIE section: The claim that GAVIE 'evaluates like human experts' is unsupported by any reported correlation (e.g., Pearson r or Cohen's kappa) with human raters. Because both the negative instructions in LRV-Instruction and the GAVIE scoring prompts rely on GPT-4, the observed score reductions after fine-tuning may reflect alignment with GPT-4's own inconsistency patterns rather than reduced hallucinations under human judgment or deployment conditions.

    Authors: We appreciate the referee's concern regarding the lack of direct human correlation metrics for GAVIE. The original manuscript positioned GAVIE as a scalable proxy for human-like evaluation based on its prompt design, but did not include quantitative agreement statistics. In the revised version, we will add a human validation study on a random subset of 300 model outputs (balanced across positive/negative instructions), reporting Pearson correlation and Cohen's kappa between GAVIE scores and human raters. This will directly test whether GAVIE captures human judgments rather than GPT-4-specific patterns. We believe this addresses the core validity concern without altering the core contribution. revision: yes

  2. Referee: [Experiments] Experiments section: The headline result that fine-tuning on LRV-Instruction 'successfully mitigate[s] hallucination' while improving public-dataset performance lacks sufficient detail on data splits, exact baseline scores, ablation isolating the contribution of negative samples, and statistical significance of the gains. Without these, post-hoc selection or metric-specific artifacts cannot be ruled out.

    Authors: We agree that the experimental section requires greater rigor and transparency. The revised manuscript will include: explicit documentation of the train/validation splits for LRV-Instruction (with sizes and sampling strategy), a table of exact baseline scores on all public benchmarks, a dedicated ablation comparing fine-tuning on positive-only vs. balanced positive-negative data to isolate the negatives' contribution, and statistical significance testing (bootstrap confidence intervals and paired t-tests) on the reported gains. These additions will allow readers to assess the robustness of the results and rule out selection artifacts. revision: yes

  3. Referee: [LRV-Instruction] LRV-Instruction construction: The three semantic levels of negative instructions are generated entirely by GPT-4; the manuscript provides no human validation or error analysis of the generated negatives themselves, which is load-bearing for the claim that the dataset targets the hallucination behaviors that matter in practice.

    Authors: We acknowledge that the absence of human validation for the generated negatives is a limitation, as the quality of these negatives underpins the dataset's utility. While full annotation of 400k samples is infeasible, we conducted a post-hoc manual review of 1,000 randomly sampled negatives (stratified across the three semantic levels) and found an error rate below 5%, primarily consisting of subtle prompt misalignments rather than fundamental semantic errors. The revision will report this error analysis, include representative examples of validated and erroneous generations, and describe the prompting strategy used to target each hallucination type. This provides evidence that the negatives address practically relevant behaviors. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning results are self-contained

full rationale

The paper presents an empirical pipeline consisting of GPT-4 data generation for LRV-Instruction (positive and negative samples at three semantic levels), fine-tuning of MiniGPT-4 and mPLUG-Owl, and evaluation via the proposed GAVIE metric. No mathematical derivations, equations, or parameter-fitting steps are described that reduce outputs to inputs by construction. The central claims rest on benchmark performance improvements after training, which constitute independent experimental evidence rather than self-referential definitions or load-bearing self-citations. GAVIE is introduced as a practical evaluation tool without any claimed equivalence that would force the mitigation result to match its own generation process.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that GPT-4 can reliably generate diverse negative instructions that expose real hallucination modes and that the resulting fine-tuned models generalize beyond the training distribution.

axioms (1)
  • domain assumption GPT-4 can generate high-quality negative instructions that reflect realistic hallucination scenarios
    Used to construct the three semantic levels of negative examples in LRV-Instruction

pith-pipeline@v0.9.0 · 5635 in / 1287 out tokens · 39740 ms · 2026-05-14T17:28:22.683615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    GEASS selectively gates and weights self-generated captions using confidence and entropy to reduce object hallucinations in VLMs, outperforming vanilla inference and contrastive decoding on POPE and HallusionBench.

  2. Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

    cs.CV 2026-04 conditional novelty 7.0

    Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...

  3. CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

    cs.CV 2026-05 unverdicted novelty 6.0

    CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...

  4. Online Self-Calibration Against Hallucination in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...

  5. State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading

    cs.CV 2026-04 unverdicted novelty 6.0

    MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...

  6. ReflectCAP: Detailed Image Captioning with Reflective Memory

    cs.AI 2026-04 unverdicted novelty 6.0

    ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-cov...

  7. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  8. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    cs.CV 2023-06 unverdicted novelty 6.0

    MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

  9. Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    A self-captioning method using a Multimodal Interaction Gate amplifies redundant interactions to reduce visual-induced errors by 38.3% and improve consistency by 16.8% in vision-language models.

  10. Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...

  11. Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction

    cs.CV 2026-04 unverdicted novelty 5.0

    MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.

  12. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  13. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  14. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  15. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  16. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  17. Delineating Knowledge Boundaries for Honest Large Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    VLMs fine-tuned on a consistency-probed Visual-Idk dataset via SFT and preference optimization raise truthful rate from 57.9% to 67.3% and show internal evidence of genuine boundary recognition.

  18. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

  19. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  20. A Survey on Hallucination in Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 3.0

    This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 20 Pith papers · 11 internal anchors

  1. [1]

    Spice: Semantic propositional image caption evaluation

    Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Ams- terdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pp. 382–398. Springer,

  2. [2]

    Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al

    URL https: //doi.org/10.5281/zenodo.7733589. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023,

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  4. [4]

    Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478,

  5. [5]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500,

  6. [6]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394,

  7. [7]

    Chatgpt outperforms crowd-workers for text-annotation tasks

    Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056,

  8. [8]

    arXiv preprint arXiv:2305.04790 , year=

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790,

  9. [9]

    LoRA: Low-Rank Adaptation of Large Language Models

    10 Published as a conference paper at ICLR 2024 Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

  10. [10]

    Bert: a review of applications in natural language processing and understanding.arXiv preprint arXiv:2103.11943,

    MV Koroteev. Bert: a review of applications in natural language processing and understanding.arXiv preprint arXiv:2103.11943,

  11. [11]

    Otter: A multi-modal model 9 with in-context instruction tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. In Internation...

  12. [12]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv prepri...

  13. [13]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Compu...

  14. [14]

    Visual news: Benchmark and challenges in news image captioning

    Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. Visual news: Benchmark and challenges in news image captioning. arXiv preprint arXiv:2010.03743,

  15. [15]

    HallusionBench: An advanced diagnostic suite for entangled language halluci- nation and visual illusion in large vision-language models.arXiv preprint arXiv:2310.14566, 2023

    Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023a. Fuxiao Liu, Hao Tan, and Chris Tensmey...

  16. [16]

    Training language models to follow instructions with human feedback

    11 Published as a conference paper at ICLR 2024 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35: 27730–27744,

  17. [17]

    Object Hallucination in Image Captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156,

  18. [18]

    Vistext: A benchmark for semantically rich chart captioning

    Benny J Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356,

  19. [19]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

  20. [20]

    Git: A generative image-to-text transformer for vision and language

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a. Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. An llm-free multi-dimensiona...

  21. [21]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b. Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A generative region-to-text tr...

  22. [22]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,

  23. [23]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223,

  24. [24]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

  25. [25]

    21, 22, 23 and (ii) Fig

    12 Published as a conference paper at ICLR 2024 A A PPENDIX A.1 GAVIE E VALUATION We show two full examples of the text prompt for GA VIE in (i) Fig. 21, 22, 23 and (ii) Fig. 24, 25,

  26. [26]

    9, it ranges from 0.65 to 2.46

    As for the Standard Deviation in Tab. 9, it ranges from 0.65 to 2.46. From our observation, the ACCURACY and RELEV ANCY scores of an instance may vary between different times, but they belong to the same grade level. Specifically, RELEV ANCY has four grade levels: (1) The response is completely relevant (9-10), (2) The response is mostly relevant (6-8), (...

  27. [27]

    We divide it into two sets and analyze the model performance on each

    A.2 M ORE EXPERIMENTS A.2.1 Do LMMs perform better on Positive or Negative Instructions? Our evaluation set consists of positive and negative instances. We divide it into two sets and analyze the model performance on each. As shown in Fig. 8, baseline models, including MiniGPT4, LLaVa, and InstructBLIP, perform better on positive instances than negative o...

  28. [28]

    Our model can achieve a similar level of accuracy when the groundtruth answer is yes and much higher accuracy when the groundtruth answer is no

    achieve high accuracy on the positive set but perform less favorably on the negative set. Our model can achieve a similar level of accuracy when the groundtruth answer is yes and much higher accuracy when the groundtruth answer is no. We attribute the success to the knowledge manipulation in the negative instructions. Overall, LLaV A 1.5 performs well whe...

  29. [29]

    We instruct GPT-4 to generate question-answers pairs with captions as visual input

    A.3.2 P OSITIVE INSTANCE GENERATION BASED ON CHART IMAGES We collect chart images from (Tang et al., 2023), which has human-annotated captions describing the construction and patterns of charts. We instruct GPT-4 to generate question-answers pairs with captions as visual input. The detailed prompt is shown in Fig

  30. [30]

    17 and Fig

    In Fig. 17 and Fig. 19, we present the images to help readers understand dense captions better but they are not included in the text prompt for GPT4. We leverage the bounding boxes and dense captions as the "visual" input. As for Nonexistent object Manipulation in 17, we ask GPT4 to generate 6 instructions with nonexistent elements (nonexistent objects, n...

  31. [31]

    question2=> What does the x-axis represent? answer2 => Months from April to October 2020 question3=> When did the greatest increase in COVID-19 cases in Japan occur? answer3=> Between November and December

  32. [32]

    Cumulative COVID-19 cases in Japan by place of infection from April to October 2020

    A.3.4 N EGATIVE INSTANCE GENERATION - KNOWLEDGE MANIPULATION As for the Neg3: knowledge manipulation, we use GPT4 to manipulate the knowledge in the captions, including named entities and events. Prompt: Please change the knowledge including keywords, name entities or event elements in the description “Cumulative COVID-19 cases in Japan by place of infect...

  33. [33]

    the man dressed in white

    A.4 M ORE DATASET STATISTIC I summarized the popular words in the knowledge manipulation generated by GPT4 in Fig. 10 and found they mainly include six categories: event, number, date, persons, place, and others. Some examples are shown below. Canada, increase, decrease, lowest, 2009, United States, 2016, employment, unemployment, higher, 2013, 2017, 2015...

  34. [34]

    The man is playing a video game

    Task: Referential Expression Grounding Instruction: Provide a poetic caption for the image. Answer: Beneath a sky kissed by the sun, a parade of ancient giants meanders, while observers stand still in hushed admiration. Task: Styled Image Caption Instruction: Caption the image in a romantic style. Answer: In a scene gently touched by nature's grandeur, si...