Recognition: 2 theorem links
· Lean TheoremMitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Pith reviewed 2026-05-14 17:28 UTC · model grok-4.3
The pith
Finetuning on a dataset with both positive and negative visual instructions reduces hallucinations in large multi-modal models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing large multi-modal models produce significant hallucinations when given negative instructions, especially those involving manipulation of existent objects or external knowledge. By constructing LRV-Instruction with matched positive and negative samples across 16 vision-language tasks and finetuning on it, the models become more robust: they generate fewer inconsistent descriptions while achieving higher performance on multiple public datasets than current leading approaches. A balanced ratio of positive to negative training instances further strengthens this robustness.
What carries the argument
LRV-Instruction dataset, which supplies 400k visual instructions containing both positive answers and three levels of negative instructions (nonexistent object, existent object, and knowledge manipulation) to drive robust instruction tuning.
If this is right
- Finetuned models exhibit fewer hallucinations specifically on existent-object and knowledge-manipulation prompts.
- Performance improves on several public vision-language benchmarks relative to prior state-of-the-art instruction-tuned models.
- A balanced mix of positive and negative training instances produces more robust models than positive-only data.
- GAVIE provides a scalable, ground-truth-free way to measure hallucination across varied instruction formats.
Where Pith is reading between the lines
- The same negative-instruction approach could be tested on other modalities such as video or audio to reduce cross-modal inconsistencies.
- Real-world deployments might see fewer user-facing errors if training pipelines routinely include GPT-4-style negative samples.
- Further work could replace GPT-4 generation with cheaper or open models to check whether the robustness gains hold without closed-source data creation.
Load-bearing premise
GPT-4-generated negative instructions at the three semantic levels capture the hallucination behaviors that matter in real use, and GAVIE scores align with human judgment.
What would settle it
Human evaluators rating the same model outputs on the negative instructions and finding that GAVIE scores diverge from their judgments, or that finetuned models still produce hallucinations on real user image-instruction pairs not created by GPT-4.
read the original abstract
Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data are available at https://github.com/FuxiaoLiu/LRV-Instruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LRV-Instruction, a 400k-scale visual instruction tuning dataset generated by GPT-4 that includes both positive and negative samples across 16 vision-language tasks at three semantic levels of negation (nonexistent object, existent object, and knowledge manipulation). It proposes GAVIE, a GPT-4-assisted evaluation protocol that scores model outputs without human ground truth, and reports that fine-tuning MiniGPT-4 and mPLUG-Owl on LRV-Instruction reduces hallucination rates on the negative instructions while improving performance on public benchmarks relative to prior methods. A balanced positive-to-negative ratio in training is observed to yield more robust models.
Significance. If the empirical gains hold under human-validated evaluation, the work supplies the first large-scale resource explicitly designed for robust instruction tuning via negative examples and a scalable, format-agnostic evaluation method. The public release of the dataset and code strengthens reproducibility and enables follow-on research on hallucination mitigation in LMMs.
major comments (3)
- [GAVIE] GAVIE section: The claim that GAVIE 'evaluates like human experts' is unsupported by any reported correlation (e.g., Pearson r or Cohen's kappa) with human raters. Because both the negative instructions in LRV-Instruction and the GAVIE scoring prompts rely on GPT-4, the observed score reductions after fine-tuning may reflect alignment with GPT-4's own inconsistency patterns rather than reduced hallucinations under human judgment or deployment conditions.
- [Experiments] Experiments section: The headline result that fine-tuning on LRV-Instruction 'successfully mitigate[s] hallucination' while improving public-dataset performance lacks sufficient detail on data splits, exact baseline scores, ablation isolating the contribution of negative samples, and statistical significance of the gains. Without these, post-hoc selection or metric-specific artifacts cannot be ruled out.
- [LRV-Instruction] LRV-Instruction construction: The three semantic levels of negative instructions are generated entirely by GPT-4; the manuscript provides no human validation or error analysis of the generated negatives themselves, which is load-bearing for the claim that the dataset targets the hallucination behaviors that matter in practice.
minor comments (2)
- [Dataset] Clarify the exact distribution of the 400k instructions across the 16 tasks and the precise positive/negative ratio used in the final training mixture.
- Ensure the released GitHub repository includes the full GPT-4 prompts and generation scripts for both LRV-Instruction and GAVIE so that the pipeline is fully reproducible.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging where revisions are needed to strengthen the claims.
read point-by-point responses
-
Referee: [GAVIE] GAVIE section: The claim that GAVIE 'evaluates like human experts' is unsupported by any reported correlation (e.g., Pearson r or Cohen's kappa) with human raters. Because both the negative instructions in LRV-Instruction and the GAVIE scoring prompts rely on GPT-4, the observed score reductions after fine-tuning may reflect alignment with GPT-4's own inconsistency patterns rather than reduced hallucinations under human judgment or deployment conditions.
Authors: We appreciate the referee's concern regarding the lack of direct human correlation metrics for GAVIE. The original manuscript positioned GAVIE as a scalable proxy for human-like evaluation based on its prompt design, but did not include quantitative agreement statistics. In the revised version, we will add a human validation study on a random subset of 300 model outputs (balanced across positive/negative instructions), reporting Pearson correlation and Cohen's kappa between GAVIE scores and human raters. This will directly test whether GAVIE captures human judgments rather than GPT-4-specific patterns. We believe this addresses the core validity concern without altering the core contribution. revision: yes
-
Referee: [Experiments] Experiments section: The headline result that fine-tuning on LRV-Instruction 'successfully mitigate[s] hallucination' while improving public-dataset performance lacks sufficient detail on data splits, exact baseline scores, ablation isolating the contribution of negative samples, and statistical significance of the gains. Without these, post-hoc selection or metric-specific artifacts cannot be ruled out.
Authors: We agree that the experimental section requires greater rigor and transparency. The revised manuscript will include: explicit documentation of the train/validation splits for LRV-Instruction (with sizes and sampling strategy), a table of exact baseline scores on all public benchmarks, a dedicated ablation comparing fine-tuning on positive-only vs. balanced positive-negative data to isolate the negatives' contribution, and statistical significance testing (bootstrap confidence intervals and paired t-tests) on the reported gains. These additions will allow readers to assess the robustness of the results and rule out selection artifacts. revision: yes
-
Referee: [LRV-Instruction] LRV-Instruction construction: The three semantic levels of negative instructions are generated entirely by GPT-4; the manuscript provides no human validation or error analysis of the generated negatives themselves, which is load-bearing for the claim that the dataset targets the hallucination behaviors that matter in practice.
Authors: We acknowledge that the absence of human validation for the generated negatives is a limitation, as the quality of these negatives underpins the dataset's utility. While full annotation of 400k samples is infeasible, we conducted a post-hoc manual review of 1,000 randomly sampled negatives (stratified across the three semantic levels) and found an error rate below 5%, primarily consisting of subtle prompt misalignments rather than fundamental semantic errors. The revision will report this error analysis, include representative examples of validated and erroneous generations, and describe the prompting strategy used to target each hallucination type. This provides evidence that the negatives address practically relevant behaviors. revision: partial
Circularity Check
No circularity: empirical fine-tuning results are self-contained
full rationale
The paper presents an empirical pipeline consisting of GPT-4 data generation for LRV-Instruction (positive and negative samples at three semantic levels), fine-tuning of MiniGPT-4 and mPLUG-Owl, and evaluation via the proposed GAVIE metric. No mathematical derivations, equations, or parameter-fitting steps are described that reduce outputs to inputs by construction. The central claims rest on benchmark performance improvements after training, which constitute independent experimental evidence rather than self-referential definitions or load-bearing self-citations. GAVIE is introduced as a practical evaluation tool without any claimed equivalence that would force the mitigation result to match its own generation process.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPT-4 can generate high-quality negative instructions that reflect realistic hallucination scenarios
Forward citations
Cited by 20 Pith papers
-
GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models
GEASS selectively gates and weights self-generated captions using confidence and entropy to reduce object hallucinations in VLMs, outperforming vanilla inference and contrastive decoding on POPE and HallusionBench.
-
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
-
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
-
Online Self-Calibration Against Hallucination in Vision-Language Models
OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...
-
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...
-
ReflectCAP: Detailed Image Captioning with Reflective Memory
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-cov...
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models
A self-captioning method using a Multimodal Interaction Gate amplifies redundant interactions to reduce visual-induced errors by 38.3% and improve consistency by 16.8% in vision-language models.
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
-
Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Delineating Knowledge Boundaries for Honest Large Vision-Language Models
VLMs fine-tuned on a consistency-probed Visual-Idk dataset via SFT and preference optimization raise truthful rate from 57.9% to 67.3% and show internal evidence of genuine boundary recognition.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
A Survey on Hallucination in Large Vision-Language Models
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
Reference graph
Works this paper leans on
-
[1]
Spice: Semantic propositional image caption evaluation
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Ams- terdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pp. 382–398. Springer,
work page 2016
-
[2]
URL https: //doi.org/10.5281/zenodo.7733589. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023,
-
[3]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[4]
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478,
-
[5]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Chatgpt outperforms crowd-workers for text-annotation tasks
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056,
-
[8]
arXiv preprint arXiv:2305.04790 , year=
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790,
-
[9]
LoRA: Low-Rank Adaptation of Large Language Models
10 Published as a conference paper at ICLR 2024 Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
MV Koroteev. Bert: a review of applications in natural language processing and understanding.arXiv preprint arXiv:2103.11943,
-
[11]
Otter: A multi-modal model 9 with in-context instruction tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. In Internation...
-
[12]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv prepri...
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[13]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Compu...
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[14]
Visual news: Benchmark and challenges in news image captioning
Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. Visual news: Benchmark and challenges in news image captioning. arXiv preprint arXiv:2010.03743,
-
[15]
Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023a. Fuxiao Liu, Hao Tan, and Chris Tensmey...
-
[16]
Training language models to follow instructions with human feedback
11 Published as a conference paper at ICLR 2024 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35: 27730–27744,
work page 2024
-
[17]
Object Hallucination in Image Captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Vistext: A benchmark for semantically rich chart captioning
Benny J Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356,
-
[19]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Git: A generative image-to-text transformer for vision and language
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a. Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. An llm-free multi-dimensiona...
-
[21]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b. Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A generative region-to-text tr...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
12 Published as a conference paper at ICLR 2024 A A PPENDIX A.1 GAVIE E VALUATION We show two full examples of the text prompt for GA VIE in (i) Fig. 21, 22, 23 and (ii) Fig. 24, 25,
work page 2024
-
[26]
9, it ranges from 0.65 to 2.46
As for the Standard Deviation in Tab. 9, it ranges from 0.65 to 2.46. From our observation, the ACCURACY and RELEV ANCY scores of an instance may vary between different times, but they belong to the same grade level. Specifically, RELEV ANCY has four grade levels: (1) The response is completely relevant (9-10), (2) The response is mostly relevant (6-8), (...
work page 2024
-
[27]
We divide it into two sets and analyze the model performance on each
A.2 M ORE EXPERIMENTS A.2.1 Do LMMs perform better on Positive or Negative Instructions? Our evaluation set consists of positive and negative instances. We divide it into two sets and analyze the model performance on each. As shown in Fig. 8, baseline models, including MiniGPT4, LLaVa, and InstructBLIP, perform better on positive instances than negative o...
work page 2023
-
[28]
achieve high accuracy on the positive set but perform less favorably on the negative set. Our model can achieve a similar level of accuracy when the groundtruth answer is yes and much higher accuracy when the groundtruth answer is no. We attribute the success to the knowledge manipulation in the negative instructions. Overall, LLaV A 1.5 performs well whe...
work page 2024
-
[29]
We instruct GPT-4 to generate question-answers pairs with captions as visual input
A.3.2 P OSITIVE INSTANCE GENERATION BASED ON CHART IMAGES We collect chart images from (Tang et al., 2023), which has human-annotated captions describing the construction and patterns of charts. We instruct GPT-4 to generate question-answers pairs with captions as visual input. The detailed prompt is shown in Fig
work page 2023
-
[30]
In Fig. 17 and Fig. 19, we present the images to help readers understand dense captions better but they are not included in the text prompt for GPT4. We leverage the bounding boxes and dense captions as the "visual" input. As for Nonexistent object Manipulation in 17, we ask GPT4 to generate 6 instructions with nonexistent elements (nonexistent objects, n...
work page 2024
-
[31]
question2=> What does the x-axis represent? answer2 => Months from April to October 2020 question3=> When did the greatest increase in COVID-19 cases in Japan occur? answer3=> Between November and December
work page 2020
-
[32]
Cumulative COVID-19 cases in Japan by place of infection from April to October 2020
A.3.4 N EGATIVE INSTANCE GENERATION - KNOWLEDGE MANIPULATION As for the Neg3: knowledge manipulation, we use GPT4 to manipulate the knowledge in the captions, including named entities and events. Prompt: Please change the knowledge including keywords, name entities or event elements in the description “Cumulative COVID-19 cases in Japan by place of infect...
work page 2020
-
[33]
A.4 M ORE DATASET STATISTIC I summarized the popular words in the knowledge manipulation generated by GPT4 in Fig. 10 and found they mainly include six categories: event, number, date, persons, place, and others. Some examples are shown below. Canada, increase, decrease, lowest, 2009, United States, 2016, employment, unemployment, higher, 2013, 2017, 2015...
work page 2009
-
[34]
The man is playing a video game
Task: Referential Expression Grounding Instruction: Provide a poetic caption for the image. Answer: Beneath a sky kissed by the sun, a parade of ancient giants meanders, while observers stand still in hushed admiration. Task: Styled Image Caption Instruction: Caption the image in a romantic style. Answer: In a scene gently touched by nature's grandeur, si...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.