pith. machine review for the scientific record. sign in

arxiv: 2305.06500 · v2 · submitted 2023-05-11 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:07 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords vision-language modelsinstruction tuningzero-shot generalizationmultimodal learningquery transformergeneral-purpose modelsfeature adaptation
0
0 comments X

The pith

Instruction tuning on diverse vision-language datasets creates models with strong zero-shot generalization to new tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to extend the success of instruction tuning from language models to vision-language settings, where visual inputs add complexity and task variety. It assembles 26 public datasets across many capabilities, reformats them as instructions, and adds a module that conditions visual feature extraction on the specific instruction. Models trained this way on 13 datasets reach state-of-the-art zero-shot results on the remaining 13 held-out datasets while also excelling after task-specific fine-tuning. A reader would care because the work offers a concrete route to versatile multimodal systems that follow instructions across images and text without repeated full retraining.

Core claim

The central claim is that instruction tuning applied to pretrained vision-language models, using a broad set of tasks converted to instruction format together with an instruction-aware Query Transformer for tailored feature extraction, produces models with wide competence. Training on 13 held-in datasets yields state-of-the-art zero-shot performance on all 13 held-out datasets and leads to top results when the models are later fine-tuned on individual downstream tasks.

What carries the argument

The instruction-aware Query Transformer, which adapts visual feature extraction to the given text instruction so that the model receives only the most relevant image information for the current task.

If this is right

  • Instruction tuning on a moderate number of datasets suffices to produce broad zero-shot ability across many vision-language tasks.
  • Further fine-tuning after instruction tuning delivers high accuracy on specific tasks such as visual question answering.
  • A single model can handle a wide range of multimodal instructions in a unified manner.
  • The resulting models support open use for additional applications and research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tuning recipe could be tried on other base vision-language models to improve their instruction-following without starting from scratch.
  • Testing the models on novel combinations of capabilities not seen together in training would reveal how far the generalization extends.
  • Such systems might support more fluid interactive applications that combine image understanding with natural language instructions.

Load-bearing premise

The held-out datasets contain no overlap or leakage with the held-in training data so that gains reflect genuine generalization from instruction tuning rather than memorization.

What would settle it

Showing substantial data overlap between any held-in and held-out dataset or demonstrating that the performance gains vanish when testing on a fresh vision-language task absent from the original collection of 26 datasets.

read the original abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces InstructBLIP by applying instruction tuning to pretrained BLIP-2 models. It collects and converts 26 public vision-language datasets into instruction format, trains on a 13-dataset held-in split, and reports state-of-the-art zero-shot results on the complementary 13 held-out datasets, outperforming BLIP-2 and larger Flamingo models. Additional gains are shown after task-specific fine-tuning (e.g., 90.7% on ScienceQA), and the models are open-sourced.

Significance. If the held-out evaluation is uncontaminated, the work supplies the first large-scale empirical demonstration that instruction tuning yields broad zero-shot generalization in vision-language models, analogous to its success in language-only models. The open release of code and checkpoints is a concrete contribution that enables follow-up research.

major comments (2)
  1. [Section 4 and Section 3.2] Section 4 (Experiments) and the dataset description in Section 3.2: the central zero-shot SOTA claim on the 13 held-out datasets is load-bearing for the paper's generalization narrative, yet no explicit check, deduplication step, or overlap statistics are reported for images, captions, or questions that may be shared across the held-in/held-out split (many constituent datasets draw from COCO, Flickr30k, and Visual Genome). Without such verification the reported gains over BLIP-2 could partly reflect leakage rather than instruction-tuning benefits.
  2. [Section 3.3] Section 3.3 (Instruction-aware Query Transformer): the architectural modification is presented as key to conditioning on instructions, but the ablation isolating its contribution versus standard Q-Former + instruction tuning is not shown; the performance delta could be driven primarily by the larger instruction-tuning corpus rather than the new module.
minor comments (2)
  1. [Table 2] Table 2 and the held-out dataset list: several datasets share visual sources; adding a column or footnote indicating the source image collections would improve transparency.
  2. [Section 3.2] The instruction templates used for each dataset are described only at a high level; releasing the exact templates (or a supplementary file) would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify important gaps in verification and ablation that we will address through targeted revisions to strengthen the claims on generalization and architectural contributions.

read point-by-point responses
  1. Referee: [Section 4 and Section 3.2] Section 4 (Experiments) and the dataset description in Section 3.2: the central zero-shot SOTA claim on the 13 held-out datasets is load-bearing for the paper's generalization narrative, yet no explicit check, deduplication step, or overlap statistics are reported for images, captions, or questions that may be shared across the held-in/held-out split (many constituent datasets draw from COCO, Flickr30k, and Visual Genome). Without such verification the reported gains over BLIP-2 could partly reflect leakage rather than instruction-tuning benefits.

    Authors: We acknowledge that explicit overlap verification was not reported. The held-in/held-out split was constructed at the dataset level to ensure the 13 held-out tasks were unseen during instruction tuning, even though some source image collections (e.g., COCO) are shared across multiple VL benchmarks. To address the concern directly, we will add a new paragraph to Section 3.2 that (1) lists all constituent datasets and their original sources, (2) reports the percentage of image overlap between the held-in and held-out collections, and (3) discusses why any residual overlap is unlikely to explain the observed zero-shot gains, given that the held-out evaluation uses entirely different instructions and question formats. If substantial leakage is found, we will also report results after removing overlapping images. revision: yes

  2. Referee: [Section 3.3] Section 3.3 (Instruction-aware Query Transformer): the architectural modification is presented as key to conditioning on instructions, but the ablation isolating its contribution versus standard Q-Former + instruction tuning is not shown; the performance delta could be driven primarily by the larger instruction-tuning corpus rather than the new module.

    Authors: We agree that an explicit ablation isolating the instruction-aware Q-Former from the effect of the larger instruction-tuning corpus would be valuable. The module was introduced precisely to make the visual queries instruction-dependent, which standard Q-Former does not do. In the revision we will add an ablation in Section 4 that trains a baseline using the original (non-instruction-aware) Q-Former architecture on the identical 13 held-in datasets and instruction format, then compares its zero-shot performance on the held-out sets against the full InstructBLIP model. This will clarify the incremental benefit of the architectural change. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training and held-out evaluation with no derivation chain

full rationale

The paper describes collecting 26 public datasets, splitting them into 13 held-in and 13 held-out, instruction-tuning BLIP-2 variants on the held-in set, and reporting zero-shot results on the held-out set. No equations, first-principles derivations, uniqueness theorems, or fitted parameters are presented as generating predictions. The central claim is an empirical performance comparison, not a reduction of outputs to inputs by construction. Any risk of image/task overlap is an experimental-validity issue, not a circularity in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard deep-learning training assumptions and the integrity of the held-in/held-out data split; no new theoretical axioms or invented entities are introduced.

axioms (1)
  • domain assumption The 13 held-in and 13 held-out datasets are disjoint with no data leakage.
    Central to the zero-shot generalization claim.

pith-pipeline@v0.9.0 · 5562 in / 1081 out tokens · 61837 ms · 2026-05-13T02:07:30.229468+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  2. Allegory of the Cave: Measurement-Grounded Vision-Language Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.

  3. Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...

  4. SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters

    cs.CV 2026-04 unverdicted novelty 7.0

    Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.

  5. Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

    cs.MM 2026-04 unverdicted novelty 7.0

    Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...

  6. TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

    cs.AI 2026-04 conditional novelty 7.0

    TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.

  7. Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.

  8. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    cs.CV 2024-10 accept novelty 7.0

    PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

  9. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    cs.CL 2023-07 unverdicted novelty 7.0

    SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

  10. Evaluating Object Hallucination in Large Vision-Language Models

    cs.CV 2023-05 accept novelty 7.0

    Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.

  11. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  12. Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    cs.MM 2026-05 unverdicted novelty 6.0

    LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

  13. Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.

  14. Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...

  15. Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...

  16. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  17. Mitigating Multimodal Hallucination via Phase-wise Self-reward

    cs.CV 2026-04 unverdicted novelty 6.0

    PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.

  18. Counting to Four is still a Chore for VLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.

  19. See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment

    cs.CV 2026-04 conditional novelty 6.0

    Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.

  20. CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

  21. SAM 3D: 3Dfy Anything in Images

    cs.CV 2025-11 unverdicted novelty 6.0

    SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.

  22. OpenVLA: An Open-Source Vision-Language-Action Model

    cs.RO 2024-06 unverdicted novelty 6.0

    OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

  23. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    cs.CV 2024-03 conditional novelty 6.0

    Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.

  24. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  25. Aligning Large Multimodal Models with Factually Augmented RLHF

    cs.CV 2023-09 conditional novelty 6.0

    Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.

  26. MMBench: Is Your Multi-modal Model an All-around Player?

    cs.CV 2023-07 accept novelty 6.0

    MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

  27. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    cs.CV 2023-06 accept novelty 6.0

    A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.

  28. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    cs.CV 2023-06 unverdicted novelty 6.0

    MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

  29. Otter: A Multi-Modal Model with In-Context Instruction Tuning

    cs.CV 2023-05 unverdicted novelty 6.0

    Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.

  30. DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models

    cs.CV 2026-04 unverdicted novelty 5.0

    Off-the-shelf models assess quality and alignment to select diverse multimodal training data, letting models trained on the filtered subset match or exceed full-dataset results on standard benchmarks.

  31. Qwen2.5-Omni Technical Report

    cs.CL 2025-03 conditional novelty 5.0

    Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...

  32. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  33. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  34. PaliGemma 2: A Family of Versatile VLMs for Transfer

    cs.CV 2024-12 unverdicted novelty 4.0

    PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...

  35. Improved Baselines with Visual Instruction Tuning

    cs.CV 2023-10 conditional novelty 4.0

    Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

  36. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

  37. A Survey on Hallucination in Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 3.0

    This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

  38. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 36 Pith papers · 5 internal anchors

  1. [1]

    https://openai.com/blog/chatgpt, 2023

    Chatgpt. https://openai.com/blog/chatgpt, 2023. 9

  2. [2]

    https://github.com/lm-sys/FastChat, 2023

    Vicuna. https://github.com/lm-sys/FastChat, 2023. 3, 6, 9

  3. [3]

    nocaps: novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In ICCV, pages 8948–8957, 2019. 3, 16

  4. [4]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...

  5. [5]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  6. [6]

    Unifying vision-and-language tasks via text generation

    Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. arXiv preprint arXiv:2102.02779, 2021. 1

  7. [7]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y . Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jef...

  8. [8]

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose M. F. Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, 2017. 3, 16

  9. [9]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

  10. [10]

    Eva: Exploring the limits of masked visual represen- tation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan-Sen Sun, Ledell Yu Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. ArXiv, abs/2211.07636, 2022. 6

  11. [11]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, July 2017. 3, 16

  12. [12]

    Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P

    Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018. 3, 16

  13. [13]

    Unnatural instructions: Tuning language models with (almost) no human labor

    Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. ArXiv, abs/2212.09689, 2022. 9

  14. [14]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 9

  15. [15]

    Smith, and Jiebo Luo

    Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, and Jiebo Luo. Promptcap: Prompt- guided task-aware image captioning, 2023. 9

  16. [16]

    Hudson and Christopher D

    Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 3, 16

  17. [17]

    Deep visual-semantic alignments for generating image descriptions

    Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. 16

  18. [18]

    The hateful memes challenge: Detecting hate speech in multimodal memes

    Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multimodal memes. In NeurIPS, 2020. 3, 16 10

  19. [19]

    Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, and Steven C. H. Hoi. Lavis: A library for language-vision intelligence, 2022. 6

  20. [20]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 1, 3, 4, 6, 9, 16

  21. [21]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022. 5, 16

  22. [22]

    Align before fuse: Vision and language representation learning with momentum distillation

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021. 5

  23. [23]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 3, 16

  24. [24]

    Visual spatial reasoning

    Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023. 3

  25. [25]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023. 3, 7, 9, 16

  26. [26]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019. 6

  27. [27]

    12-in-1: Multi-task vision and language representation learning

    Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 12-in-1: Multi-task vision and language representation learning. In CVPR, 2020. 1

  28. [28]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022. 3, 16

  29. [29]

    Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning

    Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In NeurIPS Track on Datasets and Benchmarks, 2021. 3, 16

  30. [30]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019. 3, 16

  31. [31]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019. 3, 16

  32. [32]

    Large-scale pretraining for visual dialog: A simple state-of-the-art baseline

    Vishvak Murahari, Dhruv Batra, Devi Parikh, and Abhishek Das. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, ECCV, 2020. 6, 7

  33. [33]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. 7, 9

  34. [34]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020. 3, 6

  35. [35]

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, An- toine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V . Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica...

  36. [36]

    A- okvqa: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A- okvqa: A benchmark for visual question answering using world knowledge. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,ECCV, 2022. 3, 16

  37. [37]

    Prompting large language models with answer heuristics for knowledge-based visual question answering

    Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. Prompting large language models with answer heuristics for knowledge-based visual question answering. Computer Vision and Pattern Recognition (CVPR), 2023. 9

  38. [38]

    Textcaps: a dataset for image captioningwith reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioningwith reading comprehension. 2020. 3, 16

  39. [39]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, pages 8317–8326, 2019. 3, 16

  40. [40]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023. 9 11

  41. [41]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3, 6, 9

  42. [42]

    Lawrence Zitnick, and Devi Parikh

    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4566–4575, 2015. 6

  43. [43]

    Git: A generative image-to-text transformer for vision and language, 2022

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language, 2022. 9

  44. [44]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. ArXiv, abs/2212.10560, 2022. 9

  45. [45]

    Super- NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parma...

  46. [46]

    Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

    Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. InICLR, 2022. 1, 5, 8, 9

  47. [47]

    Video question answering via gradually refined attention over appearance and motion

    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia, page 1645–1653, 2017. 3, 16

  48. [48]

    Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning

    Zhiyang Xu, Ying Shen, and Lifu Huang. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. ArXiv, abs/2212.10773, 2022. 9

  49. [49]

    Just ask: Learning to answer questions from millions of narrated videos

    Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, pages 1686–1697, 2021. 3, 6, 16

  50. [50]

    mplug-owl: Modularization empowers large language models with multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yi Zhou, Junyan Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qiang Qi, Ji Chao Zhang, and Feiyan Huang. mplug-owl: Modularization empowers large language models with multimodality. 2023. 9

  51. [51]

    From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 2014. 3, 16

  52. [52]

    The Girl with the Pearl Earring

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023. 7, 9 12 A Broader Impact InstructBLIP uses off-the-shelf frozen LLMs. Therefore it inherits some of the shortcomings from the original LLMs, such as hallucinating ungrounded text or generating o...