pith. machine review for the scientific record. sign in

arxiv: 2305.03726 · v2 · submitted 2023-05-05 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:40 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords Otter modelMIMIC-IT datasetmulti-modal instruction tuningin-context learningFlamingo architecturevideo understandingvisual assistantsmulti-image reasoning
0
0 comments X

The pith

Otter improves multi-modal instruction following by training on in-context examples from both text and images or videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Otter, a model that extends Flamingo by incorporating textual and visual in-context examples during instruction tuning. This setup uses the MIMIC-IT dataset of over three million multi-modal instruction-response pairs to train the model for general-purpose visual assistance. The central idea is that providing diverse in-context examples across images and videos helps the model converge faster and generalize to new tasks involving multiple images or dynamic video. If this holds, models could handle complex visual queries more reliably without relying solely on single-instruction training. The work targets the gap where prior models did not systematically use multi-modal contexts for better instruction adherence.

Core claim

Otter builds on the Flamingo Perceiver architecture and is instruction-tuned on the MIMIC-IT dataset, which supplies diverse in-context examples across text, multiple images, and video; this training substantially improves convergence speed and generalization on tasks that require understanding complex video content and multi-image scenarios.

What carries the argument

The MIMIC-IT dataset and its per-entry curation of in-context multi-modal examples, which the model uses during instruction tuning to process combined text, image, and video inputs.

If this is right

  • Otter processes instructions that combine text with multiple images or video clips without separate handling steps.
  • Training convergence accelerates when in-context examples accompany each instruction.
  • Performance rises on video understanding and multi-image reasoning benchmarks.
  • The model supports a broader range of real-world visual assistant queries than single-instruction baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar in-context tuning could be applied to other base multi-modal architectures to test if the gains transfer.
  • Dataset curation that emphasizes scenario diversity may reduce the total volume of data needed for strong multi-modal performance.
  • Future work could explore whether the same approach scales to longer video sequences or more images per context.
  • This points toward instruction tuning that treats context as a first-class input rather than an optional add-on.

Load-bearing premise

The assumption that MIMIC-IT's diverse in-context examples produce real generalization gains instead of improvements limited to the dataset's own scenarios.

What would settle it

Measure whether a version of Otter trained without in-context examples matches or exceeds the full model's accuracy on a held-out set of multi-image and video instruction tasks drawn from sources outside MIMIC-IT.

read the original abstract

Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the \textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the \textbf{MIMIC-IT} (\textbf{M}ult\textbf{I}-\textbf{M}odal \textbf{I}n-\textbf{C}ontext \textbf{I}nstruction \textbf{T}uning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Otter model, built on the Flamingo Perceiver architecture, and the MIMIC-IT dataset of over 3 million multi-modal instruction-response pairs featuring diverse in-context examples across images and videos. It claims that instruction tuning with these in-context examples substantially enhances convergence and generalization, enabling strong performance on complex video and multi-image tasks.

Significance. If the empirical claims hold with proper validation, the work would advance multi-modal in-context learning by releasing a large-scale dataset with broad scenario coverage and demonstrating a model that handles variable multi-image and video inputs. The dataset curation and architectural integration represent potentially useful resources for general-purpose visual assistants.

major comments (3)
  1. [Abstract] Abstract: The assertion that 'instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities' is unsupported by any quantitative metrics, baselines (e.g., Flamingo), error bars, or evaluation details, so the data-to-claim link cannot be assessed.
  2. [Experiments] Experiments section: No ablation isolates the contribution of in-context structure (multiple images/videos per instruction) versus equivalent data volume or scenario coverage in single-instruction format, leaving dataset scale as a plausible alternative explanation for any gains.
  3. [Model Architecture] Model Architecture section: Details are missing on how variable-length visual sequences from multiple images and dynamic videos are tokenized, embedded, and attended within the existing Flamingo Perceiver without hidden limitations or modifications.
minor comments (2)
  1. [Experiments] Add explicit comparison tables against Flamingo and other LMM baselines with standard metrics (e.g., accuracy, CIDEr) and statistical significance tests.
  2. [Method] Clarify notation for in-context example formatting and provide pseudocode for how multi-modal sequences are constructed during training.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications from the manuscript and indicating where revisions will be made to strengthen the presentation of results and technical details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities' is unsupported by any quantitative metrics, baselines (e.g., Flamingo), error bars, or evaluation details, so the data-to-claim link cannot be assessed.

    Authors: We appreciate this observation. The manuscript's Section 4 reports quantitative evaluations across multiple benchmarks, including direct comparisons to the base Flamingo model, with performance metrics on video and multi-image tasks and error bars in the tables. The abstract summarizes these findings at a high level. To make the link between data and claims explicit, we will revise the abstract to include specific quantitative improvements (e.g., accuracy gains on key tasks) supported by the experimental results. revision: yes

  2. Referee: [Experiments] Experiments section: No ablation isolates the contribution of in-context structure (multiple images/videos per instruction) versus equivalent data volume or scenario coverage in single-instruction format, leaving dataset scale as a plausible alternative explanation for any gains.

    Authors: This is a fair critique. The current experiments demonstrate gains from instruction tuning on MIMIC-IT but do not include an ablation that holds total data volume fixed while varying only the in-context (multi-image/video) structure versus single-instruction format. Creating a matched single-instruction control set at the same scale would require substantial additional curation. We will add a dedicated paragraph in the Experiments section acknowledging this limitation, discussing why the in-context design is central to the dataset, and noting it as an avenue for future work. revision: partial

  3. Referee: [Model Architecture] Model Architecture section: Details are missing on how variable-length visual sequences from multiple images and dynamic videos are tokenized, embedded, and attended within the existing Flamingo Perceiver without hidden limitations or modifications.

    Authors: We thank the referee for noting this gap. Otter uses the Flamingo Perceiver architecture without modifications to its core resampler or attention layers. Multiple images are encoded independently by the vision encoder and their visual tokens are concatenated into a single sequence; videos are processed by uniformly sampling frames and treating the resulting sequence as additional visual tokens. The Perceiver resampler then maps the variable-length visual sequence to a fixed set of latents for cross-attention with the language model. We will expand the Model Architecture section with a new subsection that explicitly describes the input preprocessing, tokenization, embedding, and attention flow for these variable-length cases. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new dataset and standard training

full rationale

The paper introduces Otter by extending the external Flamingo Perceiver architecture and trains it on the newly curated MIMIC-IT dataset containing multi-modal in-context examples. The central claim—that in-context instruction tuning enhances convergence and generalization—is presented as an empirical outcome of this training and subsequent evaluations, with no equations, fitted parameters, or self-citations that reduce the result to a definitional loop or renamed input. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the effectiveness of the newly curated MIMIC-IT dataset and the compatibility of the Flamingo base model with added in-context visual examples; these rest on unstated assumptions about data diversity and architecture suitability.

free parameters (2)
  • Number and selection of in-context examples per training instance
    Chosen during dataset curation and training but not quantified in the abstract.
  • Instruction tuning hyperparameters
    Standard training settings required for convergence but unspecified.
axioms (1)
  • domain assumption The Perceiver architecture in the Flamingo base model enables seamless handling of text, multiple images, and dynamic video inputs.
    Invoked when stating that Otter processes multi-modal inputs without additional architectural changes.

pith-pipeline@v0.9.0 · 5579 in / 1213 out tokens · 31120 ms · 2026-05-15T02:40:09.150748+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  2. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  3. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  4. AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

  5. Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

    cs.CV 2026-04 unverdicted novelty 7.0

    Multimodal ICL lags text-only ICL in few-shot settings due to weak cross-modal reasoning alignment and unreliable task mapping transfer, with an inference-stage method proposed to strengthen transfer.

  6. MLVU: Benchmarking Multi-task Long Video Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

  7. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    cs.CL 2023-07 unverdicted novelty 7.0

    SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

  8. Evaluating Object Hallucination in Large Vision-Language Models

    cs.CV 2023-05 accept novelty 7.0

    Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.

  9. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  10. SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

  11. R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

    cs.CV 2026-04 conditional novelty 6.0

    R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.

  12. CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

  13. Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

    cs.CV 2026-03 unverdicted novelty 6.0

    Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...

  14. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    cs.CV 2023-11 conditional novelty 6.0

    A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

  15. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  16. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    cs.CV 2023-06 accept novelty 6.0

    A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.

  17. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    cs.CV 2023-06 unverdicted novelty 6.0

    MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

  18. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  19. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  20. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  21. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    cs.CV 2023-08 unverdicted novelty 4.0

    OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 21 Pith papers · 28 internal anchors

  1. [1]

    https://commoncrawl.org/

    Common crawl. https://commoncrawl.org/. Accessed: 2023-11-

  2. [2]

    What learning algorithm is in-context learning? investigations with linear models

    Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661 ,

  3. [3]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 1

  4. [4]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 3

  5. [5]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision , pages 2425–2433, 2015. 8

  6. [7]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Open- flamingo: An open-source framework for training large autore- gressive vision-language models. arXiv preprint...

  7. [8]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 , 2023. 1, 6

  8. [9]

    Introducing our multimodal models, 2023

    Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa˘ gnak Ta¸ sırlar. Introducing our multimodal models, 2023. 12

  9. [10]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020. 1

  10. [11]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems (NeuIPS), 33:1877–1901, 2020. 2, 3

  11. [12]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition , pages 961–970, 2015. 5

  12. [13]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3558– 3568, 2021. 3

  13. [14]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3558– 3568, 2021. 5

  14. [15]

    Scanrefer: 3d object localization in rgb-d scans using natural language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. 16th European Conference on Computer Vision (ECCV) ,

  15. [16]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 , 2023. 1

  16. [17]

    Making your first choice: To address cold start problem in vision active learning

    Liangyu Chen, Yutong Bai, Siyu Huang, Yongyi Lu, Bihan Wen, Alan L Yuille, and Zongwei Zhou. Making your first choice: To address cold start problem in vision active learning. arXiv preprint arXiv:2210.02442, 2022. 7

  17. [18]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 3

  18. [19]

    Pali-x: On scaling up a multilingual vision and language model

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Good- man, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 , 2023. 1

  19. [20]

    Pali-3 vision language models: Smaller, faster, stronger, 2023

    Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut. Pali-3 vision language models: Smaller, faster, stronger, 2023. 1

  20. [21]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 9

  21. [22]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P . Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 1

  22. [23]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2

  23. [24]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 , 2022. 1, 2

  24. [25]

    Scannet: Richly- annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5828–5839, 2017. 4

  25. [26]

    Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers

    Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559, 2022. 2

  26. [28]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023. 1, 3

  27. [29]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 248–255. Ieee, 2009. 8

  28. [30]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multi- modal large language models. arXiv preprint arXiv:2306.13394 ,

  29. [31]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 , 2023. 1

  30. [32]

    What can transformers learn in-context? a case study of simple function classes

    Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems , 35:30583–30598, 2022. 2

  31. [33]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017. 3

  32. [34]

    Unnatural instructions: Tuning language models with (almost) no human labor

    Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689 , 2022. 3 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

  33. [35]

    Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al

    Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016) , 2016. 4

  34. [36]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6700–6709, 2019. 8

  35. [37]

    Face Classification and Detection

    Wondong Hyeon. Face Classification and Detection. https:// github.com/wondonghyeon/face-classification, 2023. Accessed: 2023-11-25. 5

  36. [38]

    Learning to describe differences between pairs of similar images

    Harsh Jhamtani and Taylor Berg-Kirkpatrick. Learning to describe differences between pairs of similar images. In EMNLP, pages 4024–4034. Association for Computational Linguistics, 2018. 4

  37. [39]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 787–798, 2014. 10

  38. [40]

    A hierarchical approach for generating descriptive image paragraphs

    Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei- Fei. A hierarchical approach for generating descriptive image paragraphs. In Computer Vision and Patterm Recognition (CVPR) ,

  39. [41]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision , pages 706–715,

  40. [42]

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. corr abs/1602.07332, 2016. 8

  41. [43]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision , 123:32–73, 2017. 3

  42. [44]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision , 123:32–73, 2017. 5

  43. [45]

    Obelics: An open web-scale filtered dataset of interleaved image-text documents

    Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks T rack, 2023. 1, 6

  44. [46]

    Tvr: A large- scale dataset for video-subtitle moment retrieval

    Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. Tvr: A large- scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16 , pages 447–463. Springer,

  45. [47]

    Otterhd: A high-resolution multi- modality model

    Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219 , 2023. 1, 12

  46. [48]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image en- coders and large language models. arXiv preprint arXiv:2301.12597,

  47. [49]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat- centric video understanding. arXiv preprint arXiv:2305.06355 , 2023. 9

  48. [50]

    M3it: A large-scale dataset towards multi- modal multilingual instruction tuning

    Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387 , 2023. 5, 8

  49. [51]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision- language models. arXiv preprint arXiv:2305.10355 , 2023. 9

  50. [52]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages 740–755. Springer, 2014. 3, 4, 5, 10

  51. [53]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565 , 2023. 3

  52. [54]

    Mitigating hallucination in large multi-modal models via robust instruction tuning, 2023

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning, 2023. 5

  53. [55]

    Im- proved baselines with visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Im- proved baselines with visual instruction tuning, 2023. 6, 8

  54. [57]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485 , 2023. 1

  55. [58]

    Prismer: A vision-language model with an ensemble of experts

    Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, and Anima Anandkumar. Prismer: A vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506 , 2023. 1

  56. [59]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 , 2023. 9, 12

  57. [60]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reason- ing of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023. 9

  58. [61]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages 3195– 3204, 2019. 8

  59. [62]

    Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022. 2

  60. [63]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019. 8

  61. [64]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424, 2023. 3, 5, 9

  62. [65]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. https://openai.com/research/ gpt-4, 2023. 1, 2, 3

  63. [66]

    Im2text: De- scribing images using 1 million captioned photographs

    Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: De- scribing images using 1 million captioned photographs. Advances in neural information processing systems , 24, 2011. 3

  64. [67]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 1

  65. [68]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 2, 3

  66. [69]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023. 1

  67. [70]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 3

  68. [71]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML) , pages 8748–8763. PMLR,

  69. [72]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. 1

  70. [73]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lin- tang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training en- ables zero-shot task generalization. arXiv preprint arXiv:2110.08207,

  71. [74]

    2 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16

  72. [75]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems , 35:25278–25294,

  73. [76]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3

  74. [77]

    A-okvqa: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Ken- neth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision , pages 146–162. Springer, 2022. 8

  75. [78]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 3

  76. [79]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 8317–8326, 2019. 8

  77. [80]

    Vistext: A benchmark for semantically rich chart captioning

    Benny J Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356, 2023. 5

  78. [81]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 1

  79. [82]

    Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

    MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05. 3

  80. [84]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Na- man Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3

Showing first 80 references.