pith. machine review for the scientific record. sign in

arxiv: 2310.03744 · v2 · submitted 2023-10-05 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 1 theorem link

Improved Baselines with Visual Instruction Tuning

Haotian Liu , Chunyuan Li , Yuheng Li , Yong Jae Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-12 19:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords visual instruction tuninglarge multimodal modelsLLaVAvision-language connectorVQA dataCLIP vision encodermultimodal baselinesdata-efficient training
0
0 comments X

The pith

Simple modifications to LLaVA produce stronger baselines that lead on 11 visual instruction benchmarks using only 1.2 million public examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the basic vision-language connector in LLaVA works well and needs little data to reach high performance. By swapping in a higher-resolution CLIP vision encoder, replacing the connector with a simple MLP layer, and mixing in more academic visual question-answering examples with clean response formats, the authors create improved checkpoints that set new records across many standard tests. A 13-billion-parameter model trained this way finishes in roughly one day on eight A100 GPUs and still uses only publicly available data. This lowers the barrier for researchers who want to build capable multimodal models without massive resources or proprietary datasets.

Core claim

The fully-connected vision-language cross-modal connector inside LLaVA is surprisingly powerful and data-efficient. With the straightforward substitutions of CLIP-ViT-L-336px as the vision encoder, an MLP projection layer as the connector, and the addition of academic-task-oriented VQA data formatted with simple response prompts, the resulting models establish new state-of-the-art numbers on eleven different benchmarks while training on just 1.2 million public images and completing full training in about one day on a single 8-A100 node.

What carries the argument

The fully-connected vision-language cross-modal connector, implemented as an MLP projection layer that maps visual features from the CLIP encoder into the language model's embedding space.

Load-bearing premise

The reported gains are caused by the listed changes to the vision encoder, connector, and training data rather than by differences in training procedure, data cleaning steps, or evaluation details that are not described.

What would settle it

Retraining the original LLaVA architecture with the same new data mixture and formatting prompts but keeping the smaller CLIP encoder and linear connector, then measuring whether the performance gap largely disappears.

read the original abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that simple modifications to LLaVA—specifically, replacing the vision-language connector with an MLP and using the higher-resolution CLIP-ViT-L-336px encoder, plus supplementing training with academic VQA data under simple response formatting prompts—produce stronger baselines that achieve state-of-the-art results across 11 benchmarks. The final 13B model is trained on only 1.2M public examples and completes full training in roughly one day on a single 8-A100 node.

Significance. If the reported gains hold under controlled conditions, the work is significant for demonstrating that competitive LMM performance is achievable with modest public data and compute, thereby lowering barriers to entry. The public release of code and models is a clear strength that supports reproducibility and future baseline comparisons in visual instruction tuning.

major comments (1)
  1. [Experiments] The central attribution—that the listed modifications (CLIP-ViT-L-336px + MLP connector and added academic VQA data) are responsible for the SOTA results—requires isolating ablations. The experimental comparisons do not hold all other variables (optimizer schedule, data filtering, response formatting details beyond the stated prompts, or evaluation protocol) fixed while toggling only these two changes, leaving open the possibility that unstated factors contribute to the lift.
minor comments (2)
  1. A consolidated table listing all 11 benchmarks with exact metrics for the proposed model versus prior baselines would improve clarity and allow direct verification of the SOTA claim.
  2. [Method] The MLP projection architecture (layer count, hidden dimensions) and the precise composition of the 1.2M training mixture should be specified in the method section for full reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for recognizing the significance of our work in making strong LMM baselines more accessible. We provide a point-by-point response to the major comment below.

read point-by-point responses
  1. Referee: [Experiments] The central attribution—that the listed modifications (CLIP-ViT-L-336px + MLP connector and added academic VQA data) are responsible for the SOTA results—requires isolating ablations. The experimental comparisons do not hold all other variables (optimizer schedule, data filtering, response formatting details beyond the stated prompts, or evaluation protocol) fixed while toggling only these two changes, leaving open the possibility that unstated factors contribute to the lift.

    Authors: We agree with the referee that isolating the effects of each modification through controlled ablations would provide stronger evidence for our claims. The original manuscript presents comparisons of the full improved model against prior work, but does not include exhaustive ablations that hold all other factors constant. In the revised version, we will add new experiments that fix the training recipe (optimizer, schedule, data filtering, prompts, and evaluation) and vary only the vision encoder resolution, the connector architecture, and the addition of academic VQA data. These ablations will clarify the contribution of each change. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical baseline improvements rest on training runs and benchmarks

full rationale

The paper reports results from training LLaVA variants with stated modifications (CLIP-ViT-L-336px + MLP connector, added academic VQA data, simple prompts) and evaluates them on 11 public benchmarks. No derivation chain, equations, or first-principles predictions exist that could reduce to self-defined quantities or fitted inputs by construction. Self-citations to prior LLaVA work supply background but do not serve as load-bearing uniqueness theorems or ansatzes; the central claims are new empirical numbers from explicit training procedures using 1.2M public data points. The work is self-contained against external benchmarks and does not rename known patterns or smuggle assumptions via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical engineering note with no new theoretical constructs; it relies on standard machine learning training assumptions and publicly available datasets.

axioms (1)
  • domain assumption Standard machine learning assumptions on data representativeness and model generalization to benchmarks
    The paper depends on typical supervised fine-tuning and evaluation practices in multimodal learning.

pith-pipeline@v0.9.0 · 5432 in / 1140 out tokens · 61727 ms · 2026-05-12T19:05:19.082536+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

    cs.AI 2026-04 unverdicted novelty 8.0

    FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.

  2. Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...

  3. A Sanity Check on Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.

  4. Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

    cs.CV 2026-04 unverdicted novelty 7.0

    Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

  5. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  6. Revealing Interpretable Failure Modes of VLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    REVELIO uncovers interpretable failure modes in VLMs by searching combinatorial concept spaces with diversity-aware beam search and Gaussian-process Thompson sampling, revealing vulnerabilities in autonomous driving a...

  7. When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.

  8. Text-Conditional JEPA for Learning Semantically Rich Visual Representations

    cs.LG 2026-05 unverdicted novelty 6.0

    TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.

  9. Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...

  10. PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

    cs.AI 2026-04 unverdicted novelty 6.0

    PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.

  11. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  12. Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

    cs.AI 2026-04 unverdicted novelty 6.0

    MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals...

  13. Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...

  14. Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.

  15. Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

    cs.CV 2026-03 unverdicted novelty 6.0

    Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...

  16. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    cs.AI 2025-01 unverdicted novelty 6.0

    Reinforcement learning post-training enables generalization to unseen textual rule variants and visual changes in foundation models, while supervised fine-tuning primarily leads to memorization.

  17. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  18. OpenVLA: An Open-Source Vision-Language-Action Model

    cs.RO 2024-06 unverdicted novelty 6.0

    OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

  19. Chameleon: Mixed-Modal Early-Fusion Foundation Models

    cs.CL 2024-05 unverdicted novelty 6.0

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...

  20. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    cs.CL 2024-04 accept novelty 6.0

    Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

  21. Are We on the Right Way for Evaluating Large Vision-Language Models?

    cs.CV 2024-03 conditional novelty 6.0

    Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...

  22. SGLang: Efficient Execution of Structured Language Model Programs

    cs.AI 2023-12 conditional novelty 6.0

    SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

  23. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    cs.CV 2023-11 conditional novelty 6.0

    A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

  24. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  25. MMBench: Is Your Multi-modal Model an All-around Player?

    cs.CV 2023-07 accept novelty 6.0

    MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

  26. Make Your LVLM KV Cache More Lightweight

    cs.CV 2026-05 unverdicted novelty 5.0

    LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.

  27. Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

    cs.CV 2026-04 unverdicted novelty 5.0

    MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.

  28. CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 5.0

    CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.

  29. Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.

  30. Qwen2.5-Omni Technical Report

    cs.CL 2025-03 conditional novelty 5.0

    Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...

  31. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  32. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  33. When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

    cs.CV 2026-05 unverdicted novelty 4.0

    Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt augmentation and preprocessing offering only partial mitigation.

  34. When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

    cs.CV 2026-05 unverdicted novelty 4.0

    Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.

  35. Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification

    cs.CV 2026-04 unverdicted novelty 4.0

    Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.

  36. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

  37. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  38. Yi: Open Foundation Models by 01.AI

    cs.CL 2024-03 unverdicted novelty 4.0

    Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

  39. A Survey on Hallucination in Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 3.0

    This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 37 Pith papers · 19 internal anchors

  1. [1]

    Fuyu-8b: A multimodal architecture for ai agents

    Adept AI. Fuyu-8b: A multimodal architecture for ai agents. https://www.adept.ai/blog/fuyu-8b, 2024. 2

  2. [2]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a vi- sual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 1

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1, 2, 4, 5, 6, 13

  4. [4]

    Visit-bench: A benchmark for vision- language instruction following inspired by real-world use,

    Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wan- rong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. Visit-bench: A benchmark for vision- language instruction following inspired by real-world use,

  5. [5]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023. 1

  6. [6]

    Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023

    Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023. 1

  7. [7]

    Visual instruction tuning with polite flamingo

    Delong Chen, Jianfeng Liu, Wenliang Dai, and Baoyuan Wang. Visual instruction tuning with polite flamingo. arXiv preprint arXiv:2307.01003, 2023. 2, 3

  8. [8]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195,

  9. [9]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- frey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020. 3

  10. [10]

    Improved Baselines with Momentum Contrastive Learning

    Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- proved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020. 3

  11. [11]

    Pali-x: On scaling up a multilingual vision and language model

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebas- tian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 5

  12. [12]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 9

  13. [13]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models. arXiv preprint arXiv:2210.11416,

  14. [14]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 1, 2, 3, 5, 6, 8, 13

  15. [15]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 4

  16. [16]

    Eva: Exploring the limits of masked visual representation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358– 19369, 2023. 6

  17. [17]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 1, 4, 5

  18. [18]

    arXiv preprint arXiv:2305.04790 , year=

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790,

  19. [19]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3, 5, 6, 9

  20. [20]

    Vizwiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,

  21. [21]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 4, 5, 9

  22. [22]

    Introducing idefics: An open reproduction of state-of-the-art visual language model

    IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https : / / huggingface.co/blog/idefics, 2023. 5, 6

  23. [23]

    Openclip

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Ha- jishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. 2021. If you use this software, please cite it as below. 6

  24. [24]

    Referitgame: Referring to objects in pho- tographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 4, 9

  25. [25]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 4, 9

  26. [26]

    Lisa: Reasoning segmentation via large language model,

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692,

  27. [27]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 1, 5

  28. [28]

    Otterhd: A high-resolution multi-modality model, 2023

    Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi-modality model, 2023. 2

  29. [29]

    Li, and Ziwei Liu

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023. 1

  30. [30]

    Multimodal founda- tion models: From specialists to general-purpose assistants

    Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Lin- jie Li, Lijuan Wang, and Jianfeng Gao. Multimodal founda- tion models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023. 1

  31. [31]

    et al.: LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day (Jun 2023)

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and- vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023. 1

  32. [32]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 1, 2, 3, 4, 5, 6, 13

  33. [33]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021. 3

  34. [34]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 1, 5

  35. [35]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 5

  36. [36]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 1, 2, 3, 4, 5, 6, 8, 9

  37. [37]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 1, 5, 7

  38. [38]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022. 5

  39. [39]

    An empirical study of scal- ing instruct-tuned large multimodal models

    Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jian- feng Gao, and Yelong Shen. An empirical study of scal- ing instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958, 2023. 1, 4

  40. [40]

    Generation and comprehension of unambiguous object descriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Cam- buru, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016. 4, 9

  41. [41]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 4, 9

  42. [42]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019. 4, 9

  43. [43]

    Gpt-4v(ision) system card

    OpenAI. Gpt-4v(ision) system card. https://cdn. openai . com / papers / GPTV _ System _ Card . pdf,

  44. [44]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. 6

  45. [45]

    A-okvqa: A bench- mark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A bench- mark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022. 4, 9

  46. [46]

    https://sharegpt.com/, 2023

    ShareGPT. https://sharegpt.com/, 2023. 4, 7, 8, 9

  47. [47]

    Textcaps: a dataset for image captioning with reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Aman- preet Singh. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020. 4, 9

  48. [48]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xin- lei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 8317–8326, 2019. 5

  49. [49]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 7

  50. [50]

    Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

    Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023. 1

  51. [51]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021. 2

  52. [52]

    a helpful assistant

    Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023. 6

  53. [53]

    Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model, 2023

    Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Alex Lin, and Fei Huang. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model, 2023. 2

  54. [54]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 1

  55. [55]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 1, 3, 4, 5, 8

  56. [56]

    Gpt4roi: Instruction tuning large language model on region-of- interest

    Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023. 1

  57. [57]

    Llavar: Enhanced visual instruction tuning for text-rich image understanding

    Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023. 2

  58. [58]

    Svit: Scaling up visual instruction tuning

    Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087,

  59. [59]

    On evaluating adversarial robustness of large vision-language models

    Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongx- uan Li, Ngai-Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023. 1

  60. [60]

    P Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. 10

  61. [61]

    arXiv preprint arXiv:2305.11206 , year=

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023. 2, 8

  62. [62]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 2