arxiv: 2310.03744 · v2 · submitted 2023-10-05 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 1 theorem link

Improved Baselines with Visual Instruction Tuning

Haotian Liu , Chunyuan Li , Yuheng Li , Yong Jae Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-12 19:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords visual instruction tuninglarge multimodal modelsLLaVAvision-language connectorVQA dataCLIP vision encodermultimodal baselinesdata-efficient training

0 comments

The pith

Simple modifications to LLaVA produce stronger baselines that lead on 11 visual instruction benchmarks using only 1.2 million public examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the basic vision-language connector in LLaVA works well and needs little data to reach high performance. By swapping in a higher-resolution CLIP vision encoder, replacing the connector with a simple MLP layer, and mixing in more academic visual question-answering examples with clean response formats, the authors create improved checkpoints that set new records across many standard tests. A 13-billion-parameter model trained this way finishes in roughly one day on eight A100 GPUs and still uses only publicly available data. This lowers the barrier for researchers who want to build capable multimodal models without massive resources or proprietary datasets.

Core claim

The fully-connected vision-language cross-modal connector inside LLaVA is surprisingly powerful and data-efficient. With the straightforward substitutions of CLIP-ViT-L-336px as the vision encoder, an MLP projection layer as the connector, and the addition of academic-task-oriented VQA data formatted with simple response prompts, the resulting models establish new state-of-the-art numbers on eleven different benchmarks while training on just 1.2 million public images and completing full training in about one day on a single 8-A100 node.

What carries the argument

The fully-connected vision-language cross-modal connector, implemented as an MLP projection layer that maps visual features from the CLIP encoder into the language model's embedding space.

Load-bearing premise

The reported gains are caused by the listed changes to the vision encoder, connector, and training data rather than by differences in training procedure, data cleaning steps, or evaluation details that are not described.

What would settle it

Retraining the original LLaVA architecture with the same new data mixture and formatting prompts but keeping the smaller CLIP encoder and linear connector, then measuring whether the performance gap largely disappears.

read the original abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Simple tweaks to LLaVA deliver strong new baselines on 11 benchmarks with tiny data and compute, but the paper leaves the exact sources of the gains under-specified.

read the letter

This note shows that swapping in CLIP-ViT-L-336px plus an MLP connector and folding in some academic VQA data with plain prompts can lift LLaVA-style models to state-of-the-art numbers across eleven standard vision-language tasks. The 13B checkpoint trains on just 1.2 million public examples in roughly a day on eight A100s, which is genuinely useful for anyone who wants a reproducible high-water mark without industrial-scale resources. The authors also release code and weights, so the work is immediately actionable rather than just a claim on arXiv.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that simple modifications to LLaVA—specifically, replacing the vision-language connector with an MLP and using the higher-resolution CLIP-ViT-L-336px encoder, plus supplementing training with academic VQA data under simple response formatting prompts—produce stronger baselines that achieve state-of-the-art results across 11 benchmarks. The final 13B model is trained on only 1.2M public examples and completes full training in roughly one day on a single 8-A100 node.

Significance. If the reported gains hold under controlled conditions, the work is significant for demonstrating that competitive LMM performance is achievable with modest public data and compute, thereby lowering barriers to entry. The public release of code and models is a clear strength that supports reproducibility and future baseline comparisons in visual instruction tuning.

major comments (1)

[Experiments] The central attribution—that the listed modifications (CLIP-ViT-L-336px + MLP connector and added academic VQA data) are responsible for the SOTA results—requires isolating ablations. The experimental comparisons do not hold all other variables (optimizer schedule, data filtering, response formatting details beyond the stated prompts, or evaluation protocol) fixed while toggling only these two changes, leaving open the possibility that unstated factors contribute to the lift.

minor comments (2)

A consolidated table listing all 11 benchmarks with exact metrics for the proposed model versus prior baselines would improve clarity and allow direct verification of the SOTA claim.
[Method] The MLP projection architecture (layer count, hidden dimensions) and the precise composition of the 1.2M training mixture should be specified in the method section for full reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for recognizing the significance of our work in making strong LMM baselines more accessible. We provide a point-by-point response to the major comment below.

read point-by-point responses

Referee: [Experiments] The central attribution—that the listed modifications (CLIP-ViT-L-336px + MLP connector and added academic VQA data) are responsible for the SOTA results—requires isolating ablations. The experimental comparisons do not hold all other variables (optimizer schedule, data filtering, response formatting details beyond the stated prompts, or evaluation protocol) fixed while toggling only these two changes, leaving open the possibility that unstated factors contribute to the lift.

Authors: We agree with the referee that isolating the effects of each modification through controlled ablations would provide stronger evidence for our claims. The original manuscript presents comparisons of the full improved model against prior work, but does not include exhaustive ablations that hold all other factors constant. In the revised version, we will add new experiments that fix the training recipe (optimizer, schedule, data filtering, prompts, and evaluation) and vary only the vision encoder resolution, the connector architecture, and the addition of academic VQA data. These ablations will clarify the contribution of each change. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical baseline improvements rest on training runs and benchmarks

full rationale

The paper reports results from training LLaVA variants with stated modifications (CLIP-ViT-L-336px + MLP connector, added academic VQA data, simple prompts) and evaluates them on 11 public benchmarks. No derivation chain, equations, or first-principles predictions exist that could reduce to self-defined quantities or fitted inputs by construction. Self-citations to prior LLaVA work supply background but do not serve as load-bearing uniqueness theorems or ansatzes; the central claims are new empirical numbers from explicit training procedures using 1.2M public data points. The work is self-contained against external benchmarks and does not rename known patterns or smuggle assumptions via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical engineering note with no new theoretical constructs; it relies on standard machine learning training assumptions and publicly available datasets.

axioms (1)

domain assumption Standard machine learning assumptions on data representativeness and model generalization to benchmarks
The paper depends on typical supervised fine-tuning and evaluation practices in multimodal learning.

pith-pipeline@v0.9.0 · 5432 in / 1140 out tokens · 61727 ms · 2026-05-12T19:05:19.082536+00:00 · methodology

discussion (0)

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
cs.AI 2026-04 unverdicted novelty 8.0

FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
A Sanity Check on Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
cs.CV 2026-04 unverdicted novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Revealing Interpretable Failure Modes of VLMs
cs.AI 2026-05 unverdicted novelty 6.0

REVELIO uncovers interpretable failure modes in VLMs by searching combinatorial concept spaces with diversity-aware beam search and Gaussian-process Thompson sampling, revealing vulnerabilities in autonomous driving a...
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
cs.LG 2026-05 unverdicted novelty 6.0

TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
cs.AI 2026-04 unverdicted novelty 6.0

PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
cs.AI 2026-04 unverdicted novelty 6.0

MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals...
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
cs.CV 2026-03 unverdicted novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
cs.AI 2025-01 unverdicted novelty 6.0

Reinforcement learning post-training enables generalization to unseen textual rule variants and visual changes in foundation models, while supervised fine-tuning primarily leads to memorization.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
Chameleon: Mixed-Modal Early-Fusion Foundation Models
cs.CL 2024-05 unverdicted novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
cs.CL 2024-04 accept novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
SGLang: Efficient Execution of Structured Language Model Programs
cs.AI 2023-12 conditional novelty 6.0

SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
Make Your LVLM KV Cache More Lightweight
cs.CV 2026-05 unverdicted novelty 5.0

LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
cs.CV 2026-04 unverdicted novelty 5.0

MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
Qwen2.5-Omni Technical Report
cs.CL 2025-03 conditional novelty 5.0

Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
cs.CV 2026-05 unverdicted novelty 4.0

Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt augmentation and preprocessing offering only partial mitigation.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
cs.CV 2026-05 unverdicted novelty 4.0

Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.
Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification
cs.CV 2026-04 unverdicted novelty 4.0

Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 37 Pith papers · 19 internal anchors

[1]

Fuyu-8b: A multimodal architecture for ai agents

Adept AI. Fuyu-8b: A multimodal architecture for ai agents. https://www.adept.ai/blog/fuyu-8b, 2024. 2

work page 2024
[2]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a vi- sual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1, 2, 4, 5, 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Visit-bench: A benchmark for vision- language instruction following inspired by real-world use,

Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wan- rong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. Visit-bench: A benchmark for vision- language instruction following inspired by real-world use,

work page
[5]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023. 1

work page arXiv 2023
[7]

Visual instruction tuning with polite flamingo

Delong Chen, Jianfeng Liu, Wenliang Dai, and Baoyuan Wang. Visual instruction tuning with polite flamingo. arXiv preprint arXiv:2307.01003, 2023. 2, 3

work page arXiv 2023
[8]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- frey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020. 3

work page 2020
[10]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- proved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020. 3

work page internal anchor Pith review arXiv 2003
[11]

Pali-x: On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebas- tian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 5

work page arXiv 2023
[12]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 9

work page 2023
[13]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models. arXiv preprint arXiv:2210.11416,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 1, 2, 3, 5, 6, 8, 13

work page internal anchor Pith review arXiv 2023
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 4

work page internal anchor Pith review Pith/arXiv arXiv 2010
[16]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358– 19369, 2023. 6

work page 2023
[17]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 1, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

arXiv preprint arXiv:2305.04790 , year=

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790,

work page arXiv
[19]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3, 5, 6, 9

work page 2017
[20]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,

work page
[21]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 4, 5, 9

work page 2019
[22]

Introducing idefics: An open reproduction of state-of-the-art visual language model

IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https : / / huggingface.co/blog/idefics, 2023. 5, 6

work page 2023
[23]

Openclip

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Ha- jishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. 2021. If you use this software, please cite it as below. 6

work page 2021
[24]

Referitgame: Referring to objects in pho- tographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 4, 9

work page 2014
[25]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 4, 9

work page 2017
[26]

Lisa: Reasoning segmentation via large language model,

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692,

work page arXiv
[27]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Otterhd: A high-resolution multi-modality model, 2023

Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi-modality model, 2023. 2

work page 2023
[29]

Li, and Ziwei Liu

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023. 1

work page arXiv 2023
[30]

Multimodal founda- tion models: From specialists to general-purpose assistants

Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Lin- jie Li, Lijuan Wang, and Jianfeng Gao. Multimodal founda- tion models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023. 1

work page arXiv 2023
[31]

et al.: LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day (Jun 2023)

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and- vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023. 1

work page arXiv 2023
[32]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 1, 2, 3, 4, 5, 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 5

work page 2014
[36]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 1, 2, 3, 4, 5, 6, 8, 9

work page 2023
[37]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 1, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022. 5

work page 2022
[39]

An empirical study of scal- ing instruct-tuned large multimodal models

Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jian- feng Gao, and Yelong Shen. An empirical study of scal- ing instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958, 2023. 1, 4

work page arXiv 2023
[40]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Cam- buru, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016. 4, 9

work page 2016
[41]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 4, 9

work page 2019
[42]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019. 4, 9

work page 2019
[43]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. https://cdn. openai . com / papers / GPTV _ System _ Card . pdf,

work page
[44]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

A-okvqa: A bench- mark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A bench- mark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022. 4, 9

work page 2022
[46]

https://sharegpt.com/, 2023

ShareGPT. https://sharegpt.com/, 2023. 4, 7, 8, 9

work page 2023
[47]

Textcaps: a dataset for image captioning with reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Aman- preet Singh. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020. 4, 9

work page 2020
[48]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xin- lei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 8317–8326, 2019. 5

work page 2019
[49]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023. 1

work page arXiv 2023
[51]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

a helpful assistant

Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023. 6

work page arXiv 2023
[53]

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model, 2023

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Alex Lin, and Fei Huang. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model, 2023. 2

work page 2023
[54]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 1

work page Pith review arXiv 2023
[55]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 1, 3, 4, 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Gpt4roi: Instruction tuning large language model on region-of- interest

Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023. 1

work page arXiv 2023
[57]

Llavar: Enhanced visual instruction tuning for text-rich image understanding

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023. 2

work page arXiv 2023
[58]

Svit: Scaling up visual instruction tuning

Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087,

work page arXiv
[59]

On evaluating adversarial robustness of large vision-language models

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongx- uan Li, Ngai-Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023. 1

work page arXiv 2023
[60]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. 10

work page 2023
[61]

arXiv preprint arXiv:2305.11206 , year=

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023. 2, 8

work page arXiv 2023
[62]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023