arxiv: 2308.02490 · v4 · submitted 2023-08-04 · 💻 cs.AI · cs.CL· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Jianfeng Wang, Kevin Lin, Lijuan Wang, Linjie Li, Weihao Yu, Xinchao Wang, Zhengyuan Yang, Zicheng Liu

Pith reviewed 2026-05-11 11:48 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LG

keywords large multimodal modelsevaluation benchmarkvision-language capabilitiesmultimodal tasksLLM-based evaluatorintegrated capabilitiesopen-ended evaluation

0 comments

The pith

MM-Vet evaluates large multimodal models by testing integration of six core vision-language capabilities across sixteen combinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MM-Vet as an evaluation benchmark for large multimodal models on complicated tasks. It builds the benchmark around the observation that success on such tasks often requires a model to integrate multiple core vision-language capabilities rather than relying on any single one. MM-Vet therefore defines six core capabilities and examines the sixteen integrations that arise from their pairwise and higher-order combinations. An LLM-based evaluator is introduced to score open-ended model outputs uniformly, regardless of question type or answer style. The resulting scores on representative models yield insights into the strengths and limitations of different LMM architectures and training paradigms.

Core claim

MM-Vet is designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, it proposes an LLM-based evaluator for open-ended outputs that enables evaluation across different question types and answer styles, resulting in a unified scoring metric.

What carries the argument

MM-Vet benchmark organized around six core vision-language capabilities and the sixteen specific integrations obtained by combining them.

If this is right

Provides a systematic structure for evaluating complicated multimodal tasks instead of ad-hoc question sets.
Delivers a unified scoring metric that works across question types and open-ended answer styles.
Yields comparative insights into the capabilities of different large multimodal model system paradigms and individual models.
Addresses the challenge of keeping evaluation benchmarks current as models advance rapidly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models could be trained or fine-tuned specifically on the sixteen integrations to improve performance on complex tasks.
The LLM-based evaluator could be reused or adapted for other multimodal benchmarks to maintain scoring consistency.
As new capabilities emerge in future models, the framework could be extended by adding further core capabilities and their combinations.

Load-bearing premise

The chosen six core vision-language capabilities and the sixteen derived integrations systematically cover the complicated multimodal tasks that current large multimodal models are expected to solve.

What would settle it

A model that scores highly on MM-Vet yet fails on a new collection of multimodal tasks whose capability combinations lie outside the sixteen integrations examined by the benchmark.

read the original abstract

We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MM-Vet, an evaluation benchmark for large multimodal models (LMMs) on complicated tasks. It is motivated by the observation that such tasks are solved via integration of core vision-language capabilities. The benchmark defines six core capabilities (recognition, OCR, knowledge, language generation, spatial awareness, math) and examines 16 derived integrations with example questions. It proposes an LLM-based evaluator to score open-ended outputs across question and answer types, yielding a unified metric, and reports evaluations of representative LMMs to compare system paradigms and models.

Significance. If the design choices hold, MM-Vet supplies a pragmatic, unified evaluation framework that moves beyond isolated VL tasks toward integrated capability assessment. The LLM-based scorer enables flexible, style-agnostic scoring, and the paradigm-level analysis offers actionable insights for LMM development. The benchmark is positioned as an evaluation tool rather than a completeness proof, which keeps its utility high even without exhaustive coverage claims.

major comments (2)

[§3.1] §3.1: The six core capabilities and the specific 16 integrations are introduced as an organizing principle with example questions, but the manuscript provides no empirical validation, inter-annotator agreement, or coverage study showing that this set systematically captures the space of complicated multimodal tasks that LMMs are expected to solve.
[§4.2] §4.2: The LLM-based evaluator is presented as enabling unified scoring across answer styles, yet no consistency metrics (e.g., agreement with human raters, self-consistency across prompt variations, or calibration on a held-out set) are reported; this directly affects the reliability of all quantitative results and the central claim of a robust unified metric.

minor comments (2)

[Table 2] Table 2 and Figure 3: The per-integration and per-model breakdowns would benefit from explicit error bars or variance estimates to clarify whether observed differences are statistically meaningful.
[§5] §5: The discussion of LMM system paradigms could more explicitly link back to the 16 integrations so readers can see which capability combinations drive the reported differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of MM-Vet and the recommendation for minor revision. We appreciate the constructive comments on the organizing principle and the evaluator's reliability. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [§3.1] §3.1: The six core capabilities and the specific 16 integrations are introduced as an organizing principle with example questions, but the manuscript provides no empirical validation, inter-annotator agreement, or coverage study showing that this set systematically captures the space of complicated multimodal tasks that LMMs are expected to solve.

Authors: We thank the referee for highlighting this. The six core capabilities and 16 integrations were derived from a synthesis of capabilities observed across recent LMM literature (e.g., visual math, knowledge-based VQA, spatial reasoning) and the combinations needed for the complicated tasks highlighted in the introduction. The 16 integrations were selected to span pairwise and multi-capability combinations with concrete example questions provided for each. We did not include a formal coverage study or inter-annotator agreement in the original manuscript. In the revision we will expand §3.1 to detail the literature-driven selection rationale, add references to supporting works, and explicitly discuss the limitations of not claiming exhaustive coverage. This is consistent with the benchmark's framing as a practical evaluation tool rather than a complete taxonomy. revision: partial
Referee: [§4.2] §4.2: The LLM-based evaluator is presented as enabling unified scoring across answer styles, yet no consistency metrics (e.g., agreement with human raters, self-consistency across prompt variations, or calibration on a held-out set) are reported; this directly affects the reliability of all quantitative results and the central claim of a robust unified metric.

Authors: We agree that quantitative consistency metrics are necessary to substantiate the reliability of the LLM-based evaluator. The evaluator was developed with prompt engineering intended to produce scores aligned with human judgment across answer styles, but the original manuscript omitted explicit validation numbers. We will revise §4.2 to report (1) agreement rates between the LLM evaluator and human raters on a held-out sample of model outputs and (2) self-consistency results across prompt variations. These additions will be included in the revised version to strengthen the unified metric claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark is a definitional design choice

full rationale

The paper presents MM-Vet as an evaluation benchmark whose structure is explicitly chosen by the authors around an organizing insight (integration of 6 core VL capabilities into 16 combinations). This is a pragmatic taxonomy for evaluation rather than a derived claim, prediction, or first-principles result. No equations, fitted parameters, self-citations as load-bearing premises, or reductions of outputs to inputs appear in the provided text. The 6 capabilities and 16 integrations are stated as author-defined categories with example questions; the LLM-based scorer is a proposed metric, not a self-referential fit. The work is self-contained as an evaluation tool without circular derivation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that complicated tasks reduce to integrations of a small set of core capabilities; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Complicated multimodal tasks are achieved by a generalist model integrating different core vision-language capabilities.
This is explicitly stated as the design insight in the abstract.

pith-pipeline@v0.9.0 · 5546 in / 1158 out tokens · 40926 ms · 2026-05-11T11:48:11.106358+00:00 · methodology

discussion (0)

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
cs.CV 2026-05 unverdicted novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror
cs.AI 2026-04 unverdicted novelty 7.0

MirrorBench reveals that leading MLLMs perform far below humans on tasks requiring self-referential perception and representation, even at the simplest level.
SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation
cs.CV 2026-05 conditional novelty 6.0

LithoBench is a new multi-level benchmark showing that existing large multimodal models have substantial limitations in geological semantic understanding for remote sensing lithology interpretation.
Large Vision-Language Models Get Lost in Attention
cs.AI 2026-05 unverdicted novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
cs.AI 2026-05 unverdicted novelty 6.0

Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.
Online Self-Calibration Against Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
cs.CV 2026-04 unverdicted novelty 6.0

PivotMerge merges heterogeneous multimodal pre-trained models via shared-space decomposition to filter conflicts and layer-wise weights based on alignment contributions, outperforming baselines on multimodal benchmarks.
HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

HTDC mitigates hallucinations in LVLMs by triggering calibration only at hesitation-prone decoding steps via contrasts with visual-nullification and semantic-nullification probes.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment
cs.CV 2026-04 conditional novelty 6.0

Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.
Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance
cs.CV 2026-04 unverdicted novelty 6.0

Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
Text-Guided Multi-Scale Frequency Representation Adaptation
cs.CV 2026-05 unverdicted novelty 5.0

FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
cs.CV 2025-05 conditional novelty 5.0

BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 4.0

UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
cs.AI 2026-04 unverdicted novelty 4.0

TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Improved Baselines with Visual Instruction Tuning
cs.CV 2023-10 conditional novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 41 Pith papers

[2]

Visual Genome (Krishna et al., 2017)

work page 2017
[6]

115M images from the LAION-400M (Schuhmann et al., 2021). (CapFilt (Li et al., 2022) is used to create synthetic captions for the web images) 12B LLaV A-7B (Liu et al., 2023c) LLaV A-13B (Liu et al., 2023c) CLIP ViT-L/14 (Radford et al., 2021) Vicuna-7B (Zheng et al., 2023) Vicuna-13B (Zheng et al., 2023) –

work page 2021
[7]

CC3M (Sharma et al., 2018) Concept-balanced 595K (Liu et al., 2023c)

work page 2018
[8]

7B 13B LLaV A-7B (LLaMA-2) (Liu et al., 2023c)CLIP ViT-L/14 (Radford et al., 2021)LLaMA-2-7B-Chat (Touvron et al., 2023b) –

LLaV A-Instruct-158K (Liu et al., 2023c). 7B 13B LLaV A-7B (LLaMA-2) (Liu et al., 2023c)CLIP ViT-L/14 (Radford et al., 2021)LLaMA-2-7B-Chat (Touvron et al., 2023b) –

work page 2021
[9]

LAION /CC/SBU BLIP-Caption Concept-balanced 558K (Liu et al., 2023c)

work page
[10]

LLaV A-Instruct-80K (Liu et al., 2023c). 7B LLaV A-13B (LLaMA-2) (Liu et al., 2023c) LLaMA-2-13B-Chat (Touvron et al., 2023b) 13B LLaV A-13B (V1.3, 336px) (Liu et al., 2023c) CLIP ViT-L/336px (Radford et al., 2021) Vicuna-13B-v1.3 (Zheng et al., 2023) 13B MiniGPT-4-8B (Zhu et al., 2023a) MiniGPT-4-14B (Zhu et al., 2023a) EV A-ViT-G (Fang et al., 2023) Vic...

work page 2021
[11]

CC3M (Sharma et al., 2018)

work page 2018
[15]

8B 14B LLaMA-Adapter v2-7B (Gao et al., 2023b) CLIP ViT-L/14 (Radford et al., 2021) LLaMA-7B (Touvron et al., 2023a) –

Proposed 3,500 aligned image-text pairs (Zhu et al., 2023a). 8B 14B LLaMA-Adapter v2-7B (Gao et al., 2023b) CLIP ViT-L/14 (Radford et al., 2021) LLaMA-7B (Touvron et al., 2023a) –

work page 2021
[16]

LAION-400M (Schuhmann et al., 2021)

work page 2021
[17]

COYO-700M (Byeon et al., 2022)

work page 2022
[18]

Multimodal C4 (Zhu et al., 2023b)

work page
[19]

SBU (Ordonez et al., 2011)

work page 2011
[20]

CC12M (Changpinyo et al., 2021)

work page 2021
[21]

COCO (Lin et al., 2014)

work page 2014
[22]

GPT-4-LLM (Peng et al., 2023)

work page 2023
[23]

Tuning data of LLaV A (Liu et al., 2023c) 7B Otter-9B (Li et al., 2023c) CLIP ViT-L/14 (Radford et al., 2021) MPT-7B (MPT, 2023) OpenFlamingo-9B’s

work page 2021
[24]

GATED XATTN-DENSE MIMIC-IT (Li et al., 2023b) 9B InstructBLIP-8B (Dai et al., 2023) EV A-ViT-G (Fang et al., 2023) Vicuna-7B (Zheng et al., 2023) BLIP-2’s Q-Former (Li et al., 2023d)

work page 2023
[25]

Tuning data of BLIP-2 (Li et al., 2023d)

work page
[26]

8B InstructBLIP-14B (Dai et al., 2023) Vicuna-13B (Zheng et al., 2023) 14B Transformers Agent (GPT-4 as agent) (Huggingface, 2023) –

26 publicly available datasets (transformed into instruction tuning format). 8B InstructBLIP-14B (Dai et al., 2023) Vicuna-13B (Zheng et al., 2023) 14B Transformers Agent (GPT-4 as agent) (Huggingface, 2023) –

work page 2023
[27]

GPT-4 (OpenAI, 2023c)

work page
[28]

Flan-T5 (Chung et al., 2022)

work page 2022
[29]

BART (Lewis et al., 2019)

work page 2019
[30]

Donut (Kim et al., 2022)

work page 2022
[31]

BLIP (Li et al., 2022)

work page 2022
[32]

ViLT (Kim et al., 2021)

work page 2021
[33]

CLIPSeg (Lüddecke & Ecker, 2022)

work page 2022
[34]

Whisper (Radford et al., 2023)

work page 2023
[35]

SpeechT5 (Ao et al., 2021)

work page 2021
[36]

NLLB (Costa-jussà et al., 2022) None Not clear MM-ReAct-GPT-3.5 (Yang et al., 2023c) MM-ReAct-GPT-4 (Yang et al., 2023c) – GPT-3.5 (Ouyang et al., 2022) GPT-4 (OpenAI, 2023c)

work page 2022
[37]

Azure Cognitive Services APIs (Azure, 2023) for image captioning, image tagging, dense captioning, OCR and specialized recognition on celebrities, receipts,etc

work page 2023
[38]

Screaming Panda,

Bing search; 3. PAL (Gao et al., 2022) None Not clear 15 MM-V et: Evaluating Large Multimodal Models for Integrated Capabilities Table 12: Three samples requiring different capability integrations. (a) Q: What occasions would someone use this meme? GT: This meme, commonly known as "Screaming Panda," is typically used to express shock, surprise, or fear. I...

work page 2022