arxiv: 2311.12793 · v2 · submitted 2023-11-21 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Conghui He, Dahua Lin, Feng Zhao, Jiaqi Wang, Jinsong Li, Lin Chen, Pan Zhang, Xiaoyi Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords ShareGPT4Vlarge multi-modal modelsimage captionssupervised fine-tuningGPT-4VMME benchmarkMMBenchdata quality

0 comments

The pith

Substituting standard captions with detailed GPT-4V ones boosts LMM performance on MME and MMBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the ShareGPT4V dataset of 1.2 million image captions that are more descriptive than those in prior collections, covering knowledge, object details, spatial relations, and aesthetics. It shows that replacing an equal number of existing detailed captions in supervised fine-tuning data with a portion of these new captions raises scores for models such as LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B. The same captions are also mixed into both pre-training and fine-tuning stages to produce ShareGPT4V-7B, which records strong results on most multi-modal benchmarks. Readers care because scarcity of high-quality image-text pairs remains a main limit on how well large multi-modal models can align vision and language.

Core claim

The ShareGPT4V dataset, built from 100K high-quality captions produced by GPT-4V and then expanded to 1.2M by a caption model trained on that seed set, when used to replace equivalent detailed captions in existing SFT datasets, improves LMMs including LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B by 222.8/22.0/22.3 points on MME and 2.7/1.3/1.5 points on MMBench; incorporating the data into both pre-training and SFT yields ShareGPT4V-7B, a simple-architecture model that performs well across the majority of multi-modal benchmarks.

What carries the argument

The ShareGPT4V dataset of 1.2 million highly descriptive captions, generated first from 100K GPT-4V examples and then scaled by a trained caption model.

If this is right

LMMs obtain measurable gains on MME and MMBench when ShareGPT4V captions replace an equal volume of prior detailed captions in SFT.
A simple LMM architecture reaches competitive results on most multi-modal benchmarks after using ShareGPT4V data in both pre-training and SFT.
High-quality captions can be scaled from a 100K GPT-4V seed set to 1.2M examples via a trained caption model.
The substitution approach improves existing models without requiring architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If caption quality drives the gains, applying the same GPT-4V generation process to other image collections could produce further improvements.
The dataset may support alignment methods beyond standard SFT, such as preference tuning or retrieval-augmented training.
Future caption models could be iteratively refined on ShareGPT4V outputs to create even higher-quality training data.

Load-bearing premise

That the performance gains come from the superior quality of the GPT-4V captions rather than from uncontrolled differences in data selection or training procedure.

What would settle it

Retraining the same LMMs with the identical procedure but substituting captions of matched length and topic drawn from the original datasets instead of ShareGPT4V data, and finding no benchmark improvement.

read the original abstract

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ShareGPT4V ships a usable 1.2M caption dataset that produces clear benchmark lifts when swapped into LMM SFT data, but the experiments do not isolate caption quality from image selection or training differences.

read the letter

The paper's main deliverable is the ShareGPT4V dataset: 1.2 million descriptive captions built from an initial 100k GPT-4V set and then scaled with a distilled caption model. They show that swapping an equivalent number of these captions into the SFT data for LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B produces gains of roughly 223/22/22 on MME and 2.7/1.3/1.5 on MMBench. They also train their own ShareGPT4V-7B model using the data in both pre-training and SFT stages and report broad benchmark improvements. The project page and data release make the resource immediately usable, which is the practical value here. The numbers are presented plainly and the core idea of addressing data scarcity with higher-quality captions is direct and relevant to current LMM work. The soft spot is the substitution experiment. The abstract does not state that the replacement keeps the exact same images, sources, or training hyperparameters fixed, and no ablation tables or configuration details are referenced. If the selected images differ in distribution or difficulty, or if minor training variations are present, the gains cannot be attributed cleanly to caption quality. This is a standard control issue rather than a fatal flaw, but it leaves the central claim under-supported from the given text. The work is aimed at groups training or fine-tuning large vision-language models who need better caption data. Anyone running LMM experiments will want to test the dataset, even if they treat the reported deltas as preliminary. It is coherent on its own terms and shows honest engagement with the data bottleneck problem, so it deserves a serious referee who can ask for the missing controls on image matching and training parity.

Referee Report

2 major / 2 minor

Summary. The paper introduces the ShareGPT4V dataset of 1.2 million high-quality image captions generated from a 100K GPT-4V seed set and expanded via a trained caption model. It claims that substituting an equivalent quantity of these captions into existing SFT datasets yields substantial gains for LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on MME (gains of 222.8/22.0/22.3) and MMBench (gains of 2.7/1.3/1.5), and further reports that incorporating ShareGPT4V data into both pre-training and SFT produces a competitive ShareGPT4V-7B model.

Significance. If the benchmark lifts are causally attributable to caption quality rather than uncontrolled variables, the dataset would constitute a useful public resource for improving modality alignment in LMMs. The empirical results on standard benchmarks are potentially impactful for the community, but the absence of explicit controls limits the strength of the causal claim.

major comments (2)

[§4] §4 (SFT substitution experiment): the reported gains (e.g., +222.8 on MME for LLaVA-7B) are presented as resulting from caption quality, yet the text provides no explicit statement that the replacement was performed on identical images while holding image sources, data ordering, optimizer schedule, and all other training hyperparameters fixed; without these controls the deltas cannot be isolated to caption quality.
[Experimental setup and §4] Experimental setup and §4: no ablation tables, statistical significance tests, or configuration logs are supplied to rule out confounding factors such as differences in image difficulty distribution or overlap with evaluation sets; this omission is load-bearing for the central claim that caption quality drives the observed improvements.

minor comments (2)

[Abstract] Abstract: the phrase 'equivalent quantity' is undefined (number of samples? total tokens?); this ambiguity should be clarified with a precise metric.
The expansion from 100K GPT-4V captions to 1.2M via the caption model lacks details on the training procedure, data splits, and validation of the caption model itself.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, providing clarifications on the experimental controls and acknowledging areas where additional discussion will be added in revision.

read point-by-point responses

Referee: [§4] §4 (SFT substitution experiment): the reported gains (e.g., +222.8 on MME for LLaVA-7B) are presented as resulting from caption quality, yet the text provides no explicit statement that the replacement was performed on identical images while holding image sources, data ordering, optimizer schedule, and all other training hyperparameters fixed; without these controls the deltas cannot be isolated to caption quality.

Authors: We agree an explicit statement was missing. The SFT substitution was performed by replacing captions for the exact same images drawn from the original datasets (LAION, etc.), while preserving identical data ordering, optimizer schedule, batch sizes, learning rates, and all other hyperparameters. This isolates the effect to caption quality. We will add a clear paragraph in §4 stating these controls were held fixed. revision: yes
Referee: [Experimental setup and §4] Experimental setup and §4: no ablation tables, statistical significance tests, or configuration logs are supplied to rule out confounding factors such as differences in image difficulty distribution or overlap with evaluation sets; this omission is load-bearing for the central claim that caption quality drives the observed improvements.

Authors: Because the substitution uses precisely the same images as the baseline runs, image difficulty distribution and any evaluation-set overlap are identical by construction and cannot explain the deltas. We did not include extra ablation tables or statistical significance tests in the original submission owing to space constraints and compute limits. The gains are consistent across three distinct model families, which provides supporting evidence. In revision we will add a dedicated paragraph in §4 discussing these points and noting that full configuration logs are available in the released code repository. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical dataset construction and benchmark results

full rationale

The paper's core contribution is the creation of the ShareGPT4V dataset (1.2M captions bootstrapped from 100K GPT-4V examples) followed by empirical SFT substitution experiments that report benchmark deltas on MME and MMBench. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The reported gains are direct experimental outcomes rather than quantities that reduce by construction to inputs defined inside the paper, satisfying the self-contained empirical case with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that GPT-4V captions are superior in information density and that scaling them via a distilled model preserves that quality; no free parameters are explicitly fitted in the abstract, and no new physical or mathematical entities are introduced.

axioms (1)

domain assumption GPT-4V produces higher-quality image captions than existing datasets
Used to seed the initial 100K captions that are then expanded

pith-pipeline@v0.9.0 · 5614 in / 1279 out tokens · 40514 ms · 2026-05-13T17:03:54.986547+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Allegory of the Cave: Measurement-Grounded Vision-Language Learning
cs.AI 2026-05 unverdicted novelty 7.0

PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
cs.CV 2026-04 unverdicted novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
cs.LG 2026-04 unverdicted novelty 7.0

MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 conditional novelty 6.0

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
cs.CV 2026-05 unverdicted novelty 6.0

BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on...
WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

A 4B-parameter vision-language model trained on rubric-guided synthetic wafer defect data reaches 6.493 LLM-Judge score, nearly matching Gemini-3-Flash at 7.149 for on-premise industrial use.
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
cs.CV 2026-04 unverdicted novelty 6.0

MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
cs.CV 2026-04 unverdicted novelty 6.0

CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to...
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
SmolVLM: Redefining small and efficient multimodal models
cs.AI 2025-04 unverdicted novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

A 4B Qwen3-VL model trained via rubric-guided synthetic data and Group Sequence Policy Optimization reaches an LLM-Judge score of 6.493 on wafer defect VQA, nearly matching Gemini-3-Flash while supporting full on-prem...
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
cs.CV 2024-08 unverdicted novelty 5.0

Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 4.0

A 4B-parameter Qwen3-VL model trained via synthetic VQA data and rubric-guided GSPO reinforcement learning reaches 6.493 LLM-Judge score on wafer defect analysis, approaching Gemini-3-Flash while supporting full on-pr...
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
cs.AI 2024-03 unverdicted novelty 4.0

DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 23 Pith papers · 19 internal anchors

[1]

https://sharegpt.com/, 2023

Sharegpt. https://sharegpt.com/, 2023. 6

work page 2023
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 2

work page 1901
[5]

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu1 Xiaoqian Shen1 Xiang Li, Zechun Liu2 Pengchuan Zhang, Raghuraman Krishnamoorthi2 Vikas Chandra2 Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 2, 3, 6

work page arXiv 2023
[6]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 1, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023. 2, 3, 6

work page 2023
[9]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Instructblip: Towards general- purpose vision-language models with instruction tuning,

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

work page
[11]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021. 2

work page arXiv 2021
[13]

Improving clip training with language rewrites

Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites. arXiv preprint arXiv:2305.20088, 2023. 3

work page arXiv 2023
[14]

Eva: Exploring the limits of masked visual representa- tion learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023. 3

work page 2023
[15]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 , 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Dat- acomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023. 3

work page arXiv 2023
[17]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 6, 7

work page 2017
[18]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,

work page
[19]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,

work page
[20]

Referitgame: Referring to objects in pho- tographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 6

work page 2014
[21]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. arXiv preprint arXiv:2304.02643, 2023. 4, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 6

work page 2017
[23]

Killamsetty, D

Zhengfeng Lai, Haotian Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, et al. From scarcity to effi- ciency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699, 2023. 3

work page arXiv 2023
[24]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 6, 7 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Otter: A multi-modal model with in-context instruction tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 3

work page arXiv 2023
[26]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 3, 4, 5

work page 2022
[27]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022. 3

work page 2022
[29]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 4, 6, 9

work page 2014
[30]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 2, 3, 4, 5, 6, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,

work page
[35]

Cheap and quick: Efficient vision- language instruction tuning for large language models.arXiv preprint arXiv:2305.15023, 2023

Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision- language instruction tuning for large language models.arXiv preprint arXiv:2305.15023, 2023. 2

work page arXiv 2023
[36]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 6

work page 2019
[37]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–

work page 2019
[38]

Improving multi- modal datasets with image captioning

Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Se- woong Oh, and Ludwig Schmidt. Improving multi- modal datasets with image captioning. arXiv preprint arXiv:2307.10350, 2023. 3

work page arXiv 2023
[39]

OpenAI. Chatgpt. https://chat.openai.com/ ,

work page
[40]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. 2023. 1, 4

work page 2023
[41]

Im2text: Describing images using 1 million captioned pho- tographs

Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned pho- tographs. Advances in neural information processing sys- tems, 24, 2011. 4, 9

work page 2011
[42]

Training lan- guage models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35: 27730–27744, 2022. 2

work page 2022
[43]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Improving language understanding by gen- erative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 2

work page 2018
[45]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 6

work page 2021
[46]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020. 2

work page 2020
[47]

Large-scale classifica- tion of fine-art paintings: Learning the right metric on the right feature

Babak Saleh and Ahmed Elgammal. Large-scale classifica- tion of fine-art paintings: Learning the right metric on the right feature. arXiv preprint arXiv:1505.00855, 2015. 4, 9

work page arXiv 2015
[48]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 4, 9

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

A-okvqa: A benchmark for visual question answering using world knowl- edge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. In European Conference on Computer Vision , pages 146–162. Springer, 2022. 6

work page 2022
[50]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 4, 9 14

work page 2018
[51]

Textcaps: a dataset for image caption- ing with reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ing with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part II 16, pages 742–758. Springer,

work page 2020
[52]

Internlm: A multilingual language model with progressively enhanced capabilities, 2023

InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023. 2

work page 2023
[53]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 2

work page 2017
[55]

Q-bench: A benchmark for general-purpose foundation models on low-level vision

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181, 2023. 6, 7

work page arXiv 2023
[56]

Baichuan 2: Open large-scale language models

Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. 2

work page arXiv 2023
[57]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 2, 3

work page Pith review arXiv 2023
[58]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Glipv2: Unifying localiza- tion and vision-language understanding

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq- Neng Hwang, and Jianfeng Gao. Glipv2: Unifying localiza- tion and vision-language understanding. Advances in Neural Information Processing Systems, 35:36067–36080, 2022. 3

work page 2022
[60]

Internlm- xcomposer: A vision-language large model for advanced text-image comprehension and composition

Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm- xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023. 2

work page arXiv 2023
[61]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 , 2023

work page Pith review arXiv 2023
[62]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 2, 6 15

work page internal anchor Pith review Pith/arXiv arXiv 2023