Recognition: 1 theorem link
· Lean TheoremShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Pith reviewed 2026-05-13 17:03 UTC · model grok-4.3
The pith
Substituting standard captions with detailed GPT-4V ones boosts LMM performance on MME and MMBench.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The ShareGPT4V dataset, built from 100K high-quality captions produced by GPT-4V and then expanded to 1.2M by a caption model trained on that seed set, when used to replace equivalent detailed captions in existing SFT datasets, improves LMMs including LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B by 222.8/22.0/22.3 points on MME and 2.7/1.3/1.5 points on MMBench; incorporating the data into both pre-training and SFT yields ShareGPT4V-7B, a simple-architecture model that performs well across the majority of multi-modal benchmarks.
What carries the argument
The ShareGPT4V dataset of 1.2 million highly descriptive captions, generated first from 100K GPT-4V examples and then scaled by a trained caption model.
If this is right
- LMMs obtain measurable gains on MME and MMBench when ShareGPT4V captions replace an equal volume of prior detailed captions in SFT.
- A simple LMM architecture reaches competitive results on most multi-modal benchmarks after using ShareGPT4V data in both pre-training and SFT.
- High-quality captions can be scaled from a 100K GPT-4V seed set to 1.2M examples via a trained caption model.
- The substitution approach improves existing models without requiring architecture changes.
Where Pith is reading between the lines
- If caption quality drives the gains, applying the same GPT-4V generation process to other image collections could produce further improvements.
- The dataset may support alignment methods beyond standard SFT, such as preference tuning or retrieval-augmented training.
- Future caption models could be iteratively refined on ShareGPT4V outputs to create even higher-quality training data.
Load-bearing premise
That the performance gains come from the superior quality of the GPT-4V captions rather than from uncontrolled differences in data selection or training procedure.
What would settle it
Retraining the same LMMs with the identical procedure but substituting captions of matched length and topic drawn from the original datasets instead of ShareGPT4V data, and finding no benchmark improvement.
read the original abstract
In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the ShareGPT4V dataset of 1.2 million high-quality image captions generated from a 100K GPT-4V seed set and expanded via a trained caption model. It claims that substituting an equivalent quantity of these captions into existing SFT datasets yields substantial gains for LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on MME (gains of 222.8/22.0/22.3) and MMBench (gains of 2.7/1.3/1.5), and further reports that incorporating ShareGPT4V data into both pre-training and SFT produces a competitive ShareGPT4V-7B model.
Significance. If the benchmark lifts are causally attributable to caption quality rather than uncontrolled variables, the dataset would constitute a useful public resource for improving modality alignment in LMMs. The empirical results on standard benchmarks are potentially impactful for the community, but the absence of explicit controls limits the strength of the causal claim.
major comments (2)
- [§4] §4 (SFT substitution experiment): the reported gains (e.g., +222.8 on MME for LLaVA-7B) are presented as resulting from caption quality, yet the text provides no explicit statement that the replacement was performed on identical images while holding image sources, data ordering, optimizer schedule, and all other training hyperparameters fixed; without these controls the deltas cannot be isolated to caption quality.
- [Experimental setup and §4] Experimental setup and §4: no ablation tables, statistical significance tests, or configuration logs are supplied to rule out confounding factors such as differences in image difficulty distribution or overlap with evaluation sets; this omission is load-bearing for the central claim that caption quality drives the observed improvements.
minor comments (2)
- [Abstract] Abstract: the phrase 'equivalent quantity' is undefined (number of samples? total tokens?); this ambiguity should be clarified with a precise metric.
- The expansion from 100K GPT-4V captions to 1.2M via the caption model lacks details on the training procedure, data splits, and validation of the caption model itself.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below, providing clarifications on the experimental controls and acknowledging areas where additional discussion will be added in revision.
read point-by-point responses
-
Referee: [§4] §4 (SFT substitution experiment): the reported gains (e.g., +222.8 on MME for LLaVA-7B) are presented as resulting from caption quality, yet the text provides no explicit statement that the replacement was performed on identical images while holding image sources, data ordering, optimizer schedule, and all other training hyperparameters fixed; without these controls the deltas cannot be isolated to caption quality.
Authors: We agree an explicit statement was missing. The SFT substitution was performed by replacing captions for the exact same images drawn from the original datasets (LAION, etc.), while preserving identical data ordering, optimizer schedule, batch sizes, learning rates, and all other hyperparameters. This isolates the effect to caption quality. We will add a clear paragraph in §4 stating these controls were held fixed. revision: yes
-
Referee: [Experimental setup and §4] Experimental setup and §4: no ablation tables, statistical significance tests, or configuration logs are supplied to rule out confounding factors such as differences in image difficulty distribution or overlap with evaluation sets; this omission is load-bearing for the central claim that caption quality drives the observed improvements.
Authors: Because the substitution uses precisely the same images as the baseline runs, image difficulty distribution and any evaluation-set overlap are identical by construction and cannot explain the deltas. We did not include extra ablation tables or statistical significance tests in the original submission owing to space constraints and compute limits. The gains are consistent across three distinct model families, which provides supporting evidence. In revision we will add a dedicated paragraph in §4 discussing these points and noting that full configuration logs are available in the released code repository. revision: partial
Circularity Check
No circularity: purely empirical dataset construction and benchmark results
full rationale
The paper's core contribution is the creation of the ShareGPT4V dataset (1.2M captions bootstrapped from 100K GPT-4V examples) followed by empirical SFT substitution experiments that report benchmark deltas on MME and MMBench. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The reported gains are direct experimental outcomes rather than quantities that reduce by construction to inputs defined inside the paper, satisfying the self-contained empirical case with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPT-4V produces higher-quality image captions than existing datasets
Forward citations
Cited by 26 Pith papers
-
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
-
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
-
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
-
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on...
-
WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning
A 4B-parameter vision-language model trained on rubric-guided synthetic wafer defect data reaches 6.493 LLM-Judge score, nearly matching Gemini-3-Flash at 7.149 for on-premise industrial use.
-
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...
-
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to...
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning
A 4B Qwen3-VL model trained via rubric-guided synthetic data and Group Sequence Policy Optimization reaches an LLM-Judge score of 6.493 on wafer defect VQA, nearly matching Gemini-3-Flash while supporting full on-prem...
-
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning
A 4B-parameter Qwen3-VL model trained via synthetic VQA data and rubric-guided GSPO reinforcement learning reaches 6.493 LLM-Judge score on wafer defect analysis, approaching Gemini-3-Flash while supporting full on-pr...
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
Reference graph
Works this paper leans on
- [1]
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Lan- guage models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 2
work page 1901
-
[5]
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu1 Xiaoqian Shen1 Xiang Li, Zechun Liu2 Pengchuan Zhang, Raghuraman Krishnamoorthi2 Vikas Chandra2 Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 2, 3, 6
-
[6]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 1, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[8]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023. 2, 3, 6
work page 2023
-
[9]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Instructblip: Towards general- purpose vision-language models with instruction tuning,
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,
-
[11]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Glm: General language model pretraining with autoregressive blank infilling
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021. 2
-
[13]
Improving clip training with language rewrites
Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites. arXiv preprint arXiv:2305.20088, 2023. 3
-
[14]
Eva: Exploring the limits of masked visual representa- tion learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023. 3
work page 2023
-
[15]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 , 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Dat- acomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023. 3
-
[17]
Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 6, 7
work page 2017
-
[18]
Vizwiz grand challenge: Answering visual questions from blind people
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,
-
[19]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,
-
[20]
Referitgame: Referring to objects in pho- tographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 6
work page 2014
-
[21]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. arXiv preprint arXiv:2304.02643, 2023. 4, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Visual genome: Connecting language and vision using crowdsourced dense image annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 6
work page 2017
-
[23]
Zhengfeng Lai, Haotian Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, et al. From scarcity to effi- ciency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699, 2023. 3
-
[24]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 6, 7 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Otter: A multi-modal model with in-context instruction tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 3
-
[26]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 3, 4, 5
work page 2022
-
[27]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022. 3
work page 2022
-
[29]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 4, 6, 9
work page 2014
-
[30]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 2, 3, 4, 5, 6, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,
-
[35]
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision- language instruction tuning for large language models.arXiv preprint arXiv:2305.15023, 2023. 2
-
[36]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 6
work page 2019
-
[37]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–
work page 2019
-
[38]
Improving multi- modal datasets with image captioning
Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Se- woong Oh, and Ludwig Schmidt. Improving multi- modal datasets with image captioning. arXiv preprint arXiv:2307.10350, 2023. 3
-
[39]
OpenAI. Chatgpt. https://chat.openai.com/ ,
- [40]
-
[41]
Im2text: Describing images using 1 million captioned pho- tographs
Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned pho- tographs. Advances in neural information processing sys- tems, 24, 2011. 4, 9
work page 2011
-
[42]
Training lan- guage models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35: 27730–27744, 2022. 2
work page 2022
-
[43]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Improving language understanding by gen- erative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 2
work page 2018
-
[45]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 6
work page 2021
-
[46]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020. 2
work page 2020
-
[47]
Large-scale classifica- tion of fine-art paintings: Learning the right metric on the right feature
Babak Saleh and Ahmed Elgammal. Large-scale classifica- tion of fine-art paintings: Learning the right metric on the right feature. arXiv preprint arXiv:1505.00855, 2015. 4, 9
-
[48]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 4, 9
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[49]
A-okvqa: A benchmark for visual question answering using world knowl- edge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. In European Conference on Computer Vision , pages 146–162. Springer, 2022. 6
work page 2022
-
[50]
Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 4, 9 14
work page 2018
-
[51]
Textcaps: a dataset for image caption- ing with reading comprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ing with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part II 16, pages 742–758. Springer,
work page 2020
-
[52]
Internlm: A multilingual language model with progressively enhanced capabilities, 2023
InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023. 2
work page 2023
-
[53]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 2
work page 2017
-
[55]
Q-bench: A benchmark for general-purpose foundation models on low-level vision
Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181, 2023. 6, 7
-
[56]
Baichuan 2: Open large-scale language models
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. 2
-
[57]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 2, 3
work page Pith review arXiv 2023
-
[58]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Glipv2: Unifying localiza- tion and vision-language understanding
Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq- Neng Hwang, and Jianfeng Gao. Glipv2: Unifying localiza- tion and vision-language understanding. Advances in Neural Information Processing Systems, 35:36067–36080, 2022. 3
work page 2022
-
[60]
Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm- xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023. 2
-
[61]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 , 2023
work page Pith review arXiv 2023
-
[62]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 2, 6 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.