pith. machine review for the scientific record. sign in

arxiv: 2403.09611 · v4 · submitted 2024-03-14 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 1 theorem link

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Authors on Pith no claims yet

Pith reviewed 2026-05-16 04:01 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords multimodal large language modelspre-training data mixtureimage encoder ablationfew-shot learningvision-language connectorin-context learningmultimodal benchmarks
0
0 comments X

The pith

A careful mix of image-caption, interleaved image-text, and text-only data during pre-training is crucial for state-of-the-art few-shot results in multimodal large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes through systematic ablations that data composition matters more than many architectural details when pre-training large multimodal models. A balanced combination of image captions, interleaved image-text sequences, and pure text data produces stronger few-shot performance across benchmarks than other published pre-training approaches. Image encoder choice, input resolution, and token count also drive large gains, whereas the design of the vision-language connector shows little effect. Scaling the resulting recipe yields the MM1 family of models up to 30 billion parameters that lead on pre-training metrics and remain competitive after fine-tuning. These models further demonstrate improved in-context learning and multi-image reasoning as direct outcomes of the pre-training strategy.

Core claim

For large-scale multimodal pre-training, a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks. The image encoder together with image resolution and image token count has substantial impact on performance, while the vision-language connector design is of comparatively negligible importance. Scaling this recipe produces the MM1 family of models up to 30B parameters, both dense and mixture-of-experts variants, that reach leading pre-training metrics and competitive supervised fine-tuning results on established multimodal benchmarks while exhibiting enhanced in-context learning and multi-

What carries the argument

The data mixture strategy that combines image-caption pairs, interleaved image-text sequences, and text-only data, together with choices of image encoder, resolution, and token count.

If this is right

  • Pre-training with the identified data mix will produce higher few-shot accuracy on multimodal benchmarks than pre-training with narrower data sources.
  • Changes to the image encoder, resolution, or token count will produce measurable shifts in overall model quality.
  • Variations in vision-language connector architecture will leave few-shot performance largely unchanged.
  • Scaled models will display stronger in-context learning and multi-image reasoning without additional fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future scaling efforts could prioritize expanding the volume and diversity of the three data types rather than further connector redesigns.
  • The same mixture principle may apply when extending these models to video or audio if analogous interleaved and caption-style sources are available.
  • Practitioners could reduce experimentation time by fixing connector designs early and focusing compute on data curation and encoder selection.
  • Additional benchmarks that test long-context multi-image reasoning would help confirm whether the observed in-context gains hold beyond current evaluations.

Load-bearing premise

The ablations performed isolate the true importance of data composition and image encoder choices without confounding effects from untested interactions or hyperparameter choices.

What would settle it

Training a model of comparable scale with the same image encoder and resolution but using only one data type or a different untested mixture, then measuring whether it matches or exceeds MM1 few-shot scores on the same benchmarks.

read the original abstract

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents MM1, a family of multimodal LLMs (dense and MoE variants up to 30B parameters) built via large-scale pre-training. It claims that careful ablations reveal the image encoder, resolution, and token count to have substantial impact while the vision-language connector is comparatively unimportant, and that a specific pre-training data mix of image-caption, interleaved image-text, and text-only data is crucial for achieving SOTA few-shot results across benchmarks compared to prior work. Scaling the resulting recipe yields competitive post-SFT performance and strong properties such as in-context learning and multi-image reasoning.

Significance. If the ablation controls are sound, the work supplies actionable empirical guidance on data composition and encoder choices for multimodal pre-training at scale, extending scaling observations to the multimodal regime and demonstrating practical benefits of mixed data sources for few-shot generalization.

major comments (2)
  1. [Pre-training data ablations] Pre-training data section (and associated ablation tables): the central claim that the image-caption + interleaved + text-only mix is crucial for SOTA few-shot performance requires explicit confirmation that total pre-training tokens, steps, or compute budget were held fixed across all compared data compositions. If sample counts or epochs were scaled differently without token-budget normalization, the reported gains cannot be unambiguously attributed to composition rather than effective data volume.
  2. [Architecture ablations] Image encoder and resolution ablations: the reported substantial impact of encoder choice, image resolution, and token count must be shown to be independent of interactions with the data mix; if these ablations were run only at a single fixed mix or without re-optimizing hyperparameters for each encoder variant, the isolation of effects is incomplete.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'SOTA in pre-training metrics' should be accompanied by the specific metrics and direct numerical comparisons to the strongest published baselines.
  2. [Figures/Tables] Figure and table captions throughout: ensure all ablation plots and tables explicitly state the total token count or training steps used for each condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our ablation controls. We address each major comment below and will revise the manuscript to improve clarity on experimental controls without altering the core claims.

read point-by-point responses
  1. Referee: [Pre-training data ablations] Pre-training data section (and associated ablation tables): the central claim that the image-caption + interleaved + text-only mix is crucial for SOTA few-shot performance requires explicit confirmation that total pre-training tokens, steps, or compute budget were held fixed across all compared data compositions. If sample counts or epochs were scaled differently without token-budget normalization, the reported gains cannot be unambiguously attributed to composition rather than effective data volume.

    Authors: All data composition ablations were performed with a fixed total pre-training token budget of approximately 1.2T tokens. Sample counts from each source (image-caption, interleaved, text-only) were adjusted proportionally to maintain this fixed budget while varying the mixture ratios. We will add an explicit statement and a footnote in Section 4.2 clarifying the token normalization procedure to eliminate any ambiguity. revision: yes

  2. Referee: [Architecture ablations] Image encoder and resolution ablations: the reported substantial impact of encoder choice, image resolution, and token count must be shown to be independent of interactions with the data mix; if these ablations were run only at a single fixed mix or without re-optimizing hyperparameters for each encoder variant, the isolation of effects is incomplete.

    Authors: The encoder, resolution, and token-count ablations were conducted using the final recommended data mixture identified in the data ablations. While we did not re-optimize every hyperparameter for every encoder variant, the relative performance trends remained consistent across preliminary checks with alternative mixes. We will revise Section 3 to explicitly state the fixed data mix used for these ablations and add a brief discussion of potential interactions as a limitation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical ablations with external benchmarks

full rationale

The paper conducts large-scale empirical ablations on image encoders, vision-language connectors, and pre-training data mixes (image-caption, interleaved, text-only) to identify design lessons for MLLMs. No mathematical derivations, equations, or fitted parameters are presented that could reduce to self-definitions or internal predictions. All claims of SOTA few-shot performance are grounded in comparisons to external published results and standard benchmarks, with no load-bearing self-citations or uniqueness theorems invoked. The study is self-contained against external validation, yielding no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical; it relies on standard scaling assumptions in large-model training but introduces no new free parameters, axioms, or invented entities beyond conventional ML practice.

axioms (1)
  • domain assumption Standard large-model scaling assumptions hold for multimodal pre-training.
    The paper scales the identified recipe to 30B parameters expecting continued gains.

pith-pipeline@v0.9.0 · 5642 in / 1195 out tokens · 65189 ms · 2026-05-16T04:01:36.939836+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  2. MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    cs.CV 2024-08 conditional novelty 8.0

    MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

  3. MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining

    cs.LG 2026-04 unverdicted novelty 7.0

    MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...

  4. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  5. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 conditional novelty 6.0

    Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

  6. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 unverdicted novelty 6.0

    Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.

  7. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  8. R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation

    cs.CV 2026-01 unverdicted novelty 6.0

    R3G improves vision-centric visual question answering by generating reasoning plans to guide two-stage image retrieval and reranking, achieving state-of-the-art results on MRAG-Bench across six MLLM backbones.

  9. MMaDA: Multimodal Large Diffusion Language Models

    cs.CV 2025-05 unverdicted novelty 6.0

    MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...

  10. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  11. OpenVLA: An Open-Source Vision-Language-Action Model

    cs.RO 2024-06 unverdicted novelty 6.0

    OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

  12. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    cs.CL 2024-04 accept novelty 6.0

    Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

  13. Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.

  14. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    cs.CV 2024-08 unverdicted novelty 5.0

    Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.

  15. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  16. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  17. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    cs.CV 2025-02 unverdicted novelty 4.0

    SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...

  18. PaliGemma 2: A Family of Versatile VLMs for Transfer

    cs.CV 2024-12 unverdicted novelty 4.0

    PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...

  19. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  20. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  21. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

137 extracted references · 137 canonical work pages · cited by 20 Pith papers · 45 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    In: ICCV (2019)

    Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P.: Nocaps: Novel object captioning at scale. In: ICCV (2019)

  3. [3]

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a...

  4. [4]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 17 Wortsman, M., Schmidt, L.: Openflamingo: An open-source framework for train- ing large autoregressive vision-language m...

  5. [5]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)

  6. [6]

    In: EMNLP (2013)

    Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on Freebase from question-answer pairs. In: EMNLP (2013)

  7. [7]

    AAAI (2020)

    Bisk,Y.,Zellers,R.,Lebras,R.,Gao,J.,Choi,Y.:Piqa:Reasoningaboutphysical commonsense in natural language. AAAI (2020)

  8. [8]

    Training Diffusion Models with Reinforcement Learning

    Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023)

  9. [9]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  10. [10]

    NeurIPS (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few- shot learners. NeurIPS (2020)

  11. [11]

    https://github.com/kakaobrain/coyo-dataset (2022)

    Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset (2022)

  12. [12]

    arXiv preprint arXiv:2312.06742 (2023)

    Cha, J., Kang, W., Mun, J., Roh, B.: Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742 (2023)

  13. [13]

    In: CVPR (2021)

    Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)

  14. [14]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleash- ing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

  15. [15]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)

  16. [16]

    In: ICCV (2023)

    Chen, T., Chen, X., Du, X., Rashwan, A., Yang, F., Chen, H., Wang, Z., Li, Y.: Adamv-moe: Adaptive multi-task vision mixture-of-experts. In: ICCV (2023)

  17. [17]

    arXiv preprint arXiv:2305.18565 (2023)

    Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Changpinyo, S., Wu, J., Ruiz, C.R., Goodman, S., Wang, X., Tay, Y., et al.: Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 (2023)

  18. [18]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

  19. [19]

    JMLR (2023)

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling lan- guage modeling with pathways. JMLR (2023)

  20. [20]

    MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

    Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y., Hu, Y., Wei, F., Zhang, X., Zhang, B., Wei, X., et al.: Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886 (2023)

  21. [21]

    Scaling Instruction-Finetuned Language Models

    Chung,H.W.,Hou,L.,Longpre,S.,Zoph,B.,Tay,Y.,Fedus,W.,Li,Y.,Wang,X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)

  22. [22]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning chal- lenge. arXiv preprint arXiv:1803.05457 (2018) 18 B. McKinzie et al

  23. [23]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Dai, D., Deng, C., Zhao, C., Xu, R.X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., Xie, Z., Li, Y.K., Huang, P., Luo, F., Ruan, C., Sui, Z., Liang, W.: Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 (2024)

  24. [24]

    Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)

  25. [25]

    Daxberger, E., Weers, F., Zhang, B., Gunter, T., Pang, R., Eichner, M., Emmers- berger, M., Yang, Y., Toshev, A., Du, X.: Mobile v-moes: Scaling down vision transformers via sparse mixture-of-experts (2023)

  26. [26]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  27. [27]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  28. [28]

    PaLM-E: An Embodied Multimodal Language Model

    Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

  29. [29]

    In: ICML (2022)

    Du, N., Huang, Y., Dai, A.M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A.W., Firat, O., Zoph, B., Fedus, L., Bosma, M.P., Zhou, Z., Wang, T., Wang, E., Webster, K., Pellat, M., Robinson, K., Meier-Hellstern, K., Duke, T., Dixon, L., Zhang, K., Le, Q., Wu, Y., Chen, Z., Cui, C.: GLaM: Efficient scaling of language models with mixture-of-ex...

  30. [30]

    arXiv preprint arXiv:2401.08541 (2024)

    El-Nouby, A., Klein, M., Zhai, S., Bautista, M.A., Shankar, V., Toshev, A., Susskind, J., Joulin, A.: Scalable pre-training of large autoregressive image mod- els. arXiv preprint arXiv:2401.08541 (2024)

  31. [31]

    arXiv preprint arXiv:2309.17425 (2023)

    Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V.: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)

  32. [32]

    Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion param- eter models with simple and efficient sparsity (2022)

  33. [33]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  34. [34]

    Guid- ing instruction-based image editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023

    Fu, T.J., Hu, W., Du, X., Wang, W.Y., Yang, Y., Gan, Z.: Guiding instruction- based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102 (2023)

  35. [35]

    doi:10.5281/zenodo.10256836 , url =

    Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding,L.,Hsu,J.,LeNoac’h,A.,Li,H.,McDonell,K.,Muennighoff,N.,Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., Zou, A.: A framework for few-shot language model evaluation (12 2023). https://doi.org/10.5281...

  36. [36]

    arXiv preprint arXiv:2402.05935 (2024)

    Gao, P., Zhang, R., Liu, C., Qiu, L., Huang, S., Lin, W., Zhao, S., Geng, S., Lin, Z., Jin, P., et al.: Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 (2024)

  37. [37]

    Multimodal-gpt: A vision and language model for dialogue with humans

    Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., Zhao, Q., Liu, K., Zhang, W., Luo, P., Chen, K.: Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790 (2023) MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 19

  38. [38]

    In: CVPR (2017)

    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: CVPR (2017)

  39. [39]

    In: CVPR (2018)

    Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: Answering visual questions from blind people. In: CVPR (2018)

  40. [40]

    In: CVPR (2022)

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)

  41. [41]

    In: CVPR (2016)

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

  42. [42]

    Efficient multimodal learning from data-centric perspective.arXiv preprint arXiv:2402.11530, 2024

    He, M., Liu, Y., Wu, B., Yuan, J., Wang, Y., Huang, T., Zhao, B.: Efficient mul- timodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530 (2024)

  43. [43]

    Scaling Laws for Autoregressive Generative Modeling

    Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T.B., Dhariwal, P., Gray, S., et al.: Scaling laws for autoregressive gener- ative modeling. arXiv preprint arXiv:2010.14701 (2020)

  44. [44]

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models (2022)

  45. [45]

    Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mo- hammed, O.K., Patra, B., Liu, Q., Aggarwal, K., Chi, Z., Bjorck, J., Chaudhary, V., Som, S., Song, X., Wei, F.: Language is not all you need: Aligning perception with language models (2023)

  46. [46]

    In: CVPR (2019)

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)

  47. [47]

    https://huggingface.co/blog/idefics (2023)

    IDEFICS: Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics (2023)

  48. [48]

    Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassilvitskii, S., Koyejo, S.: Scaling laws for downstream task performance of large language models (2024)

  49. [49]

    Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de las Casas, D., Hanna, E.B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L.R., Saulnier, L., Lachaux, M.A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T.L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mixtral of experts (2024)

  50. [50]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: Triviaqa: A large scale dis- tantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017)

  51. [51]

    In: CVPR (2018)

    Kafle, K., Price, B., Cohen, S., Kanan, C.: Dvqa: Understanding data visualiza- tions via question answering. In: CVPR (2018)

  52. [52]

    In: ECCV (2016)

    Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: ECCV (2016)

  53. [53]

    In: ECCV (2022)

    Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Ocr-free document understanding transformer. In: ECCV (2022)

  54. [54]

    arXiv preprint arXiv:2305.17216 (2023)

    Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal lan- guage models. arXiv preprint arXiv:2305.17216 (2023)

  55. [55]

    In: ICLR (2023) 20 B

    Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C.R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M., Houlsby, N.: Sparse upcycling: Training mixture-of- experts from dense checkpoints. In: ICLR (2023) 20 B. McKinzie et al

  56. [56]

    arXiv preprint arXiv:2308.00692 (2023)

    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)

  57. [57]

    arXiv preprint arXiv:2310.07699 (2023)

    Lai, Z., Zhang, H., Wu, W., Bai, H., Timofeev, A., Du, X., Gan, Z., Shan, J., Chuah, C.N., Yang, Y., et al.: From scarcity to efficiency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699 (2023)

  58. [58]

    Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A.M., Kiela, D., Cord, M., Sanh, V.: Obelics: An open web-scale filtered dataset of interleaved image-text documents (2023)

  59. [59]

    In: ICLR (2021)

    Lepikhin,D.,Lee,H.,Xu,Y.,Chen,D.,Firat,O.,Huang,Y.,Krikun,M.,Shazeer, N., Chen, Z.: {GS}hard: Scaling giant models with conditional computation and automatic sharding. In: ICLR (2021)

  60. [60]

    arXiv preprint arXiv:2306.05425 (2023)

    Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., Liu, Z.: Mimic- it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)

  61. [61]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)

  62. [62]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)

  63. [63]

    arXiv preprint arXiv:2309.10020 (2023)

    Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J.: Multimodal foun- dation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 (2023)

  64. [64]

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models (2023)

  65. [65]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  66. [66]

    arXiv preprint arXiv:2306.04387 (2023)

    Li, L., Yin, Y., Li, S., Chen, L., Wang, P., Ren, S., Li, M., Yang, Y., Xu, J., Sun, X.,etal.:M 3it:Alarge-scaledatasettowardsmulti-modalmultilingualinstruction tuning. arXiv preprint arXiv:2306.04387 (2023)

  67. [67]

    VisualBERT: A Simple and Performant Baseline for Vision and Language

    Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  68. [68]

    Evaluating Object Hallucination in Large Vision-Language Models

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)

  69. [69]

    Monkey: Image resolution and text label are important things for large multi-modal models

    Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., Bai, X.: Monkey: Image resolution and text label are important things for large multi- modal models. arXiv preprint arXiv:2311.06607 (2023)

  70. [70]

    Lin, B., Tang, Z., Ye, Y., Cui, J., Zhu, B., Jin, P., Huang, J., Zhang, J., Ning, M., Yuan, L.: Moe-llava: Mixture of experts for large vision-language models (2024)

  71. [71]

    Vila: On pre-training for visual language models, 2024

    Lin, J., Yin, H., Ping, W., Lu, Y., Molchanov, P., Tao, A., Mao, H., Kautz, J., Shoeybi, M., Han, S.: Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533 (2023)

  72. [72]

    Microsoft COCO: Common Objects in Context

    Lin,T., Maire,M.,Belongie, S.J.,Bourdev, L.D., Girshick, R.B.,Hays,J., Perona, P., Ramanan, D., Doll’a r, P., Zitnick, C.L.: Microsoft COCO: common objects in context. arXiv preprint arXiv:1405.0312 (2014)

  73. [73]

    Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

    Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023) MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 21

  74. [74]

    Improved Baselines with Visual Instruction Tuning

    Liu,H.,Li,C.,Li,Y.,Lee,Y.J.:Improvedbaselineswithvisualinstructiontuning. arXiv preprint arXiv:2310.03744 (2023)

  75. [75]

    io/blog/2024-01-30-llava-next/

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/

  76. [76]

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

  77. [77]

    Llava-plus: Learning to use tools for creating multi- modal agents

    Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., et al.: Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437 (2023)

  78. [78]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

  79. [79]

    NeurIPS (2019)

    Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolin- guistic representations for vision-and-language tasks. NeurIPS (2019)

  80. [80]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

Showing first 80 references.