arxiv: 2403.09611 · v4 · submitted 2024-03-14 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 1 theorem link

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Brandon McKinzie , Zhe Gan , Jean-Philippe Fauconnier , Sam Dodge , Bowen Zhang , Philipp Dufter , Dhruti Shah , Xianzhi Du

show 24 more authors

Futang Peng Floris Weers Anton Belyi Haotian Zhang Karanjeet Singh Doug Kang Ankur Jain Hongyu H\`e Max Schwarzer Tom Gunter Xiang Kong Aonan Zhang Jianyu Wang Chong Wang Nan Du Tao Lei Sam Wiseman Guoli Yin Mark Lee Zirui Wang Ruoming Pang Peter Grasch Alexander Toshev Yinfei Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 04:01 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords multimodal large language modelspre-training data mixtureimage encoder ablationfew-shot learningvision-language connectorin-context learningmultimodal benchmarks

0 comments

The pith

A careful mix of image-caption, interleaved image-text, and text-only data during pre-training is crucial for state-of-the-art few-shot results in multimodal large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes through systematic ablations that data composition matters more than many architectural details when pre-training large multimodal models. A balanced combination of image captions, interleaved image-text sequences, and pure text data produces stronger few-shot performance across benchmarks than other published pre-training approaches. Image encoder choice, input resolution, and token count also drive large gains, whereas the design of the vision-language connector shows little effect. Scaling the resulting recipe yields the MM1 family of models up to 30 billion parameters that lead on pre-training metrics and remain competitive after fine-tuning. These models further demonstrate improved in-context learning and multi-image reasoning as direct outcomes of the pre-training strategy.

Core claim

For large-scale multimodal pre-training, a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks. The image encoder together with image resolution and image token count has substantial impact on performance, while the vision-language connector design is of comparatively negligible importance. Scaling this recipe produces the MM1 family of models up to 30B parameters, both dense and mixture-of-experts variants, that reach leading pre-training metrics and competitive supervised fine-tuning results on established multimodal benchmarks while exhibiting enhanced in-context learning and multi-

What carries the argument

The data mixture strategy that combines image-caption pairs, interleaved image-text sequences, and text-only data, together with choices of image encoder, resolution, and token count.

If this is right

Pre-training with the identified data mix will produce higher few-shot accuracy on multimodal benchmarks than pre-training with narrower data sources.
Changes to the image encoder, resolution, or token count will produce measurable shifts in overall model quality.
Variations in vision-language connector architecture will leave few-shot performance largely unchanged.
Scaled models will display stronger in-context learning and multi-image reasoning without additional fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future scaling efforts could prioritize expanding the volume and diversity of the three data types rather than further connector redesigns.
The same mixture principle may apply when extending these models to video or audio if analogous interleaved and caption-style sources are available.
Practitioners could reduce experimentation time by fixing connector designs early and focusing compute on data curation and encoder selection.
Additional benchmarks that test long-context multi-image reasoning would help confirm whether the observed in-context gains hold beyond current evaluations.

Load-bearing premise

The ablations performed isolate the true importance of data composition and image encoder choices without confounding effects from untested interactions or hyperparameter choices.

What would settle it

Training a model of comparable scale with the same image encoder and resolution but using only one data type or a different untested mixture, then measuring whether it matches or exceeds MM1 few-shot scores on the same benchmarks.

read the original abstract

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MM1 gives a usable pre-training recipe where data mix beats connector design for few-shot multimodal performance, but the ablations need close checking on token budgets.

read the letter

The paper's core finding is that mixing image captions, interleaved image-text, and plain text during pre-training delivers the best few-shot results across benchmarks, while the vision-language connector adds little once the image encoder and resolution are set. They back this with ablations and then scale the recipe to 30B dense and MoE models that hit strong pre-training metrics and hold up after supervised fine-tuning. The in-context and multi-image reasoning gains are presented as direct outcomes of that pre-training scale.

Referee Report

2 major / 2 minor

Summary. The paper presents MM1, a family of multimodal LLMs (dense and MoE variants up to 30B parameters) built via large-scale pre-training. It claims that careful ablations reveal the image encoder, resolution, and token count to have substantial impact while the vision-language connector is comparatively unimportant, and that a specific pre-training data mix of image-caption, interleaved image-text, and text-only data is crucial for achieving SOTA few-shot results across benchmarks compared to prior work. Scaling the resulting recipe yields competitive post-SFT performance and strong properties such as in-context learning and multi-image reasoning.

Significance. If the ablation controls are sound, the work supplies actionable empirical guidance on data composition and encoder choices for multimodal pre-training at scale, extending scaling observations to the multimodal regime and demonstrating practical benefits of mixed data sources for few-shot generalization.

major comments (2)

[Pre-training data ablations] Pre-training data section (and associated ablation tables): the central claim that the image-caption + interleaved + text-only mix is crucial for SOTA few-shot performance requires explicit confirmation that total pre-training tokens, steps, or compute budget were held fixed across all compared data compositions. If sample counts or epochs were scaled differently without token-budget normalization, the reported gains cannot be unambiguously attributed to composition rather than effective data volume.
[Architecture ablations] Image encoder and resolution ablations: the reported substantial impact of encoder choice, image resolution, and token count must be shown to be independent of interactions with the data mix; if these ablations were run only at a single fixed mix or without re-optimizing hyperparameters for each encoder variant, the isolation of effects is incomplete.

minor comments (2)

[Abstract] Abstract: the phrase 'SOTA in pre-training metrics' should be accompanied by the specific metrics and direct numerical comparisons to the strongest published baselines.
[Figures/Tables] Figure and table captions throughout: ensure all ablation plots and tables explicitly state the total token count or training steps used for each condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our ablation controls. We address each major comment below and will revise the manuscript to improve clarity on experimental controls without altering the core claims.

read point-by-point responses

Referee: [Pre-training data ablations] Pre-training data section (and associated ablation tables): the central claim that the image-caption + interleaved + text-only mix is crucial for SOTA few-shot performance requires explicit confirmation that total pre-training tokens, steps, or compute budget were held fixed across all compared data compositions. If sample counts or epochs were scaled differently without token-budget normalization, the reported gains cannot be unambiguously attributed to composition rather than effective data volume.

Authors: All data composition ablations were performed with a fixed total pre-training token budget of approximately 1.2T tokens. Sample counts from each source (image-caption, interleaved, text-only) were adjusted proportionally to maintain this fixed budget while varying the mixture ratios. We will add an explicit statement and a footnote in Section 4.2 clarifying the token normalization procedure to eliminate any ambiguity. revision: yes
Referee: [Architecture ablations] Image encoder and resolution ablations: the reported substantial impact of encoder choice, image resolution, and token count must be shown to be independent of interactions with the data mix; if these ablations were run only at a single fixed mix or without re-optimizing hyperparameters for each encoder variant, the isolation of effects is incomplete.

Authors: The encoder, resolution, and token-count ablations were conducted using the final recommended data mixture identified in the data ablations. While we did not re-optimize every hyperparameter for every encoder variant, the relative performance trends remained consistent across preliminary checks with alternative mixes. We will revise Section 3 to explicitly state the fixed data mix used for these ablations and add a brief discussion of potential interactions as a limitation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical ablations with external benchmarks

full rationale

The paper conducts large-scale empirical ablations on image encoders, vision-language connectors, and pre-training data mixes (image-caption, interleaved, text-only) to identify design lessons for MLLMs. No mathematical derivations, equations, or fitted parameters are presented that could reduce to self-definitions or internal predictions. All claims of SOTA few-shot performance are grounded in comparisons to external published results and standard benchmarks, with no load-bearing self-citations or uniqueness theorems invoked. The study is self-contained against external validation, yielding no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical; it relies on standard scaling assumptions in large-model training but introduces no new free parameters, axioms, or invented entities beyond conventional ML practice.

axioms (1)

domain assumption Standard large-model scaling assumptions hold for multimodal pre-training.
The paper scales the identified recipe to 30B parameters expecting continued gains.

pith-pipeline@v0.9.0 · 5642 in / 1195 out tokens · 65189 ms · 2026-05-16T04:01:36.939836+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
cs.CV 2024-08 conditional novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
cs.LG 2026-04 unverdicted novelty 7.0

MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
cs.CV 2024-07 unverdicted novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 conditional novelty 6.0

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation
cs.CV 2026-01 unverdicted novelty 6.0

R3G improves vision-centric visual question answering by generating reasoning plans to guide two-stage image retrieval and reranking, achieving state-of-the-art results on MRAG-Bench across six MLLM backbones.
MMaDA: Multimodal Large Diffusion Language Models
cs.CV 2025-05 unverdicted novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
cs.CL 2024-04 accept novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
cs.AI 2026-04 unverdicted novelty 5.0

Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
cs.CV 2024-08 unverdicted novelty 5.0

Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
cs.CV 2025-02 unverdicted novelty 4.0

SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...
PaliGemma 2: A Family of Versatile VLMs for Transfer
cs.CV 2024-12 unverdicted novelty 4.0

PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

137 extracted references · 137 canonical work pages · cited by 20 Pith papers · 45 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: ICCV (2019)

Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P.: Nocaps: Novel object captioning at scale. In: ICCV (2019)

work page 2019
[3]

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a...

work page 2022
[4]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 17 Wortsman, M., Schmidt, L.: Openflamingo: An open-source framework for train- ing large autoregressive vision-language m...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

In: EMNLP (2013)

Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on Freebase from question-answer pairs. In: EMNLP (2013)

work page 2013
[7]

AAAI (2020)

Bisk,Y.,Zellers,R.,Lebras,R.,Gao,J.,Choi,Y.:Piqa:Reasoningaboutphysical commonsense in natural language. AAAI (2020)

work page 2020
[8]

Training Diffusion Models with Reinforcement Learning

Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

NeurIPS (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few- shot learners. NeurIPS (2020)

work page 2020
[11]

https://github.com/kakaobrain/coyo-dataset (2022)

Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset (2022)

work page 2022
[12]

arXiv preprint arXiv:2312.06742 (2023)

Cha, J., Kang, W., Mun, J., Roh, B.: Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742 (2023)

work page arXiv 2023
[13]

In: CVPR (2021)

Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)

work page 2021
[14]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleash- ing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

In: ICCV (2023)

Chen, T., Chen, X., Du, X., Rashwan, A., Yang, F., Chen, H., Wang, Z., Li, Y.: Adamv-moe: Adaptive multi-task vision mixture-of-experts. In: ICCV (2023)

work page 2023
[17]

arXiv preprint arXiv:2305.18565 (2023)

Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Changpinyo, S., Wu, J., Ruiz, C.R., Goodman, S., Wang, X., Tay, Y., et al.: Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 (2023)

work page arXiv 2023
[18]

Microsoft COCO Captions: Data Collection and Evaluation Server

Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

JMLR (2023)

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling lan- guage modeling with pathways. JMLR (2023)

work page 2023
[20]

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y., Hu, Y., Wei, F., Zhang, X., Zhang, B., Wei, X., et al.: Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886 (2023)

work page internal anchor Pith review arXiv 2023
[21]

Scaling Instruction-Finetuned Language Models

Chung,H.W.,Hou,L.,Longpre,S.,Zoph,B.,Tay,Y.,Fedus,W.,Li,Y.,Wang,X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning chal- lenge. arXiv preprint arXiv:1803.05457 (2018) 18 B. McKinzie et al

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai, D., Deng, C., Zhao, C., Xu, R.X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., Xie, Z., Li, Y.K., Huang, P., Luo, F., Ruan, C., Sui, Z., Liang, W.: Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)

work page 2023
[25]

Daxberger, E., Weers, F., Zhang, B., Gunter, T., Pang, R., Eichner, M., Emmers- berger, M., Yang, Y., Toshev, A., Du, X.: Mobile v-moes: Scaling down vision transformers via sparse mixture-of-experts (2023)

work page 2023
[26]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[28]

PaLM-E: An Embodied Multimodal Language Model

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

In: ICML (2022)

Du, N., Huang, Y., Dai, A.M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A.W., Firat, O., Zoph, B., Fedus, L., Bosma, M.P., Zhou, Z., Wang, T., Wang, E., Webster, K., Pellat, M., Robinson, K., Meier-Hellstern, K., Duke, T., Dixon, L., Zhang, K., Le, Q., Wu, Y., Chen, Z., Cui, C.: GLaM: Efficient scaling of language models with mixture-of-ex...

work page 2022
[30]

arXiv preprint arXiv:2401.08541 (2024)

El-Nouby, A., Klein, M., Zhai, S., Bautista, M.A., Shankar, V., Toshev, A., Susskind, J., Joulin, A.: Scalable pre-training of large autoregressive image mod- els. arXiv preprint arXiv:2401.08541 (2024)

work page arXiv 2024
[31]

arXiv preprint arXiv:2309.17425 (2023)

Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V.: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)

work page arXiv 2023
[32]

Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion param- eter models with simple and efficient sparsity (2022)

work page 2022
[33]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Guid- ing instruction-based image editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023

Fu, T.J., Hu, W., Du, X., Wang, W.Y., Yang, Y., Gan, Z.: Guiding instruction- based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102 (2023)

work page arXiv 2023
[35]

doi:10.5281/zenodo.10256836 , url =

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding,L.,Hsu,J.,LeNoac’h,A.,Li,H.,McDonell,K.,Muennighoff,N.,Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., Zou, A.: A framework for few-shot language model evaluation (12 2023). https://doi.org/10.5281...

work page doi:10.5281/zenodo.10256836 2023
[36]

arXiv preprint arXiv:2402.05935 (2024)

Gao, P., Zhang, R., Liu, C., Qiu, L., Huang, S., Lin, W., Zhao, S., Geng, S., Lin, Z., Jin, P., et al.: Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 (2024)

work page arXiv 2024
[37]

Multimodal-gpt: A vision and language model for dialogue with humans

Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., Zhao, Q., Liu, K., Zhang, W., Luo, P., Chen, K.: Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790 (2023) MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 19

work page arXiv 2023
[38]

In: CVPR (2017)

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: CVPR (2017)

work page 2017
[39]

In: CVPR (2018)

Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: Answering visual questions from blind people. In: CVPR (2018)

work page 2018
[40]

In: CVPR (2022)

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)

work page 2022
[41]

In: CVPR (2016)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

work page 2016
[42]

Efficient multimodal learning from data-centric perspective.arXiv preprint arXiv:2402.11530, 2024

He, M., Liu, Y., Wu, B., Yuan, J., Wang, Y., Huang, T., Zhao, B.: Efficient mul- timodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530 (2024)

work page arXiv 2024
[43]

Scaling Laws for Autoregressive Generative Modeling

Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T.B., Dhariwal, P., Gray, S., et al.: Scaling laws for autoregressive gener- ative modeling. arXiv preprint arXiv:2010.14701 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[44]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models (2022)

work page 2022
[45]

Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mo- hammed, O.K., Patra, B., Liu, Q., Aggarwal, K., Chi, Z., Bjorck, J., Chaudhary, V., Som, S., Song, X., Wei, F.: Language is not all you need: Aligning perception with language models (2023)

work page 2023
[46]

In: CVPR (2019)

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)

work page 2019
[47]

https://huggingface.co/blog/idefics (2023)

IDEFICS: Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics (2023)

work page 2023
[48]

Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassilvitskii, S., Koyejo, S.: Scaling laws for downstream task performance of large language models (2024)

work page 2024
[49]

Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de las Casas, D., Hanna, E.B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L.R., Saulnier, L., Lachaux, M.A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T.L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mixtral of experts (2024)

work page 2024
[50]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: Triviaqa: A large scale dis- tantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[51]

In: CVPR (2018)

Kafle, K., Price, B., Cohen, S., Kanan, C.: Dvqa: Understanding data visualiza- tions via question answering. In: CVPR (2018)

work page 2018
[52]

In: ECCV (2016)

Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: ECCV (2016)

work page 2016
[53]

In: ECCV (2022)

Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Ocr-free document understanding transformer. In: ECCV (2022)

work page 2022
[54]

arXiv preprint arXiv:2305.17216 (2023)

Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal lan- guage models. arXiv preprint arXiv:2305.17216 (2023)

work page arXiv 2023
[55]

In: ICLR (2023) 20 B

Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C.R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M., Houlsby, N.: Sparse upcycling: Training mixture-of- experts from dense checkpoints. In: ICLR (2023) 20 B. McKinzie et al

work page 2023
[56]

arXiv preprint arXiv:2308.00692 (2023)

Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)

work page arXiv 2023
[57]

arXiv preprint arXiv:2310.07699 (2023)

Lai, Z., Zhang, H., Wu, W., Bai, H., Timofeev, A., Du, X., Gan, Z., Shan, J., Chuah, C.N., Yang, Y., et al.: From scarcity to efficiency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699 (2023)

work page arXiv 2023
[58]

Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A.M., Kiela, D., Cord, M., Sanh, V.: Obelics: An open web-scale filtered dataset of interleaved image-text documents (2023)

work page 2023
[59]

In: ICLR (2021)

Lepikhin,D.,Lee,H.,Xu,Y.,Chen,D.,Firat,O.,Huang,Y.,Krikun,M.,Shazeer, N., Chen, Z.: {GS}hard: Scaling giant models with conditional computation and automatic sharding. In: ICLR (2021)

work page 2021
[60]

arXiv preprint arXiv:2306.05425 (2023)

Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., Liu, Z.: Mimic- it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)

work page arXiv 2023
[61]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

arXiv preprint arXiv:2309.10020 (2023)

Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J.: Multimodal foun- dation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 (2023)

work page arXiv 2023
[64]

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models (2023)

work page 2023
[65]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

arXiv preprint arXiv:2306.04387 (2023)

Li, L., Yin, Y., Li, S., Chen, L., Wang, P., Ren, S., Li, M., Yang, Y., Xu, J., Sun, X.,etal.:M 3it:Alarge-scaledatasettowardsmulti-modalmultilingualinstruction tuning. arXiv preprint arXiv:2306.04387 (2023)

work page arXiv 2023
[67]

VisualBERT: A Simple and Performant Baseline for Vision and Language

Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908
[68]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Monkey: Image resolution and text label are important things for large multi-modal models

Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., Bai, X.: Monkey: Image resolution and text label are important things for large multi- modal models. arXiv preprint arXiv:2311.06607 (2023)

work page arXiv 2023
[70]

Lin, B., Tang, Z., Ye, Y., Cui, J., Zhu, B., Jin, P., Huang, J., Zhang, J., Ning, M., Yuan, L.: Moe-llava: Mixture of experts for large vision-language models (2024)

work page 2024
[71]

Vila: On pre-training for visual language models, 2024

Lin, J., Yin, H., Ping, W., Lu, Y., Molchanov, P., Tao, A., Mao, H., Kautz, J., Shoeybi, M., Han, S.: Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533 (2023)

work page arXiv 2023
[72]

Microsoft COCO: Common Objects in Context

Lin,T., Maire,M.,Belongie, S.J.,Bourdev, L.D., Girshick, R.B.,Hays,J., Perona, P., Ramanan, D., Doll’a r, P., Zitnick, C.L.: Microsoft COCO: common objects in context. arXiv preprint arXiv:1405.0312 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[73]

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023) MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 21

work page arXiv 2023
[74]

Improved Baselines with Visual Instruction Tuning

Liu,H.,Li,C.,Li,Y.,Lee,Y.J.:Improvedbaselineswithvisualinstructiontuning. arXiv preprint arXiv:2310.03744 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

io/blog/2024-01-30-llava-next/

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/

work page 2024
[76]

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

work page 2023
[77]

Llava-plus: Learning to use tools for creating multi- modal agents

Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., et al.: Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437 (2023)

work page arXiv 2023
[78]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

NeurIPS (2019)

Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolin- guistic representations for vision-and-language tasks. NeurIPS (2019)

work page 2019
[80]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.