pith. machine review for the scientific record. sign in

arxiv: 2306.13549 · v4 · submitted 2023-06-23 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Survey on Multimodal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords multimodal large language modelsMLLMGPT-4Vemergent capabilitiesmultimodal reasoningvision-language modelsmultimodal hallucinationartificial general intelligence
0
0 comments X

The pith

Multimodal large language models use LLMs as a central brain to handle images and other inputs with new emergent reasoning skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews the fast rise of multimodal large language models that combine large language models with visual and other data sources. It covers their basic structure, training on mixed datasets, and evaluation on tasks such as image description and reasoning. The work examines extensions to finer details, more data types, languages, and real-world uses, plus problems like false outputs from images. It closes by listing current limits and open research directions in a field that may lead toward broader artificial intelligence systems.

Core claim

The paper claims that multimodal large language models, represented by GPT-4V, use powerful large language models as a brain to perform multimodal tasks and display surprising emergent capabilities such as writing stories based on images and OCR-free math reasoning that are rare in traditional multimodal methods, while summarizing their formulation, architecture, training strategy, data, evaluation, extensions to more granularity modalities languages and scenarios, multimodal hallucination, extended techniques including M-ICL M-CoT and LAVR, challenges, and promising directions.

What carries the argument

The central object is the large language model used as a unifying brain to process and reason over combined multimodal inputs through shared architectures and joint training.

If this is right

  • MLLMs can be extended to support finer granularity, additional modalities, more languages, and complex scenarios.
  • Techniques such as multimodal in-context learning, multimodal chain-of-thought reasoning, and LLM-aided visual reasoning improve performance on multimodal tasks.
  • Tackling multimodal hallucination is required for dependable real-world applications.
  • Continued progress in this area may open a route toward artificial general intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Unified LLM-centered models may replace earlier separate-modality approaches in many vision-language settings.
  • Adding real-time video or audio streams could test whether current emergent skills scale to continuous inputs.
  • The linked repository underscores the value of living resources for tracking fast-changing research areas.

Load-bearing premise

The survey assumes that the cited literature and the associated GitHub repository together provide a sufficiently complete and up-to-date picture of the rapidly evolving MLLM field.

What would settle it

A new review identifying many important recent MLLM papers or key developments absent from this survey and its linked repository would show the summary is incomplete.

read the original abstract

Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper is a survey tracing recent progress on Multimodal Large Language Models (MLLMs). It begins with the basic formulation and related concepts of architecture, training strategy, data, and evaluation. It then covers extensions supporting greater granularity, additional modalities, languages, and scenarios, followed by multimodal hallucination and techniques including Multimodal In-Context Learning (M-ICL), Multimodal Chain-of-Thought (M-CoT), and LLM-Aided Visual Reasoning (LAVR). The survey concludes with challenges, promising directions, and an associated GitHub repository for updates.

Significance. If the coverage proves comprehensive, the survey supplies a useful organizational framework for the fast-moving MLLM field, explicitly crediting emergent capabilities such as image-based story writing and OCR-free math reasoning while pointing to an open GitHub repository that collects latest papers. This combination of structured delineation and a living resource strengthens its value as a reference for researchers working on vision-language integration.

major comments (2)
  1. [Evaluation] The evaluation section does not quantify how well current benchmarks capture the emergent capabilities highlighted in the abstract (e.g., story writing from images); without such analysis the contrast with traditional multimodal methods remains qualitative and weakens the motivation for the survey's scope.
  2. [Training and Data] In the training and data section, the discussion of data curation omits explicit comparison of scale, filtering, and alignment procedures across representative models (LLaVA, MiniGPT-4, etc.), which is load-bearing for readers seeking to reproduce or extend the reported performance trends.
minor comments (3)
  1. [Abstract] The abstract repeats motivational phrasing about AGI that could be shortened without loss of clarity.
  2. [Architecture] Figure captions for architecture diagrams should explicitly label each component (vision encoder, projector, LLM backbone) to match the textual description.
  3. [Introduction] The GitHub repository is mentioned only in the abstract; a short dedicated paragraph in the introduction describing its maintenance policy and coverage criteria would improve usability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the encouraging assessment and the specific comments, which help clarify areas where the survey can be strengthened. We address each major comment below and outline the corresponding revisions.

read point-by-point responses
  1. Referee: [Evaluation] The evaluation section does not quantify how well current benchmarks capture the emergent capabilities highlighted in the abstract (e.g., story writing from images); without such analysis the contrast with traditional multimodal methods remains qualitative and weakens the motivation for the survey's scope.

    Authors: We acknowledge that the evaluation section primarily summarizes existing benchmarks and notes emergent capabilities without providing quantitative metrics on benchmark coverage. As this is a survey, we do not introduce new empirical evaluations; however, we will expand the section with a dedicated paragraph discussing the limitations of current benchmarks in capturing capabilities such as image-based story writing and OCR-free reasoning, referencing any available meta-analyses or studies that quantify these gaps. This addition will make the contrast with traditional methods more explicit while remaining within the survey's scope. revision: partial

  2. Referee: [Training and Data] In the training and data section, the discussion of data curation omits explicit comparison of scale, filtering, and alignment procedures across representative models (LLaVA, MiniGPT-4, etc.), which is load-bearing for readers seeking to reproduce or extend the reported performance trends.

    Authors: We agree that a side-by-side comparison would improve utility for readers. We will insert a new table in the training and data section that explicitly compares data scale, filtering strategies, and alignment procedures for representative models including LLaVA, MiniGPT-4, and others, based on details reported in their original papers. This table will directly address reproducibility needs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; descriptive survey of external literature

full rationale

This paper is a literature survey with no original derivations, equations, quantitative predictions, or first-principles results. Its contribution is organizational: delineating architectures, training strategies, data, evaluations, extensions, hallucination, and techniques like M-ICL and M-CoT drawn from cited external works. The abstract's reference to emergent capabilities is presented as motivation from prior examples rather than a derived claim. No self-citations function as load-bearing justifications for novel results, and no steps reduce to fitted inputs or self-definitions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper introduces no free parameters, axioms, or invented entities; all technical content is drawn from the referenced prior literature.

pith-pipeline@v0.9.0 · 5606 in / 1054 out tokens · 62176 ms · 2026-05-16T02:48:44.972565+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cross-Modal Backdoors in Multimodal Large Language Models

    cs.CR 2026-05 unverdicted novelty 8.0

    Poisoning a single connector in MLLMs establishes a reusable latent backdoor pathway that transfers across modalities with over 95% attack success rate under bounded perturbations.

  2. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  3. ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...

  4. EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

    cs.AI 2026-04 unverdicted novelty 7.0

    EmergentBridge improves zero-shot cross-modal transfer for unpaired modality pairs by learning noisy bridge anchors and enforcing proxy alignment only in the orthogonal subspace to preserve existing anchor alignments.

  5. When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.

  6. LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

    cs.AI 2026-05 unverdicted novelty 6.0

    LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.

  7. OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

    cs.MM 2026-04 unverdicted novelty 6.0

    OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.

  8. EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

    cs.AI 2026-04 unverdicted novelty 6.0

    EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradien...

  9. MMaDA: Multimodal Large Diffusion Language Models

    cs.CV 2025-05 unverdicted novelty 6.0

    MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...

  10. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

    cs.CV 2024-01 conditional novelty 6.0

    MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.

  11. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  12. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    cs.CV 2023-06 unverdicted novelty 6.0

    MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

  13. ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring

    cs.CL 2026-05 unverdicted novelty 5.0

    ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.

  14. Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

    cs.CL 2026-04 conditional novelty 5.0

    Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.

  15. SALLIE: Safeguarding Against Latent Language & Image Exploits

    cs.CR 2026-04 unverdicted novelty 5.0

    SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.

  16. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  17. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  18. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

  19. A Survey on the Memory Mechanism of Large Language Model based Agents

    cs.AI 2024-04 accept novelty 3.0

    A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.

  20. A Survey on Hallucination in Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 3.0

    This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

209 extracted references · 209 canonical work pages · cited by 19 Pith papers · 59 internal anchors

  1. [1]

    A Survey of Large Language Models

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al. , “A survey of large language models,” arXiv:2303.18223, 2023. 1

  2. [2]

    Chatgpt: A language model for conversational ai,

    OpenAI, “Chatgpt: A language model for conversational ai,” OpenAI, Tech. Rep., 2023. [Online]. Available: https: //www.openai.com/research/chatgpt 1, 6

  3. [3]

    GPT-4 Technical Report

    ——, “Gpt-4 technical report,” arXiv:2303.08774, 2023. 1

  4. [4]

    Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality,

    W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al. , “Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality,”

  5. [5]

    Available: https://vicuna.lmsys.org 1, 3, 4

    [Online]. Available: https://vicuna.lmsys.org 1, 3, 4

  6. [6]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv:2302.13971, 2023. 1, 3, 4

  7. [7]

    Instruction Tuning with GPT-4

    B. Peng, C. Li, P . He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” arXiv:2304.03277, 2023. 1

  8. [8]

    Lan- guage models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhari- wal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell et al. , “Lan- guage models are few-shot learners,” NeurIPS, 2020. 1, 3, 6

  9. [9]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,” arXiv:2201.11903, 2022. 1, 12

  10. [10]

    Segment Anything

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al. , “Segment anything,” arXiv:2304.02643, 2023. 1, 9

  11. [11]

    Aligning and prompting everything all at once for universal visual perception,

    Y. Shen, C. Fu, P . Chen, M. Zhang, K. Li, X. Sun, Y. Wu, S. Lin, and R. Ji, “Aligning and prompting everything all at once for universal visual perception,” in CVPR, 2024. 1

  12. [12]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv:2203.03605, 2022. 1

  13. [13]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V . Khalidov, P . Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,” arXiv:2304.07193, 2023. 1

  14. [14]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in ICML, 2021. 1, 2, 3, 5

  15. [15]

    Align before fuse: Vision and language representation learning with momentum distillation,

    J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” NeurIPS, 2021. 1

  16. [16]

    Uniter: Universal image-text representation learn- ing,

    Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learn- ing,” in ECCV, 2020. 1

  17. [17]

    Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,

    P . Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in ICML, 2022. 1

  18. [18]

    Unifying vision-and- language tasks via text generation,

    J. Cho, J. Lei, H. Tan, and M. Bansal, “Unifying vision-and- language tasks via text generation,” in ICML, 2021. 1

  19. [19]

    Simvlm: Simple visual language model pretraining with weak supervision,

    Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “Simvlm: Simple visual language model pretraining with weak supervision,” arXiv:2108.10904, 2021. 1

  20. [20]

    Finetuned Language Models Are Zero-Shot Learners

    J. Wei, M. Bosma, V . Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero- shot learners,” arXiv:2109.01652, 2021. 1, 6, 11

  21. [21]

    Visual Instruction Tuning

    H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv:2304.08485, 2023. 1, 4, 6, 7, 8, 9, 10

  22. [22]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv:2304.10592, 2023. 1, 2, 6, 7

  23. [23]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” arXiv:2303.11381, 2023. 1, 11, 12, 13

  24. [24]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al. , “Palm-e: An embodied multimodal language model,” arXiv:2303.03378, 2023. 1

  25. [25]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y. Hanafy, W. Zhu, K. Marathe, Y. Bitton, S. Gadre, S. Sagawa et al., “Openflamingo: An open-source framework for training large autoregressive vision-language models,” arXiv:2308.01390, 2023. 1

  26. [26]

    VideoChat: Chat-Centric Video Understanding

    K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P . Luo, Y. Wang, L. Wang, and Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv:2305.06355, 2023. 1, 4, 6

  27. [27]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction- tuned audio-visual language model for video understanding,” arXiv:2306.02858, 2023. 1, 4

  28. [28]

    Pengi: An audio language model for audio tasks,

    S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” NeurIPS, 2024. 1, 3

  29. [29]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao, “Shikra: Unleashing multimodal llm’s referential dialogue magic,” arXiv:2306.15195. 1, 9

  30. [30]

    Osprey: Pixel understanding with visual instruction tuning,

    Y. Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, and J. Zhu, “Osprey: Pixel understanding with visual instruction tuning,” arXiv:2312.10032. 1, 2, 9

  31. [31]

    Imagebind-llm: Multi-modality instruction tuning,

    J. Han, R. Zhang, W. Shao, P . Gao, P . Xu, H. Xiao, K. Zhang, C. Liu, S. Wen, Z. Guo et al., “Imagebind-llm: Multi-modality instruction tuning,” arXiv:2309.03905, 2023. 1, 3

  32. [32]

    Anymal: An efficient and scalable any-modality augmented language model,

    S. Moon, A. Madotto, Z. Lin, T. Nagarajan, M. Smith, S. Jain, C.-F. Yeh, P . Murugesan, P . Heidari, Y. Liu et al. , “Anymal: An efficient and scalable any-modality augmented language model,” arXiv:2309.16058, 2023. 1

  33. [33]

    Next-gpt: Any-to-any multimodal llm,

    S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” arXiv:2309.05519, 2023. 1, 9

  34. [34]

    Large multilingual models pivot zero- shot multimodal learning across languages,

    J. Hu, Y. Yao, C. Wang, S. Wang, Y. Pan, Q. Chen, T. Yu, H. Wu, Y. Zhao, H. Zhang et al., “Large multilingual models pivot zero- shot multimodal learning across languages,” arXiv:2308.12038,

  35. [35]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv:2308.12966, 2023. 1, 3, 4, 10

  36. [37]

    Med-flamingo: a multi- modal medical few-shot learner,

    M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y. Dalmia, J. Leskovec, C. Zakka, E. P . Reis, and P . Rajpurkar, “Med-flamingo: a multi- modal medical few-shot learner,” in Machine Learning for Health (ML4H), 2023. 1, 10

  37. [38]

    PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

    X. Zhang, C. Wu, Z. Zhao, W. Lin, Y. Zhang, Y. Wang, and W. Xie, “Pmc-vqa: Visual instruction tuning for medical visual question answering,” arXiv:2305.10415, 2023. 1, 4, 6, 10

  38. [39]

    arXiv preprint arXiv:2307.02499 (2023)

    J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, Y. Dan, C. Zhao, G. Xu, C. Li, J. Tian et al. , “mplug-docowl: Modularized multimodal large language model for document understanding,” arXiv:2307.02499,

  39. [40]

    Textmonkey: An ocr-free large multimodal model for under- standing document,

    Y. Liu, B. Yang, Q. Liu, Z. Li, Z. Ma, S. Zhang, and X. Bai, “Textmonkey: An ocr-free large multimodal model for under- standing document,” arXiv:2403.04473, 2024. 1, 10

  40. [41]

    mplug-paperowl: Scientific diagram analysis with the multimodal large language model,

    A. Hu, Y. Shi, H. Xu, J. Ye, Q. Ye, M. Yan, C. Li, Q. Qian, J. Zhang, and F. Huang, “mplug-paperowl: Scientific diagram analysis with the multimodal large language model,” arXiv:2311.18248,

  41. [42]

    1 IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 15

  42. [43]

    An embodied generalist agent in 3d world,

    J. Huang, S. Yong, X. Ma, X. Linghu, P . Li, Y. Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,” arXiv:2311.12871, 2023. 1, 9, 10

  43. [44]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei, “Kosmos-2: Grounding multimodal large language models to the world,” arXiv:2306.14824, 2023. 1

  44. [45]

    Appagent: Multimodal agents as smartphone users,

    Z. Yang, J. Liu, Y. Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “Appagent: Multimodal agents as smartphone users,” arXiv:2312.13771, 2023. 1, 10

  45. [46]

    Cogagent: A visual language model for gui agents,

    W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y. Wang, Z. Wang, Y. Dong, M. Ding et al., “Cogagent: A visual language model for gui agents,” arXiv:2312.08914, 2023. 1, 3, 10

  46. [47]

    Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,

    J. Wang, H. Xu, J. Ye, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,” arXiv:2401.16158, 2024. 1, 10

  47. [48]

    Repro- ducible scaling laws for contrastive language-image learning,

    M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Repro- ducible scaling laws for contrastive language-image learning,” in CVPR, 2023. 2, 3

  48. [49]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, “Eva-clip: Improved training techniques for clip at scale,” arXiv:2303.15389, 2023. 2, 3

  49. [50]

    Eva: Exploring the limits of masked visual representation learning at scale,

    Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in CVPR, 2023. 2

  50. [51]

    Introducing our multimodal models,

    R. Bavishi, E. Elsen, C. Hawthorne, M. Nye, A. Odena, A. Somani, and S. Ta¸ sırlar, “Introducing our multimodal models,” 2023. [Online]. Available: https://www.adept.ai/blog/fuyu-8b 2

  51. [52]

    Improved Baselines with Visual Instruction Tuning

    H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” arXiv:2310.03744, 2023. 3, 4

  52. [53]

    arXiv preprint arXiv:2311.06607 (2023)

    Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. Sun, Y. Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” arXiv:2311.06607, 2023. 3

  53. [54]

    MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

    B. McKinzie, Z. Gan, J.-P . Fauconnier, S. Dodge, B. Zhang, P . Dufter, D. Shah, X. Du, F. Peng, F. Weers et al. , “Mm1: Methods, analysis & insights from multimodal llm pre-training,” arXiv:2403.09611, 2024. 3, 4

  54. [55]

    arXiv preprint arXiv:2311.07575 (2023) MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 21

    Z. Lin, C. Liu, R. Zhang, P . Gao, L. Qiu, H. Xiao, H. Qiu, C. Lin, W. Shao, K. Chen et al. , “Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models,” arXiv:2311.07575, 2023. 3

  55. [56]

    Clap learning audio concepts from natural language supervision,

    B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP, 2023. 3

  56. [57]

    Imagebind: One embedding space to bind them all,

    R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in CVPR, 2023. 3

  57. [58]

    Scaling Instruction-Finetuned Language Models

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma et al. , “Scaling instruction- finetuned language models,” arXiv:2210.11416, 2022. 3, 4, 6

  58. [59]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv:2307.09288, 2023. 3, 4

  59. [60]

    Qwen Technical Report

    J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al. , “Qwen technical report,” arXiv:2309.16609, 2023. 3, 4, 10

  60. [61]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv:2301.12597, 2023. 3, 4

  61. [62]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P . Fung, and S. Hoi, “Instructblip: Towards general- purpose vision-language models with instruction tuning,” arXiv:2305.06500, 2023. 3, 4, 6, 7, 8

  62. [63]

    Llava-next: Improved reasoning, ocr, and world knowledge,

    H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/ blog/2024-01-30-llava-next/ 3

  63. [64]

    An empir- ical study of scaling instruct-tuned large multimodal models,

    Y. Lu, C. Li, H. Liu, J. Yang, J. Gao, and Y. Shen, “An empir- ical study of scaling instruct-tuned large multimodal models,” arXiv:2309.09958, 2023. 3

  64. [65]

    arXiv preprint arXiv:2312.16886 (2023)

    X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei et al. , “Mobilevlm: A fast, repro- ducible and strong vision language assistant for mobile devices,” arXiv:2312.16886, 2023. 3, 10

  65. [66]

    Mobilevlm v2: Faster and stronger baseline for vision language model,

    X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang et al., “Mobilevlm v2: Faster and stronger baseline for vision language model,” arXiv:2402.03766, 2024. 3

  66. [67]

    Mixture-of-experts meets instruction tuning: A winning combination for large language models,

    S. Shen, L. Hou, Y. Zhou, N. Du, S. Longpre, J. Wei, H. W. Chung, B. Zoph, W. Fedus, X. Chen et al. , “Mixture-of-experts meets instruction tuning: A winning combination for large language models,” arXiv:2305.14705, 2023. 3

  67. [68]

    Mixtral of Experts

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand et al., “Mixtral of experts,” arXiv:2401.04088, 2024. 3

  68. [69]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” JMLR, 2022. 3

  69. [70]

    MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

    B. Lin, Z. Tang, Y. Ye, J. Cui, B. Zhu, P . Jin, J. Zhang, M. Ning, and L. Yuan, “Moe-llava: Mixture of experts for large vision-language models,” arXiv:2401.15947, 2024. 3

  70. [71]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020. 4

  71. [72]

    X- llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages,

    F. Chen, M. Han, H. Zhao, Q. Zhang, J. Shi, S. Xu, and B. Xu, “X- llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages,” arXiv:2305.04160, 2023. 4, 6, 8, 9

  72. [73]

    Pandagpt: One model to instruction-follow them all,

    Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv:2305.16355, 2023. 4, 6

  73. [74]

    Detgpt: Detect what you need via reasoning,

    R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, and L. K. T. Zhang, “Detgpt: Detect what you need via reasoning,” arXiv:2305.14167, 2023. 4, 7

  74. [75]

    What matters in training a gpt4-style language model with multimodal inputs?

    Y. Zeng, H. Zhang, J. Zheng, J. Xia, G. Wei, Y. Wei, Y. Zhang, and T. Kong, “What matters in training a gpt4-style language model with multimodal inputs?” arXiv:2307.02469, 2023. 4, 7

  75. [76]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P . Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” NeurIPS, 2022. 4, 11, 12

  76. [77]

    CogVLM: Visual Expert for Pretrained Language Models

    W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song et al. , “Cogvlm: Visual expert for pretrained language models,” arXiv:2311.03079, 2023. 4

  77. [78]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P . Lu, H. Li, P . Gao, and Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models with zero-init attention,” arXiv:2303.16199, 2023. 4, 6, 8, 9

  78. [79]

    Woodpecker: Hallucination correction for multimodal large language models,

    S. Yin, C. Fu, S. Zhao, T. Xu, H. Wang, D. Sui, Y. Shen, K. Li, X. Sun, and E. Chen, “Woodpecker: Hallucination correction for multimodal large language models,” arXiv:2310.16045, 2023. 4, 9, 10, 11

  79. [80]

    From images to textual prompts: Zero-shot visual question answering with frozen large language models,

    J. Guo, J. Li, D. Li, A. M. H. Tiong, B. Li, D. Tao, and S. Hoi, “From images to textual prompts: Zero-shot visual question answering with frozen large language models,” in CVPR, 2023. 4

  80. [81]

    Caption anything: Interactive image description with diverse multimodal controls,

    T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao, S. Zhao, Y. Shan et al. , “Caption anything: Interactive image description with diverse multimodal controls,” arXiv:2305.02677,

Showing first 80 references.