arxiv: 2306.13549 · v4 · submitted 2023-06-23 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Survey on Multimodal Large Language Models

Shukang Yin , Chaoyou Fu , Sirui Zhao , Ke Li , Xing Sun , Tong Xu , Enhong Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords multimodal large language modelsMLLMGPT-4Vemergent capabilitiesmultimodal reasoningvision-language modelsmultimodal hallucinationartificial general intelligence

0 comments

The pith

Multimodal large language models use LLMs as a central brain to handle images and other inputs with new emergent reasoning skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews the fast rise of multimodal large language models that combine large language models with visual and other data sources. It covers their basic structure, training on mixed datasets, and evaluation on tasks such as image description and reasoning. The work examines extensions to finer details, more data types, languages, and real-world uses, plus problems like false outputs from images. It closes by listing current limits and open research directions in a field that may lead toward broader artificial intelligence systems.

Core claim

The paper claims that multimodal large language models, represented by GPT-4V, use powerful large language models as a brain to perform multimodal tasks and display surprising emergent capabilities such as writing stories based on images and OCR-free math reasoning that are rare in traditional multimodal methods, while summarizing their formulation, architecture, training strategy, data, evaluation, extensions to more granularity modalities languages and scenarios, multimodal hallucination, extended techniques including M-ICL M-CoT and LAVR, challenges, and promising directions.

What carries the argument

The central object is the large language model used as a unifying brain to process and reason over combined multimodal inputs through shared architectures and joint training.

If this is right

MLLMs can be extended to support finer granularity, additional modalities, more languages, and complex scenarios.
Techniques such as multimodal in-context learning, multimodal chain-of-thought reasoning, and LLM-aided visual reasoning improve performance on multimodal tasks.
Tackling multimodal hallucination is required for dependable real-world applications.
Continued progress in this area may open a route toward artificial general intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Unified LLM-centered models may replace earlier separate-modality approaches in many vision-language settings.
Adding real-time video or audio streams could test whether current emergent skills scale to continuous inputs.
The linked repository underscores the value of living resources for tracking fast-changing research areas.

Load-bearing premise

The survey assumes that the cited literature and the associated GitHub repository together provide a sufficiently complete and up-to-date picture of the rapidly evolving MLLM field.

What would settle it

A new review identifying many important recent MLLM papers or key developments absent from this survey and its linked repository would show the summary is incomplete.

read the original abstract

Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper is a survey tracing recent progress on Multimodal Large Language Models (MLLMs). It begins with the basic formulation and related concepts of architecture, training strategy, data, and evaluation. It then covers extensions supporting greater granularity, additional modalities, languages, and scenarios, followed by multimodal hallucination and techniques including Multimodal In-Context Learning (M-ICL), Multimodal Chain-of-Thought (M-CoT), and LLM-Aided Visual Reasoning (LAVR). The survey concludes with challenges, promising directions, and an associated GitHub repository for updates.

Significance. If the coverage proves comprehensive, the survey supplies a useful organizational framework for the fast-moving MLLM field, explicitly crediting emergent capabilities such as image-based story writing and OCR-free math reasoning while pointing to an open GitHub repository that collects latest papers. This combination of structured delineation and a living resource strengthens its value as a reference for researchers working on vision-language integration.

major comments (2)

[Evaluation] The evaluation section does not quantify how well current benchmarks capture the emergent capabilities highlighted in the abstract (e.g., story writing from images); without such analysis the contrast with traditional multimodal methods remains qualitative and weakens the motivation for the survey's scope.
[Training and Data] In the training and data section, the discussion of data curation omits explicit comparison of scale, filtering, and alignment procedures across representative models (LLaVA, MiniGPT-4, etc.), which is load-bearing for readers seeking to reproduce or extend the reported performance trends.

minor comments (3)

[Abstract] The abstract repeats motivational phrasing about AGI that could be shortened without loss of clarity.
[Architecture] Figure captions for architecture diagrams should explicitly label each component (vision encoder, projector, LLM backbone) to match the textual description.
[Introduction] The GitHub repository is mentioned only in the abstract; a short dedicated paragraph in the introduction describing its maintenance policy and coverage criteria would improve usability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the encouraging assessment and the specific comments, which help clarify areas where the survey can be strengthened. We address each major comment below and outline the corresponding revisions.

read point-by-point responses

Referee: [Evaluation] The evaluation section does not quantify how well current benchmarks capture the emergent capabilities highlighted in the abstract (e.g., story writing from images); without such analysis the contrast with traditional multimodal methods remains qualitative and weakens the motivation for the survey's scope.

Authors: We acknowledge that the evaluation section primarily summarizes existing benchmarks and notes emergent capabilities without providing quantitative metrics on benchmark coverage. As this is a survey, we do not introduce new empirical evaluations; however, we will expand the section with a dedicated paragraph discussing the limitations of current benchmarks in capturing capabilities such as image-based story writing and OCR-free reasoning, referencing any available meta-analyses or studies that quantify these gaps. This addition will make the contrast with traditional methods more explicit while remaining within the survey's scope. revision: partial
Referee: [Training and Data] In the training and data section, the discussion of data curation omits explicit comparison of scale, filtering, and alignment procedures across representative models (LLaVA, MiniGPT-4, etc.), which is load-bearing for readers seeking to reproduce or extend the reported performance trends.

Authors: We agree that a side-by-side comparison would improve utility for readers. We will insert a new table in the training and data section that explicitly compares data scale, filtering strategies, and alignment procedures for representative models including LLaVA, MiniGPT-4, and others, based on details reported in their original papers. This table will directly address reproducibility needs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; descriptive survey of external literature

full rationale

This paper is a literature survey with no original derivations, equations, quantitative predictions, or first-principles results. Its contribution is organizational: delineating architectures, training strategies, data, evaluations, extensions, hallucination, and techniques like M-ICL and M-CoT drawn from cited external works. The abstract's reference to emergent capabilities is presented as motivation from prior examples rather than a derived claim. No self-citations function as load-bearing justifications for novel results, and no steps reduce to fitted inputs or self-definitions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper introduces no free parameters, axioms, or invented entities; all technical content is drawn from the referenced prior literature.

pith-pipeline@v0.9.0 · 5606 in / 1054 out tokens · 62176 ms · 2026-05-16T02:48:44.972565+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cross-Modal Backdoors in Multimodal Large Language Models
cs.CR 2026-05 unverdicted novelty 8.0

Poisoning a single connector in MLLMs establishes a reusable latent backdoor pathway that transfers across modalities with over 95% attack success rate under bounded perturbations.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
cs.AI 2026-04 unverdicted novelty 7.0

EmergentBridge improves zero-shot cross-modal transfer for unpaired modality pairs by learning noisy bridge anchors and enforcing proxy alignment only in the orthogonal subspace to preserve existing anchor alignments.
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
cs.AI 2026-05 unverdicted novelty 6.0

LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
cs.MM 2026-04 unverdicted novelty 6.0

OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
cs.AI 2026-04 unverdicted novelty 6.0

EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradien...
MMaDA: Multimodal Large Diffusion Language Models
cs.CV 2025-05 unverdicted novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
cs.CV 2024-01 conditional novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
cs.CL 2026-05 unverdicted novelty 5.0

ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
cs.CL 2026-04 conditional novelty 5.0

Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
SALLIE: Safeguarding Against Latent Language & Image Exploits
cs.CR 2026-04 unverdicted novelty 5.0

SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
A Survey on the Memory Mechanism of Large Language Model based Agents
cs.AI 2024-04 accept novelty 3.0

A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

209 extracted references · 209 canonical work pages · cited by 19 Pith papers · 59 internal anchors

[1]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al. , “A survey of large language models,” arXiv:2303.18223, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Chatgpt: A language model for conversational ai,

OpenAI, “Chatgpt: A language model for conversational ai,” OpenAI, Tech. Rep., 2023. [Online]. Available: https: //www.openai.com/research/chatgpt 1, 6

work page 2023
[3]

GPT-4 Technical Report

——, “Gpt-4 technical report,” arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al. , “Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality,”

work page
[5]

Available: https://vicuna.lmsys.org 1, 3, 4

[Online]. Available: https://vicuna.lmsys.org 1, 3, 4

work page
[6]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv:2302.13971, 2023. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Instruction Tuning with GPT-4

B. Peng, C. Li, P . He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” arXiv:2304.03277, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Lan- guage models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhari- wal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell et al. , “Lan- guage models are few-shot learners,” NeurIPS, 2020. 1, 3, 6

work page 2020
[9]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,” arXiv:2201.11903, 2022. 1, 12

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al. , “Segment anything,” arXiv:2304.02643, 2023. 1, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Aligning and prompting everything all at once for universal visual perception,

Y. Shen, C. Fu, P . Chen, M. Zhang, K. Li, X. Sun, Y. Wu, S. Lin, and R. Ji, “Aligning and prompting everything all at once for universal visual perception,” in CVPR, 2024. 1

work page 2024
[12]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv:2203.03605, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V . Khalidov, P . Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,” arXiv:2304.07193, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in ICML, 2021. 1, 2, 3, 5

work page 2021
[15]

Align before fuse: Vision and language representation learning with momentum distillation,

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” NeurIPS, 2021. 1

work page 2021
[16]

Uniter: Universal image-text representation learn- ing,

Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learn- ing,” in ECCV, 2020. 1

work page 2020
[17]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,

P . Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in ICML, 2022. 1

work page 2022
[18]

Unifying vision-and- language tasks via text generation,

J. Cho, J. Lei, H. Tan, and M. Bansal, “Unifying vision-and- language tasks via text generation,” in ICML, 2021. 1

work page 2021
[19]

Simvlm: Simple visual language model pretraining with weak supervision,

Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “Simvlm: Simple visual language model pretraining with weak supervision,” arXiv:2108.10904, 2021. 1

work page arXiv 2021
[20]

Finetuned Language Models Are Zero-Shot Learners

J. Wei, M. Bosma, V . Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero- shot learners,” arXiv:2109.01652, 2021. 1, 6, 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Visual Instruction Tuning

H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv:2304.08485, 2023. 1, 4, 6, 7, 8, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv:2304.10592, 2023. 1, 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” arXiv:2303.11381, 2023. 1, 11, 12, 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al. , “Palm-e: An embodied multimodal language model,” arXiv:2303.03378, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y. Hanafy, W. Zhu, K. Marathe, Y. Bitton, S. Gadre, S. Sagawa et al., “Openflamingo: An open-source framework for training large autoregressive vision-language models,” arXiv:2308.01390, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

VideoChat: Chat-Centric Video Understanding

K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P . Luo, Y. Wang, L. Wang, and Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv:2305.06355, 2023. 1, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction- tuned audio-visual language model for video understanding,” arXiv:2306.02858, 2023. 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Pengi: An audio language model for audio tasks,

S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” NeurIPS, 2024. 1, 3

work page 2024
[29]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao, “Shikra: Unleashing multimodal llm’s referential dialogue magic,” arXiv:2306.15195. 1, 9

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Osprey: Pixel understanding with visual instruction tuning,

Y. Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, and J. Zhu, “Osprey: Pixel understanding with visual instruction tuning,” arXiv:2312.10032. 1, 2, 9

work page arXiv
[31]

Imagebind-llm: Multi-modality instruction tuning,

J. Han, R. Zhang, W. Shao, P . Gao, P . Xu, H. Xiao, K. Zhang, C. Liu, S. Wen, Z. Guo et al., “Imagebind-llm: Multi-modality instruction tuning,” arXiv:2309.03905, 2023. 1, 3

work page arXiv 2023
[32]

Anymal: An efficient and scalable any-modality augmented language model,

S. Moon, A. Madotto, Z. Lin, T. Nagarajan, M. Smith, S. Jain, C.-F. Yeh, P . Murugesan, P . Heidari, Y. Liu et al. , “Anymal: An efficient and scalable any-modality augmented language model,” arXiv:2309.16058, 2023. 1

work page arXiv 2023
[33]

Next-gpt: Any-to-any multimodal llm,

S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” arXiv:2309.05519, 2023. 1, 9

work page arXiv 2023
[34]

Large multilingual models pivot zero- shot multimodal learning across languages,

J. Hu, Y. Yao, C. Wang, S. Wang, Y. Pan, Q. Chen, T. Yu, H. Wu, Y. Zhao, H. Zhang et al., “Large multilingual models pivot zero- shot multimodal learning across languages,” arXiv:2308.12038,

work page arXiv
[35]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv:2308.12966, 2023. 1, 3, 4, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Med-flamingo: a multi- modal medical few-shot learner,

M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y. Dalmia, J. Leskovec, C. Zakka, E. P . Reis, and P . Rajpurkar, “Med-flamingo: a multi- modal medical few-shot learner,” in Machine Learning for Health (ML4H), 2023. 1, 10

work page 2023
[38]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

X. Zhang, C. Wu, Z. Zhao, W. Lin, Y. Zhang, Y. Wang, and W. Xie, “Pmc-vqa: Visual instruction tuning for medical visual question answering,” arXiv:2305.10415, 2023. 1, 4, 6, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

arXiv preprint arXiv:2307.02499 (2023)

J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, Y. Dan, C. Zhao, G. Xu, C. Li, J. Tian et al. , “mplug-docowl: Modularized multimodal large language model for document understanding,” arXiv:2307.02499,

work page arXiv
[40]

Textmonkey: An ocr-free large multimodal model for under- standing document,

Y. Liu, B. Yang, Q. Liu, Z. Li, Z. Ma, S. Zhang, and X. Bai, “Textmonkey: An ocr-free large multimodal model for under- standing document,” arXiv:2403.04473, 2024. 1, 10

work page arXiv 2024
[41]

mplug-paperowl: Scientific diagram analysis with the multimodal large language model,

A. Hu, Y. Shi, H. Xu, J. Ye, Q. Ye, M. Yan, C. Li, Q. Qian, J. Zhang, and F. Huang, “mplug-paperowl: Scientific diagram analysis with the multimodal large language model,” arXiv:2311.18248,

work page arXiv
[42]

1 IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 15

work page
[43]

An embodied generalist agent in 3d world,

J. Huang, S. Yong, X. Ma, X. Linghu, P . Li, Y. Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,” arXiv:2311.12871, 2023. 1, 9, 10

work page arXiv 2023
[44]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei, “Kosmos-2: Grounding multimodal large language models to the world,” arXiv:2306.14824, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Appagent: Multimodal agents as smartphone users,

Z. Yang, J. Liu, Y. Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “Appagent: Multimodal agents as smartphone users,” arXiv:2312.13771, 2023. 1, 10

work page arXiv 2023
[46]

Cogagent: A visual language model for gui agents,

W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y. Wang, Z. Wang, Y. Dong, M. Ding et al., “Cogagent: A visual language model for gui agents,” arXiv:2312.08914, 2023. 1, 3, 10

work page arXiv 2023
[47]

Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,

J. Wang, H. Xu, J. Ye, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,” arXiv:2401.16158, 2024. 1, 10

work page arXiv 2024
[48]

Repro- ducible scaling laws for contrastive language-image learning,

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Repro- ducible scaling laws for contrastive language-image learning,” in CVPR, 2023. 2, 3

work page 2023
[49]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, “Eva-clip: Improved training techniques for clip at scale,” arXiv:2303.15389, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Eva: Exploring the limits of masked visual representation learning at scale,

Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in CVPR, 2023. 2

work page 2023
[51]

Introducing our multimodal models,

R. Bavishi, E. Elsen, C. Hawthorne, M. Nye, A. Odena, A. Somani, and S. Ta¸ sırlar, “Introducing our multimodal models,” 2023. [Online]. Available: https://www.adept.ai/blog/fuyu-8b 2

work page 2023
[52]

Improved Baselines with Visual Instruction Tuning

H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” arXiv:2310.03744, 2023. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

arXiv preprint arXiv:2311.06607 (2023)

Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. Sun, Y. Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” arXiv:2311.06607, 2023. 3

work page arXiv 2023
[54]

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

B. McKinzie, Z. Gan, J.-P . Fauconnier, S. Dodge, B. Zhang, P . Dufter, D. Shah, X. Du, F. Peng, F. Weers et al. , “Mm1: Methods, analysis & insights from multimodal llm pre-training,” arXiv:2403.09611, 2024. 3, 4

work page internal anchor Pith review arXiv 2024
[55]

arXiv preprint arXiv:2311.07575 (2023) MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 21

Z. Lin, C. Liu, R. Zhang, P . Gao, L. Qiu, H. Xiao, H. Qiu, C. Lin, W. Shao, K. Chen et al. , “Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models,” arXiv:2311.07575, 2023. 3

work page arXiv 2023
[56]

Clap learning audio concepts from natural language supervision,

B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP, 2023. 3

work page 2023
[57]

Imagebind: One embedding space to bind them all,

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in CVPR, 2023. 3

work page 2023
[58]

Scaling Instruction-Finetuned Language Models

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma et al. , “Scaling instruction- finetuned language models,” arXiv:2210.11416, 2022. 3, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv:2307.09288, 2023. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Qwen Technical Report

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al. , “Qwen technical report,” arXiv:2309.16609, 2023. 3, 4, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv:2301.12597, 2023. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P . Fung, and S. Hoi, “Instructblip: Towards general- purpose vision-language models with instruction tuning,” arXiv:2305.06500, 2023. 3, 4, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Llava-next: Improved reasoning, ocr, and world knowledge,

H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/ blog/2024-01-30-llava-next/ 3

work page 2024
[64]

An empir- ical study of scaling instruct-tuned large multimodal models,

Y. Lu, C. Li, H. Liu, J. Yang, J. Gao, and Y. Shen, “An empir- ical study of scaling instruct-tuned large multimodal models,” arXiv:2309.09958, 2023. 3

work page arXiv 2023
[65]

arXiv preprint arXiv:2312.16886 (2023)

X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei et al. , “Mobilevlm: A fast, repro- ducible and strong vision language assistant for mobile devices,” arXiv:2312.16886, 2023. 3, 10

work page arXiv 2023
[66]

Mobilevlm v2: Faster and stronger baseline for vision language model,

X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang et al., “Mobilevlm v2: Faster and stronger baseline for vision language model,” arXiv:2402.03766, 2024. 3

work page arXiv 2024
[67]

Mixture-of-experts meets instruction tuning: A winning combination for large language models,

S. Shen, L. Hou, Y. Zhou, N. Du, S. Longpre, J. Wei, H. W. Chung, B. Zoph, W. Fedus, X. Chen et al. , “Mixture-of-experts meets instruction tuning: A winning combination for large language models,” arXiv:2305.14705, 2023. 3

work page arXiv 2023
[68]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand et al., “Mixtral of experts,” arXiv:2401.04088, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” JMLR, 2022. 3

work page 2022
[70]

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

B. Lin, Z. Tang, Y. Ye, J. Cui, B. Zhu, P . Jin, J. Zhang, M. Ning, and L. Yuan, “Moe-llava: Mixture of experts for large vision-language models,” arXiv:2401.15947, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020. 4

work page 2020
[72]

X- llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages,

F. Chen, M. Han, H. Zhao, Q. Zhang, J. Shi, S. Xu, and B. Xu, “X- llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages,” arXiv:2305.04160, 2023. 4, 6, 8, 9

work page arXiv 2023
[73]

Pandagpt: One model to instruction-follow them all,

Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv:2305.16355, 2023. 4, 6

work page arXiv 2023
[74]

Detgpt: Detect what you need via reasoning,

R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, and L. K. T. Zhang, “Detgpt: Detect what you need via reasoning,” arXiv:2305.14167, 2023. 4, 7

work page arXiv 2023
[75]

What matters in training a gpt4-style language model with multimodal inputs?

Y. Zeng, H. Zhang, J. Zheng, J. Xia, G. Wei, Y. Wei, Y. Zhang, and T. Kong, “What matters in training a gpt4-style language model with multimodal inputs?” arXiv:2307.02469, 2023. 4, 7

work page arXiv 2023
[76]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P . Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” NeurIPS, 2022. 4, 11, 12

work page 2022
[77]

CogVLM: Visual Expert for Pretrained Language Models

W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song et al. , “Cogvlm: Visual expert for pretrained language models,” arXiv:2311.03079, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P . Lu, H. Li, P . Gao, and Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models with zero-init attention,” arXiv:2303.16199, 2023. 4, 6, 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Woodpecker: Hallucination correction for multimodal large language models,

S. Yin, C. Fu, S. Zhao, T. Xu, H. Wang, D. Sui, Y. Shen, K. Li, X. Sun, and E. Chen, “Woodpecker: Hallucination correction for multimodal large language models,” arXiv:2310.16045, 2023. 4, 9, 10, 11

work page arXiv 2023
[80]

From images to textual prompts: Zero-shot visual question answering with frozen large language models,

J. Guo, J. Li, D. Li, A. M. H. Tiong, B. Li, D. Tao, and S. Hoi, “From images to textual prompts: Zero-shot visual question answering with frozen large language models,” in CVPR, 2023. 4

work page 2023
[81]

Caption anything: Interactive image description with diverse multimodal controls,

T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao, S. Zhao, Y. Shan et al. , “Caption anything: Interactive image description with diverse multimodal controls,” arXiv:2305.02677,

work page arXiv

Showing first 80 references.