Toward Native Multimodal Modeling: A Roadmap

Biao Liu; Bo Ke; Difan Zou; Di Yin; Haopeng Wang; Junnan Dong; Junru Lu; Keyu Chen; Linhao Luo; Qiufeng Wang

arxiv: 2605.25343 · v1 · pith:B3BTNA67new · submitted 2026-05-25 · 💻 cs.CV

Toward Native Multimodal Modeling: A Roadmap

Siyu An , Junru Lu , Junnan Dong , Qiufeng Wang , Yinghui Li , Weizhi Fei , Zichao Yu , Zheng Yuan

show 13 more authors

Biao Liu Haopeng Wang Renzhao Liang Yixuan Yang Yunhang Shen Bo Ke Keyu Chen Linhao Luo Difan Zou Xiao Huang Di Yin Ruizhi Qiao Xing Sun

This is my paper

Pith reviewed 2026-06-29 22:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords native multimodal modelingearly fusionmid fusionunified transformermultimodal generationcross-modal understandingroadmap

0 comments

The pith

Native multimodal modeling integrates modalities intrinsically within a unified transformer for superior performance over late-fusion methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper maps the shift from late-fusion multimodal systems that assemble separate encoders with frozen language models to native multimodal modeling that integrates modalities from the start. It formally defines architectural nativity by separating early-fusion and mid-fusion designs from non-native ones. Existing work is organized into three input-output categories: Multi-to-Text for comprehension with text output, Multi-to-Target for generating images or other media, and Multi-to-Multi for symmetric unified handling. The authors lay out an end-to-end industrial pipeline covering architecture coordination, data curation, training, inference, and evaluation to reach a state where understanding and generation coexist in one transformer.

Core claim

Native multimodal modeling with the intrinsic integration of modalities achieves superior multimodal performance compared to late-fusion approaches, enabling understanding and generation to coexist within a unified transformer paradigm.

What carries the argument

The formal definition of architectural nativity that distinguishes mid-fusion and early-fusion from non-native paradigms, together with the organization of models by input-output duality into the three categories Multi-to-Text, Multi-to-Target, and Multi-to-Multi.

If this is right

A single transformer architecture can handle both cross-modal comprehension and scenario-oriented generation without separate output heads.
Industrial-scale native models require coordinated design across data curation, full-stack training, and deployment rather than modular assembly.
Evaluation must shift from isolated modality tasks to unified tests that measure seamless understanding-generation coexistence.
Reorganizing prior models under the input-output lens reveals gaps that future native designs can target.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The roadmap's emphasis on early integration may reduce reliance on frozen unimodal encoders in large-scale systems.
If the three categories prove exhaustive, they could serve as a template for standardizing multimodal system taxonomies beyond this paper.
The transition described could connect to broader world-modeling goals by making generation and perception native to the same parameters.

Load-bearing premise

The proposed distinctions between mid-fusion, early-fusion, and non-native paradigms, along with the three input-output categories, sufficiently define and cover the design space of native architectures.

What would settle it

A controlled experiment in which a well-tuned late-fusion model matches or exceeds native models across a broad suite of multimodal understanding and generation benchmarks would falsify the claimed superiority.

Figures

Figures reproduced from arXiv: 2605.25343 by Biao Liu, Bo Ke, Difan Zou, Di Yin, Haopeng Wang, Junnan Dong, Junru Lu, Keyu Chen, Linhao Luo, Qiufeng Wang, Renzhao Liang, Ruizhi Qiao, Siyu An, Weizhi Fei, Xiao Huang, Xing Sun, Yinghui Li, Yixuan Yang, Yunhang Shen, Zheng Yuan, Zichao Yu.

**Figure 1.** Figure 1: A sketched overview of the evolutionary landscape in the paradigms of multimodal modeling. In the final stage, the model achieves a born-native state where all modalities are processed within a unified transformer space, facilitating symmetric multi-to-multi understanding and generation. 1 arXiv:2605.25343v1 [cs.CV] 25 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Evolutionary timeline and functional taxonomy of Native Multimodal Foundation Models (2023–2026). The upward trajectory charts the historical progression from early mid-fusion alignment to the early-fusion methods, i.e., born-native transformer. 2 Task Formalization 2.1 What is Native? Formalizing Cross-modal Fusion Nativity To establish a rigorous boundary for native multimodal modeling, we formalize the … view at source ↗

**Figure 3.** Figure 3: Illustrative examples of three primary NMM architectures considering input-output duality. Beyond the depth of architectural integration, the degree of native capability is inherently bounded by the inputoutput modality flow. We formalize this taxonomy from the perspective of modality duality and structural symmetry, mapping the native landscape into three progressive paradigms. Multi-to-Text (M2T) Unim… view at source ↗

**Figure 4.** Figure 4: A hierarchical taxonomy of the major technical challenges, core design axes, and representative NMM systems (as listed in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Training × Fusion: a stage-by-regime grid. Rows = training stages (PT/SFT/RL/OPD); columns = fusion regimes (late/mid/early). Every rightward arrow is an architectural necessity, not a stylistic choice. 1 PT late→mid: differential LR becomes mandatory once the encoder receives gradients. 2 PT mid→early: z-loss and QK-Norm become preconditions; the modality-mixture schedule replaces differential LR. 3 RL mi… view at source ↗

read the original abstract

Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference & deployment, and the comprehensive evaluation for truly native modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Roadmap that defines native multimodal categories and pipeline but asserts performance gains without new evidence or comparisons.

read the letter

This paper mainly organizes existing multimodal work by defining architectural nativity (mid-fusion and early-fusion versus non-native) and sorting models into three input-output groups: Multi-to-Text, Multi-to-Target, and Multi-to-Multi. It also sketches an end-to-end industrial pipeline covering data, training, inference, and evaluation.

It does a reasonable job giving clear labels and a practical framing for the shift toward unified transformers that handle understanding and generation together. The pipeline discussion draws from real deployment concerns and could help teams structure their own efforts.

The soft spot is the central premise that native integration delivers superior performance. The text references recent efforts for this but supplies no new benchmarks, ablations, or aggregated metrics to show consistent gains over late-fusion baselines. That leaves the advantage as a starting assumption rather than a demonstrated result. The three categories are useful for organization but may leave edge cases unaddressed.

This is for researchers and engineers who want a high-level map of multimodal trends rather than novel methods or data. It shows straightforward thinking in laying out the space.

I would send it for peer review as a survey-style contribution so referees can assess the completeness of the references and whether the distinctions add lasting value.

Referee Report

2 major / 0 minor

Summary. The paper claims to provide a formalized roadmap for transitioning to native multimodal modeling (NMM), which features intrinsic integration of modalities in unified transformers. It defines architectural nativity by distinguishing mid-fusion and early-fusion from non-native late-fusion paradigms, organizes existing models into three input-output categories (Multi-to-Text, Multi-to-Target, Multi-to-Multi), and unpacks an end-to-end industrial pipeline covering architecture coordination, data curation, training recipes, inference, deployment, and evaluation, with the goal of enabling coexisting understanding and generation.

Significance. If the definitions and categorization hold, the roadmap could help standardize terminology and guide development of intrinsically integrated multimodal systems beyond late-fusion baselines. The industrial pipeline discussion adds practical value for scaling such models, though the motivating claim of superior performance remains unsubstantiated within the manuscript itself.

major comments (2)

[Abstract] Abstract: The assertion that recent efforts have shifted toward NMM 'with the intrinsic integration of modalities for superior multimodal performance' is presented as a premise without new empirical results, meta-analysis, or specific quantitative comparisons to late-fusion baselines; this is load-bearing for motivating the entire roadmap.
[Section on input-output categories] Section on input-output categories: The claim that the three categories (Multi-to-Text, Multi-to-Target, Multi-to-Multi) via input-output duality sufficiently organize and cover the design space of native architectures lacks explicit mapping criteria, boundary cases, or demonstration of exhaustiveness, which directly affects the roadmap's claimed comprehensiveness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our roadmap manuscript. Below we address each major comment point by point, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that recent efforts have shifted toward NMM 'with the intrinsic integration of modalities for superior multimodal performance' is presented as a premise without new empirical results, meta-analysis, or specific quantitative comparisons to late-fusion baselines; this is load-bearing for motivating the entire roadmap.

Authors: We acknowledge that the manuscript is a roadmap paper and does not contain new empirical results or meta-analyses comparing NMM to late-fusion approaches. The statement in the abstract reflects an observed trend in the literature we cite rather than a claim supported by new data within this work. To address the concern, we will revise the abstract to describe the shift as an emerging architectural direction without asserting unsubstantiated superiority, and we will add a clarifying sentence in the introduction noting that the roadmap is motivated by design trends rather than proven performance gains. revision: partial
Referee: [Section on input-output categories] Section on input-output categories: The claim that the three categories (Multi-to-Text, Multi-to-Target, Multi-to-Multi) via input-output duality sufficiently organize and cover the design space of native architectures lacks explicit mapping criteria, boundary cases, or demonstration of exhaustiveness, which directly affects the roadmap's claimed comprehensiveness.

Authors: We agree that the current presentation would benefit from greater explicitness. In the revised manuscript we will expand the relevant section to define explicit mapping criteria based on input-output modality combinations, include a discussion of boundary cases (e.g., models that straddle categories or hybrid designs), and provide a brief argument for exhaustiveness by illustrating how representative architectures map to the three categories. A summary table will be added to improve clarity and demonstrate coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: roadmap defines categories and organizes models without derivations or self-referential reductions

full rationale

The paper is a definitional roadmap that introduces architectural nativity distinctions (mid-fusion/early-fusion vs non-native) and input-output categories (Multi-to-Text, Multi-to-Target, Multi-to-Multi) to organize existing work. It asserts superiority of native multimodal modeling based on 'recent efforts' but supplies no equations, fitted parameters, uniqueness theorems, or self-citation chains that reduce any claim to its own inputs by construction. The distinctions function as an organizing framework rather than a derived result; the central premise remains a motivating premise external to any internal reduction. No load-bearing circular steps exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone; the work focuses on definitions, categorization, and high-level pipeline description without mathematical or empirical components.

pith-pipeline@v0.9.1-grok · 5831 in / 1142 out tokens · 30106 ms · 2026-06-29T22:56:12.900946+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM can Read Spectrogram: Encoder-free Speech-Language Modeling
eess.AS 2026-06 unverdicted novelty 6.0

Mel-LLM shows that LLMs can process Mel spectrograms directly for competitive ASR performance without a dedicated speech encoder, with limited degradation versus encoder-based versions when using multimodal initializa...

Reference graph

Works this paper leans on

291 extracted references · 163 canonical work pages · cited by 1 Pith paper · 86 internal anchors

[1]

Youtu-llm: Unlocking the native agentic potential for lightweight large language models.arXiv preprint arXiv:2512.24618, 2025

Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai, Bo Ke, Jianfeng He, Ruizhi Qiao, Di Yin, Xing Sun, et al. Youtu-llm: Unlocking the native agentic potential for lightweight large language models.arXiv preprint arXiv:2512.24618, 2025

work page arXiv 2025
[2]

Clr-bench: Evaluating large language models in college-level reasoning.arXiv preprint arXiv:2410.17558, 2024

Junnan Dong, Zijin Hong, Yuanchen Bei, Feiran Huang, Xinrun Wang, and Xiao Huang. Clr-bench: Evaluating large language models in college-level reasoning.arXiv preprint arXiv:2410.17558, 2024

work page arXiv 2024
[3]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen3-VL Technical Report

Qwen Team. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 33 NMM Roadmap

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Beyond language modeling: An exploration of multimodal pretraining

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, et al. Beyond language modeling: An exploration of multimodal pretraining. arXiv preprint arXiv:2603.03276, 2026

work page arXiv 2026
[7]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

The revolution of multimodal large language models: A survey.Findings of the association for computational linguistics: ACL 2024, pages 13590–13618, 2024

Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. The revolution of multimodal large language models: A survey.Findings of the association for computational linguistics: ACL 2024, pages 13590–13618, 2024

2024
[9]

A survey on multimodal large language models.National Science Review, 11, 2023

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11, 2023

2023
[10]

Modality-aware integration with large language models for knowledge-based visual question answering

Junnan Dong, Qinggang Zhang, Huachi Zhou, Daochen Zha, Pai Zheng, and Xiao Huang. Modality-aware integration with large language models for knowledge-based visual question answering. InACL, pages 2417– 2429, 2024

2024
[12]

Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, and Xinlong Wang. Emu3.5: Native multimodal models are world learners.arXiv prepr...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhu- oshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, et al. Nemotron 3 nano omni: Efficient and open multimodal intelligence.arXiv preprint arXiv:2604.24954, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Mimo-v2.5: Frontier agency and native multimodality, 2026

Xiaomi MiMo Team. Mimo-v2.5: Frontier agency and native multimodality, 2026. Accessed: 2026-04-29

2026
[19]

Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen. ai/blog?id=qwen3.6-35b-a3b

2026
[20]

Gemma 4: Intelligence-per-parameter for advanced reasoning

Google DeepMind. Gemma 4: Intelligence-per-parameter for advanced reasoning. https://ai.google. dev/gemma/docs/core/model_card_4, 2026

2026
[21]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Glm-5v-turbo: Toward a native foundation model for multimodal agents

Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, et al. Glm-5v-turbo: Toward a native foundation model for multimodal agents. Technical report, ZhipuAI, 2026. 34 NMM Roadmap

2026
[23]

The Llama 4 herd: Architecture, training, evaluation, and deployment notes

Aaron Adcock, Aayushi Srivastava, Abhimanyu Dubey, et al. The Llama 4 herd: Architecture, training, evaluation, and deployment notes. Technical report, Meta AI, 2025. Includes Scout and Maverick series

2025
[24]

Qwen2.5-VL Technical Report

Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang et al. CogVLM: Visual expert for pretrained language models.arXiv preprint arXiv:2311.03079, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

Qi Cai, Jingwen Chen, Chengmin Gao, Zijian Gong, Yehao Li, Tao Mei, Yingwei Pan, Yi Peng, Zhaofan Qiu, Ting Yao, Kai Yu, Yiheng Zhang, et al. Hidream-o1-image: A natively unified image generative foundation model with pixel-level unified transformer.arXiv preprint arXiv:2605.11061, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, V...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821, 2026

Inclusion AI. Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821, 2026

work page arXiv 2026
[32]

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, et al. Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction.arXiv preprint arXiv:2604.27393, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Kling-Omni Technical Report

Kling Team. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Wan 2.2: A high-fidelity video generation foundation model, 2025

Wan Team. Wan 2.2: A high-fidelity video generation foundation model, 2025

2025
[37]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Hu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo, Hao Li, Yinghang Song, Fei Ding, Jianzhu Guo, Qian He, Zheren Fu, Zhendong Mao, and Yongdong Zhang. Lance: Unified multimodal modeling by multi-task synergy.arXiv preprint arXiv:2605.18678, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

Yangming Shi, Shixiang Zhu, Tao Shen, Zhimiao Yu, Dengsheng Chen, Taicai Chen, Yunfei Yang, Juan Zhou, Chen Cheng, Liang Ma, Xibin Wu, Benxuan Yan, Ge Li, Tuoyu Zhang, Dan Li, Chang Liu, and Zhenbang Sun. Mamoda2.5: Enhancing unified multimodal model with dit-moe.arXiv preprint arXiv:2605.02641, 2026. 35 NMM Roadmap

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, and Yuren Cong. Tuna-2: Pixel embeddings beat vision encoders for multimodal understanding and generation.arXiv preprint arXiv:2604.24763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, et al. Sensenova-u1: Unifying multimodal understanding and generation with neo-unify architecture.arXiv preprint arXiv:2605.12500, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Inclusion AI. Llada2.0-uni: Unifying multimodal understanding and generation with diffusion large language model.arXiv preprint arXiv:2604.20796, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

work page arXiv 2026
[44]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

work page arXiv 2025
[47]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

DeepSeek-AI. Janus-Pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team, Meta AI. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Anygpt: Unified multimodal llm with discrete sequence modeling.arXiv preprint arXiv:2402.12226, 2025

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, and Xipeng Qiu. Anygpt: Unified multimodal llm with discrete sequence modeling.arXiv preprint arXiv:2402.12226, 2025

work page arXiv 2025
[52]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, and Daniel Povey. Omnivoice: Towards omnilingual zero-shot text-to-speech with diffusion language models.arXiv preprint arXiv:2604.00688, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024. 36 NMM Roadmap

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, Zili Wang, Hui Zhang, Haonan Wang, Hang Zhou, Yifan Pu, Xingxuan Li, Fangneng Zhan, Bo Li, Lidong Bing, Yuxin Song, Ziwei Liu, Wenhu Chen, Jingdong Wang, Xinchao Wang, Xiaojuan Qi, Shijian Lu, and Bin Wang. Visual generation in t...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning, 2022

2022
[61]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[62]

Mini-omni-reasoner: Token-level thinking-in-speaking in large speech models.arXiv preprint arXiv:2508.15827, 2025

Zhifei Xie, Ziyang Ma, Zihang Liu, Kaiyu Pang, Hongyu Li, Jialin Zhang, Yue Liao, Deheng Ye, Chunyan Miao, and Shuicheng Yan. Mini-omni-reasoner: Token-level thinking-in-speaking in large speech models.arXiv preprint arXiv:2508.15827, 2025

work page arXiv 2025
[63]

What about gravity in video generation? post-training newton’s laws with verifiable rewards.arXiv preprint arXiv:2512.00425, 2025

Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, and Dimitris Samaras. What about gravity in video generation? post-training newton’s laws with verifiable rewards.arXiv preprint arXiv:2512.00425, 2025

work page arXiv 2025
[64]

Physrvg: Physics-aware unified reinforcement learning for video generative models

Qiyuan Zhang, Biao Gong, Shuai Tan, Zheng Zhang, Yujun Shen, Xing Zhu, Yuyuan Li, Kelu Yao, Chunhua Shen, and Changqing Zou. Physrvg: Physics-aware unified reinforcement learning for video generative models. arXiv preprint arXiv:2601.11087, 2026

work page arXiv 2026
[65]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In International Conference on Learning Representations, volume 2025, pages 28085–28128, 2025

2025
[66]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

2015
[68]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

2022
[69]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[70]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 37 NMM Roadmap

2022
[71]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[72]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of ACL, 2018

2018
[73]

Yfcc100m: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016

2016
[74]

Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

2023
[75]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

2015
[76]

Gqa: A new dataset for real-world visual reasoning and com- positional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and com- positional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

2019
[77]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

2019
[78]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems, 35:2507–2521, 2022

2022
[79]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.ArXiv, abs/2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems, 36:49250–49267, 2023

2023
[81]

Multimodal c4: An open, billion-scale corpus of images interleaved with text.Advances in Neural Information Processing Systems, 36:8958–8974, 2023

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text.Advances in Neural Information Processing Systems, 36:8958–8974, 2023

2023
[82]

Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Processing Systems, 36:71683–71702, 2023

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Processing Systems, 36:71683–71702, 2023

2023
[83]

Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text

Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text. InInternational Conference on Learning Representations, volume 2025, pages 13647–13689, 2025

2025
[84]

Introducing mantis: a novel multi-domain information seeking dialogues dataset.arXiv preprint arXiv:1912.04639, 2019

Gustavo Penha, Alexandru Balan, and Claudia Hauff. Introducing mantis: a novel multi-domain information seeking dialogues dataset.arXiv preprint arXiv:1912.04639, 2019

work page arXiv 1912
[85]

A corpus for reasoning about natural language grounded in photographs

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 6418–6428, 2019

2019

Showing first 80 references.

[1] [1]

Youtu-llm: Unlocking the native agentic potential for lightweight large language models.arXiv preprint arXiv:2512.24618, 2025

Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai, Bo Ke, Jianfeng He, Ruizhi Qiao, Di Yin, Xing Sun, et al. Youtu-llm: Unlocking the native agentic potential for lightweight large language models.arXiv preprint arXiv:2512.24618, 2025

work page arXiv 2025

[2] [2]

Clr-bench: Evaluating large language models in college-level reasoning.arXiv preprint arXiv:2410.17558, 2024

Junnan Dong, Zijin Hong, Yuanchen Bei, Feiran Huang, Xinrun Wang, and Xiao Huang. Clr-bench: Evaluating large language models in college-level reasoning.arXiv preprint arXiv:2410.17558, 2024

work page arXiv 2024

[3] [3]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Qwen3-VL Technical Report

Qwen Team. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 33 NMM Roadmap

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Beyond language modeling: An exploration of multimodal pretraining

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, et al. Beyond language modeling: An exploration of multimodal pretraining. arXiv preprint arXiv:2603.03276, 2026

work page arXiv 2026

[7] [7]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

The revolution of multimodal large language models: A survey.Findings of the association for computational linguistics: ACL 2024, pages 13590–13618, 2024

Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. The revolution of multimodal large language models: A survey.Findings of the association for computational linguistics: ACL 2024, pages 13590–13618, 2024

2024

[9] [9]

A survey on multimodal large language models.National Science Review, 11, 2023

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11, 2023

2023

[10] [10]

Modality-aware integration with large language models for knowledge-based visual question answering

Junnan Dong, Qinggang Zhang, Huachi Zhou, Daochen Zha, Pai Zheng, and Xiao Huang. Modality-aware integration with large language models for knowledge-based visual question answering. InACL, pages 2417– 2429, 2024

2024

[11] [12]

Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, and Xinlong Wang. Emu3.5: Native multimodal models are world learners.arXiv prepr...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [13]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [14]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhu- oshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [15]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [17]

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, et al. Nemotron 3 nano omni: Efficient and open multimodal intelligence.arXiv preprint arXiv:2604.24954, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [18]

Mimo-v2.5: Frontier agency and native multimodality, 2026

Xiaomi MiMo Team. Mimo-v2.5: Frontier agency and native multimodality, 2026. Accessed: 2026-04-29

2026

[17] [19]

Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen. ai/blog?id=qwen3.6-35b-a3b

2026

[18] [20]

Gemma 4: Intelligence-per-parameter for advanced reasoning

Google DeepMind. Gemma 4: Intelligence-per-parameter for advanced reasoning. https://ai.google. dev/gemma/docs/core/model_card_4, 2026

2026

[19] [21]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [22]

Glm-5v-turbo: Toward a native foundation model for multimodal agents

Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, et al. Glm-5v-turbo: Toward a native foundation model for multimodal agents. Technical report, ZhipuAI, 2026. 34 NMM Roadmap

2026

[21] [23]

The Llama 4 herd: Architecture, training, evaluation, and deployment notes

Aaron Adcock, Aayushi Srivastava, Abhimanyu Dubey, et al. The Llama 4 herd: Architecture, training, evaluation, and deployment notes. Technical report, Meta AI, 2025. Includes Scout and Maverick series

2025

[22] [24]

Qwen2.5-VL Technical Report

Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [25]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang et al. CogVLM: Visual expert for pretrained language models.arXiv preprint arXiv:2311.03079, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [26]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [27]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [28]

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

Qi Cai, Jingwen Chen, Chengmin Gao, Zijian Gong, Yehao Li, Tao Mei, Yingwei Pan, Yi Peng, Zhaofan Qiu, Ting Yao, Kai Yu, Yiheng Zhang, et al. Hidream-o1-image: A natively unified image generative foundation model with pixel-level unified transformer.arXiv preprint arXiv:2605.11061, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [30]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, V...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [31]

Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821, 2026

Inclusion AI. Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821, 2026

work page arXiv 2026

[29] [32]

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, et al. Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction.arXiv preprint arXiv:2604.27393, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [33]

Kling-Omni Technical Report

Kling Team. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [34]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [36]

Wan 2.2: A high-fidelity video generation foundation model, 2025

Wan Team. Wan 2.2: A high-fidelity video generation foundation model, 2025

2025

[33] [37]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Hu...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [38]

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo, Hao Li, Yinghang Song, Fei Ding, Jianzhu Guo, Qian He, Zheren Fu, Zhendong Mao, and Yongdong Zhang. Lance: Unified multimodal modeling by multi-task synergy.arXiv preprint arXiv:2605.18678, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [39]

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

Yangming Shi, Shixiang Zhu, Tao Shen, Zhimiao Yu, Dengsheng Chen, Taicai Chen, Yunfei Yang, Juan Zhou, Chen Cheng, Liang Ma, Xibin Wu, Benxuan Yan, Ge Li, Tuoyu Zhang, Dan Li, Chang Liu, and Zhenbang Sun. Mamoda2.5: Enhancing unified multimodal model with dit-moe.arXiv preprint arXiv:2605.02641, 2026. 35 NMM Roadmap

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [40]

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, and Yuren Cong. Tuna-2: Pixel embeddings beat vision encoders for multimodal understanding and generation.arXiv preprint arXiv:2604.24763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [41]

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, et al. Sensenova-u1: Unifying multimodal understanding and generation with neo-unify architecture.arXiv preprint arXiv:2605.12500, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [42]

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Inclusion AI. Llada2.0-uni: Unifying multimodal understanding and generation with diffusion large language model.arXiv preprint arXiv:2604.20796, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [43]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

work page arXiv 2026

[40] [44]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [45]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [46]

Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

work page arXiv 2025

[43] [47]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

DeepSeek-AI. Janus-Pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [49]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [50]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team, Meta AI. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [51]

Anygpt: Unified multimodal llm with discrete sequence modeling.arXiv preprint arXiv:2402.12226, 2025

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, and Xipeng Qiu. Anygpt: Unified multimodal llm with discrete sequence modeling.arXiv preprint arXiv:2402.12226, 2025

work page arXiv 2025

[47] [52]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [53]

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, and Daniel Povey. Omnivoice: Towards omnilingual zero-shot text-to-speech with diffusion language models.arXiv preprint arXiv:2604.00688, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [54]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [55]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024. 36 NMM Roadmap

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [56]

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, Zili Wang, Hui Zhang, Haonan Wang, Hang Zhou, Yifan Pu, Xingxuan Li, Fangneng Zhan, Bo Li, Lidong Bing, Yuxin Song, Ziwei Liu, Wenhu Chen, Jingdong Wang, Xinchao Wang, Xiaojuan Qi, Shijian Lu, and Bin Wang. Visual generation in t...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [57]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [58]

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [59]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [60]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning, 2022

2022

[56] [61]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[57] [62]

Mini-omni-reasoner: Token-level thinking-in-speaking in large speech models.arXiv preprint arXiv:2508.15827, 2025

Zhifei Xie, Ziyang Ma, Zihang Liu, Kaiyu Pang, Hongyu Li, Jialin Zhang, Yue Liao, Deheng Ye, Chunyan Miao, and Shuicheng Yan. Mini-omni-reasoner: Token-level thinking-in-speaking in large speech models.arXiv preprint arXiv:2508.15827, 2025

work page arXiv 2025

[58] [63]

What about gravity in video generation? post-training newton’s laws with verifiable rewards.arXiv preprint arXiv:2512.00425, 2025

Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, and Dimitris Samaras. What about gravity in video generation? post-training newton’s laws with verifiable rewards.arXiv preprint arXiv:2512.00425, 2025

work page arXiv 2025

[59] [64]

Physrvg: Physics-aware unified reinforcement learning for video generative models

Qiyuan Zhang, Biao Gong, Shuai Tan, Zheng Zhang, Yujun Shen, Xing Zhu, Yuyuan Li, Kelu Yao, Chunhua Shen, and Changqing Zou. Physrvg: Physics-aware unified reinforcement learning for video generative models. arXiv preprint arXiv:2601.11087, 2026

work page arXiv 2026

[60] [65]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In International Conference on Learning Representations, volume 2025, pages 28085–28128, 2025

2025

[61] [66]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [67]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

2015

[63] [68]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

2022

[64] [69]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[65] [70]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 37 NMM Roadmap

2022

[66] [71]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[67] [72]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of ACL, 2018

2018

[68] [73]

Yfcc100m: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016

2016

[69] [74]

Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

2023

[70] [75]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

2015

[71] [76]

Gqa: A new dataset for real-world visual reasoning and com- positional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and com- positional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

2019

[72] [77]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

2019

[73] [78]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems, 35:2507–2521, 2022

2022

[74] [79]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.ArXiv, abs/2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[75] [80]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems, 36:49250–49267, 2023

2023

[76] [81]

Multimodal c4: An open, billion-scale corpus of images interleaved with text.Advances in Neural Information Processing Systems, 36:8958–8974, 2023

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text.Advances in Neural Information Processing Systems, 36:8958–8974, 2023

2023

[77] [82]

Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Processing Systems, 36:71683–71702, 2023

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Processing Systems, 36:71683–71702, 2023

2023

[78] [83]

Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text

Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text. InInternational Conference on Learning Representations, volume 2025, pages 13647–13689, 2025

2025

[79] [84]

Introducing mantis: a novel multi-domain information seeking dialogues dataset.arXiv preprint arXiv:1912.04639, 2019

Gustavo Penha, Alexandru Balan, and Claudia Hauff. Introducing mantis: a novel multi-domain information seeking dialogues dataset.arXiv preprint arXiv:1912.04639, 2019

work page arXiv 1912

[80] [85]

A corpus for reasoning about natural language grounded in photographs

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 6418–6428, 2019

2019