pith. machine review for the scientific record. sign in

arxiv: 2505.16933 · v2 · submitted 2025-05-22 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:40 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV
keywords multimodal large language modelsdiffusion modelsvisual instruction tuningmasked diffusionmultimodal understandingvision-language models
0
0 comments X

The pith

A purely diffusion-based multimodal model matches autoregressive leaders on visual instruction tasks by adding a vision encoder to a language diffusion backbone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LLaDA-V as a departure from autoregressive multimodal large language models, instead using a masked diffusion process on a language diffusion model with added visual components. It shows that this setup delivers competitive multimodal performance even when the base language model lags on pure text tasks, and reaches state-of-the-art results among diffusion-based and hybrid MLLMs. The findings indicate that diffusion models can handle multimodal alignment and instruction following without sequential token generation. If correct, this suggests diffusion approaches could scale effectively for vision-language work and reduce reliance on autoregressive decoding constraints.

Core claim

LLaDA-V integrates a vision encoder and MLP connector into the LLaDA language diffusion model to project visual features into the embedding space, enabling masked diffusion training on visual instruction data. When trained on the same data as LLaMA3-V, it proves competitive across multimodal tasks and narrows the gap to Qwen2-VL while outperforming other diffusion-based and hybrid MLLMs, showing that large language diffusion models remain effective for multimodal understanding despite weaker standalone text performance.

What carries the argument

Masked diffusion process applied to language tokens combined with a vision encoder and MLP connector that maps visual features into the shared embedding space for joint denoising.

If this is right

  • The architecture supports better data scalability on multimodal tasks than some autoregressive baselines under identical training conditions.
  • Performance advantages appear concentrated in multimodal understanding rather than pure language modeling.
  • Results encourage replacing or complementing autoregressive decoding with diffusion steps in future multimodal systems.
  • The model narrows the gap to strong autoregressive systems such as Qwen2-VL on shared instruction data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Parallel denoising in diffusion could enable faster or more flexible multimodal output generation than left-to-right token prediction.
  • The approach may generalize to additional modalities if the same encoder-connector pattern is applied to audio or other inputs.
  • Longer context or higher-resolution visual inputs could be tested to measure whether diffusion maintains coherence better than autoregressive models under increased sequence length.

Load-bearing premise

That empirical gains on the selected instruction-tuning datasets and benchmarks will hold beyond this specific mixture and that the diffusion process can preserve coherent multimodal alignment without autoregressive sequential constraints.

What would settle it

A clear drop in multimodal benchmark scores when the same model is tested on instruction data drawn from distributions outside the training mixture or when the vision encoder and connector are ablated while keeping the diffusion backbone fixed.

read the original abstract

In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LLaDA-V, a purely diffusion-based Multimodal Large Language Model extending the LLaDA language diffusion model via a vision encoder and MLP connector for visual instruction tuning. It reports competitive multimodal performance to LLaMA3-V when using identical instruction data, better data scalability, narrowing of the gap to Qwen2-VL, and state-of-the-art results among existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs.

Significance. If the performance claims are shown to hold under controlled comparisons of data volume and vision encoders, the work would establish that masked diffusion models can achieve effective multimodal alignment and instruction following without autoregressive decoding. The reported competitiveness despite a weaker base language model on text-only tasks, together with scalability observations, would provide evidence that diffusion paradigms merit further study as alternatives to dominant autoregressive MLLMs.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The SOTA claim among hybrid AR-diffusion and pure diffusion MLLMs is load-bearing for the central contribution, yet the manuscript provides no explicit quantification of total training tokens, image-text pairs, or vision encoder parameters for the diffusion baselines. Without this, the reported gains cannot be isolated from potential differences in training mixture size or backbone strength (e.g., CLIP-style encoders).
  2. [§4.2] §4.2 (Main Results): The competitiveness to LLaMA3-V on identical instruction data and the narrowing gap to Qwen2-VL are presented without accompanying statistical significance tests, variance across runs, or details on evaluation data splits. This weakens assessment of whether the diffusion architecture itself drives the observed multimodal gains.
minor comments (2)
  1. [Conclusion] The manuscript could add a dedicated limitations paragraph discussing potential coherence issues in long multimodal sequences under the diffusion process.
  2. [§3] Notation for the masked diffusion objective and the MLP connector projection could be made more explicit with an equation reference in §3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The SOTA claim among hybrid AR-diffusion and pure diffusion MLLMs is load-bearing for the central contribution, yet the manuscript provides no explicit quantification of total training tokens, image-text pairs, or vision encoder parameters for the diffusion baselines. Without this, the reported gains cannot be isolated from potential differences in training mixture size or backbone strength (e.g., CLIP-style encoders).

    Authors: We agree that an explicit comparison of training resources would better support the SOTA claim and help isolate architectural effects. In the revised manuscript we will add a table in §4 summarizing the number of training tokens, image-text pairs, and vision-encoder parameters for LLaDA-V and the compared hybrid and pure-diffusion MLLMs, citing the original publications for the baselines. This addition will clarify the scale of the comparisons. revision: yes

  2. Referee: [§4.2] §4.2 (Main Results): The competitiveness to LLaMA3-V on identical instruction data and the narrowing gap to Qwen2-VL are presented without accompanying statistical significance tests, variance across runs, or details on evaluation data splits. This weakens assessment of whether the diffusion architecture itself drives the observed multimodal gains.

    Authors: We acknowledge the value of statistical reporting. The LLaMA3-V comparison uses exactly the same instruction data, which already controls for data volume. All evaluations follow the standard benchmark protocols and splits published with each dataset. Because of the substantial compute required for large-model training, we report single-run results. In revision we will (i) explicitly state the evaluation splits used, (ii) note the consistency of gains across tasks, and (iii) add a limitations paragraph acknowledging the lack of multiple runs and statistical tests. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical performance claims

full rationale

The paper introduces LLaDA-V by extending a prior diffusion language model with a vision encoder and reports benchmark results after instruction tuning. All central claims consist of observed performance numbers on public multimodal benchmarks rather than any derivation, prediction, or first-principles result that reduces to fitted parameters or self-citations by construction. No equations appear that would turn training objectives back into outputs, and comparisons to LLaMA3-V or Qwen2-VL are presented as empirical observations, not forced equivalences. Self-reference to the base LLaDA model is architectural background and does not load-bear any circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The architecture relies on standard vision-encoder pretraining and MLP projection assumptions common in MLLMs; no new physical entities or ad-hoc conserved quantities are introduced. Training hyperparameters and data mixtures are the main free parameters, but these are typical for the field.

free parameters (1)
  • MLP connector dimensions and training schedule
    Chosen to align visual features with the diffusion language embedding space; specific values not stated in abstract.
axioms (1)
  • domain assumption Masked diffusion language modeling can be extended to multimodal inputs via a vision encoder and linear projection without loss of coherence.
    Invoked when the authors state that the vision encoder and MLP enable effective multimodal alignment.

pith-pipeline@v0.9.0 · 5552 in / 1257 out tokens · 24721 ms · 2026-05-17T03:40:53.041337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.

  2. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  3. Relative Score Policy Optimization for Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.

  4. GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization

    cs.CV 2026-05 unverdicted novelty 7.0

    GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.

  5. GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization

    cs.CV 2026-05 unverdicted novelty 7.0

    GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.

  6. BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

    cs.CV 2026-04 unverdicted novelty 7.0

    BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.

  7. ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

    cs.LG 2026-04 unverdicted novelty 7.0

    ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.

  8. dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

    cs.CV 2025-12 conditional novelty 7.0

    dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.

  9. Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

    cs.CL 2025-05 conditional novelty 7.0

    Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.

  10. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  11. ELF: Embedded Language Flows

    cs.CL 2026-05 unverdicted novelty 6.0

    ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.

  12. Continuous Latent Diffusion Language Model

    cs.CL 2026-05 unverdicted novelty 6.0

    Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...

  13. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  14. Stability-Weighted Decoding for Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.

  15. Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.

  16. LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation

    cs.SD 2026-04 unverdicted novelty 6.0

    LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestrati...

  17. Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.

  18. Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

    cs.CL 2025-12 unverdicted novelty 6.0

    Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.

  19. DMax: Aggressive Parallel Decoding for dLLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...

Reference graph

Works this paper leans on

125 extracted references · 125 canonical work pages · cited by 17 Pith papers · 39 internal anchors

  1. [1]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

  2. [2]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306

  3. [3]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

  4. [4]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198

  5. [5]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024

  6. [6]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” arXiv preprint arXiv:2405.09818, 2024

  7. [7]

    Kimi-Audio Technical Report

    D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,”arXiv preprint arXiv:2504.18425, 2025

  8. [8]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv preprint arXiv:2311.07919, 2023

  9. [9]

    Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,

    S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha, “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,”arXiv preprint arXiv:2406.11768, 2024

  10. [10]

    InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

    Y . Wang, X. Li, Z. Yan, Y . He, J. Yu, X. Zeng, C. Wang, C. Ma, H. Huang, J. Gao et al., “Internvideo2. 5: Empowering video mllms with long and rich context modeling,” arXiv preprint arXiv:2501.12386, 2025

  11. [11]

    Sharegpt4video: Improving video understanding and generation with better captions,

    L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, Z. Tang, L. Yuan et al., “Sharegpt4video: Improving video understanding and generation with better captions,” Advances in Neural Information Processing Systems, vol. 37, pp. 19 472–19 495, 2024

  12. [12]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024

  13. [13]

    Improving language understand- ing by generative pre-training,

    A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understand- ing by generative pre-training,” 2018

  14. [14]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

  15. [15]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  16. [16]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

  17. [17]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023. 10

  18. [18]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  19. [19]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint arXiv:2412.15115, 2024

  20. [20]

    Textbooks Are All You Need II: phi-1.5 technical report

    Y . Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and Y . T. Lee, “Textbooks are all you need ii: phi-1.5 technical report,” arXiv preprint arXiv:2309.05463, 2023

  21. [21]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu et al., “Deepseek llm: Scaling open-source language models with longtermism,” arXiv preprint arXiv:2401.02954, 2024

  22. [22]

    Deep unsupervised learning using nonequilibrium thermodynamics,

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning. pmlr, 2015, pp. 2256–2265

  23. [23]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

  24. [24]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based gen- erative modeling through stochastic differential equations,”arXiv preprint arXiv:2011.13456, 2020

  25. [25]

    Argmax flows and multinomial diffusion: Learning categorical distributions,

    E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and M. Welling, “Argmax flows and multinomial diffusion: Learning categorical distributions,”NeurIPS, vol. 34, pp. 12 454–12 465, 2021

  26. [26]

    Structured denoising diffusion models in discrete state-spaces,

    J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg, “Structured denoising diffusion models in discrete state-spaces,” in Advances in Neural Information Processing Systems, 2021

  27. [27]

    One transformer fits all distributions in multi-modal diffusion at scale,

    F. Bao, S. Nie, K. Xue, C. Li, S. Pu, Y . Wang, G. Yue, Y . Cao, H. Su, and J. Zhu, “One transformer fits all distributions in multi-modal diffusion at scale,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1692–1717

  28. [28]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y . Gu, Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer to unify multimodal understanding and generation,” arXiv preprint arXiv:2408.12528, 2024

  29. [29]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy, “Transfusion: Predict the next token and diffuse images with one multi-modal model,”arXiv preprint arXiv:2408.11039, 2024

  30. [30]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

    Y . Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, L. Zhaoet al., “Janus- flow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation,”arXiv preprint arXiv:2411.07975, 2024

  31. [31]

    MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    S. Tong, D. Fan, J. Zhu, Y . Xiong, X. Chen, K. Sinha, M. Rabbat, Y . LeCun, S. Xie, and Z. Liu, “Metamorph: Multimodal understanding and generation via instruction tuning,” arXiv preprint arXiv:2412.14164, 2024

  32. [32]

    Orthus: Autoregressive interleaved image-text generation with modality-specific heads,

    S. Kou, J. Jin, Z. Liu, C. Liu, Y . Ma, J. Jia, Q. Chen, P. Jiang, and Z. Deng, “Orthus: Autoregressive interleaved image-text generation with modality-specific heads,”arXiv preprint arXiv:2412.00127, 2024

  33. [33]

    Unified multimodal discrete diffusion,

    A. Swerdlow, M. Prabhudesai, S. Gandhi, D. Pathak, and K. Fragkiadaki, “Unified multimodal discrete diffusion,”arXiv preprint arXiv:2503.20853, 2025

  34. [34]

    Dual diffusion for unified image generation and understanding.arXiv preprint arXiv:2501.00289, 2024

    Z. Li, H. Li, Y . Shi, A. B. Farimani, Y . Kluger, L. Yang, and P. Wang, “Dual diffusion for unified image generation and understanding,”arXiv preprint arXiv:2501.00289, 2024

  35. [35]

    A continuous time framework for discrete denoising models,

    A. Campbell, J. Benton, V . D. Bortoli, T. Rainforth, G. Deligiannidis, and A. Doucet, “A continuous time framework for discrete denoising models,” inAdvances in Neural Information Processing Systems, 2022. 11

  36. [36]

    Diffusionbert: Improving generative masked language models with diffusion models,

    Z. He, T. Sun, K. Wang, X. Huang, and X. Qiu, “Diffusionbert: Improving generative masked language models with diffusion models,”arXiv preprint arXiv:2211.15029, 2022

  37. [37]

    Score-based continuous-time discrete diffusion models,

    H. Sun, L. Yu, B. Dai, D. Schuurmans, and H. Dai, “Score-based continuous-time discrete diffusion models,” in The Eleventh International Conference on Learning Representations, 2023

  38. [38]

    Discrete diffusion modeling by estimating the ratios of the data distribution,

    A. Lou, C. Meng, and S. Ermon, “Discrete diffusion modeling by estimating the ratios of the data distribution,” inForty-first International Conference on Machine Learning, 2024

  39. [39]

    Simplified and generalized masked diffusion for discrete data,

    J. Shi, K. Han, Z. Wang, A. Doucet, and M. K. Titsias, “Simplified and generalized masked diffusion for discrete data,”arXiv preprint arXiv:2406.04329, 2024

  40. [40]

    Simple and effective masked diffusion language models,

    S. S. Sahoo, M. Arriola, Y . Schiff, A. Gokaslan, E. Marroquin, J. T. Chiu, A. Rush, and V . Kuleshov, “Simple and effective masked diffusion language models,” arXiv preprint arXiv:2406.07524, 2024

  41. [41]

    Your absorbing discrete diffusion secretly models the conditional distributions of clean data,

    J. Ou, S. Nie, K. Xue, F. Zhu, J. Sun, Z. Li, and C. Li, “Your absorbing discrete diffusion secretly models the conditional distributions of clean data,”arXiv preprint arXiv:2406.03736, 2024

  42. [42]

    Large Language Diffusion Models

    S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li, “Large language diffusion models,”arXiv preprint arXiv:2502.09992, 2025

  43. [43]

    Effective and efficient masked image generation models,

    Z. You, J. Ou, X. Zhang, J. Hu, J. Zhou, and C. Li, “Effective and efficient masked image generation models,”arXiv preprint arXiv:2503.07197, 2025

  44. [44]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafaet al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,” arXiv preprint arXiv:2502.14786, 2025

  45. [45]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin et al., “Are we on the right way for evaluating large vision-language models?”arXiv preprint arXiv:2403.20330, 2024

  46. [46]

    Scaling up masked diffusion models on text,

    S. Nie, F. Zhu, C. Du, T. Pang, Q. Liu, G. Zeng, M. Lin, and C. Li, “Scaling up masked diffusion models on text,”arXiv preprint arXiv:2410.18514, 2024

  47. [47]

    Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design,

    A. Campbell, J. Yim, R. Barzilay, T. Rainforth, and T. Jaakkola, “Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design,” 2024

  48. [48]

    [mask] is all you need,

    V . T. Hu and B. Ommer, “[mask] is all you need,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06787

  49. [49]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervi- sion,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  50. [50]

    Sigmoid loss for language image pre- training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre- training,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

  51. [51]

    Wan: Open and Advanced Large-Scale Video Generative Models

    A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng et al. , “Wan: Open and advanced large-scale video generative models,” arXiv preprint arXiv:2503.20314, 2025

  52. [52]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

  53. [53]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024. 12

  54. [54]

    Llava-next: Improved reasoning, ocr, and world knowledge,

    H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-01-30-llava-next/

  55. [55]

    Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237,

    J. Guo, T. Zheng, Y . Bai, B. Li, Y . Wang, K. Zhu, Y . Li, G. Neubig, W. Chen, and X. Yue, “Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale,” arXiv preprint arXiv:2412.05237, 2024

  56. [56]

    Visualwebinstruct: Scaling up multimodal instruction data through web search,

    Y . Jia, J. Li, X. Yue, B. Li, P. Nie, K. Zou, and W. Chen, “Visualwebinstruct: Scaling up multimodal instruction data through web search,”arXiv preprint arXiv:2503.10582, 2025

  57. [57]

    Qwen3: Think deeper, act faster,

    Q. Team, “Qwen3: Think deeper, act faster,” 2025, https://qwenlm.github.io/blog/qwen3/. [Online]. Available: https://qwenlm.github.io/blog/qwen3/

  58. [58]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  59. [59]

    Direct pref- erence optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct pref- erence optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023

  60. [60]

    Simpo: Simple preference optimization with a reference-free reward,

    Y . Meng, M. Xia, and D. Chen, “Simpo: Simple preference optimization with a reference-free reward,”Advances in Neural Information Processing Systems, vol. 37, pp. 124 198–124 235, 2024

  61. [61]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

  62. [62]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

    X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9556–9567

  63. [63]

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    X. Yue, T. Zheng, Y . Ni, Y . Wang, K. Zhang, S. Tong, Y . Sun, B. Yu, G. Zhang, H. Sunet al., “Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark,”arXiv preprint arXiv:2409.02813, 2024

  64. [64]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, and R. Ji, “Mme: A comprehensive evaluation benchmark for multimodal large language models,”arXiv preprint arXiv:2306.13394, 2023

  65. [65]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,”arXiv preprint arXiv:2307.16125, 2023

  66. [66]

    Mmbench: Is your multi-modal model an all-around player?

    Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liuet al., “Mmbench: Is your multi-modal model an all-around player?” in European conference on computer vision. Springer, 2024, pp. 216–233

  67. [67]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?

    R. Zhang, D. Jiang, Y . Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, Y . Qiao et al., “Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?” in European Conference on Computer Vision. Springer, 2024, pp. 169–186

  68. [68]

    Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models,

    P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models,”CoRR, 2023

  69. [69]

    A diagram is worth a dozen images,

    A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi, “A diagram is worth a dozen images,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 2016, pp. 235–251

  70. [70]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,”arXiv preprint arXiv:2203.10244, 2022. 13

  71. [71]

    Docvqa: A dataset for vqa on document images,

    M. Mathew, D. Karatzas, and C. Jawahar, “Docvqa: A dataset for vqa on document images,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 2200–2209

  72. [72]

    Infographicvqa,

    M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar, “Infographicvqa,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1697–1706

  73. [73]

    Grok-1.5 vision preview,

    x.ai, “Grok-1.5 vision preview,” 2024, https://x.ai/news/grok-1.5v/. [Online]. Available: https://x.ai/news/grok-1.5v/

  74. [74]

    MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

    F. Wang, X. Fu, J. Y . Huang, Z. Li, Q. Liu, X. Liu, M. D. Ma, N. Xu, W. Zhou, K. Zhang et al., “Muirbench: A comprehensive benchmark for robust multi-image understanding,”arXiv preprint arXiv:2406.09411, 2024

  75. [75]

    MLVU: Benchmarking Multi-task Long Video Understanding

    J. Zhou, Y . Shu, B. Zhao, B. Wu, S. Xiao, X. Yang, Y . Xiong, B. Zhang, T. Huang, and Z. Liu, “Mlvu: A comprehensive benchmark for multi-task long video understanding,”arXiv preprint arXiv:2406.04264, 2024

  76. [76]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,”arXiv preprint arXiv:2405.21075, 2024

  77. [77]

    Sharegpt4v: Improving large multi-modal models with better captions,

    L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin, “Sharegpt4v: Improving large multi-modal models with better captions,” in European Conference on Computer Vision. Springer, 2024, pp. 370–387

  78. [78]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms,

    P. Tong, E. Brown, P. Wu, S. Woo, A. J. V . IYER, S. C. Akula, S. Yang, J. Yang, M. Middepogu, Z. Wang et al., “Cambrian-1: A fully open, vision-centric exploration of multimodal llms,” Advances in Neural Information Processing Systems, vol. 37, pp. 87 310–87 356, 2024

  79. [79]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang et al., “Deepseek-vl: towards real-world vision-language understanding,” arXiv preprint arXiv:2403.05525, 2024

  80. [80]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y . Ma, C. Wu, B. Wang et al., “Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal under- standing,”arXiv preprint arXiv:2412.10302, 2024

Showing first 80 references.