pith. machine review for the scientific record. sign in

arxiv: 2604.20796 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Haoquan Li, Haoxing Chen, Haoyuan Wu, Hongjun Wang, Inclusion AI, Jianguo Li, Junbo Zhao, Kai Gan, Long Cui, Qi Qin, Tao Lin, Tieyuan Chen, Tiwei Bie, Xiaomei Wang, Yi Xin, Zhenglin Cheng, Zhenzhong Lan, Zhicheng Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal understandingimage generationdiffusion modelslarge language modelsunified modelsdiscrete diffusioninterleaved generation
0
0 comments X

The pith

LLaDA2.0-Uni unifies multimodal understanding and image generation inside one discrete diffusion large language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LLaDA2.0-Uni as a single architecture that performs both multimodal understanding and image generation through discrete diffusion. It discretizes visual inputs so the language model backbone can apply masked diffusion to text and vision tokens alike, then uses a separate decoder to turn the resulting tokens back into images. The model is trained on large-scale data with a multi-stage pipeline and is reported to reach understanding performance comparable to specialized vision-language models while also producing strong generation and editing results. Native support for interleaved sequences of reasoning and generation is built in. A sympathetic reader would see this as evidence that one model can handle tasks previously split across separate systems.

Core claim

LLaDA2.0-Uni combines a semantic discrete tokenizer based on SigLIP-VQ, an MoE-based discrete diffusion LLM backbone that applies block-level masked diffusion to both text and vision inputs, and a diffusion decoder that reconstructs visual tokens into images. Supported by large-scale curated data and multi-stage training, the model matches specialized VLMs on multimodal understanding benchmarks and delivers strong performance on image generation and editing while enabling native interleaved generation and reasoning.

What carries the argument

Block-level masked diffusion inside the MoE dLLM backbone that treats discretized visual tokens identically to text tokens for joint understanding and generation tasks.

Load-bearing premise

Discretizing continuous visual inputs via SigLIP-VQ preserves enough semantic and perceptual information for both high-fidelity reconstruction and competitive understanding performance without task-specific trade-offs.

What would settle it

Compare LLaDA2.0-Uni scores on standard multimodal understanding benchmarks such as VQAv2 or POPE against leading specialized VLMs, and measure FID or human preference on image generation and editing tasks against dedicated diffusion models; consistent gaps in either direction would falsify the unification claim.

Figures

Figures reproduced from arXiv: 2604.20796 by Haoquan Li, Haoxing Chen, Haoyuan Wu, Hongjun Wang, Inclusion AI, Jianguo Li, Junbo Zhao, Kai Gan, Long Cui, Qi Qin, Tao Lin, Tieyuan Chen, Tiwei Bie, Xiaomei Wang, Yi Xin, Zhenglin Cheng, Zhenzhong Lan, Zhicheng Huang.

Figure 1
Figure 1. Figure 1: Benchmark Performance of LLaDA2.0-Uni. Authors are listed in alphabetical order based on last name. † indicates tech-leaders. 1 arXiv:2604.20796v1 [cs.CV] 22 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Showcases of LLaDA2.0-Uni in High-Fidelity Image Generation. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Showcases of LLaDA2.0-Uni in Single/Multi-Reference Editing, Interleaved Generation and [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture Overview of LLaDA2.0-Uni. The framework integrates a SigLIP-VQ tokenizer and a [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Data Packing Strategy for Efficient Training. Multiple shorter samples are concatenated into [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of InterGen Benchmark. A muscular man with a beard, wearing a black tank top and a black beanie, sits on a beige couch. He looks directly at the camera with a neutral expression, his right hand resting on his thigh. Behind him, a large window reveals a cityscape with tall buildings, and a framed picture hangs on the wall to his left. Next, the man shifts his gaze slightly to his right, his express… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Results on Interleaved Generation Task. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Results on Interleaved Reasoning Task. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visual Comparison of the Decoder and the Distilled Version. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) for multimodal understanding and generation. It discretizes visual inputs using a SigLIP-VQ tokenizer, employs an MoE-based dLLM backbone for block-level masked diffusion on interleaved text and vision tokens, and uses a diffusion decoder to reconstruct high-fidelity images. Supported by large-scale curated data and multi-stage training, the model claims to match specialized VLMs on understanding benchmarks while achieving strong results on image generation and editing tasks, with native support for interleaved reasoning and generation. Code and models are released.

Significance. If the empirical results hold after verification, this work would advance unified multimodal foundation models by demonstrating that a single discrete diffusion framework can handle both dense understanding and high-fidelity generation without apparent task specialization, potentially simplifying architectures and enabling more flexible interleaved workflows. The open release of code and models strengthens reproducibility and follow-up research.

major comments (2)
  1. [Abstract and architecture description] The central claim that SigLIP-VQ discretization enables competitive understanding and high-fidelity generation without task trade-offs is load-bearing, yet the architecture description provides no quantitative bounds on reconstruction error, semantic preservation metrics, or ablation studies isolating the VQ step's impact on downstream performance (e.g., no comparison of continuous vs. discrete visual tokens on the same backbone).
  2. [Abstract] The abstract states that the model 'matches specialized VLMs in multimodal understanding' and delivers 'strong performance in image generation and editing,' but reports no specific benchmark scores, baselines, or error analysis; without these in the experiments section, the no-trade-off claim cannot be evaluated.
minor comments (2)
  1. [Method] Clarify the exact number of diffusion steps, distillation schedule, and MoE routing details in the methods to allow reproduction.
  2. [Training] Add a table summarizing key hyperparameters (e.g., expert count, training data mixtures) for the multi-stage pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the presentation of our central claims. We address each major comment below and have made revisions to the manuscript to provide the requested quantitative details and clarifications.

read point-by-point responses
  1. Referee: [Abstract and architecture description] The central claim that SigLIP-VQ discretization enables competitive understanding and high-fidelity generation without task trade-offs is load-bearing, yet the architecture description provides no quantitative bounds on reconstruction error, semantic preservation metrics, or ablation studies isolating the VQ step's impact on downstream performance (e.g., no comparison of continuous vs. discrete visual tokens on the same backbone).

    Authors: We agree that additional quantitative support for the SigLIP-VQ tokenizer would strengthen the architecture description and better substantiate the no-trade-off claim. In the revised manuscript, we will add explicit metrics on reconstruction error (including FID and LPIPS for tokenized image reconstruction) and semantic preservation (such as downstream task accuracy using discrete tokens). We will also incorporate an ablation study comparing the discrete SigLIP-VQ approach against continuous visual token variants on the identical MoE dLLM backbone, evaluating impacts on both multimodal understanding and image generation/editing performance. revision: yes

  2. Referee: [Abstract] The abstract states that the model 'matches specialized VLMs in multimodal understanding' and delivers 'strong performance in image generation and editing,' but reports no specific benchmark scores, baselines, or error analysis; without these in the experiments section, the no-trade-off claim cannot be evaluated.

    Authors: We will revise the abstract to include representative quantitative results, such as key benchmark scores for understanding tasks (with main VLM baselines) and generation/editing metrics (e.g., FID scores), to allow immediate evaluation of the claims. We will also expand the experiments section with more detailed error analysis, explicit no-trade-off comparisons across tasks, and additional baseline results to ensure the supporting evidence is fully accessible and rigorous. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and training results are self-contained

full rationale

The paper presents LLaDA2.0-Uni as a model architecture (SigLIP-VQ tokenizer + MoE dLLM backbone + diffusion decoder) trained via multi-stage pipeline on curated data. All performance claims (matching VLMs on understanding, strong generation/editing, interleaved support) are stated as outcomes of that training and evaluation, with no derivation chain, equations, or 'predictions' that reduce by construction to fitted inputs or self-citations. The discretization step is an explicit design choice whose information-preservation properties are left to empirical verification rather than assumed via prior self-referential results. This is a standard empirical model paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard deep-learning assumptions plus several engineering choices whose justification is not visible in the abstract.

free parameters (3)
  • MoE expert count and routing parameters
    Chosen to balance capacity and efficiency in the dLLM backbone; values not specified in abstract.
  • Diffusion timestep schedule and distillation steps
    Fitted or selected to achieve few-step inference while preserving quality.
  • Multi-stage training data mixture ratios
    Curated large-scale data ratios are hand-tuned for the unified objective.
axioms (2)
  • domain assumption Discrete tokenization via SigLIP-VQ preserves both semantic and reconstructible visual information
    Invoked when stating that block-level masked diffusion works for vision inputs.
  • standard math Standard transformer/MoE scaling laws apply to the discrete diffusion setting
    Implicit in expecting competitive performance from large-scale training.
invented entities (2)
  • SigLIP-VQ tokenizer no independent evidence
    purpose: Convert continuous images into discrete semantic tokens compatible with the dLLM backbone
    New integration point; no independent evidence provided beyond the claim of high-fidelity reconstruction.
  • Diffusion decoder no independent evidence
    purpose: Reconstruct visual tokens into images after backbone processing
    Component introduced to close the generation loop; evidence is performance claims only.

pith-pipeline@v0.9.0 · 5544 in / 1453 out tokens · 22810 ms · 2026-05-10T00:46:12.674919+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Relative Score Policy Optimization for Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 22 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

  2. [2]

    Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,

    Inclusion AI, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Dandan Zheng, Fudong Wang, Furong Xu, et al. Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,

  3. [3]

    Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745,

  4. [4]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699,

  5. [5]

    Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding

    Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, Bo Qu, Wenhai Wang, Yu Qiao, Dajuin Yao, and Yihao Liu. Artimuse: Fine-grained image aesthetics assessment with joint scoring and expert-level understanding.arXiv preprint arXiv:2507.14533, 2025a. Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng,...

  6. [6]

    Interleaved scene graphs for interleaved text-and-image generation assessment.arXiv preprint arXiv:2411.17188, 2024a

    Dongping Chen, Ruoxi Chen, Shu Pu, Zhaoyi Liu, Yanru Wu, Caixi Chen, Benlin Liu, Yue Huang, Yao Wan, Pan Zhou, et al. Interleaved scene graphs for interleaved text-and-image generation assessment.arXiv preprint arXiv:2411.17188, 2024a. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et a...

  7. [7]

    Weave: A benchmark for evaluating multimodal editing models.arXiv preprint arXiv:2511.15738,

    Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, et al. Weave: A benchmark for evaluating multimodal editing models.arXiv preprint arXiv:2511.15738,

  8. [8]

    et al.\ (2025)

    20 Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, et al. Editmgt: Unleashing potentials of masked generative transformers in image editing.arXiv preprint arXiv:2512.11715,

  9. [9]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025a. Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multim...

  10. [10]

    arXiv preprint arXiv:2503.23461 (2025)

    Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461,

  11. [11]

    Flux-reason-6m and prism-bench: A million-scale text-to- image reasoning dataset and comprehensive benchmark.arXiv preprint arXiv:2509.09680, 2025

    Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, and Hongsheng Li. Flux-reason-6m & prism-bench: A million-scale text-to-image reasoning dataset and comprehensive benchmark.arXiv preprint arXiv:2509.09680,

  12. [12]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,

  13. [13]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346,

  14. [14]

    arXiv preprint arXiv:2507.22058 (2025)

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058,

  15. [15]

    Ui-venus technical report: Building high-performance ui agents with rft

    Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft.arXiv preprint arXiv:2508.10833,

  16. [16]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,

  17. [17]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image...

  18. [18]

    Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi

    Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vl-rewardbench: A challenging benchmark for vision-language generative reward models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025a. Shufan Li, Konstantinos Kallidromitis, Hritik...

  19. [19]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147,

  20. [20]

    Every activation boosted: Scaling general reasoner to 1 trillion open language foundation.arXiv preprint arXiv: 2510.22115,

    Ling Team. Every activation boosted: Scaling general reasoner to 1 trillion open language foundation.arXiv preprint arXiv:2510.22115,

  21. [21]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  22. [22]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Dongyang Liu, Yi Xin, Shitian Zhao, Le Zhuo, Weifeng Lin, Xinyue Li, Qi Qin, Guangtao Zhai, Xiaohong Liu, Hongsheng Li, et al. Lumina-mgpt: Flexible photore...

  23. [23]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...

  24. [24]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision (ECCV), 2024b. Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin ...

  25. [25]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081,

  26. [26]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,

  27. [27]

    Veomni: Scaling any modality model training with model-centric distributed recipe zoo

    Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, et al. Veomni: Scaling any modality model training with model-centric distributed recipe zoo.arXiv preprint arXiv:2508.02317,

  28. [28]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,

  29. [29]

    Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Chaoran Feng, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

  30. [30]

    Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

    Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808,

  31. [31]

    Qwen Team. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025a. Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action.Qwen Blog. Accessed, pp. 10–04, 2025b. Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion l...

  32. [32]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864,

  33. [33]

    GitHub repository

    Peng Sun, Yi Jiang, and Tao Lin. Unified continuous generative models.arXiv preprint arXiv:2505.07447,

  34. [34]

    Duality models: An embarrassingly simple one-step generation paradigm.arXiv preprint arXiv:2602.17682,

    Peng Sun, Xinyi Shang, Tao Lin, and Zhiqiang Shen. Duality models: An embarrassingly simple one-step generation paradigm.arXiv preprint arXiv:2602.17682,

  35. [35]

    Longcat-image technical report

    23 Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584,

  36. [36]

    Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877, 2026

    Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877,

  37. [37]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

  38. [38]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems (NeurIPS), 2024a. Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, ...

  39. [39]

    MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

    Xinyu Wei, Kangrui Cen, Hongyang Wei, Zhen Guo, Bairui Li, Zeqing Wang, Jinrui Zhang, and Lei Zhang. Mico-150k: A comprehensive dataset advancing multi-image composition.arXiv preprint arXiv:2512.07348,

  40. [40]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025a. Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, et al. Janus: Decoupling visual encoding for unified multimodal understanding...

  41. [41]

    Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

    Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025a. 24 Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, ...

  42. [42]

    Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

    Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025a. Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao...

  43. [43]

    Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a

    Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, and Leoweiliang. Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a. Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, et al. Nextflow: Uni...

  44. [44]

    et al.\ (2025)

    Le Zhuo, Songhao Han, Yuandong Pu, Boxiang Qiu, Sayak Paul, Yue Liao, Yihao Liu, Jie Shao, Xi Chen, Si Liu, et al. Factuality matters: When image generation and editing meet structured visuals.arXiv preprint arXiv:2510.05091,