pith. machine review for the scientific record. sign in

arxiv: 2604.24763 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords pixel embeddingsmultimodal modelsimage generationvision encodersunified modelingpatch embeddingsend-to-end learning
0
0 comments X

The pith

Tuna-2 shows that simple pixel patch embeddings can replace pretrained vision encoders for unified multimodal understanding and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tuna-2 as a unified multimodal model that handles both visual understanding and image generation directly from raw pixels. It uses only basic patch embedding layers and removes the pretrained vision encoders and separate latent representations common in other models. Experiments demonstrate state-of-the-art results on multimodal benchmarks, with the encoder-free version performing especially well on fine-grained perception tasks at scale. This indicates that end-to-end pixel-space training provides a competitive and scalable alternative to encoder-based designs.

Core claim

Tuna-2 performs visual understanding and generation directly based on pixel embeddings by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder, and achieves state-of-the-art performance in multimodal benchmarks while showing that the encoder-free design reaches stronger understanding at scale on tasks requiring fine-grained visual perception.

What carries the argument

Simple patch embedding layers applied directly to raw pixels for encoding visual input in a unified model.

Load-bearing premise

Simple patch embedding layers applied directly to pixels can extract sufficient visual features for both high-quality generation and fine-grained understanding without the inductive biases or pretraining from dedicated vision encoders.

What would settle it

A large-scale experiment in which an encoder-based multimodal model significantly outperforms the encoder-free Tuna-2 on fine-grained visual perception benchmarks after equivalent training.

read the original abstract

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Tuna-2, a unified multimodal model that encodes visual inputs using simple patch embedding layers directly on raw pixels for both understanding and generation tasks, completely discarding pretrained vision encoders such as VAEs or representation encoders. It claims state-of-the-art performance across multimodal benchmarks, shows that unified pixel-space modeling competes with latent-space approaches for high-quality image generation, and reports that the encoder-free design achieves stronger multimodal understanding at scale (especially on fine-grained perception tasks) despite slower early convergence compared to encoder-based variants.

Significance. If the results hold under controlled comparisons, the work would be significant for demonstrating that end-to-end pixel-space learning can match or exceed modular designs relying on pretrained encoders, simplifying architectures and enabling fully joint optimization from raw pixels. This challenges the necessity of inductive biases from vision encoders and suggests a scalable path for unified multimodal models, with potential implications for both generation quality and fine-grained visual reasoning.

major comments (2)
  1. [Experiments and Ablations] The central claim that the encoder-free Tuna-2 design achieves stronger multimodal understanding at scale (abstract) rests on the assumption of fair comparisons. The manuscript must explicitly report and match parameter counts, training compute/FLOPs, data schedules, and optimization details between Tuna-2, its encoder-based ablations, and external baselines; without these controls, performance differences cannot be isolated to the removal of the vision encoder rather than capacity or regime differences. This is load-bearing for the abstract's assertion that 'pretrained vision encoders are not necessary'.
  2. [Abstract] The abstract asserts SOTA results and superiority on fine-grained tasks but supplies no quantitative numbers, named benchmarks, baselines, or error bars. While the full manuscript presumably contains tables, the absence of even summary metrics or key result highlights undermines immediate verifiability of the claims that pixel embeddings 'fully compete' and 'beat' encoders.
minor comments (2)
  1. [Introduction] Clarify early in the introduction how 'simple patch embedding layers' differ from standard ViT-style patch embeddings and what (if any) additional processing is applied before feeding into the transformer.
  2. [Results] Ensure all experimental tables include standard deviations or confidence intervals and clearly label whether comparisons use the same training budget.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on experimental controls and abstract clarity. We address each major point below and have revised the manuscript to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Experiments and Ablations] The central claim that the encoder-free Tuna-2 design achieves stronger multimodal understanding at scale (abstract) rests on the assumption of fair comparisons. The manuscript must explicitly report and match parameter counts, training compute/FLOPs, data schedules, and optimization details between Tuna-2, its encoder-based ablations, and external baselines; without these controls, performance differences cannot be isolated to the removal of the vision encoder rather than capacity or regime differences. This is load-bearing for the abstract's assertion that 'pretrained vision encoders are not necessary'.

    Authors: We agree that rigorous matching of these factors is essential to isolate the contribution of the pixel-embedding design. The revised manuscript now includes an expanded experimental section with a dedicated table that reports parameter counts (Tuna-2 and its encoder-based ablation are matched within 5%), training FLOPs, data schedules, and full optimization hyperparameters for both internal ablations and external baselines. Where exact matching was not feasible due to differences in public baseline implementations, we explicitly note the discrepancies and their potential impact. These additions confirm that the observed advantages in fine-grained understanding at scale are attributable to the encoder-free approach rather than capacity or regime differences. revision: yes

  2. Referee: [Abstract] The abstract asserts SOTA results and superiority on fine-grained tasks but supplies no quantitative numbers, named benchmarks, baselines, or error bars. While the full manuscript presumably contains tables, the absence of even summary metrics or key result highlights undermines immediate verifiability of the claims that pixel embeddings 'fully compete' and 'beat' encoders.

    Authors: We acknowledge that the original abstract was too high-level for immediate verification. In the revision, we have incorporated concise quantitative highlights drawn from the main results, including named benchmarks and performance margins on both understanding and generation tasks. This improves verifiability while respecting abstract length constraints; full tables, error bars, and detailed baselines remain in the body of the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model comparisons rest on external benchmarks, not self-referential definitions or fitted inputs.

full rationale

The paper introduces Tuna-2 as an architecture using simple patch embeddings on raw pixels, then reports experimental results on multimodal benchmarks showing competitive or superior performance to encoder-based models. No derivation chain exists that reduces a claimed result to its own inputs by construction (e.g., no parameter fitted on a subset then relabeled as a prediction, no uniqueness theorem imported from self-citation, no ansatz smuggled via prior work). The central claim that 'pretrained vision encoders are not necessary' follows from the reported ablation and SOTA numbers rather than being presupposed by the model definition. This is a standard empirical architecture paper; the derivation is self-contained against the benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not mention any free parameters, axioms, or invented entities. The approach relies on standard patch embedding, a technique already established in vision transformer literature.

pith-pipeline@v0.9.0 · 5527 in / 1161 out tokens · 47486 ms · 2026-05-08T04:27:47.732435+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

    cs.CV 2026-05 unverdicted novelty 6.0

    V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...

  2. STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.

  3. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

Reference graph

Works this paper leans on

58 extracted references · 57 canonical work pages · cited by 3 Pith papers · 24 internal anchors

  1. [1]

    Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,

    Inclusion AI, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Dandan Zheng, Fudong Wang, Furong Xu, et al. Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,

  2. [2]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661,

  3. [3]

    Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599,

    Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. Onestory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026a. Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, and Marta Tintore Gazulla. Vggrp...

  4. [4]

    Qwen Technical Report

    Accessed: 2026-04-24. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  5. [5]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025a. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv p...

  6. [6]

    Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

    Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Fuyu-8b: A multimodal architecture for ai agents, October 2023.https://www.adept.ai/blog/fuyu-8b/. Adept blog post. Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabw...

  7. [7]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699,

  8. [8]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025a. Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping ...

  9. [9]

    Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

    Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025b. 14 Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv pr...

  10. [10]

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

  11. [11]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

  12. [12]

    From pixels to words–towards native vision-language primitives at scale

    Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, and Ziwei Liu. From pixels to words–towards native vision-language primitives at scale.arXiv preprint arXiv:2510.14979,

  13. [13]

    Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025

    Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025a. Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, and Ziwei Liu. The prism hypothesis: Harmonizing semantic ...

  14. [14]

    arXiv preprint arXiv:2507.22058 (2025)

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058,

  15. [15]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  16. [16]

    Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898, 2025.https://arxiv.org/abs/2506.18898

    Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898,

  17. [17]

    Uni-x: Mitigating modality conflict with a two-end- separated architecture for unified multimodal models.arXiv preprint arXiv:2509.24365,

    Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, and Jun Yu. Uni-x: Mitigating modality conflict with a two-end- separated architecture for unified multimodal models.arXiv preprint arXiv:2509.24365,

  18. [18]

    Emma: Efficient multimodal understanding, generation, and editing with a unified architecture.arXiv preprint arXiv:2512.04810,

    Xin He, Longhui Wei, Jianbo Ouyang, Minghui Liao, Lingxi Xie, and Qi Tian. Emma: Efficient multimodal understanding, generation, and editing with a unified architecture.arXiv preprint arXiv:2512.04810,

  19. [19]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,

  20. [20]

    Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer

    Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, and Jun Zhou. Ming-univision: Joint image understanding and generation with a unified continuous tokenizer.arXiv preprint arXiv:2510.06590, 2025a. Ziyuan Huang, DanD...

  21. [21]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich...

  22. [22]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720,

  23. [23]

    Videochat-flash: Hierarchical compression for long-context video modeling,

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024c. Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin...

  24. [24]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025a. Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, and ...

  25. [25]

    Tuna: Taming unified visual representations for native unified multimodal models.arXiv preprint arXiv:2512.02014, 2025

    16 Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models.arXiv preprint arXiv:2512.02014,

  26. [26]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  27. [27]

    Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025

    Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a. Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizin...

  28. [28]

    Does understanding inform generation in unified multimodal models? from analysis to path forward.arXiv preprint arXiv:2511.20561,

    Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, and Li Yuan. Does understanding inform generation in unified multimodal models? from analysis to path forward.arXiv preprint arXiv:2511.20561,

  29. [29]

    Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel

    Accessed: 2026-04-24. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF international conference on computer vision, pages 3170–3180,

  30. [30]

    Histream: Efficient high-resolution video generation via redundancy-eliminated streaming.arXiv preprint arXiv:2512.21338, 2025

    Haonan Qiu, Shikun Liu, Zijian Zhou, Zhaochong An, Weiming Ren, Zhiheng Liu, Jonas Schult, Sen He, Shoufa Chen, Yuren Cong, et al. Histream: Efficient high-resolution video generation via redundancy-eliminated streaming.arXiv preprint arXiv:2512.21338,

  31. [31]

    Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation

    Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, et al. Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation. arXiv preprint arXiv:2511.18262,

  32. [32]

    Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

    Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model.arXiv preprint arXiv:2505.23606,

  33. [33]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

  34. [34]

    arXiv preprint arXiv:2507.23278 , year=

    17 Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278,

  35. [35]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,

  36. [36]

    Longcat-image technical report

    Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584,

  37. [37]

    Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877, 2026

    Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877,

  38. [38]

    Beyond language modeling: An exploration of multimodal pretraining.arXiv preprint arXiv:2603.03276,

    Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, et al. Beyond language modeling: An exploration of multimodal pretraining. arXiv preprint arXiv:2603.03276,

  39. [39]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

  40. [40]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  41. [41]

    Ovis-u1 technical report

    Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, et al. Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025a. Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, and Can Huang. Vision as lora.arXiv preprint arXiv:2503.20680, 20...

  42. [42]

    Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025a

    Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025a. Hongyang Wei, Baixin Xu, Hongbo Liu, Size Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building ...

  43. [43]

    Qwen-Image Technical Report

    18 Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025a. Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanc...

  44. [44]

    Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025c

    Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, and Chen Change Loy. Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025c. Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representati...

  45. [45]

    Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025a

    Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025a. Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimod...

  46. [46]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025b. Rongchang Xie, Chen Du, Ping Song, and Chang Liu. Muse-vl: Modeling unified vlm through semantic discrete encoding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24135–24146, 2025c. ...

  47. [47]

    Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong

    Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279,

  48. [48]

    Mmada: Multimodal large diffusion language models

    Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent denoising makes good tokenizers. InThe Fourteenth International Conference on Learning Representations, 2025a. Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025b....

  49. [49]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275,

  50. [50]

    Llada-o: An effective and length-adaptive omni diffusion model

    Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, and Ji-Rong Wen. Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068,

  51. [51]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940,

  52. [52]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    19 Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,

  53. [53]

    PixelDiT: Pixel Diffusion Transformers for Image Generation

    Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation.arXiv preprint arXiv:2511.20645,

  54. [54]

    Uniflow: A unified pixel flow tokenizer for visual understanding and generation.arXiv preprint arXiv:2510.10575, 2025

    Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng, Boyu Chen, Chenting Wang, Shaobin Zhuang, Lu Dong, KunPeng Du, Yi Wang, Limin Wang, et al. Uniflow: A unified pixel flow tokenizer for visual understanding and generation.arXiv preprint arXiv:2510.10575,

  55. [55]

    Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a

    Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, et al. Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a. Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, et al. Nextflow: Unified seque...

  56. [56]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690,

  57. [57]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039,

  58. [58]

    Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905,

    Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, et al. Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905,