arxiv: 2604.24763 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Zhiheng Liu , Weiming Ren , Xiaoke Huang , Shoufa Chen , Tianhong Li , Mengzhao Chen , Yatai Ji , Sen He

show 7 more authors

Jonas Schult Belinda Zeng Tao Xiang Wenhu Chen Ping Luo Luke Zettlemoyer Yuren Cong

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords pixel embeddingsmultimodal modelsimage generationvision encodersunified modelingpatch embeddingsend-to-end learning

0 comments

The pith

Tuna-2 shows that simple pixel patch embeddings can replace pretrained vision encoders for unified multimodal understanding and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tuna-2 as a unified multimodal model that handles both visual understanding and image generation directly from raw pixels. It uses only basic patch embedding layers and removes the pretrained vision encoders and separate latent representations common in other models. Experiments demonstrate state-of-the-art results on multimodal benchmarks, with the encoder-free version performing especially well on fine-grained perception tasks at scale. This indicates that end-to-end pixel-space training provides a competitive and scalable alternative to encoder-based designs.

Core claim

Tuna-2 performs visual understanding and generation directly based on pixel embeddings by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder, and achieves state-of-the-art performance in multimodal benchmarks while showing that the encoder-free design reaches stronger understanding at scale on tasks requiring fine-grained visual perception.

What carries the argument

Simple patch embedding layers applied directly to raw pixels for encoding visual input in a unified model.

Load-bearing premise

Simple patch embedding layers applied directly to pixels can extract sufficient visual features for both high-quality generation and fine-grained understanding without the inductive biases or pretraining from dedicated vision encoders.

What would settle it

A large-scale experiment in which an encoder-based multimodal model significantly outperforms the encoder-free Tuna-2 on fine-grained visual perception benchmarks after equivalent training.

read the original abstract

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tuna-2 claims simple pixel patch embeddings can replace pretrained vision encoders for unified multimodal understanding and generation, with the encoder-free version pulling ahead on fine-grained tasks at scale.

read the letter

The core claim here is that you can drop dedicated vision encoders like VAEs or pretrained representation models and still match or beat current multimodal performance by training basic patch embeddings end-to-end from raw pixels. The paper positions Tuna-2 as a simpler unified architecture that avoids task misalignment between understanding and generation. What it does well is lay out a clean argument for fully pixel-space modeling: the encoder-free design reportedly converges to stronger results on perception-heavy benchmarks once scale increases, even if the encoder-based version moves faster early on. This directly tests whether the inductive biases from separate vision pretraining are necessary, and the abstract suggests the answer is no for both generation quality and fine-grained understanding. That is a useful data point for anyone thinking about end-to-end training at larger scales. The experiments appear to include direct comparisons between the two variants plus external baselines, which is the right structure. The soft spots are mostly around verification and controls. The abstract states SOTA results and superiority on fine-grained tasks but gives no numbers, error bars, or ablation tables, so the size of the gains is impossible to judge from the summary alone. The stress-test concern lands: if the encoder-free run used higher capacity, longer schedules, or different data than the ablations and baselines, then the conclusion that encoders are unnecessary does not isolate the variable cleanly. We would need the full training details and matched-capacity runs to be confident the architecture change itself drives the outcome. Minor issues include the usual need for more precise generation metrics and checks that the patch embedding layer is not secretly carrying over some pretrained signal. This paper is for researchers building or scaling unified multimodal models who want to question the default encoder stack. It is worth a serious referee because the question is timely, the setup is straightforward, and the results—if they hold under scrutiny—would affect architecture choices. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces Tuna-2, a unified multimodal model that encodes visual inputs using simple patch embedding layers directly on raw pixels for both understanding and generation tasks, completely discarding pretrained vision encoders such as VAEs or representation encoders. It claims state-of-the-art performance across multimodal benchmarks, shows that unified pixel-space modeling competes with latent-space approaches for high-quality image generation, and reports that the encoder-free design achieves stronger multimodal understanding at scale (especially on fine-grained perception tasks) despite slower early convergence compared to encoder-based variants.

Significance. If the results hold under controlled comparisons, the work would be significant for demonstrating that end-to-end pixel-space learning can match or exceed modular designs relying on pretrained encoders, simplifying architectures and enabling fully joint optimization from raw pixels. This challenges the necessity of inductive biases from vision encoders and suggests a scalable path for unified multimodal models, with potential implications for both generation quality and fine-grained visual reasoning.

major comments (2)

[Experiments and Ablations] The central claim that the encoder-free Tuna-2 design achieves stronger multimodal understanding at scale (abstract) rests on the assumption of fair comparisons. The manuscript must explicitly report and match parameter counts, training compute/FLOPs, data schedules, and optimization details between Tuna-2, its encoder-based ablations, and external baselines; without these controls, performance differences cannot be isolated to the removal of the vision encoder rather than capacity or regime differences. This is load-bearing for the abstract's assertion that 'pretrained vision encoders are not necessary'.
[Abstract] The abstract asserts SOTA results and superiority on fine-grained tasks but supplies no quantitative numbers, named benchmarks, baselines, or error bars. While the full manuscript presumably contains tables, the absence of even summary metrics or key result highlights undermines immediate verifiability of the claims that pixel embeddings 'fully compete' and 'beat' encoders.

minor comments (2)

[Introduction] Clarify early in the introduction how 'simple patch embedding layers' differ from standard ViT-style patch embeddings and what (if any) additional processing is applied before feeding into the transformer.
[Results] Ensure all experimental tables include standard deviations or confidence intervals and clearly label whether comparisons use the same training budget.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on experimental controls and abstract clarity. We address each major point below and have revised the manuscript to strengthen the presentation of our results.

read point-by-point responses

Referee: [Experiments and Ablations] The central claim that the encoder-free Tuna-2 design achieves stronger multimodal understanding at scale (abstract) rests on the assumption of fair comparisons. The manuscript must explicitly report and match parameter counts, training compute/FLOPs, data schedules, and optimization details between Tuna-2, its encoder-based ablations, and external baselines; without these controls, performance differences cannot be isolated to the removal of the vision encoder rather than capacity or regime differences. This is load-bearing for the abstract's assertion that 'pretrained vision encoders are not necessary'.

Authors: We agree that rigorous matching of these factors is essential to isolate the contribution of the pixel-embedding design. The revised manuscript now includes an expanded experimental section with a dedicated table that reports parameter counts (Tuna-2 and its encoder-based ablation are matched within 5%), training FLOPs, data schedules, and full optimization hyperparameters for both internal ablations and external baselines. Where exact matching was not feasible due to differences in public baseline implementations, we explicitly note the discrepancies and their potential impact. These additions confirm that the observed advantages in fine-grained understanding at scale are attributable to the encoder-free approach rather than capacity or regime differences. revision: yes
Referee: [Abstract] The abstract asserts SOTA results and superiority on fine-grained tasks but supplies no quantitative numbers, named benchmarks, baselines, or error bars. While the full manuscript presumably contains tables, the absence of even summary metrics or key result highlights undermines immediate verifiability of the claims that pixel embeddings 'fully compete' and 'beat' encoders.

Authors: We acknowledge that the original abstract was too high-level for immediate verification. In the revision, we have incorporated concise quantitative highlights drawn from the main results, including named benchmarks and performance margins on both understanding and generation tasks. This improves verifiability while respecting abstract length constraints; full tables, error bars, and detailed baselines remain in the body of the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model comparisons rest on external benchmarks, not self-referential definitions or fitted inputs.

full rationale

The paper introduces Tuna-2 as an architecture using simple patch embeddings on raw pixels, then reports experimental results on multimodal benchmarks showing competitive or superior performance to encoder-based models. No derivation chain exists that reduces a claimed result to its own inputs by construction (e.g., no parameter fitted on a subset then relabeled as a prediction, no uniqueness theorem imported from self-citation, no ansatz smuggled via prior work). The central claim that 'pretrained vision encoders are not necessary' follows from the reported ablation and SOTA numbers rather than being presupposed by the model definition. This is a standard empirical architecture paper; the derivation is self-contained against the benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not mention any free parameters, axioms, or invented entities. The approach relies on standard patch embedding, a technique already established in vision transformer literature.

pith-pipeline@v0.9.0 · 5527 in / 1161 out tokens · 47486 ms · 2026-05-08T04:27:47.732435+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
cs.CV 2026-05 unverdicted novelty 6.0

STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

Reference graph

Works this paper leans on

58 extracted references · 57 canonical work pages · cited by 3 Pith papers · 24 internal anchors

[1]

Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,

Inclusion AI, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Dandan Zheng, Fudong Wang, Furong Xu, et al. Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,

work page arXiv
[2]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661,

work page internal anchor Pith review arXiv
[3]

Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599,

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. Onestory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026a. Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, and Marta Tintore Gazulla. Vggrp...

work page arXiv
[4]

Qwen Technical Report

Accessed: 2026-04-24. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review arXiv 2026
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025a. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv p...

work page internal anchor Pith review arXiv
[6]

Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Fuyu-8b: A multimodal architecture for ai agents, October 2023.https://www.adept.ai/blog/fuyu-8b/. Adept blog post. Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabw...

work page arXiv 2023
[7]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699,

work page internal anchor Pith review arXiv
[8]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025a. Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping ...

work page Pith review arXiv
[9]

Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025b. 14 Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv pr...

work page arXiv
[10]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

work page arXiv
[11]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

work page internal anchor Pith review arXiv
[12]

From pixels to words–towards native vision-language primitives at scale

Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, and Ziwei Liu. From pixels to words–towards native vision-language primitives at scale.arXiv preprint arXiv:2510.14979,

work page arXiv
[13]

Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025

Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025a. Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, and Ziwei Liu. The prism hypothesis: Harmonizing semantic ...

work page arXiv
[14]

arXiv preprint arXiv:2507.22058 (2025)

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058,

work page arXiv
[15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review arXiv
[16]

Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898, 2025.https://arxiv.org/abs/2506.18898

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898,

work page arXiv
[17]

Uni-x: Mitigating modality conflict with a two-end- separated architecture for unified multimodal models.arXiv preprint arXiv:2509.24365,

Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, and Jun Yu. Uni-x: Mitigating modality conflict with a two-end- separated architecture for unified multimodal models.arXiv preprint arXiv:2509.24365,

work page arXiv
[18]

Emma: Efficient multimodal understanding, generation, and editing with a unified architecture.arXiv preprint arXiv:2512.04810,

Xin He, Longhui Wei, Jianbo Ouyang, Minghui Liao, Lingxi Xie, and Qi Tian. Emma: Efficient multimodal understanding, generation, and editing with a unified architecture.arXiv preprint arXiv:2512.04810,

work page arXiv
[19]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,

work page internal anchor Pith review arXiv
[20]

Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer

Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, and Jun Zhou. Ming-univision: Joint image understanding and generation with a unified continuous tokenizer.arXiv preprint arXiv:2510.06590, 2025a. Ziyuan Huang, DanD...

work page arXiv
[21]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich...

work page internal anchor Pith review arXiv
[22]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720,

work page internal anchor Pith review arXiv
[23]

Videochat-flash: Hierarchical compression for long-context video modeling,

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024c. Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin...

work page arXiv
[24]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025a. Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, and ...

work page internal anchor Pith review arXiv
[25]

Tuna: Taming unified visual representations for native unified multimodal models.arXiv preprint arXiv:2512.02014, 2025

16 Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models.arXiv preprint arXiv:2512.02014,

work page arXiv
[26]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review arXiv
[27]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a. Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizin...

work page arXiv 2022
[28]

Does understanding inform generation in unified multimodal models? from analysis to path forward.arXiv preprint arXiv:2511.20561,

Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, and Li Yuan. Does understanding inform generation in unified multimodal models? from analysis to path forward.arXiv preprint arXiv:2511.20561,

work page arXiv
[29]

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel

Accessed: 2026-04-24. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF international conference on computer vision, pages 3170–3180,

2026
[30]

Histream: Efficient high-resolution video generation via redundancy-eliminated streaming.arXiv preprint arXiv:2512.21338, 2025

Haonan Qiu, Shikun Liu, Zijian Zhou, Zhaochong An, Weiming Ren, Zhiheng Liu, Jonas Schult, Sen He, Shoufa Chen, Yuren Cong, et al. Histream: Efficient high-resolution video generation via redundancy-eliminated streaming.arXiv preprint arXiv:2512.21338,

work page arXiv
[31]

Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation

Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, et al. Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation. arXiv preprint arXiv:2511.18262,

work page arXiv
[32]

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model.arXiv preprint arXiv:2505.23606,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review arXiv
[34]

arXiv preprint arXiv:2507.23278 , year=

17 Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278,

work page arXiv
[35]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,

work page internal anchor Pith review arXiv
[36]

Longcat-image technical report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584,

work page arXiv
[37]

Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877, 2026

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877,

work page arXiv
[38]

Beyond language modeling: An exploration of multimodal pretraining.arXiv preprint arXiv:2603.03276,

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, et al. Beyond language modeling: An exploration of multimodal pretraining. arXiv preprint arXiv:2603.03276,

work page arXiv
[39]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review arXiv
[40]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review arXiv
[41]

Ovis-u1 technical report

Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, et al. Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025a. Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, and Can Huang. Vision as lora.arXiv preprint arXiv:2503.20680, 20...

work page arXiv
[42]

Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025a

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025a. Hongyang Wei, Baixin Xu, Hongbo Liu, Size Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building ...

work page arXiv
[43]

Qwen-Image Technical Report

18 Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025a. Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanc...

work page internal anchor Pith review arXiv
[44]

Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025c

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, and Chen Change Loy. Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025c. Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representati...

work page arXiv
[45]

Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025a

Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025a. Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimod...

work page arXiv
[46]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025b. Rongchang Xie, Chen Du, Ping Song, and Chang Liu. Muse-vl: Modeling unified vlm through semantic discrete encoding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24135–24146, 2025c. ...

work page internal anchor Pith review arXiv
[47]

Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279,

work page arXiv
[48]

Mmada: Multimodal large diffusion language models

Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent denoising makes good tokenizers. InThe Fourteenth International Conference on Learning Representations, 2025a. Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025b....

work page arXiv
[49]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275,

work page internal anchor Pith review arXiv
[50]

Llada-o: An effective and length-adaptive omni diffusion model

Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, and Ji-Rong Wen. Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068,

work page arXiv
[51]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940,

work page internal anchor Pith review arXiv
[52]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

19 Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review arXiv
[53]

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation.arXiv preprint arXiv:2511.20645,

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Uniflow: A unified pixel flow tokenizer for visual understanding and generation.arXiv preprint arXiv:2510.10575, 2025

Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng, Boyu Chen, Chenting Wang, Shaobin Zhuang, Lu Dong, KunPeng Du, Yi Wang, Limin Wang, et al. Uniflow: A unified pixel flow tokenizer for visual understanding and generation.arXiv preprint arXiv:2510.10575,

work page arXiv
[55]

Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a

Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, et al. Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a. Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, et al. Nextflow: Unified seque...

work page arXiv
[56]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690,

work page internal anchor Pith review arXiv
[57]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039,

work page internal anchor Pith review arXiv
[58]

Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905,

Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, et al. Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905,

work page arXiv