Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Belinda Zeng; Jonas Schult; Luke Zettlemoyer; Mengzhao Chen; Ping Luo; Sen He; Shoufa Chen; Tao Xiang; Tianhong Li; Weiming Ren

arxiv: 2604.24763 · v2 · pith:XLMUT3UJnew · submitted 2026-04-27 · 💻 cs.CV

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Zhiheng Liu , Weiming Ren , Xiaoke Huang , Shoufa Chen , Tianhong Li , Mengzhao Chen , Yatai Ji , Sen He

show 7 more authors

Jonas Schult Belinda Zeng Tao Xiang Wenhu Chen Ping Luo Luke Zettlemoyer Yuren Cong

This is my paper

Pith reviewed 2026-05-20 23:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal modelspixel embeddingsvision encodersimage generationvisual understandingend-to-end learningpatch embeddings

0 comments

The pith

Tuna-2 shows that simple pixel patch embeddings can replace pretrained vision encoders for state-of-the-art multimodal understanding and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tuna-2 as a unified multimodal model that encodes visual inputs using only simple patch embedding layers and performs both understanding and generation directly in pixel space. It claims this encoder-free design reaches state-of-the-art results on multimodal benchmarks while outperforming encoder-based versions at scale, especially on fine-grained perception tasks. The work argues that separate pretrained vision encoders create unnecessary misalignment and that end-to-end pixel-space learning offers a simpler and more scalable route to strong visual representations. A sympathetic reader would care because the approach removes reliance on modular pretrained components and allows fully joint optimization from raw pixels.

Core claim

Tuna-2 performs visual understanding and generation directly based on pixel embeddings by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual感知.

What carries the argument

simple patch embedding layers that encode visual input directly from raw pixels

If this is right

Unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation.
The encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception.
Pretrained vision encoders are not necessary for multimodal modelling.
End-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could reduce training complexity by removing the need to maintain and align separate vision encoder modules.
Models built this way might show better consistency across understanding and generation because both tasks share the exact same pixel-derived representations.
If the result holds under further scaling, it would suggest that visual representation learning benefits more from joint optimization than from frozen pretrained features.

Load-bearing premise

That the training data, scale, and optimization setup allow simple patch embeddings to acquire visual representations competitive with those from large-scale pretrained vision encoders without inheriting their inductive biases.

What would settle it

A controlled experiment training both an encoder-based variant and the encoder-free Tuna-2 on identical data and compute, then measuring whether the encoder-free version falls short on fine-grained visual perception benchmarks.

read the original abstract

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tuna-2 shows pixel patch embeddings can replace vision encoders in unified multimodal models at scale, but the comparisons need tighter controls on training parity.

read the letter

Tuna-2 claims that pixel embeddings alone can beat traditional vision encoders in multimodal understanding and generation when scaled properly. What is new is the application of this simple patch embedding approach in a large unified model that does both tasks without any pretrained vision module or VAE. The paper does well in outlining how this enables fully joint optimization and in observing the different convergence behaviors between the two designs. Where it is soft is on the validation side. The strongest claims about outperforming at scale and achieving SOTA rest on experiments that are not detailed in the abstract. To trust that the pixel embeddings are the key, the training setups must be equivalent in data, compute, and hyperparameters. The stress-test concern is on point because without those controls, alternative explanations remain possible. This work is for multimodal researchers looking to simplify their pipelines. It offers a path toward more integrated models, so a reader in that area would find it useful to consider. The paper shows clear thinking on the architecture trade-offs, so it deserves a serious referee to dig into the results. I recommend putting it through peer review with attention to the experimental controls.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Tuna-2, a unified multimodal model that performs visual understanding and generation directly from pixel embeddings using simple patch embedding layers, discarding pretrained vision encoders and VAEs. It claims state-of-the-art performance on multimodal benchmarks, that unified pixel-space modeling competes with latent-space approaches for image generation, and that the encoder-free variant achieves stronger results at scale (especially on fine-grained perception tasks) while an encoder-based variant converges faster early in pretraining.

Significance. If the experimental claims hold under matched training conditions, the work would demonstrate that end-to-end pixel-space learning can acquire visual representations competitive with large-scale pretrained encoders, simplifying architectures and removing reliance on separate modular components for understanding versus generation.

major comments (2)

Abstract: The assertion of state-of-the-art performance and scale advantages for the encoder-free design supplies no quantitative metrics, baselines, ablation details, or error analysis, making the central claims impossible to evaluate from the provided text.
Experiments section: No scaling curves, parameter counts, dataset sizes, or controlled ablations isolating the effect of removing the vision encoder are reported; without these, performance differences cannot be attributed to the pixel-embedding design rather than differences in training regime, data volume, or optimization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses

Referee: Abstract: The assertion of state-of-the-art performance and scale advantages for the encoder-free design supplies no quantitative metrics, baselines, ablation details, or error analysis, making the central claims impossible to evaluate from the provided text.

Authors: We agree that the abstract, being a concise summary, does not contain specific numerical results. The body of the manuscript presents the supporting quantitative evidence on multimodal benchmarks together with comparisons to baselines. To make the central claims more directly evaluable, we will revise the abstract to include key performance metrics and a brief reference to the scale-related findings. This change will be implemented in the revised manuscript. revision: yes
Referee: Experiments section: No scaling curves, parameter counts, dataset sizes, or controlled ablations isolating the effect of removing the vision encoder are reported; without these, performance differences cannot be attributed to the pixel-embedding design rather than differences in training regime, data volume, or optimization.

Authors: We acknowledge that the current version lacks scaling curves and sufficiently detailed controlled ablations that isolate the contribution of the pixel-embedding approach. We will add scaling curves, explicit parameter counts, dataset sizes, and additional ablations that hold training regime, data volume, and optimization fixed. These additions will appear in the revised Experiments section and will strengthen the attribution of results to the architectural choice. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical benchmarks with no derivations or self-referential fitting

full rationale

The paper contains no equations, derivations, or first-principles results that could reduce to their own inputs. Central claims about pixel embeddings competing with vision encoders are supported solely by experimental comparisons on multimodal benchmarks. No parameters are fitted to subsets and then relabeled as predictions, no uniqueness theorems are imported from self-citations, and no ansatzes are smuggled in. The work is self-contained because its results are externally falsifiable via replication on the stated tasks and datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that end-to-end pixel training suffices to match pretrained vision representations; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Simple patch embedding layers trained end-to-end can acquire visual features competitive with those from large-scale pretrained vision encoders.
Invoked when the paper discards VAE and representation encoders in favor of direct pixel embeddings.

pith-pipeline@v0.9.0 · 5758 in / 1166 out tokens · 28664 ms · 2026-05-20T23:37:54.489575+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we adopt the x-prediction and v-loss paradigm from JiT for pixel-space flow matching

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lance: Unified Multimodal Modeling by Multi-Task Synergy
cs.CV 2026-05 unverdicted novelty 6.0

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keepin...
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
cs.CV 2026-05 unverdicted novelty 6.0

STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
Lance: Unified Multimodal Modeling by Multi-Task Synergy
cs.CV 2026-05 unverdicted novelty 5.0

Lance introduces a dual-stream MoE model with modality-aware rotary positional encoding and staged multi-task training that outperforms open-source unified models on image and video generation while retaining understa...
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 4 Pith papers · 29 internal anchors

[1]

Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,

Inclusion AI, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Dandan Zheng, Fudong Wang, Furong Xu, et al. Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,

work page arXiv
[2]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Vggrpo: Towards world-consistent video generation with 4d latent reward, 2026

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. Onestory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026a. Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, and Marta Tintore Gazulla. Vggrp...

work page arXiv
[4]

Qwen Technical Report

Accessed: 2026-04-24. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025a. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Hu, J., Huang, Z., Lan, Z., et al

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Fuyu-8b: A multimodal architecture for ai agents, October 2023.https://www.adept.ai/blog/fuyu-8b/. Adept blog post. Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabw...

work page arXiv 2023
[7]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025a. Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping ...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2504.07963 (2025)

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025b. 14 Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv pr...

work page arXiv
[10]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

From pixels to words–towards native vision-language primitives at scale.arXiv preprint arXiv:2510.14979, 2025a

Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, and Ziwei Liu. From pixels to words–towards native vision-language primitives at scale.arXiv preprint arXiv:2510.14979,

work page arXiv
[13]

Unified autoregressive visual generation and understanding with continuous tokens

Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025a. Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, and Ziwei Liu. The prism hypothesis: Harmonizing semantic ...

work page arXiv
[14]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058,

work page arXiv
[15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898,

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898,

work page arXiv
[17]

Uni-x: Mitigating modality conflict with a two-end- separated architecture for unified multimodal models.arXiv preprint arXiv:2509.24365,

Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, and Jun Yu. Uni-x: Mitigating modality conflict with a two-end- separated architecture for unified multimodal models.arXiv preprint arXiv:2509.24365,

work page arXiv
[18]

Emma: Efficient multimodal understanding, generation, and editing with a unified architecture.arXiv preprint arXiv:2512.04810, 2025

Xin He, Longhui Wei, Jianbo Ouyang, Minghui Liao, Lingxi Xie, and Qi Tian. Emma: Efficient multimodal understanding, generation, and editing with a unified architecture.arXiv preprint arXiv:2512.04810,

work page arXiv
[19]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Ming-univision: Joint image understanding and generation with a unified continuous tokenizer.arXiv preprint arXiv:2510.06590, 2025a

Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, and Jun Zhou. Ming-univision: Joint image understanding and generation with a unified continuous tokenizer.arXiv preprint arXiv:2510.06590, 2025a. Ziyuan Huang, DanD...

work page arXiv
[21]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024c. Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin...

work page internal anchor Pith review Pith/arXiv arXiv
[24]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025a. Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, and ...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Tuna: Taming unified visual representations for native unified multimodal models

16 Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models.arXiv preprint arXiv:2512.02014,

work page arXiv
[26]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a. Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizin...

work page arXiv 2022
[28]

Does understanding inform generation in unified multimodal models? from analysis to path forward.arXiv preprint arXiv:2511.20561,

Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, and Li Yuan. Does understanding inform generation in unified multimodal models? from analysis to path forward.arXiv preprint arXiv:2511.20561,

work page arXiv
[29]

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel

Accessed: 2026-04-24. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF international conference on computer vision, pages 3170–3180,

work page 2026
[30]

Histream: Efficient high-resolution video generation via redundancy-eliminated streaming.arXiv preprint arXiv:2512.21338,

Haonan Qiu, Shikun Liu, Zijian Zhou, Zhaochong An, Weiming Ren, Zhiheng Liu, Jonas Schult, Sen He, Shoufa Chen, Yuren Cong, et al. Histream: Efficient high-resolution video generation via redundancy-eliminated streaming.arXiv preprint arXiv:2512.21338,

work page arXiv
[31]

Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation

Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, et al. Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation. arXiv preprint arXiv:2511.18262,

work page arXiv
[32]

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model.arXiv preprint arXiv:2505.23606,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

17 Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278,

work page arXiv
[35]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877,

work page arXiv
[38]

Beyond language modeling: An exploration of multimodal pretraining

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, et al. Beyond language modeling: An exploration of multimodal pretraining. arXiv preprint arXiv:2603.03276,

work page arXiv
[39]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025

Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, et al. Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025a. Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, and Can Huang. Vision as lora.arXiv preprint arXiv:2503.20680, 20...

work page arXiv
[42]

Univideo: Unified understanding, generation, and editing for videos,

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025a. Hongyang Wei, Baixin Xu, Hongbo Liu, Size Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building ...

work page arXiv
[43]

Qwen-Image Technical Report

18 Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025a. Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanc...

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025b

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, and Chen Change Loy. Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025c. Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representati...

work page arXiv
[45]

"Your output must be a single JSON object.\n\n

Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025a. Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimod...

work page arXiv
[46]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025b. Rongchang Xie, Chen Du, Ping Song, and Chang Liu. Muse-vl: Modeling unified vlm through semantic discrete encoding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24135–24146, 2025c. ...

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279,

work page arXiv
[48]

MMaDA: Multimodal Large Diffusion Language Models

Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent denoising makes good tokenizers. InThe Fourteenth International Conference on Learning Representations, 2025a. Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025b....

work page internal anchor Pith review Pith/arXiv arXiv
[49]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275,

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068,

Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, and Ji-Rong Wen. Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068,

work page arXiv
[51]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940,

work page internal anchor Pith review Pith/arXiv arXiv
[52]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

19 Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv
[53]

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation.arXiv preprint arXiv:2511.20645,

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Uniflow: A unified pixel flow tokenizer for visual understanding and generation.arXiv preprint arXiv:2510.10575,

Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng, Boyu Chen, Chenting Wang, Shaobin Zhuang, Lu Dong, KunPeng Du, Yi Wang, Limin Wang, et al. Uniflow: A unified pixel flow tokenizer for visual understanding and generation.arXiv preprint arXiv:2510.10575,

work page arXiv
[55]

Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a

Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, et al. Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a. Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, et al. Nextflow: Unified seque...

work page arXiv
[56]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690,

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905,

Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, et al. Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905,

work page arXiv

[1] [1]

Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,

Inclusion AI, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Dandan Zheng, Fudong Wang, Furong Xu, et al. Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,

work page arXiv

[2] [2]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Vggrpo: Towards world-consistent video generation with 4d latent reward, 2026

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. Onestory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026a. Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, and Marta Tintore Gazulla. Vggrp...

work page arXiv

[4] [4]

Qwen Technical Report

Accessed: 2026-04-24. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025a. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Hu, J., Huang, Z., Lan, Z., et al

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Fuyu-8b: A multimodal architecture for ai agents, October 2023.https://www.adept.ai/blog/fuyu-8b/. Adept blog post. Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabw...

work page arXiv 2023

[7] [7]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025a. Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping ...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2504.07963 (2025)

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025b. 14 Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv pr...

work page arXiv

[10] [10]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

From pixels to words–towards native vision-language primitives at scale.arXiv preprint arXiv:2510.14979, 2025a

Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, and Ziwei Liu. From pixels to words–towards native vision-language primitives at scale.arXiv preprint arXiv:2510.14979,

work page arXiv

[13] [13]

Unified autoregressive visual generation and understanding with continuous tokens

Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025a. Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, and Ziwei Liu. The prism hypothesis: Harmonizing semantic ...

work page arXiv

[14] [14]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058,

work page arXiv

[15] [15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898,

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898,

work page arXiv

[17] [17]

Uni-x: Mitigating modality conflict with a two-end- separated architecture for unified multimodal models.arXiv preprint arXiv:2509.24365,

Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, and Jun Yu. Uni-x: Mitigating modality conflict with a two-end- separated architecture for unified multimodal models.arXiv preprint arXiv:2509.24365,

work page arXiv

[18] [18]

Emma: Efficient multimodal understanding, generation, and editing with a unified architecture.arXiv preprint arXiv:2512.04810, 2025

Xin He, Longhui Wei, Jianbo Ouyang, Minghui Liao, Lingxi Xie, and Qi Tian. Emma: Efficient multimodal understanding, generation, and editing with a unified architecture.arXiv preprint arXiv:2512.04810,

work page arXiv

[19] [19]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Ming-univision: Joint image understanding and generation with a unified continuous tokenizer.arXiv preprint arXiv:2510.06590, 2025a

Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, and Jun Zhou. Ming-univision: Joint image understanding and generation with a unified continuous tokenizer.arXiv preprint arXiv:2510.06590, 2025a. Ziyuan Huang, DanD...

work page arXiv

[21] [21]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich...

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024c. Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin...

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025a. Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, and ...

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Tuna: Taming unified visual representations for native unified multimodal models

16 Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models.arXiv preprint arXiv:2512.02014,

work page arXiv

[26] [26]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a. Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizin...

work page arXiv 2022

[28] [28]

Does understanding inform generation in unified multimodal models? from analysis to path forward.arXiv preprint arXiv:2511.20561,

Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, and Li Yuan. Does understanding inform generation in unified multimodal models? from analysis to path forward.arXiv preprint arXiv:2511.20561,

work page arXiv

[29] [29]

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel

Accessed: 2026-04-24. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF international conference on computer vision, pages 3170–3180,

work page 2026

[30] [30]

Histream: Efficient high-resolution video generation via redundancy-eliminated streaming.arXiv preprint arXiv:2512.21338,

Haonan Qiu, Shikun Liu, Zijian Zhou, Zhaochong An, Weiming Ren, Zhiheng Liu, Jonas Schult, Sen He, Shoufa Chen, Yuren Cong, et al. Histream: Efficient high-resolution video generation via redundancy-eliminated streaming.arXiv preprint arXiv:2512.21338,

work page arXiv

[31] [31]

Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation

Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, et al. Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation. arXiv preprint arXiv:2511.18262,

work page arXiv

[32] [32]

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model.arXiv preprint arXiv:2505.23606,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

17 Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278,

work page arXiv

[35] [35]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877,

work page arXiv

[38] [38]

Beyond language modeling: An exploration of multimodal pretraining

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, et al. Beyond language modeling: An exploration of multimodal pretraining. arXiv preprint arXiv:2603.03276,

work page arXiv

[39] [39]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025

Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, et al. Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025a. Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, and Can Huang. Vision as lora.arXiv preprint arXiv:2503.20680, 20...

work page arXiv

[42] [42]

Univideo: Unified understanding, generation, and editing for videos,

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025a. Hongyang Wei, Baixin Xu, Hongbo Liu, Size Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building ...

work page arXiv

[43] [43]

Qwen-Image Technical Report

18 Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025a. Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanc...

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025b

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, and Chen Change Loy. Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025c. Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representati...

work page arXiv

[45] [45]

"Your output must be a single JSON object.\n\n

Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025a. Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimod...

work page arXiv

[46] [46]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025b. Rongchang Xie, Chen Du, Ping Song, and Chang Liu. Muse-vl: Modeling unified vlm through semantic discrete encoding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24135–24146, 2025c. ...

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279,

work page arXiv

[48] [48]

MMaDA: Multimodal Large Diffusion Language Models

Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent denoising makes good tokenizers. InThe Fourteenth International Conference on Learning Representations, 2025a. Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025b....

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275,

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068,

Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, and Ji-Rong Wen. Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068,

work page arXiv

[51] [51]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940,

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

19 Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation.arXiv preprint arXiv:2511.20645,

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Uniflow: A unified pixel flow tokenizer for visual understanding and generation.arXiv preprint arXiv:2510.10575,

Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng, Boyu Chen, Chenting Wang, Shaobin Zhuang, Lu Dong, KunPeng Du, Yi Wang, Limin Wang, et al. Uniflow: A unified pixel flow tokenizer for visual understanding and generation.arXiv preprint arXiv:2510.10575,

work page arXiv

[55] [55]

Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a

Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, et al. Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a. Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, et al. Nextflow: Unified seque...

work page arXiv

[56] [56]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690,

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039,

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905,

Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, et al. Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905,

work page arXiv