Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Chenfei Wu; Junyang Lin; Lingchen Meng; Rongyao Fang; Shuai Bai; Wujian Peng; Xianwei Zhuang; Yuhuan Yang; Yuxuan Cai; Zuxuan Wu

arxiv: 2606.18249 · v2 · pith:WABYITTYnew · submitted 2026-06-16 · 💻 cs.CV

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Wujian Peng , Lingchen Meng , Yuxuan Cai , Xianwei Zhuang , Yuhuan Yang , Rongyao Fang , Chenfei Wu , Junyang Lin

show 2 more authors

Zuxuan Wu Shuai Bai

This is my paper

Pith reviewed 2026-06-27 01:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified multimodal modelingautoregressive modelingvisual tokenizerimage generationimage editingmultimodal understanding

0 comments

The pith

A single discrete visual tokenizer unifies understanding and generation by letting an autoregressive model read its own outputs in shared context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that replacing two separate visual tokenizers with one discrete tokenizer removes the split in representation space and allows true unification inside a single autoregressive model. The tokenizer creates a shared context so the model can interpret the visual tokens it just generated without any extra re-encoding step. This matters to a reader because separate tokenizers have forced multimodal systems to treat understanding and generation as disconnected processes. The construction adapts a pretrained vision encoder through multi-level feature fusion and lookup-free bitwise quantization to retain both semantic and detailed information while keeping the vocabulary size manageable.

Core claim

UniAR shows that a single discrete visual tokenizer, obtained by adapting a pretrained vision encoder with multi-level feature fusion and lookup-free bitwise quantization, serves as the bridge between understanding and generation. This tokenizer supplies a shared context in which the autoregressive model directly consumes its own generated visual tokens. Parallel-bitwise-prediction of spatially grouped multi-level codes shortens the visual sequence, and a diffusion decoder converts the discrete tokens into images. Large-scale pre-training followed by supervised fine-tuning and reinforcement learning produces state-of-the-art image generation and editing results while remaining competitive on

What carries the argument

The single discrete visual tokenizer that supplies shared context so generated visual tokens can be interpreted directly without re-encoding.

If this is right

The model performs both image understanding and image generation inside one autoregressive sequence without switching token spaces.
Parallel-bitwise-prediction of multi-level codes reduces visual sequence length and speeds up generation.
A diffusion decoder converts the discrete tokens into high-fidelity images after autoregressive prediction.
After pre-training, fine-tuning, and reinforcement learning the system reaches top results on generation and editing while staying competitive on understanding tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared-tokenizer design could allow an autoregressive model to chain understanding steps directly into generation steps inside a single forward pass.
The bitwise quantization approach may extend to other modalities such as video or audio for similar unification gains.
Because the tokenizer is derived from an existing vision encoder, the method may transfer to new backbone architectures with only modest retraining.

Load-bearing premise

Adapting one pretrained vision encoder with multi-level feature fusion and lookup-free bitwise quantization preserves enough high-level semantics and low-level details to support both understanding and generation tasks at once.

What would settle it

A controlled experiment that measures whether forcing the model to re-encode its own generated tokens improves or degrades performance on image-editing benchmarks would directly test whether the shared-context benefit is real.

read the original abstract

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniAR's single shared tokenizer with bitwise quantization is a concrete try at unification, but the results look driven more by scale than by proving the bridge works without tradeoffs.

read the letter

UniAR's core move is replacing two separate visual tokenizers with one discrete one that feeds a shared autoregressive context. The model adapts a pretrained encoder via multi-level fusion and lookup-free bitwise quantization, then uses parallel-bitwise-prediction on grouped codes to shorten sequences, and finally decodes with a diffusion model. That setup is presented as the fix for the representation split that blocks true unification.

The paper does lay out a workable pipeline and reports SOTA numbers on generation and editing after pre-training, supervised fine-tuning, and RL, while staying competitive on understanding tasks. The parallel prediction step looks like a practical engineering choice that could cut generation time.

The soft spot is the central assumption that the adapted tokenizer keeps both high-level semantics and low-level pixel details. The stress-test concern lands: without clear ablations isolating the quantization's effect on fidelity versus semantics, it's hard to know whether the unification benefit comes from the shared context or just from bigger training runs. The diffusion decoder at the end also means the system isn't end-to-end autoregressive in pixel space.

This is for groups already building unified AR multimodal models and looking at tokenizer choices. A reader focused on reducing encoder fragmentation might pick up the bitwise scheme or the parallel prediction trick.

Send it to peer review. The unification problem is worth referee time even if the current evidence needs tighter controls on what the single tokenizer actually contributes.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes UniAR, a unified autoregressive framework for multimodal modeling that integrates visual understanding and generation via a single discrete visual tokenizer. This tokenizer is obtained by adapting a pretrained vision encoder through multi-level feature fusion and lookup-free bitwise quantization, which the authors claim preserves both high-level semantics and low-level details. The autoregressive component employs parallel-bitwise-prediction to jointly predict spatially grouped multi-level codes, shortening sequences, while a diffusion-based decoder reconstructs images from the discrete tokens. After large-scale pre-training, supervised fine-tuning, and reinforcement learning, the model is reported to achieve state-of-the-art results on image generation and editing tasks while remaining competitive on multimodal understanding benchmarks.

Significance. If the central architectural claim holds—that a single tokenizer enables a shared context in which generated visual tokens can be directly interpreted without re-encoding—this would constitute a meaningful simplification over approaches that maintain separate tokenizers for understanding and generation. The bitwise quantization scheme for vocabulary scaling at low cost and the parallel prediction strategy are technically interesting contributions. The work would be strengthened by explicit credit for any reproducible code or detailed ablation tables that isolate the tokenizer's contribution; absent those, the significance remains conditional on verification that the adaptation step does not trade off fidelity in one domain for the other.

major comments (2)

[Tokenizer adaptation] Tokenizer adaptation section (description of multi-level fusion + lookup-free bitwise quantization): the claim that this procedure 'preserves both high-level semantics and low-level details' is load-bearing for the unification benefit. No quantitative reconstruction metrics (e.g., PSNR, LPIPS, or FID on held-out images) or semantic retention metrics (e.g., linear probing accuracy or zero-shot classification) are referenced to demonstrate that the quantization step does not introduce unacceptable loss in either regime; without such evidence the 'key bridge' property cannot be evaluated.
[Results] Results section (claims of SOTA on generation/editing): the abstract states 'state-of-the-art performance' after pre-training, SFT, and RL but supplies no baselines, dataset sizes, or table references. If the full manuscript contains only aggregate scores without controls for tokenizer quality versus model scale, the attribution of gains specifically to the shared tokenizer remains unverified.

minor comments (1)

[Abstract] The abstract mentions 'parallel-bitwise-prediction' and 'spatially grouped, multi-level visual codes' without defining the grouping or bit allocation; a short notation table or equation would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating where revisions will be made to provide additional evidence.

read point-by-point responses

Referee: [Tokenizer adaptation] Tokenizer adaptation section (description of multi-level fusion + lookup-free bitwise quantization): the claim that this procedure 'preserves both high-level semantics and low-level details' is load-bearing for the unification benefit. No quantitative reconstruction metrics (e.g., PSNR, LPIPS, or FID on held-out images) or semantic retention metrics (e.g., linear probing accuracy or zero-shot classification) are referenced to demonstrate that the quantization step does not introduce unacceptable loss in either regime; without such evidence the 'key bridge' property cannot be evaluated.

Authors: We agree that quantitative metrics are required to substantiate the claim that the adaptation preserves both high-level semantics and low-level details. The manuscript describes the multi-level fusion and bitwise quantization but does not reference the requested metrics. In the revised manuscript we will add reconstruction metrics (PSNR, LPIPS, FID on held-out images) and semantic retention metrics (linear probing accuracy, zero-shot classification) to the tokenizer adaptation section. revision: yes
Referee: [Results] Results section (claims of SOTA on generation/editing): the abstract states 'state-of-the-art performance' after pre-training, SFT, and RL but supplies no baselines, dataset sizes, or table references. If the full manuscript contains only aggregate scores without controls for tokenizer quality versus model scale, the attribution of gains specifically to the shared tokenizer remains unverified.

Authors: The abstract is a concise summary and does not contain baselines or table references; these appear in the results section of the full manuscript together with dataset sizes. To strengthen attribution of gains to the shared tokenizer, we will expand or add ablation studies that control for tokenizer quality versus model scale in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on architecture design and empirical training outcomes

full rationale

The paper presents UniAR as an architectural proposal (single discrete visual tokenizer via adapted pretrained encoder, multi-level fusion, and bitwise quantization) whose unification benefit is asserted to emerge from large-scale pre-training, SFT, and RL rather than from any self-referential definition or fitted parameter renamed as prediction. No equations, uniqueness theorems, or self-citations are invoked in the provided text to force the central claim; the tokenizer's dual utility for understanding and generation is framed as an empirical outcome of the training regime, not a tautology. This is the normal non-circular case for an engineering paper whose results are externally falsifiable on benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger is provisional and limited because only the abstract is available; no specific free parameters, axioms, or invented entities are detailed beyond high-level claims.

axioms (1)

domain assumption A single discrete visual tokenizer can serve as a bridge for both understanding and generation tasks.
Core premise stated in the abstract as the key to unification.

invented entities (1)

Shared context-visual tokenizer no independent evidence
purpose: Bridge between understanding and generation in a unified autoregressive model
Introduced as the central innovation enabling shared context without re-encoding.

pith-pipeline@v0.9.1-grok · 5777 in / 1231 out tokens · 37700 ms · 2026-06-27T01:15:10.350025+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 25 linked inside Pith

[1]

Qwen3-vl technical report.arXivpreprintarXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

Pith/arXiv arXiv 2025
[2]

Oneig-bench: Omni-dimensional nuanced evaluation for image generation

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. InNeurIPS, 2025

2025
[3]

Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprintarXiv:2505.09568, 2025

Pith/arXiv arXiv 2025
[4]

Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

arXiv 2025
[5]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025
[6]

Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu-Gang Jiang. Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

arXiv 2025
[7]

Catok: Taming mean flows for one-dimensional causal image tokenization

Yitong Chen, Zuxuan Wu, Xipeng Qiu, and Yu-Gang Jiang. Catok: Taming mean flows for one-dimensional causal image tokenization. InCVPR, 2026

2026
[8]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXivpreprintarXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXivpreprintarXiv:2507.06261, 2025

Pith/arXiv arXiv 2025
[9]

Paddleocr 3.0 technical report, 2025

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025. URLhttps://arxiv.org/abs/2507.05595

Pith/arXiv arXiv 2025
[10]

Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

Pith/arXiv arXiv 2025
[11]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024

2024
[12]

Multimodal autoregressive pre-training of large vision encoders

Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor G Turrisi da Costa, Louis Béthune, Zhe Gan, et al. Multimodal autoregressive pre-training of large vision encoders. In CVPR, pages 9641–9654, 2025

2025
[13]

Mme: A comprehensive evaluation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. InNeurIPS, 2025

2025
[14]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXivpreprintarXiv:2507.22058, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXivpreprintarXiv:2507.22058, 2025

arXiv 2025
[15]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023. 14

2023
[16]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[17]

Vision as a dialect: Unifying visual understanding and generation via text-aligned representations

JiamingHan, HaoChen, YangZhao, HanyuWang, QiZhao, ZiyanYang, HaoHe, XiangyuYue, andLuJiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. InNeurIPS, 2025

2025
[18]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InCVPR, 2025

2025
[19]

Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022

Pith/arXiv arXiv 2022
[20]

Gpt-4o system card.arXiv preprintarXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprintarXiv:2410.21276, 2024

Pith/arXiv arXiv 2024
[21]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprintarXiv:2506.15742, 2025

Pith/arXiv arXiv 2025
[22]

Llava-onevision: Easy visual task transfer.arXiv preprintarXiv:2408.03326, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprintarXiv:2408.03326, 2024

Pith/arXiv arXiv 2024
[23]

Seed-bench: Benchmarkingmultimodal llms with generative comprehension

BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. Seed-bench: Benchmarkingmultimodal llms with generative comprehension. InCVPR, 2024

2024
[24]

Xiaomi mimo-vl-miloco technical report.arXivpreprintarXiv:2512.17436, 2025

JiazeLi, JingyangChen, YuxunQu, JianzhongJu, ZhenboLuo, JianLuan, ShĳieXu, ZhenruLin, JunyouZhu, Boshen Xu, et al. Xiaomi mimo-vl-miloco technical report.arXivpreprintarXiv:2512.17436, 2025

arXiv 2025
[25]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, pages 22195–22206, 2024

2024
[26]

Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

Pith/arXiv arXiv 2025
[27]

Uniworld: High-resolution semantic encoders for unified visual understanding and generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprintarXiv:2506.03147, 2025

Pith/arXiv arXiv 2025
[28]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprintarXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[29]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings ofthe IEEE/CVF conferenceoncomputervision andpattern recognition, pages 26296–26306, 2024

2024
[30]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

2024
[31]

Flow-grpo: Training flow matching models via online rl

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. InNeurIPS, volume 38, pages 40783–40818, 2025

2025
[32]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China InformationSciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China InformationSciences, 67(12):220102, 2024

2024
[33]

Tuna: Taming unified visual representations for native unified multimodal models

Zhiheng Liu, Weiming Ren, Haozhe Liu, Zĳian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models. arXiv preprintarXiv:2512.02014, 2025

arXiv 2025
[34]

Compositionaltext-to-imagegenerationviaregion-aware bimodal direct preference optimization

ZhuohanLiu,WujianPeng,YitongChen,andZuxuanWu. Compositionaltext-to-imagegenerationviaregion-aware bimodal direct preference optimization. InCVPR, pages 36604–36614, 2026

2026
[35]

Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025

Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025. 15

arXiv 2025
[36]

Unitok: A unified tokenizer for visual generation and understanding

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. InNeurIPS, volume 38, pages 129274–129297, 2025

2025
[37]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InCVPR, 2025

2025
[38]

Chartqa: A benchmark for question answeringaboutchartswithvisualandlogicalreasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answeringaboutchartswithvisualandlogicalreasoning. In Findingsoftheassociationforcomputationallinguistics: ACL2022, pages 2263–2279, 2022

2022
[39]

Docvqa: Adatasetforvqaondocumentimages

MineshMathew,DimosthenisKaratzas,andCVJawahar. Docvqa: Adatasetforvqaondocumentimages. In WACV, pages 2200–2209, 2021

2021
[40]

Infographicvqa

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022

2022
[41]

Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms

Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. InNeurIPS, 2024

2024
[42]

Transfer between modalities with metaqueries.arXivpreprint arXiv:2504.06256, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXivpreprint arXiv:2504.06256, 2025

Pith/arXiv arXiv 2025
[43]

Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning

Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning. InNeurIPS, 2025

2025
[44]

Du, Zehuan Yuan, and Xinglong Wu

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. InCVPR, 2025

2025
[45]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML. PmLR, 2021

2021
[46]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, pages 146–162. Springer, 2022

2022
[47]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[48]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedingsofthe IEEE/CVFInternational ConferenceonComputer Vision, pages 16037–16046, 2025

2025
[49]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR, pages 8317–8326, 2019

2019
[50]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXivpreprint arXiv:2507.23278, 2025

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXivpreprint arXiv:2507.23278, 2025

arXiv 2025
[51]

Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Pith/arXiv arXiv 2024
[52]

Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025

Pith/arXiv arXiv 2025
[53]

Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning.arXiv preprint arXiv:2511.14760, 2025

RuiTian,MingfeiGao,HaimingGang,JiasenLu,ZheGan,YinfeiYang,ZuxuanWu,andAfshinDehghan. Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning.arXiv preprint arXiv:2511.14760, 2025

arXiv 2025
[54]

Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation

Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, and Afshin Dehghan. Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation. InNeurIPS, 2025. 16

2025
[55]

Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy,TalfanEvans,LucasBeyer,YeXia,BasilMustafa,etal. Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025

Pith/arXiv arXiv 2025
[56]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

2017
[57]

Omnitokenizer: A joint image-video tokenizer for visual generation.NeurIPS, 37:28281–28295, 2024

Junke Wang, Yi Jiang, Zehuan Yuan, Bingyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.NeurIPS, 37:28281–28295, 2024

2024
[58]

Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

Pith/arXiv arXiv 2024
[59]

Unified reward model for multimodal understanding and generation.arXiv preprintarXiv:2503.05236, 2025

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprintarXiv:2503.05236, 2025

Pith/arXiv arXiv 2025
[60]

Qwen-image technical report.arXiv preprintarXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprintarXiv:2508.02324, 2025

Pith/arXiv arXiv 2025
[61]

Omnigen2: Exploration to advanced multimodal generation.arXivpreprintarXiv:2506.18871, 2025

ChenyuanWu,PengfeiZheng,RuiranYan,ShitaoXiao,XinLuo,YuezeWang,WanliLi,XiyanJiang,YexinLiu,Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXivpreprintarXiv:2506.18871, 2025

Pith/arXiv arXiv 2025
[62]

Human prefer- ence score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human prefer- ence score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

Pith/arXiv arXiv 2023
[63]

Show-o2: Improved native unified multimodal models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. In NeurIPS, volume 38, pages 47490–47518, 2025

2025
[64]

Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782, 2025

Zhiyuan Yan, Junyan Ye, Weĳia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782, 2025

arXiv 2025
[65]

Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[66]

Imgedit: A unified image editing dataset and benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark. InNeurIPS, 2025

2025
[67]

Language model beats diffusion-tokenizer is key to visual generation

Lĳun Yu, José Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion-tokenizer is key to visual generation. InICLR, 2024

2024
[68]

Nextflow: Unified sequential modeling activates multimodal understanding and generation.arXiv preprint arXiv:2601.02204, 2026

Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, et al. Nextflow: Unified sequential modeling activates multimodal understanding and generation.arXiv preprint arXiv:2601.02204, 2026

arXiv 2026
[69]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023

2023
[70]

Imageandvideotokenizationwithbinarysphericalquantization

YueZhao,YuanjunXiong,andPhilippKrähenbühl. Imageandvideotokenizationwithbinarysphericalquantization. In ICLR, 2025

2025
[71]

Qlip: Text-aligned visual tokenization unifies auto-regressive multimodal understanding and generation

Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, and De-An Huang. Qlip: Text-aligned visual tokenization unifies auto-regressive multimodal understanding and generation. arXiv preprintarXiv:2502.05178, 2025

arXiv 2025
[72]

Transfusion: Predict the next token and diffuse images with one multi-modal model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. InICLR, 2025. 17

2025

[1] [1]

Qwen3-vl technical report.arXivpreprintarXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

Pith/arXiv arXiv 2025

[2] [2]

Oneig-bench: Omni-dimensional nuanced evaluation for image generation

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. InNeurIPS, 2025

2025

[3] [3]

Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprintarXiv:2505.09568, 2025

Pith/arXiv arXiv 2025

[4] [4]

Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

arXiv 2025

[5] [5]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025

[6] [6]

Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu-Gang Jiang. Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

arXiv 2025

[7] [7]

Catok: Taming mean flows for one-dimensional causal image tokenization

Yitong Chen, Zuxuan Wu, Xipeng Qiu, and Yu-Gang Jiang. Catok: Taming mean flows for one-dimensional causal image tokenization. InCVPR, 2026

2026

[8] [8]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXivpreprintarXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXivpreprintarXiv:2507.06261, 2025

Pith/arXiv arXiv 2025

[9] [9]

Paddleocr 3.0 technical report, 2025

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025. URLhttps://arxiv.org/abs/2507.05595

Pith/arXiv arXiv 2025

[10] [10]

Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

Pith/arXiv arXiv 2025

[11] [11]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024

2024

[12] [12]

Multimodal autoregressive pre-training of large vision encoders

Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor G Turrisi da Costa, Louis Béthune, Zhe Gan, et al. Multimodal autoregressive pre-training of large vision encoders. In CVPR, pages 9641–9654, 2025

2025

[13] [13]

Mme: A comprehensive evaluation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. InNeurIPS, 2025

2025

[14] [14]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXivpreprintarXiv:2507.22058, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXivpreprintarXiv:2507.22058, 2025

arXiv 2025

[15] [15]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023. 14

2023

[16] [16]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[17] [17]

Vision as a dialect: Unifying visual understanding and generation via text-aligned representations

JiamingHan, HaoChen, YangZhao, HanyuWang, QiZhao, ZiyanYang, HaoHe, XiangyuYue, andLuJiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. InNeurIPS, 2025

2025

[18] [18]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InCVPR, 2025

2025

[19] [19]

Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022

Pith/arXiv arXiv 2022

[20] [20]

Gpt-4o system card.arXiv preprintarXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprintarXiv:2410.21276, 2024

Pith/arXiv arXiv 2024

[21] [21]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprintarXiv:2506.15742, 2025

Pith/arXiv arXiv 2025

[22] [22]

Llava-onevision: Easy visual task transfer.arXiv preprintarXiv:2408.03326, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprintarXiv:2408.03326, 2024

Pith/arXiv arXiv 2024

[23] [23]

Seed-bench: Benchmarkingmultimodal llms with generative comprehension

BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. Seed-bench: Benchmarkingmultimodal llms with generative comprehension. InCVPR, 2024

2024

[24] [24]

Xiaomi mimo-vl-miloco technical report.arXivpreprintarXiv:2512.17436, 2025

JiazeLi, JingyangChen, YuxunQu, JianzhongJu, ZhenboLuo, JianLuan, ShĳieXu, ZhenruLin, JunyouZhu, Boshen Xu, et al. Xiaomi mimo-vl-miloco technical report.arXivpreprintarXiv:2512.17436, 2025

arXiv 2025

[25] [25]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, pages 22195–22206, 2024

2024

[26] [26]

Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

Pith/arXiv arXiv 2025

[27] [27]

Uniworld: High-resolution semantic encoders for unified visual understanding and generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprintarXiv:2506.03147, 2025

Pith/arXiv arXiv 2025

[28] [28]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprintarXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[29] [29]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings ofthe IEEE/CVF conferenceoncomputervision andpattern recognition, pages 26296–26306, 2024

2024

[30] [30]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

2024

[31] [31]

Flow-grpo: Training flow matching models via online rl

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. InNeurIPS, volume 38, pages 40783–40818, 2025

2025

[32] [32]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China InformationSciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China InformationSciences, 67(12):220102, 2024

2024

[33] [33]

Tuna: Taming unified visual representations for native unified multimodal models

Zhiheng Liu, Weiming Ren, Haozhe Liu, Zĳian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models. arXiv preprintarXiv:2512.02014, 2025

arXiv 2025

[34] [34]

Compositionaltext-to-imagegenerationviaregion-aware bimodal direct preference optimization

ZhuohanLiu,WujianPeng,YitongChen,andZuxuanWu. Compositionaltext-to-imagegenerationviaregion-aware bimodal direct preference optimization. InCVPR, pages 36604–36614, 2026

2026

[35] [35]

Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025

Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025. 15

arXiv 2025

[36] [36]

Unitok: A unified tokenizer for visual generation and understanding

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. InNeurIPS, volume 38, pages 129274–129297, 2025

2025

[37] [37]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InCVPR, 2025

2025

[38] [38]

Chartqa: A benchmark for question answeringaboutchartswithvisualandlogicalreasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answeringaboutchartswithvisualandlogicalreasoning. In Findingsoftheassociationforcomputationallinguistics: ACL2022, pages 2263–2279, 2022

2022

[39] [39]

Docvqa: Adatasetforvqaondocumentimages

MineshMathew,DimosthenisKaratzas,andCVJawahar. Docvqa: Adatasetforvqaondocumentimages. In WACV, pages 2200–2209, 2021

2021

[40] [40]

Infographicvqa

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022

2022

[41] [41]

Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms

Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. InNeurIPS, 2024

2024

[42] [42]

Transfer between modalities with metaqueries.arXivpreprint arXiv:2504.06256, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXivpreprint arXiv:2504.06256, 2025

Pith/arXiv arXiv 2025

[43] [43]

Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning

Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning. InNeurIPS, 2025

2025

[44] [44]

Du, Zehuan Yuan, and Xinglong Wu

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. InCVPR, 2025

2025

[45] [45]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML. PmLR, 2021

2021

[46] [46]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, pages 146–162. Springer, 2022

2022

[47] [47]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[48] [48]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedingsofthe IEEE/CVFInternational ConferenceonComputer Vision, pages 16037–16046, 2025

2025

[49] [49]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR, pages 8317–8326, 2019

2019

[50] [50]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXivpreprint arXiv:2507.23278, 2025

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXivpreprint arXiv:2507.23278, 2025

arXiv 2025

[51] [51]

Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Pith/arXiv arXiv 2024

[52] [52]

Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025

Pith/arXiv arXiv 2025

[53] [53]

Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning.arXiv preprint arXiv:2511.14760, 2025

RuiTian,MingfeiGao,HaimingGang,JiasenLu,ZheGan,YinfeiYang,ZuxuanWu,andAfshinDehghan. Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning.arXiv preprint arXiv:2511.14760, 2025

arXiv 2025

[54] [54]

Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation

Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, and Afshin Dehghan. Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation. InNeurIPS, 2025. 16

2025

[55] [55]

Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy,TalfanEvans,LucasBeyer,YeXia,BasilMustafa,etal. Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025

Pith/arXiv arXiv 2025

[56] [56]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

2017

[57] [57]

Omnitokenizer: A joint image-video tokenizer for visual generation.NeurIPS, 37:28281–28295, 2024

Junke Wang, Yi Jiang, Zehuan Yuan, Bingyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.NeurIPS, 37:28281–28295, 2024

2024

[58] [58]

Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

Pith/arXiv arXiv 2024

[59] [59]

Unified reward model for multimodal understanding and generation.arXiv preprintarXiv:2503.05236, 2025

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprintarXiv:2503.05236, 2025

Pith/arXiv arXiv 2025

[60] [60]

Qwen-image technical report.arXiv preprintarXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprintarXiv:2508.02324, 2025

Pith/arXiv arXiv 2025

[61] [61]

Omnigen2: Exploration to advanced multimodal generation.arXivpreprintarXiv:2506.18871, 2025

ChenyuanWu,PengfeiZheng,RuiranYan,ShitaoXiao,XinLuo,YuezeWang,WanliLi,XiyanJiang,YexinLiu,Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXivpreprintarXiv:2506.18871, 2025

Pith/arXiv arXiv 2025

[62] [62]

Human prefer- ence score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human prefer- ence score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

Pith/arXiv arXiv 2023

[63] [63]

Show-o2: Improved native unified multimodal models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. In NeurIPS, volume 38, pages 47490–47518, 2025

2025

[64] [64]

Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782, 2025

Zhiyuan Yan, Junyan Ye, Weĳia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782, 2025

arXiv 2025

[65] [65]

Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[66] [66]

Imgedit: A unified image editing dataset and benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark. InNeurIPS, 2025

2025

[67] [67]

Language model beats diffusion-tokenizer is key to visual generation

Lĳun Yu, José Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion-tokenizer is key to visual generation. InICLR, 2024

2024

[68] [68]

Nextflow: Unified sequential modeling activates multimodal understanding and generation.arXiv preprint arXiv:2601.02204, 2026

Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, et al. Nextflow: Unified sequential modeling activates multimodal understanding and generation.arXiv preprint arXiv:2601.02204, 2026

arXiv 2026

[69] [69]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023

2023

[70] [70]

Imageandvideotokenizationwithbinarysphericalquantization

YueZhao,YuanjunXiong,andPhilippKrähenbühl. Imageandvideotokenizationwithbinarysphericalquantization. In ICLR, 2025

2025

[71] [71]

Qlip: Text-aligned visual tokenization unifies auto-regressive multimodal understanding and generation

Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, and De-An Huang. Qlip: Text-aligned visual tokenization unifies auto-regressive multimodal understanding and generation. arXiv preprintarXiv:2502.05178, 2025

arXiv 2025

[72] [72]

Transfusion: Predict the next token and diffuse images with one multi-modal model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. InICLR, 2025. 17

2025