pith. sign in

arxiv: 2606.18249 · v2 · pith:WABYITTYnew · submitted 2026-06-16 · 💻 cs.CV

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Pith reviewed 2026-06-27 01:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modelingautoregressive modelingvisual tokenizerimage generationimage editingmultimodal understanding
0
0 comments X

The pith

A single discrete visual tokenizer unifies understanding and generation by letting an autoregressive model read its own outputs in shared context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that replacing two separate visual tokenizers with one discrete tokenizer removes the split in representation space and allows true unification inside a single autoregressive model. The tokenizer creates a shared context so the model can interpret the visual tokens it just generated without any extra re-encoding step. This matters to a reader because separate tokenizers have forced multimodal systems to treat understanding and generation as disconnected processes. The construction adapts a pretrained vision encoder through multi-level feature fusion and lookup-free bitwise quantization to retain both semantic and detailed information while keeping the vocabulary size manageable.

Core claim

UniAR shows that a single discrete visual tokenizer, obtained by adapting a pretrained vision encoder with multi-level feature fusion and lookup-free bitwise quantization, serves as the bridge between understanding and generation. This tokenizer supplies a shared context in which the autoregressive model directly consumes its own generated visual tokens. Parallel-bitwise-prediction of spatially grouped multi-level codes shortens the visual sequence, and a diffusion decoder converts the discrete tokens into images. Large-scale pre-training followed by supervised fine-tuning and reinforcement learning produces state-of-the-art image generation and editing results while remaining competitive on

What carries the argument

The single discrete visual tokenizer that supplies shared context so generated visual tokens can be interpreted directly without re-encoding.

If this is right

  • The model performs both image understanding and image generation inside one autoregressive sequence without switching token spaces.
  • Parallel-bitwise-prediction of multi-level codes reduces visual sequence length and speeds up generation.
  • A diffusion decoder converts the discrete tokens into high-fidelity images after autoregressive prediction.
  • After pre-training, fine-tuning, and reinforcement learning the system reaches top results on generation and editing while staying competitive on understanding tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The shared-tokenizer design could allow an autoregressive model to chain understanding steps directly into generation steps inside a single forward pass.
  • The bitwise quantization approach may extend to other modalities such as video or audio for similar unification gains.
  • Because the tokenizer is derived from an existing vision encoder, the method may transfer to new backbone architectures with only modest retraining.

Load-bearing premise

Adapting one pretrained vision encoder with multi-level feature fusion and lookup-free bitwise quantization preserves enough high-level semantics and low-level details to support both understanding and generation tasks at once.

What would settle it

A controlled experiment that measures whether forcing the model to re-encode its own generated tokens improves or degrades performance on image-editing benchmarks would directly test whether the shared-context benefit is real.

read the original abstract

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes UniAR, a unified autoregressive framework for multimodal modeling that integrates visual understanding and generation via a single discrete visual tokenizer. This tokenizer is obtained by adapting a pretrained vision encoder through multi-level feature fusion and lookup-free bitwise quantization, which the authors claim preserves both high-level semantics and low-level details. The autoregressive component employs parallel-bitwise-prediction to jointly predict spatially grouped multi-level codes, shortening sequences, while a diffusion-based decoder reconstructs images from the discrete tokens. After large-scale pre-training, supervised fine-tuning, and reinforcement learning, the model is reported to achieve state-of-the-art results on image generation and editing tasks while remaining competitive on multimodal understanding benchmarks.

Significance. If the central architectural claim holds—that a single tokenizer enables a shared context in which generated visual tokens can be directly interpreted without re-encoding—this would constitute a meaningful simplification over approaches that maintain separate tokenizers for understanding and generation. The bitwise quantization scheme for vocabulary scaling at low cost and the parallel prediction strategy are technically interesting contributions. The work would be strengthened by explicit credit for any reproducible code or detailed ablation tables that isolate the tokenizer's contribution; absent those, the significance remains conditional on verification that the adaptation step does not trade off fidelity in one domain for the other.

major comments (2)
  1. [Tokenizer adaptation] Tokenizer adaptation section (description of multi-level fusion + lookup-free bitwise quantization): the claim that this procedure 'preserves both high-level semantics and low-level details' is load-bearing for the unification benefit. No quantitative reconstruction metrics (e.g., PSNR, LPIPS, or FID on held-out images) or semantic retention metrics (e.g., linear probing accuracy or zero-shot classification) are referenced to demonstrate that the quantization step does not introduce unacceptable loss in either regime; without such evidence the 'key bridge' property cannot be evaluated.
  2. [Results] Results section (claims of SOTA on generation/editing): the abstract states 'state-of-the-art performance' after pre-training, SFT, and RL but supplies no baselines, dataset sizes, or table references. If the full manuscript contains only aggregate scores without controls for tokenizer quality versus model scale, the attribution of gains specifically to the shared tokenizer remains unverified.
minor comments (1)
  1. [Abstract] The abstract mentions 'parallel-bitwise-prediction' and 'spatially grouped, multi-level visual codes' without defining the grouping or bit allocation; a short notation table or equation would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating where revisions will be made to provide additional evidence.

read point-by-point responses
  1. Referee: [Tokenizer adaptation] Tokenizer adaptation section (description of multi-level fusion + lookup-free bitwise quantization): the claim that this procedure 'preserves both high-level semantics and low-level details' is load-bearing for the unification benefit. No quantitative reconstruction metrics (e.g., PSNR, LPIPS, or FID on held-out images) or semantic retention metrics (e.g., linear probing accuracy or zero-shot classification) are referenced to demonstrate that the quantization step does not introduce unacceptable loss in either regime; without such evidence the 'key bridge' property cannot be evaluated.

    Authors: We agree that quantitative metrics are required to substantiate the claim that the adaptation preserves both high-level semantics and low-level details. The manuscript describes the multi-level fusion and bitwise quantization but does not reference the requested metrics. In the revised manuscript we will add reconstruction metrics (PSNR, LPIPS, FID on held-out images) and semantic retention metrics (linear probing accuracy, zero-shot classification) to the tokenizer adaptation section. revision: yes

  2. Referee: [Results] Results section (claims of SOTA on generation/editing): the abstract states 'state-of-the-art performance' after pre-training, SFT, and RL but supplies no baselines, dataset sizes, or table references. If the full manuscript contains only aggregate scores without controls for tokenizer quality versus model scale, the attribution of gains specifically to the shared tokenizer remains unverified.

    Authors: The abstract is a concise summary and does not contain baselines or table references; these appear in the results section of the full manuscript together with dataset sizes. To strengthen attribution of gains to the shared tokenizer, we will expand or add ablation studies that control for tokenizer quality versus model scale in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on architecture design and empirical training outcomes

full rationale

The paper presents UniAR as an architectural proposal (single discrete visual tokenizer via adapted pretrained encoder, multi-level fusion, and bitwise quantization) whose unification benefit is asserted to emerge from large-scale pre-training, SFT, and RL rather than from any self-referential definition or fitted parameter renamed as prediction. No equations, uniqueness theorems, or self-citations are invoked in the provided text to force the central claim; the tokenizer's dual utility for understanding and generation is framed as an empirical outcome of the training regime, not a tautology. This is the normal non-circular case for an engineering paper whose results are externally falsifiable on benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger is provisional and limited because only the abstract is available; no specific free parameters, axioms, or invented entities are detailed beyond high-level claims.

axioms (1)
  • domain assumption A single discrete visual tokenizer can serve as a bridge for both understanding and generation tasks.
    Core premise stated in the abstract as the key to unification.
invented entities (1)
  • Shared context-visual tokenizer no independent evidence
    purpose: Bridge between understanding and generation in a unified autoregressive model
    Introduced as the central innovation enabling shared context without re-encoding.

pith-pipeline@v0.9.1-grok · 5777 in / 1231 out tokens · 37700 ms · 2026-06-27T01:15:10.350025+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 25 linked inside Pith

  1. [1]

    Qwen3-vl technical report.arXivpreprintarXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  2. [2]

    Oneig-bench: Omni-dimensional nuanced evaluation for image generation

    Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. InNeurIPS, 2025

  3. [3]

    Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprintarXiv:2505.09568, 2025

  4. [4]

    Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

    Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

  5. [5]

    Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  6. [6]

    Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

    Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu-Gang Jiang. Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

  7. [7]

    Catok: Taming mean flows for one-dimensional causal image tokenization

    Yitong Chen, Zuxuan Wu, Xipeng Qiu, and Yu-Gang Jiang. Catok: Taming mean flows for one-dimensional causal image tokenization. InCVPR, 2026

  8. [8]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXivpreprintarXiv:2507.06261, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXivpreprintarXiv:2507.06261, 2025

  9. [9]

    Paddleocr 3.0 technical report, 2025

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025. URLhttps://arxiv.org/abs/2507.05595

  10. [10]

    Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

  11. [11]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024

  12. [12]

    Multimodal autoregressive pre-training of large vision encoders

    Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor G Turrisi da Costa, Louis Béthune, Zhe Gan, et al. Multimodal autoregressive pre-training of large vision encoders. In CVPR, pages 9641–9654, 2025

  13. [13]

    Mme: A comprehensive evaluation benchmark for multimodal large language models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. InNeurIPS, 2025

  14. [14]

    X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXivpreprintarXiv:2507.22058, 2025

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXivpreprintarXiv:2507.22058, 2025

  15. [15]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023. 14

  16. [16]

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  17. [17]

    Vision as a dialect: Unifying visual understanding and generation via text-aligned representations

    JiamingHan, HaoChen, YangZhao, HanyuWang, QiZhao, ZiyanYang, HaoHe, XiangyuYue, andLuJiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. InNeurIPS, 2025

  18. [18]

    Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

    Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InCVPR, 2025

  19. [19]

    Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022

  20. [20]

    Gpt-4o system card.arXiv preprintarXiv:2410.21276, 2024

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprintarXiv:2410.21276, 2024

  21. [21]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprintarXiv:2506.15742, 2025

  22. [22]

    Llava-onevision: Easy visual task transfer.arXiv preprintarXiv:2408.03326, 2024

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprintarXiv:2408.03326, 2024

  23. [23]

    Seed-bench: Benchmarkingmultimodal llms with generative comprehension

    BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. Seed-bench: Benchmarkingmultimodal llms with generative comprehension. InCVPR, 2024

  24. [24]

    Xiaomi mimo-vl-miloco technical report.arXivpreprintarXiv:2512.17436, 2025

    JiazeLi, JingyangChen, YuxunQu, JianzhongJu, ZhenboLuo, JianLuan, ShijieXu, ZhenruLin, JunyouZhu, Boshen Xu, et al. Xiaomi mimo-vl-miloco technical report.arXivpreprintarXiv:2512.17436, 2025

  25. [25]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, pages 22195–22206, 2024

  26. [26]

    Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

    Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

  27. [27]

    Uniworld: High-resolution semantic encoders for unified visual understanding and generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprintarXiv:2506.03147, 2025

  28. [28]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprintarXiv:2210.02747, 2022

  29. [29]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings ofthe IEEE/CVF conferenceoncomputervision andpattern recognition, pages 26296–26306, 2024

  30. [30]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

  31. [31]

    Flow-grpo: Training flow matching models via online rl

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. InNeurIPS, volume 38, pages 40783–40818, 2025

  32. [32]

    Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China InformationSciences, 67(12):220102, 2024

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China InformationSciences, 67(12):220102, 2024

  33. [33]

    Tuna: Taming unified visual representations for native unified multimodal models

    Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models. arXiv preprintarXiv:2512.02014, 2025

  34. [34]

    Compositionaltext-to-imagegenerationviaregion-aware bimodal direct preference optimization

    ZhuohanLiu,WujianPeng,YitongChen,andZuxuanWu. Compositionaltext-to-imagegenerationviaregion-aware bimodal direct preference optimization. InCVPR, pages 36604–36614, 2026

  35. [35]

    Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025

    Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025. 15

  36. [36]

    Unitok: A unified tokenizer for visual generation and understanding

    Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. InNeurIPS, volume 38, pages 129274–129297, 2025

  37. [37]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InCVPR, 2025

  38. [38]

    Chartqa: A benchmark for question answeringaboutchartswithvisualandlogicalreasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answeringaboutchartswithvisualandlogicalreasoning. In Findingsoftheassociationforcomputationallinguistics: ACL2022, pages 2263–2279, 2022

  39. [39]

    Docvqa: Adatasetforvqaondocumentimages

    MineshMathew,DimosthenisKaratzas,andCVJawahar. Docvqa: Adatasetforvqaondocumentimages. In WACV, pages 2200–2209, 2021

  40. [40]

    Infographicvqa

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022

  41. [41]

    Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms

    Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. InNeurIPS, 2024

  42. [42]

    Transfer between modalities with metaqueries.arXivpreprint arXiv:2504.06256, 2025

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXivpreprint arXiv:2504.06256, 2025

  43. [43]

    Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning

    Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning. InNeurIPS, 2025

  44. [44]

    Du, Zehuan Yuan, and Xinglong Wu

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. InCVPR, 2025

  45. [45]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML. PmLR, 2021

  46. [46]

    A-okvqa: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, pages 146–162. Springer, 2022

  47. [47]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  48. [48]

    Scalable image tokenization with index backpropagation quantization

    Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedingsofthe IEEE/CVFInternational ConferenceonComputer Vision, pages 16037–16046, 2025

  49. [49]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR, pages 8317–8326, 2019

  50. [50]

    Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXivpreprint arXiv:2507.23278, 2025

    Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXivpreprint arXiv:2507.23278, 2025

  51. [51]

    Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  52. [52]

    Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025

  53. [53]

    Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning.arXiv preprint arXiv:2511.14760, 2025

    RuiTian,MingfeiGao,HaimingGang,JiasenLu,ZheGan,YinfeiYang,ZuxuanWu,andAfshinDehghan. Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning.arXiv preprint arXiv:2511.14760, 2025

  54. [54]

    Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation

    Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, and Afshin Dehghan. Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation. InNeurIPS, 2025. 16

  55. [55]

    Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy,TalfanEvans,LucasBeyer,YeXia,BasilMustafa,etal. Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025

  56. [56]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

  57. [57]

    Omnitokenizer: A joint image-video tokenizer for visual generation.NeurIPS, 37:28281–28295, 2024

    Junke Wang, Yi Jiang, Zehuan Yuan, Bingyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.NeurIPS, 37:28281–28295, 2024

  58. [58]

    Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

  59. [59]

    Unified reward model for multimodal understanding and generation.arXiv preprintarXiv:2503.05236, 2025

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprintarXiv:2503.05236, 2025

  60. [60]

    Qwen-image technical report.arXiv preprintarXiv:2508.02324, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprintarXiv:2508.02324, 2025

  61. [61]

    Omnigen2: Exploration to advanced multimodal generation.arXivpreprintarXiv:2506.18871, 2025

    ChenyuanWu,PengfeiZheng,RuiranYan,ShitaoXiao,XinLuo,YuezeWang,WanliLi,XiyanJiang,YexinLiu,Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXivpreprintarXiv:2506.18871, 2025

  62. [62]

    Human prefer- ence score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human prefer- ence score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  63. [63]

    Show-o2: Improved native unified multimodal models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. In NeurIPS, volume 38, pages 47490–47518, 2025

  64. [64]

    Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782, 2025

    Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782, 2025

  65. [65]

    Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

  66. [66]

    Imgedit: A unified image editing dataset and benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark. InNeurIPS, 2025

  67. [67]

    Language model beats diffusion-tokenizer is key to visual generation

    Lijun Yu, José Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion-tokenizer is key to visual generation. InICLR, 2024

  68. [68]

    Nextflow: Unified sequential modeling activates multimodal understanding and generation.arXiv preprint arXiv:2601.02204, 2026

    Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, et al. Nextflow: Unified sequential modeling activates multimodal understanding and generation.arXiv preprint arXiv:2601.02204, 2026

  69. [69]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023

  70. [70]

    Imageandvideotokenizationwithbinarysphericalquantization

    YueZhao,YuanjunXiong,andPhilippKrähenbühl. Imageandvideotokenizationwithbinarysphericalquantization. In ICLR, 2025

  71. [71]

    Qlip: Text-aligned visual tokenization unifies auto-regressive multimodal understanding and generation

    Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, and De-An Huang. Qlip: Text-aligned visual tokenization unifies auto-regressive multimodal understanding and generation. arXiv preprintarXiv:2502.05178, 2025

  72. [72]

    Transfusion: Predict the next token and diffuse images with one multi-modal model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. InICLR, 2025. 17