pith. sign in

arxiv: 2503.14324 · v3 · submitted 2025-03-18 · 💻 cs.CV · cs.CL

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Pith reviewed 2026-05-22 23:49 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords DualTokendual codebooksvisual tokenizerunified vision-language modelimage understandingvisual generationautoregressive modelssemantic alignment
0
0 comments X

The pith

DualToken resolves representation conflicts in unified vision-language models by using two separate codebooks instead of one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual understanding and generation demand incompatible representation spaces when both are handled inside autoregressive language models. A tokenizer optimized for pixel reconstruction captures fine details well but lacks the abstract semantics needed for language-aligned understanding, while contrastive encoders do the reverse. DualToken avoids forcing one codebook to serve both goals by maintaining two distinct codebooks: one dedicated to high-level semantics and one to low-level visual details. This separation produces strong results on reconstruction metrics, zero-shot classification, and downstream multimodal benchmarks for both understanding and generation. Readers should care because the approach offers a direct architectural route to models that can both interpret and create visual content inside the same autoregressive framework.

Core claim

DualToken disentangles high-level semantics and low-level visual details by introducing separate codebooks for each, allowing a single tokenizer to support both visual understanding and generation without the performance conflicts that arise when a shared codebook must satisfy both reconstruction and semantic objectives simultaneously.

What carries the argument

Dual visual vocabularies consisting of two distinct codebooks, one for high-level semantics and one for low-level visual details, that replace a single shared codebook inside the tokenizer.

Load-bearing premise

The performance conflict between reconstruction and semantic objectives is caused by a single shared codebook and can be resolved simply by introducing two separate codebooks without introducing new optimization or integration conflicts.

What would settle it

A single-codebook tokenizer trained with the same combined objectives but with improved balancing or regularization that matches or exceeds DualToken on both rFID and zero-shot ImageNet accuracy would show the dual-codebook separation is not required.

Figures

Figures reproduced from arXiv: 2503.14324 by Jianhua Xu, Jiaqi Wang, Kaicheng Yu, Long Chen, Wei Song, Yadong Li, Yuran Wang, Zenan Zhou, Zijia Song.

Figure 1
Figure 1. Figure 1: Comparison with state-of-the-art vision encoders. (Left) We compare zero-shot classification accuracy and recon￾struction FID on ImageNet-1K(val) across baseline methods and DualToken. DualToken achieves results comparable to or surpass￾ing both semantic-only and reconstruction-only methods in both tasks. (Right) Reconstruction results of VILA-U and DualToken, our DualToken significantly outperforms VILA-U… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our unified vision tokenizer. Given input images the features extracted by the vision encoder are discretized using residual quantization. Then the discrete vision features are meanwhile put into the vision decoder to reconstruct images and used to perform the text-image alignment. During this process, the reconstruction loss and contrastive loss are computed to update the vision tower, endowin… view at source ↗
Figure 3
Figure 3. Figure 3: An overview of our framework utilizing dual visual codebooks for unified visual understanding and generation. (a) Directly using VQGAN and SigLIP to separately acquire high-level (semantic) and low-level (pixel) visual codebooks. (b) Our approach: decoupling high-level and low-level visual codebooks within a unified vision tokenizer. The image is converted into low-level visual tokens (green) and text-alig… view at source ↗
Figure 4
Figure 4. Figure 4: Visual generation results with DualToken. (Left) Our DualToken can generate high-quality images given text input. (Right) Following the pipeline introduced in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level visual appearance, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives creates conflicts, leading to degraded performance in both reconstruction fidelity and semantic accuracy. Instead of forcing a single codebook to capture both visual appearance and semantics, DualToken disentangles them by introducing separate codebooks for high-level semantics and low-level visual details. As a result, DualToken achieves 0.25 rFID and 82.0% zero-shot accuracy on ImageNet, and demonstrates strong effectiveness in downstream MLLM tasks for both understanding and generation. Specifically, our method surpasses VILA-U by 5.8 points on average across ten visual understanding benchmarks and delivers a 13% improvement on GenAI-Bench. Notably, incorporating dual visual tokens outperforms using a single token type on both understanding and generation tasks. We hope our research offers a new perspective on leveraging dual visual vocabularies for building unified vision-language models. Project page is available at https://songweii.github.io/dualtoken-project-page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DualToken, a vision tokenizer for autoregressive vision-language models that introduces two separate codebooks—one for high-level semantics and one for low-level visual details—to resolve conflicts that arise when a single codebook is trained for both reconstruction and semantic objectives. The paper reports that this disentanglement yields 0.25 rFID and 82.0% zero-shot ImageNet accuracy while surpassing VILA-U by 5.8 points on average across ten understanding benchmarks and delivering a 13% gain on GenAI-Bench; it further states that dual tokens outperform a single token type on both understanding and generation tasks.

Significance. If the reported gains are reproducible and attributable to the dual-vocabulary design, the work would supply a straightforward architectural response to a recognized tension between reconstruction fidelity and semantic alignment in unified VLMs. The explicit claim that dual tokens improve both task families over a single-token baseline is a constructive element of the presentation.

major comments (2)
  1. [Abstract] Abstract: the central claim that separate codebooks resolve the reconstruction-semantic conflict without introducing new optimization or integration issues rests on an untested assumption; the provided text supplies no ablation studies, capacity-matched single-codebook controls, or training details that would isolate the contribution of disentanglement to the stated metrics (0.25 rFID, 82.0% accuracy, 5.8-point gain).
  2. [Experiments] Experiments section: no architecture diagram, loss formulation, or token-integration procedure is described, leaving open whether the dual codebooks are simply concatenated, routed conditionally, or otherwise combined in the autoregressive sequence—an omission that directly affects verification of the method's claimed advantage over VILA-U.
minor comments (1)
  1. [Abstract] Abstract: the statement that 'incorporating dual visual tokens outperforms using a single token type' would be strengthened by an explicit pointer to the corresponding table or figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We address the major comments below and will revise the manuscript accordingly to include additional ablations, diagrams, and descriptions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that separate codebooks resolve the reconstruction-semantic conflict without introducing new optimization or integration issues rests on an untested assumption; the provided text supplies no ablation studies, capacity-matched single-codebook controls, or training details that would isolate the contribution of disentanglement to the stated metrics (0.25 rFID, 82.0% accuracy, 5.8-point gain).

    Authors: While the manuscript does include a comparison showing that dual tokens outperform a single token type on both tasks, we agree that more rigorous ablations, including capacity-matched single-codebook baselines and detailed training procedures, are needed to fully isolate the effect of disentanglement. We will add these in the revised version, along with explicit discussion of any optimization or integration considerations. revision: yes

  2. Referee: [Experiments] Experiments section: no architecture diagram, loss formulation, or token-integration procedure is described, leaving open whether the dual codebooks are simply concatenated, routed conditionally, or otherwise combined in the autoregressive sequence—an omission that directly affects verification of the method's claimed advantage over VILA-U.

    Authors: We acknowledge this omission in the current manuscript. The revised version will include an architecture diagram, the full loss formulation, and a clear description of how the dual tokens are integrated into the autoregressive sequence (e.g., concatenation or conditional routing). This will facilitate verification and comparison with VILA-U. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation

full rationale

The paper motivates DualToken via an empirical observation that a shared codebook creates reconstruction-semantic conflicts and proposes separate codebooks as a direct architectural response. No equations, first-principles derivations, or 'predictions' are presented that reduce by construction to fitted parameters from the same data. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The reported metrics are framed as experimental outcomes of the disentanglement choice, not as quantities forced by the method's own definitions. This is the common case of a self-contained empirical engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the dual-codebook design, which is introduced without independent evidence or formal justification in the abstract; standard reconstruction and contrastive objectives are assumed to remain compatible once separated.

axioms (1)
  • domain assumption Standard vision tokenizer training objectives (reconstruction and contrastive learning) remain compatible once separated into distinct codebooks
    The paper builds directly on these existing training paradigms without additional justification.
invented entities (1)
  • Dual visual vocabularies (separate semantic and appearance codebooks) no independent evidence
    purpose: To capture high-level semantics and low-level details independently inside one tokenizer
    Newly postulated design choice introduced to resolve the stated objective conflict; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5832 in / 1288 out tokens · 34617 ms · 2026-05-22T23:49:23.955803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features... using shallow-layer features for reconstruction and deep-layer features for semantic learning

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    this hierarchical decoupling not only resolves the conflict between the two objectives but also enables the semantic learning objective to enhance low-level reconstruction

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Vision Foundation Models as Generalist Tokenizers for Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.

  2. Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

    cs.CV 2025-12 conditional novelty 6.0

    Scone unifies subject understanding and generation in a two-stage trained model to improve both composition and distinction in multi-subject image generation, outperforming prior open-source models on new benchmarks.

  3. WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

    cs.CV 2026-05 unverdicted novelty 5.0

    WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.

  4. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  5. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 5 Pith papers · 17 internal anchors

  1. [1]

    Getting vit in shape: Scaling laws for compute-optimal model design

    Ibrahim M Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. Advances in Neural Information Processing Systems, 36:16406–16425, 2023. 1, 3, 5

  2. [2]

    Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023. 1

  3. [3]

    Factor- ized visual tokenization and generation

    Zechen Bai, Jianxiong Gao, Ziteng Gao, Pichao Wang, Zheng Zhang, Tong He, and Mike Zheng Shou. Factor- ized visual tokenization and generation. arXiv preprint arXiv:2411.16681, 2024. 2

  4. [4]

    Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR,

  5. [5]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 7

  6. [6]

    Building vision transformers with hierarchy aware feature aggregation

    Yongjie Chen, Hongmin Liu, Haoran Yin, and Bin Fan. Building vision transformers with hierarchy aware feature aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5908–5918, 2023. 2

  7. [7]

    Instructblip: Towards general- purpose vision-language models with instruction tuning,

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

  8. [8]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

  9. [9]

    DreamLLM: Synergistic multimodal com- prehension and creation

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal com- prehension and creation. In The Twelfth International Con- ference on Learning Representations, 2024. 2, 7

  10. [10]

    Generating im- ages with perceptual similarity metrics based on deep net- works

    Alexey Dosovitskiy and Thomas Brox. Generating im- ages with perceptual similarity metrics based on deep net- works. Advances in neural information processing systems , 29, 2016. 3

  11. [11]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2

  12. [12]

    Mme: A compre- hensive evaluation benchmark for multimodal large language models, 2024

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A compre- hensive evaluation benchmark for multimodal large language models, 2024. 1, 5, 7

  13. [13]

    Making llama see and draw with seed tokenizer

    Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023. 2, 7

  14. [14]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Mul- timodal models with unified multi-granularity comprehen- sion and generation. arXiv preprint arXiv:2404.14396, 2024. 2

  15. [15]

    Making the V in VQA matter: Ele- vating the role of image understanding in Visual Question Answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the V in VQA matter: Ele- vating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 5

  16. [16]

    Contri- butions of low-and high-level properties to neural processing of visual scenes in the human brain

    Iris IA Groen, Edward H Silson, and Chris I Baker. Contri- butions of low-and high-level properties to neural processing of visual scenes in the human brain. Philosophical Transac- tions of the Royal Society B: Biological Sciences, 372(1714): 20160102, 2017. 2

  17. [17]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 5

  18. [18]

    Image-to-image translation with conditional adver- sarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,

  19. [19]

    Autoregressive image generation using residual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022. 2, 3, 4

  20. [20]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 1, 5

  21. [21]

    Baichuan-audio: A unified frame- work for end-to-end speech interaction

    Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Gu- osheng Dong, et al. Baichuan-audio: A unified frame- work for end-to-end speech interaction. arXiv preprint arXiv:2502.17239, 2025. 4

  22. [22]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 5

  23. [23]

    Baichuan-omni technical report

    Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu, Zenan Zhou, and Weipeng Chen. Baichuan-omni technical report. arXiv preprint a...

  24. [24]

    Baichuan-omni-1.5 technical report

    Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report. arXiv preprint arXiv:2501.15368, 2025. 1 8

  25. [25]

    Vila: On pre-training for vi- sual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 26689–26699, 2024. 7

  26. [26]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 1

  27. [27]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 3, 6, 7

  28. [28]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 1, 7

  29. [29]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 1, 5, 7

  30. [30]

    Deepseek-vl: Towards real-world vision- language understanding, 2024

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision- language understanding, 2024. 1

  31. [31]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,

  32. [32]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 1

  33. [33]

    Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation

    Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation. arXiv preprint arXiv:2409.04410, 2024. 2, 6

  34. [34]

    Tokenflow: Unified image tokenizer for multimodal understanding and generation

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069, 2024. 2, 3

  35. [35]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763, 2021. 1, 2

  36. [36]

    Gener- ating diverse high-fidelity images with vq-vae-2

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener- ating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019. 3

  37. [37]

    Sber-movqgan, 2023

    SberBank. Sber-movqgan, 2023. 2, 6

  38. [38]

    M3gia: A cognition inspired multilingual and multi- modal general intelligence ability benchmark.arXiv preprint arXiv:2406.05343, 2024

    Wei Song, Yadong Li, Jianhua Xu, Guowei Wu, Lingfeng Ming, Kexin Yi, Weihua Luo, Houyi Li, Yi Du, Fangda Guo, et al. M3gia: A cognition inspired multilingual and multi- modal general intelligence ability benchmark.arXiv preprint arXiv:2406.05343, 2024. 1

  39. [39]

    Generative multimodal mod- els are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiy- ing Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal mod- els are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14398–14409, 2024. 2

  40. [40]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. 1, 2, 7

  41. [41]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 6

  42. [42]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information pro- cessing systems, 30, 2017. 3

  43. [43]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information pro- cessing systems, 30, 2017. 2

  44. [44]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1

  45. [45]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024. 1, 2, 7

  46. [46]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024. 7

  47. [47]

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024. 1, 2, 3, 7

  48. [48]

    Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

  49. [49]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024. 7 9

  50. [50]

    Muse- vl: Modeling unified vlm through semantic discrete encod- ing

    Rongchang Xie, Chen Du, Ping Song, and Chang Liu. Muse- vl: Modeling unified vlm through semantic discrete encod- ing. arXiv preprint arXiv:2411.17762, 2024. 2, 7

  51. [51]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024. 5

  52. [52]

    Vector-quantized Image Modeling with Improved VQGAN

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021. 2

  53. [53]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, Jos ´e Lezama, Nitesh B Gundavarapu, Luca Ver- sari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023. 2

  54. [54]

    Scaling autoregressive multi- modal models: Pretraining and instruction tuning

    Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi- modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023. 1, 2

  55. [55]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 1, 5

  56. [56]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1, 2

  57. [57]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 3

  58. [58]

    M3exam: A multilingual, multi- modal, multilevel benchmark for examining large language models, 2024

    Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. M3exam: A multilingual, multi- modal, multilevel benchmark for examining large language models, 2024. 1

  59. [59]

    Movq: Modulating quantized vectors for high- fidelity image generation

    Chuanxia Zheng, Tung-Long Vuong, Jianfei Cai, and Dinh Phung. Movq: Modulating quantized vectors for high- fidelity image generation. Advances in Neural Information Processing Systems, 35:23412–23425, 2022. 2

  60. [60]

    Llava-phi: Efficient multi-modal assistant with small language model

    Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. Llava-phi: Efficient multi-modal assistant with small language model. In Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited, pages 18–22, 2024. 7 10