DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
Pith reviewed 2026-05-22 23:49 UTC · model grok-4.3
The pith
DualToken resolves representation conflicts in unified vision-language models by using two separate codebooks instead of one.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DualToken disentangles high-level semantics and low-level visual details by introducing separate codebooks for each, allowing a single tokenizer to support both visual understanding and generation without the performance conflicts that arise when a shared codebook must satisfy both reconstruction and semantic objectives simultaneously.
What carries the argument
Dual visual vocabularies consisting of two distinct codebooks, one for high-level semantics and one for low-level visual details, that replace a single shared codebook inside the tokenizer.
Load-bearing premise
The performance conflict between reconstruction and semantic objectives is caused by a single shared codebook and can be resolved simply by introducing two separate codebooks without introducing new optimization or integration conflicts.
What would settle it
A single-codebook tokenizer trained with the same combined objectives but with improved balancing or regularization that matches or exceeds DualToken on both rFID and zero-shot ImageNet accuracy would show the dual-codebook separation is not required.
Figures
read the original abstract
The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level visual appearance, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives creates conflicts, leading to degraded performance in both reconstruction fidelity and semantic accuracy. Instead of forcing a single codebook to capture both visual appearance and semantics, DualToken disentangles them by introducing separate codebooks for high-level semantics and low-level visual details. As a result, DualToken achieves 0.25 rFID and 82.0% zero-shot accuracy on ImageNet, and demonstrates strong effectiveness in downstream MLLM tasks for both understanding and generation. Specifically, our method surpasses VILA-U by 5.8 points on average across ten visual understanding benchmarks and delivers a 13% improvement on GenAI-Bench. Notably, incorporating dual visual tokens outperforms using a single token type on both understanding and generation tasks. We hope our research offers a new perspective on leveraging dual visual vocabularies for building unified vision-language models. Project page is available at https://songweii.github.io/dualtoken-project-page.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DualToken, a vision tokenizer for autoregressive vision-language models that introduces two separate codebooks—one for high-level semantics and one for low-level visual details—to resolve conflicts that arise when a single codebook is trained for both reconstruction and semantic objectives. The paper reports that this disentanglement yields 0.25 rFID and 82.0% zero-shot ImageNet accuracy while surpassing VILA-U by 5.8 points on average across ten understanding benchmarks and delivering a 13% gain on GenAI-Bench; it further states that dual tokens outperform a single token type on both understanding and generation tasks.
Significance. If the reported gains are reproducible and attributable to the dual-vocabulary design, the work would supply a straightforward architectural response to a recognized tension between reconstruction fidelity and semantic alignment in unified VLMs. The explicit claim that dual tokens improve both task families over a single-token baseline is a constructive element of the presentation.
major comments (2)
- [Abstract] Abstract: the central claim that separate codebooks resolve the reconstruction-semantic conflict without introducing new optimization or integration issues rests on an untested assumption; the provided text supplies no ablation studies, capacity-matched single-codebook controls, or training details that would isolate the contribution of disentanglement to the stated metrics (0.25 rFID, 82.0% accuracy, 5.8-point gain).
- [Experiments] Experiments section: no architecture diagram, loss formulation, or token-integration procedure is described, leaving open whether the dual codebooks are simply concatenated, routed conditionally, or otherwise combined in the autoregressive sequence—an omission that directly affects verification of the method's claimed advantage over VILA-U.
minor comments (1)
- [Abstract] Abstract: the statement that 'incorporating dual visual tokens outperforms using a single token type' would be strengthened by an explicit pointer to the corresponding table or figure.
Simulated Author's Rebuttal
Thank you for the detailed review. We address the major comments below and will revise the manuscript accordingly to include additional ablations, diagrams, and descriptions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that separate codebooks resolve the reconstruction-semantic conflict without introducing new optimization or integration issues rests on an untested assumption; the provided text supplies no ablation studies, capacity-matched single-codebook controls, or training details that would isolate the contribution of disentanglement to the stated metrics (0.25 rFID, 82.0% accuracy, 5.8-point gain).
Authors: While the manuscript does include a comparison showing that dual tokens outperform a single token type on both tasks, we agree that more rigorous ablations, including capacity-matched single-codebook baselines and detailed training procedures, are needed to fully isolate the effect of disentanglement. We will add these in the revised version, along with explicit discussion of any optimization or integration considerations. revision: yes
-
Referee: [Experiments] Experiments section: no architecture diagram, loss formulation, or token-integration procedure is described, leaving open whether the dual codebooks are simply concatenated, routed conditionally, or otherwise combined in the autoregressive sequence—an omission that directly affects verification of the method's claimed advantage over VILA-U.
Authors: We acknowledge this omission in the current manuscript. The revised version will include an architecture diagram, the full loss formulation, and a clear description of how the dual tokens are integrated into the autoregressive sequence (e.g., concatenation or conditional routing). This will facilitate verification and comparison with VILA-U. revision: yes
Circularity Check
No significant circularity in claimed derivation
full rationale
The paper motivates DualToken via an empirical observation that a shared codebook creates reconstruction-semantic conflicts and proposes separate codebooks as a direct architectural response. No equations, first-principles derivations, or 'predictions' are presented that reduce by construction to fitted parameters from the same data. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The reported metrics are framed as experimental outcomes of the disentanglement choice, not as quantities forced by the method's own definitions. This is the common case of a self-contained empirical engineering paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard vision tokenizer training objectives (reconstruction and contrastive learning) remain compatible once separated into distinct codebooks
invented entities (1)
-
Dual visual vocabularies (separate semantic and appearance codebooks)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features... using shallow-layer features for reconstruction and deep-layer features for semantic learning
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
this hierarchical decoupling not only resolves the conflict between the two objectives but also enables the semantic learning objective to enhance low-level reconstruction
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Vision Foundation Models as Generalist Tokenizers for Image Generation
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
-
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Scone unifies subject understanding and generation in a two-stage trained model to improve both composition and distinction in multi-subject image generation, outperforming prior open-source models on new benchmarks.
-
WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens
WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Reference graph
Works this paper leans on
-
[1]
Getting vit in shape: Scaling laws for compute-optimal model design
Ibrahim M Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. Advances in Neural Information Processing Systems, 36:16406–16425, 2023. 1, 3, 5
work page 2023
-
[2]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023. 1
work page 2023
-
[3]
Factor- ized visual tokenization and generation
Zechen Bai, Jianxiong Gao, Ziteng Gao, Pichao Wang, Zheng Zhang, Tong He, and Mike Zheng Shou. Factor- ized visual tokenization and generation. arXiv preprint arXiv:2411.16681, 2024. 2
-
[4]
Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR,
-
[5]
Sharegpt4v: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 7
work page 2024
-
[6]
Building vision transformers with hierarchy aware feature aggregation
Yongjie Chen, Hongmin Liu, Haoran Yin, and Bin Fan. Building vision transformers with hierarchy aware feature aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5908–5918, 2023. 2
work page 2023
-
[7]
Instructblip: Towards general- purpose vision-language models with instruction tuning,
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,
-
[8]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5
work page 2009
-
[9]
DreamLLM: Synergistic multimodal com- prehension and creation
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal com- prehension and creation. In The Twelfth International Con- ference on Learning Representations, 2024. 2, 7
work page 2024
-
[10]
Generating im- ages with perceptual similarity metrics based on deep net- works
Alexey Dosovitskiy and Thomas Brox. Generating im- ages with perceptual similarity metrics based on deep net- works. Advances in neural information processing systems , 29, 2016. 3
work page 2016
-
[11]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2
work page 2021
-
[12]
Mme: A compre- hensive evaluation benchmark for multimodal large language models, 2024
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A compre- hensive evaluation benchmark for multimodal large language models, 2024. 1, 5, 7
work page 2024
-
[13]
Making llama see and draw with seed tokenizer
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023. 2, 7
-
[14]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Mul- timodal models with unified multi-granularity comprehen- sion and generation. arXiv preprint arXiv:2404.14396, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Making the V in VQA matter: Ele- vating the role of image understanding in Visual Question Answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the V in VQA matter: Ele- vating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 5
work page 2017
-
[16]
Iris IA Groen, Edward H Silson, and Chris I Baker. Contri- butions of low-and high-level properties to neural processing of visual scenes in the human brain. Philosophical Transac- tions of the Royal Society B: Biological Sciences, 372(1714): 20160102, 2017. 2
work page 2017
-
[17]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Image-to-image translation with conditional adver- sarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,
-
[19]
Autoregressive image generation using residual quantization
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022. 2, 3, 4
work page 2022
-
[20]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Baichuan-audio: A unified frame- work for end-to-end speech interaction
Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Gu- osheng Dong, et al. Baichuan-audio: A unified frame- work for end-to-end speech interaction. arXiv preprint arXiv:2502.17239, 2025. 4
-
[22]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Baichuan-omni technical report
Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu, Zenan Zhou, and Weipeng Chen. Baichuan-omni technical report. arXiv preprint a...
-
[24]
Baichuan-omni-1.5 technical report
Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report. arXiv preprint arXiv:2501.15368, 2025. 1 8
-
[25]
Vila: On pre-training for vi- sual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 26689–26699, 2024. 7
work page 2024
-
[26]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 1
work page 2023
-
[27]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 3, 6, 7
work page 2024
-
[28]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 1, 7
work page 2024
-
[29]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 1, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Deepseek-vl: Towards real-world vision- language understanding, 2024
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision- language understanding, 2024. 1
work page 2024
-
[31]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,
-
[32]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation
Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation. arXiv preprint arXiv:2409.04410, 2024. 2, 6
-
[34]
Tokenflow: Unified image tokenizer for multimodal understanding and generation
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069, 2024. 2, 3
-
[35]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763, 2021. 1, 2
work page 2021
-
[36]
Gener- ating diverse high-fidelity images with vq-vae-2
Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener- ating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019. 3
work page 2019
- [37]
-
[38]
Wei Song, Yadong Li, Jianhua Xu, Guowei Wu, Lingfeng Ming, Kexin Yi, Weihua Luo, Houyi Li, Yi Du, Fangda Guo, et al. M3gia: A cognition inspired multilingual and multi- modal general intelligence ability benchmark.arXiv preprint arXiv:2406.05343, 2024. 1
-
[39]
Generative multimodal mod- els are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiy- ing Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal mod- els are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14398–14409, 2024. 2
work page 2024
-
[40]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. 1, 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Neural discrete representation learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information pro- cessing systems, 30, 2017. 3
work page 2017
-
[43]
Neural discrete representation learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information pro- cessing systems, 30, 2017. 2
work page 2017
-
[44]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024. 1, 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024. 1, 2, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...
work page 2024
-
[49]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024. 7 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Muse- vl: Modeling unified vlm through semantic discrete encod- ing
Rongchang Xie, Chen Du, Ping Song, and Chang Liu. Muse- vl: Modeling unified vlm through semantic discrete encod- ing. arXiv preprint arXiv:2411.17762, 2024. 2, 7
-
[51]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Vector-quantized Image Modeling with Improved VQGAN
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[53]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, Jos ´e Lezama, Nitesh B Gundavarapu, Luca Ver- sari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Scaling autoregressive multi- modal models: Pretraining and instruction tuning
Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi- modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023. 1, 2
-
[55]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1, 2
work page 2023
-
[57]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 3
work page 2018
-
[58]
M3exam: A multilingual, multi- modal, multilevel benchmark for examining large language models, 2024
Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. M3exam: A multilingual, multi- modal, multilevel benchmark for examining large language models, 2024. 1
work page 2024
-
[59]
Movq: Modulating quantized vectors for high- fidelity image generation
Chuanxia Zheng, Tung-Long Vuong, Jianfei Cai, and Dinh Phung. Movq: Modulating quantized vectors for high- fidelity image generation. Advances in Neural Information Processing Systems, 35:23412–23425, 2022. 2
work page 2022
-
[60]
Llava-phi: Efficient multi-modal assistant with small language model
Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. Llava-phi: Efficient multi-modal assistant with small language model. In Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited, pages 18–22, 2024. 7 10
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.