Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
Pith reviewed 2026-06-27 01:15 UTC · model grok-4.3
The pith
A single discrete visual tokenizer unifies understanding and generation by letting an autoregressive model read its own outputs in shared context.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UniAR shows that a single discrete visual tokenizer, obtained by adapting a pretrained vision encoder with multi-level feature fusion and lookup-free bitwise quantization, serves as the bridge between understanding and generation. This tokenizer supplies a shared context in which the autoregressive model directly consumes its own generated visual tokens. Parallel-bitwise-prediction of spatially grouped multi-level codes shortens the visual sequence, and a diffusion decoder converts the discrete tokens into images. Large-scale pre-training followed by supervised fine-tuning and reinforcement learning produces state-of-the-art image generation and editing results while remaining competitive on
What carries the argument
The single discrete visual tokenizer that supplies shared context so generated visual tokens can be interpreted directly without re-encoding.
If this is right
- The model performs both image understanding and image generation inside one autoregressive sequence without switching token spaces.
- Parallel-bitwise-prediction of multi-level codes reduces visual sequence length and speeds up generation.
- A diffusion decoder converts the discrete tokens into high-fidelity images after autoregressive prediction.
- After pre-training, fine-tuning, and reinforcement learning the system reaches top results on generation and editing while staying competitive on understanding tasks.
Where Pith is reading between the lines
- The shared-tokenizer design could allow an autoregressive model to chain understanding steps directly into generation steps inside a single forward pass.
- The bitwise quantization approach may extend to other modalities such as video or audio for similar unification gains.
- Because the tokenizer is derived from an existing vision encoder, the method may transfer to new backbone architectures with only modest retraining.
Load-bearing premise
Adapting one pretrained vision encoder with multi-level feature fusion and lookup-free bitwise quantization preserves enough high-level semantics and low-level details to support both understanding and generation tasks at once.
What would settle it
A controlled experiment that measures whether forcing the model to re-encode its own generated tokens improves or degrades performance on image-editing benchmarks would directly test whether the shared-context benefit is real.
read the original abstract
Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes UniAR, a unified autoregressive framework for multimodal modeling that integrates visual understanding and generation via a single discrete visual tokenizer. This tokenizer is obtained by adapting a pretrained vision encoder through multi-level feature fusion and lookup-free bitwise quantization, which the authors claim preserves both high-level semantics and low-level details. The autoregressive component employs parallel-bitwise-prediction to jointly predict spatially grouped multi-level codes, shortening sequences, while a diffusion-based decoder reconstructs images from the discrete tokens. After large-scale pre-training, supervised fine-tuning, and reinforcement learning, the model is reported to achieve state-of-the-art results on image generation and editing tasks while remaining competitive on multimodal understanding benchmarks.
Significance. If the central architectural claim holds—that a single tokenizer enables a shared context in which generated visual tokens can be directly interpreted without re-encoding—this would constitute a meaningful simplification over approaches that maintain separate tokenizers for understanding and generation. The bitwise quantization scheme for vocabulary scaling at low cost and the parallel prediction strategy are technically interesting contributions. The work would be strengthened by explicit credit for any reproducible code or detailed ablation tables that isolate the tokenizer's contribution; absent those, the significance remains conditional on verification that the adaptation step does not trade off fidelity in one domain for the other.
major comments (2)
- [Tokenizer adaptation] Tokenizer adaptation section (description of multi-level fusion + lookup-free bitwise quantization): the claim that this procedure 'preserves both high-level semantics and low-level details' is load-bearing for the unification benefit. No quantitative reconstruction metrics (e.g., PSNR, LPIPS, or FID on held-out images) or semantic retention metrics (e.g., linear probing accuracy or zero-shot classification) are referenced to demonstrate that the quantization step does not introduce unacceptable loss in either regime; without such evidence the 'key bridge' property cannot be evaluated.
- [Results] Results section (claims of SOTA on generation/editing): the abstract states 'state-of-the-art performance' after pre-training, SFT, and RL but supplies no baselines, dataset sizes, or table references. If the full manuscript contains only aggregate scores without controls for tokenizer quality versus model scale, the attribution of gains specifically to the shared tokenizer remains unverified.
minor comments (1)
- [Abstract] The abstract mentions 'parallel-bitwise-prediction' and 'spatially grouped, multi-level visual codes' without defining the grouping or bit allocation; a short notation table or equation would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below, indicating where revisions will be made to provide additional evidence.
read point-by-point responses
-
Referee: [Tokenizer adaptation] Tokenizer adaptation section (description of multi-level fusion + lookup-free bitwise quantization): the claim that this procedure 'preserves both high-level semantics and low-level details' is load-bearing for the unification benefit. No quantitative reconstruction metrics (e.g., PSNR, LPIPS, or FID on held-out images) or semantic retention metrics (e.g., linear probing accuracy or zero-shot classification) are referenced to demonstrate that the quantization step does not introduce unacceptable loss in either regime; without such evidence the 'key bridge' property cannot be evaluated.
Authors: We agree that quantitative metrics are required to substantiate the claim that the adaptation preserves both high-level semantics and low-level details. The manuscript describes the multi-level fusion and bitwise quantization but does not reference the requested metrics. In the revised manuscript we will add reconstruction metrics (PSNR, LPIPS, FID on held-out images) and semantic retention metrics (linear probing accuracy, zero-shot classification) to the tokenizer adaptation section. revision: yes
-
Referee: [Results] Results section (claims of SOTA on generation/editing): the abstract states 'state-of-the-art performance' after pre-training, SFT, and RL but supplies no baselines, dataset sizes, or table references. If the full manuscript contains only aggregate scores without controls for tokenizer quality versus model scale, the attribution of gains specifically to the shared tokenizer remains unverified.
Authors: The abstract is a concise summary and does not contain baselines or table references; these appear in the results section of the full manuscript together with dataset sizes. To strengthen attribution of gains to the shared tokenizer, we will expand or add ablation studies that control for tokenizer quality versus model scale in the revised version. revision: partial
Circularity Check
No significant circularity; claims rest on architecture design and empirical training outcomes
full rationale
The paper presents UniAR as an architectural proposal (single discrete visual tokenizer via adapted pretrained encoder, multi-level fusion, and bitwise quantization) whose unification benefit is asserted to emerge from large-scale pre-training, SFT, and RL rather than from any self-referential definition or fitted parameter renamed as prediction. No equations, uniqueness theorems, or self-citations are invoked in the provided text to force the central claim; the tokenizer's dual utility for understanding and generation is framed as an empirical outcome of the training regime, not a tautology. This is the normal non-circular case for an engineering paper whose results are externally falsifiable on benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A single discrete visual tokenizer can serve as a bridge for both understanding and generation tasks.
invented entities (1)
-
Shared context-visual tokenizer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Qwen3-vl technical report.arXivpreprintarXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
Pith/arXiv arXiv 2025
-
[2]
Oneig-bench: Omni-dimensional nuanced evaluation for image generation
Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. InNeurIPS, 2025
2025
-
[3]
Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprintarXiv:2505.09568, 2025
Pith/arXiv arXiv 2025
-
[4]
Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025
arXiv 2025
-
[5]
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
Pith/arXiv arXiv 2025
-
[6]
Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu-Gang Jiang. Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025
arXiv 2025
-
[7]
Catok: Taming mean flows for one-dimensional causal image tokenization
Yitong Chen, Zuxuan Wu, Xipeng Qiu, and Yu-Gang Jiang. Catok: Taming mean flows for one-dimensional causal image tokenization. InCVPR, 2026
2026
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXivpreprintarXiv:2507.06261, 2025
Pith/arXiv arXiv 2025
-
[9]
Paddleocr 3.0 technical report, 2025
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025. URLhttps://arxiv.org/abs/2507.05595
Pith/arXiv arXiv 2025
-
[10]
Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025
Pith/arXiv arXiv 2025
-
[11]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024
2024
-
[12]
Multimodal autoregressive pre-training of large vision encoders
Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor G Turrisi da Costa, Louis Béthune, Zhe Gan, et al. Multimodal autoregressive pre-training of large vision encoders. In CVPR, pages 9641–9654, 2025
2025
-
[13]
Mme: A comprehensive evaluation benchmark for multimodal large language models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. InNeurIPS, 2025
2025
-
[14]
Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXivpreprintarXiv:2507.22058, 2025
arXiv 2025
-
[15]
Geneval: An object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023. 14
2023
-
[16]
The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
Pith/arXiv arXiv 2024
-
[17]
Vision as a dialect: Unifying visual understanding and generation via text-aligned representations
JiamingHan, HaoChen, YangZhao, HanyuWang, QiZhao, ZiyanYang, HaoHe, XiangyuYue, andLuJiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. InNeurIPS, 2025
2025
-
[18]
Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InCVPR, 2025
2025
-
[19]
Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022
Pith/arXiv arXiv 2022
-
[20]
Gpt-4o system card.arXiv preprintarXiv:2410.21276, 2024
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprintarXiv:2410.21276, 2024
Pith/arXiv arXiv 2024
-
[21]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprintarXiv:2506.15742, 2025
Pith/arXiv arXiv 2025
-
[22]
Llava-onevision: Easy visual task transfer.arXiv preprintarXiv:2408.03326, 2024
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprintarXiv:2408.03326, 2024
Pith/arXiv arXiv 2024
-
[23]
Seed-bench: Benchmarkingmultimodal llms with generative comprehension
BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. Seed-bench: Benchmarkingmultimodal llms with generative comprehension. InCVPR, 2024
2024
-
[24]
Xiaomi mimo-vl-miloco technical report.arXivpreprintarXiv:2512.17436, 2025
JiazeLi, JingyangChen, YuxunQu, JianzhongJu, ZhenboLuo, JianLuan, ShijieXu, ZhenruLin, JunyouZhu, Boshen Xu, et al. Xiaomi mimo-vl-miloco technical report.arXivpreprintarXiv:2512.17436, 2025
arXiv 2025
-
[25]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, pages 22195–22206, 2024
2024
-
[26]
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025
Pith/arXiv arXiv 2025
-
[27]
Uniworld: High-resolution semantic encoders for unified visual understanding and generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprintarXiv:2506.03147, 2025
Pith/arXiv arXiv 2025
-
[28]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprintarXiv:2210.02747, 2022
Pith/arXiv arXiv 2022
-
[29]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings ofthe IEEE/CVF conferenceoncomputervision andpattern recognition, pages 26296–26306, 2024
2024
-
[30]
Llavanext: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024
2024
-
[31]
Flow-grpo: Training flow matching models via online rl
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. InNeurIPS, volume 38, pages 40783–40818, 2025
2025
-
[32]
Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China InformationSciences, 67(12):220102, 2024
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China InformationSciences, 67(12):220102, 2024
2024
-
[33]
Tuna: Taming unified visual representations for native unified multimodal models
Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models. arXiv preprintarXiv:2512.02014, 2025
arXiv 2025
-
[34]
Compositionaltext-to-imagegenerationviaregion-aware bimodal direct preference optimization
ZhuohanLiu,WujianPeng,YitongChen,andZuxuanWu. Compositionaltext-to-imagegenerationviaregion-aware bimodal direct preference optimization. InCVPR, pages 36604–36614, 2026
2026
-
[35]
Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025
Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025. 15
arXiv 2025
-
[36]
Unitok: A unified tokenizer for visual generation and understanding
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. InNeurIPS, volume 38, pages 129274–129297, 2025
2025
-
[37]
Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InCVPR, 2025
2025
-
[38]
Chartqa: A benchmark for question answeringaboutchartswithvisualandlogicalreasoning
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answeringaboutchartswithvisualandlogicalreasoning. In Findingsoftheassociationforcomputationallinguistics: ACL2022, pages 2263–2279, 2022
2022
-
[39]
Docvqa: Adatasetforvqaondocumentimages
MineshMathew,DimosthenisKaratzas,andCVJawahar. Docvqa: Adatasetforvqaondocumentimages. In WACV, pages 2200–2209, 2021
2021
-
[40]
Infographicvqa
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022
2022
-
[41]
Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms
Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. InNeurIPS, 2024
2024
-
[42]
Transfer between modalities with metaqueries.arXivpreprint arXiv:2504.06256, 2025
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXivpreprint arXiv:2504.06256, 2025
Pith/arXiv arXiv 2025
-
[43]
Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning
Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning. InNeurIPS, 2025
2025
-
[44]
Du, Zehuan Yuan, and Xinglong Wu
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. InCVPR, 2025
2025
-
[45]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML. PmLR, 2021
2021
-
[46]
A-okvqa: A benchmark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, pages 146–162. Springer, 2022
2022
-
[47]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
Pith/arXiv arXiv 2024
-
[48]
Scalable image tokenization with index backpropagation quantization
Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedingsofthe IEEE/CVFInternational ConferenceonComputer Vision, pages 16037–16046, 2025
2025
-
[49]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR, pages 8317–8326, 2019
2019
-
[50]
Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXivpreprint arXiv:2507.23278, 2025
arXiv 2025
-
[51]
Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
Pith/arXiv arXiv 2024
-
[52]
Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXivpreprintarXiv:2504.07491, 2025
Pith/arXiv arXiv 2025
-
[53]
RuiTian,MingfeiGao,HaimingGang,JiasenLu,ZheGan,YinfeiYang,ZuxuanWu,andAfshinDehghan. Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning.arXiv preprint arXiv:2511.14760, 2025
arXiv 2025
-
[54]
Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation
Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, and Afshin Dehghan. Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation. InNeurIPS, 2025. 16
2025
-
[55]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy,TalfanEvans,LucasBeyer,YeXia,BasilMustafa,etal. Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025
Pith/arXiv arXiv 2025
-
[56]
Neural discrete representation learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017
2017
-
[57]
Omnitokenizer: A joint image-video tokenizer for visual generation.NeurIPS, 37:28281–28295, 2024
Junke Wang, Yi Jiang, Zehuan Yuan, Bingyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.NeurIPS, 37:28281–28295, 2024
2024
-
[58]
Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024
Pith/arXiv arXiv 2024
-
[59]
Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprintarXiv:2503.05236, 2025
Pith/arXiv arXiv 2025
-
[60]
Qwen-image technical report.arXiv preprintarXiv:2508.02324, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprintarXiv:2508.02324, 2025
Pith/arXiv arXiv 2025
-
[61]
Omnigen2: Exploration to advanced multimodal generation.arXivpreprintarXiv:2506.18871, 2025
ChenyuanWu,PengfeiZheng,RuiranYan,ShitaoXiao,XinLuo,YuezeWang,WanliLi,XiyanJiang,YexinLiu,Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXivpreprintarXiv:2506.18871, 2025
Pith/arXiv arXiv 2025
-
[62]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human prefer- ence score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023
Pith/arXiv arXiv 2023
-
[63]
Show-o2: Improved native unified multimodal models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. In NeurIPS, volume 38, pages 47490–47518, 2025
2025
-
[64]
Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782, 2025
arXiv 2025
-
[65]
Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025
Pith/arXiv arXiv 2025
-
[66]
Imgedit: A unified image editing dataset and benchmark
Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark. InNeurIPS, 2025
2025
-
[67]
Language model beats diffusion-tokenizer is key to visual generation
Lijun Yu, José Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion-tokenizer is key to visual generation. InICLR, 2024
2024
-
[68]
Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, et al. Nextflow: Unified sequential modeling activates multimodal understanding and generation.arXiv preprint arXiv:2601.02204, 2026
arXiv 2026
-
[69]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023
2023
-
[70]
Imageandvideotokenizationwithbinarysphericalquantization
YueZhao,YuanjunXiong,andPhilippKrähenbühl. Imageandvideotokenizationwithbinarysphericalquantization. In ICLR, 2025
2025
-
[71]
Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, and De-An Huang. Qlip: Text-aligned visual tokenization unifies auto-regressive multimodal understanding and generation. arXiv preprintarXiv:2502.05178, 2025
arXiv 2025
-
[72]
Transfusion: Predict the next token and diffuse images with one multi-modal model
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. InICLR, 2025. 17
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.