pith. sign in

arxiv: 2606.11188 · v1 · pith:6WTOKKHNnew · submitted 2026-06-09 · 💻 cs.CV

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

Pith reviewed 2026-06-27 13:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive multimodal modeldiscrete visual tokenizerimage generationimage editingreinforcement learningnext token predictionvision language modelunified representations
0
0 comments X

The pith

An autoregressive model using a shared discrete visual tokenizer unifies image understanding, generation, and editing, with RL inducing task synergy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ARM shows that next-token prediction can serve as the basis for a multimodal system handling multiple image tasks. A tokenizer is first trained with several objectives to turn images into tokens that capture semantics, align with language, and allow reconstruction. This enables a 7B autoregressive model to learn from combined text and image data for both understanding and creation. Applying reinforcement learning then improves results on generation and editing while also making the two tasks support each other. The work argues this combination offers a path to scalable multimodal intelligence.

Core claim

The paper claims that a discrete representation-based autoregressive model can unify image understanding, generation, and editing in one framework. The central mechanism is a multi-objective discrete semantic visual tokenizer that maps images to compact token sequences supporting all three tasks in a shared space. Training a 7B autoregressive model on large-scale token sequences develops the necessary perception and generation abilities. Reinforcement learning is then applied to optimize for visual quality, instruction adherence, and edit consistency, which improves target metrics and surprisingly creates cross-task synergy.

What carries the argument

The multi-objective discrete semantic visual tokenizer that creates token sequences for a shared latent space across tasks, processed through autoregressive next-token prediction and refined by reinforcement learning.

If this is right

  • Metrics on text-to-image generation and instruction-guided editing improve after RL.
  • Cross-task synergy emerges between generation and editing.
  • The unified approach provides a scalable foundation for multimodal intelligence using next-token prediction.
  • Separate models for understanding, generation, and editing become unnecessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the shared space works, the same method could incorporate additional tasks like visual reasoning without new architectures.
  • The observed synergy implies that preference optimization can uncover beneficial interactions between tasks in the representation space.
  • Further scaling of the autoregressive model or data might amplify these unified capabilities.

Load-bearing premise

The multi-objective discrete semantic visual tokenizer succeeds in producing token sequences that support understanding, generation, and editing together without major task interference or information loss.

What would settle it

An experiment in which applying RL to optimize generation and editing causes a measurable decline in understanding performance or eliminates the reported synergy.

read the original abstract

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces ARM, a 7B autoregressive multimodal model built on a multi-objective discrete semantic visual tokenizer that produces unified token sequences for image understanding, generation, and editing. After pretraining the tokenizer and the AR model on large-scale text-image sequences, the authors apply reinforcement learning to optimize task-level objectives for text-to-image generation and instruction-guided editing; they report metric gains (WISE overall 0.50→0.56; GEdit-Bench-EN G_O 5.75→6.68) and claim that the RL stage induces cross-task synergy between the two generation/editing tasks.

Significance. If the synergy result is shown to arise from joint rather than independent optimization, the work would strengthen the case for unified discrete token spaces as a scalable substrate for multimodal AR models and would illustrate positive transfer from preference optimization across related vision-language tasks. The public code release is a concrete strength that supports reproducibility.

major comments (1)
  1. [Abstract] Abstract: the central claim that RL 'induces cross-task synergy' between text-to-image generation and instruction-guided editing rests on an untested assumption. The abstract reports only the joint-RL outcome and attributes mutual gains to synergy, without any comparison to separate per-task RL runs (generation-only or editing-only). This control is load-bearing for the synergy interpretation; its absence leaves open the possibility that the observed improvements are additive rather than interactive.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for highlighting the evidentiary gap in our synergy claim. We agree that the abstract's attribution of cross-task improvements to synergy requires qualification in the absence of per-task RL controls, and we will revise the manuscript to address this directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that RL 'induces cross-task synergy' between text-to-image generation and instruction-guided editing rests on an untested assumption. The abstract reports only the joint-RL outcome and attributes mutual gains to synergy, without any comparison to separate per-task RL runs (generation-only or editing-only). This control is load-bearing for the synergy interpretation; its absence leaves open the possibility that the observed improvements are additive rather than interactive.

    Authors: We acknowledge the validity of this observation. The manuscript reports performance gains on both tasks after joint RL but does not include separate generation-only or editing-only RL baselines that would isolate interactive effects from additive ones. The synergy interpretation is therefore inferential rather than directly tested. We will revise the abstract (and corresponding discussion) to describe the observed joint improvements without using the term 'synergy' or implying interactive transfer, and we will add an explicit limitations statement noting the lack of these controls. If compute resources allow during revision, we will attempt to run the per-task ablations and report them; otherwise the claim will be removed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with no derivation reductions

full rationale

The paper presents an empirical training sequence—multi-objective discrete tokenizer, autoregressive next-token modeling on text/image tokens, then RL for preference objectives—without any claimed mathematical derivations, uniqueness theorems, or predictions that reduce by construction to fitted inputs or self-citations. Reported metric gains (e.g., WISE 0.50→0.56) are framed as training outcomes, not tautological restatements. The work is self-contained against external benchmarks with no load-bearing self-citation chains or ansatz smuggling visible in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5833 in / 1155 out tokens · 26927 ms · 2026-06-27T13:27:17.759493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

104 extracted references · 54 canonical work pages · 41 internal anchors

  1. [1]

    GPT-4 Technical Report

    JoshAchiam,StevenAdler,SandhiniAgarwal,LamaAhmad,IlgeAkkaya,FlorenciaLeoniAleman,DiogoAlmeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXivpreprintarXiv:2303.08774, 2023

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022

  3. [3]

    Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXivpreprintarXiv:2412.04280, 2024

    Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXivpreprintarXiv:2412.04280, 2024

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXivpreprint arXiv:2502.13923, 2025

  5. [5]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  6. [6]

    Perception encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025

  7. [7]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InCVPR, 2023

  8. [8]

    Video generation models as world simulators.OpenAI Blog, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 2024

  9. [9]

    Language models are few-shot learners

    TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal,ArvindNeelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, 2020

  10. [10]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InCVPR, 2022

  11. [11]

    Muse: Text-to-image generation via masked generative transformers

    HuiwenChang,HanZhang,JarredBarber,AJMaschinot,JoseLezama,LuJiang,Ming-HsuanYang,KevinMurphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXivpreprintarXiv:2301.00704, 2023

  12. [12]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXivpreprintarXiv:2505.09568, 2025

  13. [13]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXivpreprintarXiv:2310.00426, 2023

  14. [14]

    Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

    Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

  15. [15]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  16. [16]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprintarXiv:2412.05271, 2024

  17. [17]

    Common crawl: Open repository of web crawl data.https://commoncrawl.org/, 2007

    Common Crawl. Common crawl: Open repository of web crawl data.https://commoncrawl.org/, 2007

  18. [18]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXivpreprintarXiv:2505.14683, 2025. 13

  19. [19]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

  20. [20]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021

  21. [21]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

  22. [22]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXivpreprintarXiv:2306.13394, 2023

  23. [23]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodalmodelswithunifiedmulti-granularitycomprehensionandgeneration. arXivpreprintarXiv:2404.14396, 2024

  24. [24]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

  25. [25]

    Generative adversarial nets

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014

  26. [26]

    Experiment with gemini 2.0 flash native image generation.https://developers

    Google Developers Blog. Experiment with gemini 2.0 flash native image generation.https://developers. googleblog.com/en/experiment-with-gemini-20-flash-native-image-generation/, March 2025

  27. [27]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 2025

  28. [28]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXivpreprint arXiv:2505.07062, 2025

  29. [29]

    Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025

    Pengbo Guo, Junke Wang, Zhen Xing, Chengxu Liu, Daoguo Dong, Xueming Qian, and Zuxuan Wu. Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025

  30. [30]

    arXiv preprint arXiv:2506.18898 , year=

    Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXivpreprint arXiv:2506.18898, 2025

  31. [31]

    Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

    Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InCVPR, 2025

  32. [32]

    Mvimgnet2

    Xiaoguang Han, Yushuang Wu, Luyue Shi, Haolin Liu, Hongjie Liao, Lingteng Qiu, Weihao Yuan, Xiaodong Gu, Zilong Dong, and Shuguang Cui. Mvimgnet2. 0: A larger-scale dataset of multi-view images.arXiv preprint arXiv:2412.01430, 2024

  33. [33]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022

  34. [34]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

  35. [35]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXivpreprintarXiv:2403.05135, 2024

  36. [36]

    Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering

    DrewAHudsonandChristopherDManning. Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering. InCVPR, 2019

  37. [37]

    Hq-edit: A high-quality dataset for instruction-based image editing.arXivpreprintarXiv:2404.09990, 2024

    MudeHui, SiweiYang, BingchenZhao, YichunShi, HengWang, PengWang, YuyinZhou, andCihangXie. Hq-edit: A high-quality dataset for instruction-based image editing.arXivpreprintarXiv:2404.09990, 2024

  38. [38]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXivpreprintarXiv:2410.21276, 2024

  39. [39]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprintarXiv:1312.6114, 2013. 14

  40. [40]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXivpreprintarXiv:2408.03326, 2024

  41. [41]

    Seed-bench: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, 2024

  42. [42]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023

  43. [43]

    arXiv preprint arXiv:2406.08418 , year=

    Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text.arXivpreprintarXiv:2406.08418, 2024

  44. [44]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXivpreprintarXiv:2305.10355, 2023

  45. [45]

    Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

  46. [46]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXivpreprintarXiv:2506.03147, 2025

  47. [47]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXivpreprintarXiv:2210.02747, 2022

  48. [48]

    World Model on Million-Length Video And Language With Blockwise RingAttention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXivpreprint arXiv:2402.08268, 2024

  49. [49]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

  50. [50]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXivpreprintarXiv:2504.17761, 2025

  51. [51]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXivpreprintarXiv:2209.03003, 2022

  52. [52]

    Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

    YuanLiu,HaodongDuan,YuanhanZhang,BoLi,SongyangZhang,WangboZhao,YikeYuan,JiaqiWang,Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

  53. [53]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXivpreprintarXiv:1711.05101, 2017

  54. [54]

    Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025

    Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025

  55. [55]

    Unitok: A unified tokenizer for visual generation and understanding

    Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. InNeurIPS, 2025

  56. [56]

    Finite Scalar Quantization: VQ-VAE Made Simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprintarXiv:2309.15505, 2023

  57. [57]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprintarXiv:2503.07265, 2025

  58. [58]

    Dall·e 3.https://openai.com/research/dall-e-3, September 2023

    OpenAI. Dall·e 3.https://openai.com/research/dall-e-3, September 2023

  59. [59]

    Introducing gpt-4.1 in the api

    OpenAI. Introducing gpt-4.1 in the api. https://openai.com/index/gpt-4-1/, 2025

  60. [60]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, April 2025. URL https://cdn. openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf . Ac- cessed: 2026-01-28. 15

  61. [61]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

  62. [62]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  63. [63]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  64. [64]

    Variational autoencoder for deep learning of images, labels and captions

    Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. Variational autoencoder for deep learning of images, labels and captions. InNeurIPS, 2016

  65. [65]

    arXiv preprint arXiv:2503.21758 , year=

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprintarXiv:2503.21758, 2025

  66. [66]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

  67. [67]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

  68. [68]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

  69. [69]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXivpreprintarXiv:2402.03300, 2024

  70. [70]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InECCS, 2025

  71. [71]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  72. [72]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXivpreprintarXiv:2406.06525, 2024

  73. [73]

    Generative multimodal models are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InCVPR, 2024

  74. [74]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  75. [75]

    Qwen2 Technical Report

    Qwen Team et al. Qwen2 technical report.arXivpreprintarXiv:2407.10671, 2024

  76. [76]

    Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation

    Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, and Afshin Dehghan. Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation. InNeurIPS, 2025

  77. [77]

    Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning

    Rui Tian, Mingfei Gao, Haiming Gang, Jiasen Lu, Zhe Gan, Yinfei Yang, Zuxuan Wu, and Afshin Dehghan. Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning. In CVPR, 2026

  78. [78]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoderswithimprovedsemanticunderstanding,localization,anddensefeatures. arXivpreprintarXiv:2502.14786, 2025

  79. [79]

    Growing visual generative capacity for pre-trained mllms.arXivpreprintarXiv:2510.01546, 2025

    Hanyu Wang, Jiaming Han, Ziyan Yang, Qi Zhao, Shanchuan Lin, Xiangyu Yue, Abhinav Shrivastava, Zhenheng Yang, and Hao Chen. Growing visual generative capacity for pre-trained mllms.arXivpreprintarXiv:2510.01546, 2025. 16

  80. [80]

    Omnitokenizer: A joint image-video tokenizer for visual generation

    Junke Wang, Yi Jiang, Zehuan Yuan, Bingyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024

Showing first 80 references.