ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

Chaorui Deng; Danhui Guan; Feng Li; Hao Chen; Haoqi Fan; Jiacheng Pan; Jingxiang Sun; Junke Wang; Kaibin Tian; Kun Xu

arxiv: 2606.11188 · v1 · pith:6WTOKKHNnew · submitted 2026-06-09 · 💻 cs.CV

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

Junke Wang , Xiao Wang , Jiacheng Pan , Xuefeng Hu , Feng Li , Jingxiang Sun , Chaorui Deng , Zilong Chen

show 11 more authors

Yunpeng Chen Kaibin Tian Matthew Gwilliam Hao Chen Danhui Guan Kun Xu Weilin Huang Zuxuan Wu Haoqi Fan Yu-Gang Jiang Zhenheng Yang

This is my paper

Pith reviewed 2026-06-27 13:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords autoregressive multimodal modeldiscrete visual tokenizerimage generationimage editingreinforcement learningnext token predictionvision language modelunified representations

0 comments

The pith

An autoregressive model using a shared discrete visual tokenizer unifies image understanding, generation, and editing, with RL inducing task synergy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ARM shows that next-token prediction can serve as the basis for a multimodal system handling multiple image tasks. A tokenizer is first trained with several objectives to turn images into tokens that capture semantics, align with language, and allow reconstruction. This enables a 7B autoregressive model to learn from combined text and image data for both understanding and creation. Applying reinforcement learning then improves results on generation and editing while also making the two tasks support each other. The work argues this combination offers a path to scalable multimodal intelligence.

Core claim

The paper claims that a discrete representation-based autoregressive model can unify image understanding, generation, and editing in one framework. The central mechanism is a multi-objective discrete semantic visual tokenizer that maps images to compact token sequences supporting all three tasks in a shared space. Training a 7B autoregressive model on large-scale token sequences develops the necessary perception and generation abilities. Reinforcement learning is then applied to optimize for visual quality, instruction adherence, and edit consistency, which improves target metrics and surprisingly creates cross-task synergy.

What carries the argument

The multi-objective discrete semantic visual tokenizer that creates token sequences for a shared latent space across tasks, processed through autoregressive next-token prediction and refined by reinforcement learning.

If this is right

Metrics on text-to-image generation and instruction-guided editing improve after RL.
Cross-task synergy emerges between generation and editing.
The unified approach provides a scalable foundation for multimodal intelligence using next-token prediction.
Separate models for understanding, generation, and editing become unnecessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the shared space works, the same method could incorporate additional tasks like visual reasoning without new architectures.
The observed synergy implies that preference optimization can uncover beneficial interactions between tasks in the representation space.
Further scaling of the autoregressive model or data might amplify these unified capabilities.

Load-bearing premise

The multi-objective discrete semantic visual tokenizer succeeds in producing token sequences that support understanding, generation, and editing together without major task interference or information loss.

What would settle it

An experiment in which applying RL to optimize generation and editing causes a measurable decline in understanding performance or eliminates the reported synergy.

read the original abstract

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARM's main addition is a multi-objective tokenizer plus RL that lifts both generation and editing metrics, but the claimed cross-task synergy lacks the joint-vs-separate controls needed to confirm it.

read the letter

The paper's core move is training one discrete visual tokenizer with combined semantic, alignment, and reconstruction losses, feeding it into a 7B autoregressive model, then running RL on both text-to-image generation and instruction-guided editing. They report gains on WISE (0.50 to 0.56) and GEdit-Bench (5.75 to 6.68) and say the RL stage creates positive transfer between the two tasks.

The tokenizer design is the clearest piece of new work. Supervising it for multiple objectives at once is a direct attempt to make the same token space usable for perception, generation, and editing without obvious task interference. That is a practical step beyond single-objective tokenizers in prior autoregressive multimodal models. The RL application after the base AR training is also straightforward and produces the reported metric improvements.

The soft spot is the synergy interpretation. The abstract presents the mutual gains as evidence that RL induces cross-task synergy, yet it gives no results from running RL on generation alone or editing alone. Without those controls it is difficult to tell whether the improvements come from interaction between the tasks or simply from applying RL to each task independently. The abstract also omits baselines, error bars, dataset sizes, and ablation numbers, so the strength of the metric claims is hard to judge from the summary alone.

This is incremental work that sits inside the existing autoregressive multimodal line rather than breaking new ground on fundamentals. Readers already tracking models that unify generation and understanding under next-token prediction will find the tokenizer and RL details useful to examine. The central empirical claims are testable once the full experiments are available.

I would bring the tokenizer section and the RL results to a reading group to check the implementation details. The paper is coherent enough on its own terms to merit peer review, mainly so referees can ask for the missing controls on the synergy claim and the full set of ablations.

Referee Report

1 major / 0 minor

Summary. The paper introduces ARM, a 7B autoregressive multimodal model built on a multi-objective discrete semantic visual tokenizer that produces unified token sequences for image understanding, generation, and editing. After pretraining the tokenizer and the AR model on large-scale text-image sequences, the authors apply reinforcement learning to optimize task-level objectives for text-to-image generation and instruction-guided editing; they report metric gains (WISE overall 0.50→0.56; GEdit-Bench-EN G_O 5.75→6.68) and claim that the RL stage induces cross-task synergy between the two generation/editing tasks.

Significance. If the synergy result is shown to arise from joint rather than independent optimization, the work would strengthen the case for unified discrete token spaces as a scalable substrate for multimodal AR models and would illustrate positive transfer from preference optimization across related vision-language tasks. The public code release is a concrete strength that supports reproducibility.

major comments (1)

[Abstract] Abstract: the central claim that RL 'induces cross-task synergy' between text-to-image generation and instruction-guided editing rests on an untested assumption. The abstract reports only the joint-RL outcome and attributes mutual gains to synergy, without any comparison to separate per-task RL runs (generation-only or editing-only). This control is load-bearing for the synergy interpretation; its absence leaves open the possibility that the observed improvements are additive rather than interactive.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for highlighting the evidentiary gap in our synergy claim. We agree that the abstract's attribution of cross-task improvements to synergy requires qualification in the absence of per-task RL controls, and we will revise the manuscript to address this directly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that RL 'induces cross-task synergy' between text-to-image generation and instruction-guided editing rests on an untested assumption. The abstract reports only the joint-RL outcome and attributes mutual gains to synergy, without any comparison to separate per-task RL runs (generation-only or editing-only). This control is load-bearing for the synergy interpretation; its absence leaves open the possibility that the observed improvements are additive rather than interactive.

Authors: We acknowledge the validity of this observation. The manuscript reports performance gains on both tasks after joint RL but does not include separate generation-only or editing-only RL baselines that would isolate interactive effects from additive ones. The synergy interpretation is therefore inferential rather than directly tested. We will revise the abstract (and corresponding discussion) to describe the observed joint improvements without using the term 'synergy' or implying interactive transfer, and we will add an explicit limitations statement noting the lack of these controls. If compute resources allow during revision, we will attempt to run the per-task ablations and report them; otherwise the claim will be removed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with no derivation reductions

full rationale

The paper presents an empirical training sequence—multi-objective discrete tokenizer, autoregressive next-token modeling on text/image tokens, then RL for preference objectives—without any claimed mathematical derivations, uniqueness theorems, or predictions that reduce by construction to fitted inputs or self-citations. Reported metric gains (e.g., WISE 0.50→0.56) are framed as training outcomes, not tautological restatements. The work is self-contained against external benchmarks with no load-bearing self-citation chains or ansatz smuggling visible in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5833 in / 1155 out tokens · 26927 ms · 2026-06-27T13:27:17.759493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

104 extracted references · 54 canonical work pages · 41 internal anchors

[1]

GPT-4 Technical Report

JoshAchiam,StevenAdler,SandhiniAgarwal,LamaAhmad,IlgeAkkaya,FlorenciaLeoniAleman,DiogoAlmeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXivpreprintarXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022

2022
[3]

Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXivpreprintarXiv:2412.04280, 2024

Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXivpreprintarXiv:2412.04280, 2024

work page arXiv 2024
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shĳie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXivpreprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[6]

Perception encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025

2025
[7]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InCVPR, 2023

2023
[8]

Video generation models as world simulators.OpenAI Blog, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 2024

2024
[9]

Language models are few-shot learners

TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal,ArvindNeelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, 2020

2020
[10]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InCVPR, 2022

2022
[11]

Muse: Text-to-image generation via masked generative transformers

HuiwenChang,HanZhang,JarredBarber,AJMaschinot,JoseLezama,LuJiang,Ming-HsuanYang,KevinMurphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXivpreprintarXiv:2301.00704, 2023

work page arXiv 2023
[12]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXivpreprintarXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXivpreprintarXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

work page arXiv 2025
[15]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprintarXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Common crawl: Open repository of web crawl data.https://commoncrawl.org/, 2007

Common Crawl. Common crawl: Open repository of web crawl data.https://commoncrawl.org/, 2007

2007
[18]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXivpreprintarXiv:2505.14683, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

2009
[20]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021

2021
[21]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

2024
[22]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXivpreprintarXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sĳie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodalmodelswithunifiedmulti-granularitycomprehensionandgeneration. arXivpreprintarXiv:2404.14396, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

2023
[25]

Generative adversarial nets

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014

2014
[26]

Experiment with gemini 2.0 flash native image generation.https://developers

Google Developers Blog. Experiment with gemini 2.0 flash native image generation.https://developers. googleblog.com/en/experiment-with-gemini-20-flash-native-image-generation/, March 2025

2025
[27]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 2025

2025
[28]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXivpreprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025

Pengbo Guo, Junke Wang, Zhen Xing, Chengxu Liu, Daoguo Dong, Xueming Qian, and Zuxuan Wu. Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025

work page arXiv 2025
[30]

arXiv preprint arXiv:2506.18898 , year=

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXivpreprint arXiv:2506.18898, 2025

work page arXiv 2025
[31]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InCVPR, 2025

2025
[32]

Mvimgnet2

Xiaoguang Han, Yushuang Wu, Luyue Shi, Haolin Liu, Hongjie Liao, Lingteng Qiu, Weihao Yuan, Xiaodong Gu, Zilong Dong, and Shuguang Cui. Mvimgnet2. 0: A larger-scale dataset of multi-view images.arXiv preprint arXiv:2412.01430, 2024

work page arXiv 2024
[33]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

2020
[35]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXivpreprintarXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering

DrewAHudsonandChristopherDManning. Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering. InCVPR, 2019

2019
[37]

Hq-edit: A high-quality dataset for instruction-based image editing.arXivpreprintarXiv:2404.09990, 2024

MudeHui, SiweiYang, BingchenZhao, YichunShi, HengWang, PengWang, YuyinZhou, andCihangXie. Hq-edit: A high-quality dataset for instruction-based image editing.arXivpreprintarXiv:2404.09990, 2024

work page arXiv 2024
[38]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXivpreprintarXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprintarXiv:1312.6114, 2013. 14

work page internal anchor Pith review Pith/arXiv arXiv 2013
[40]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXivpreprintarXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, 2024

2024
[42]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023

2023
[43]

arXiv preprint arXiv:2406.08418 , year=

Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text.arXivpreprintarXiv:2406.08418, 2024

work page arXiv 2024
[44]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXivpreprintarXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXivpreprintarXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXivpreprintarXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXivpreprint arXiv:2402.08268, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

2023
[50]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXivpreprintarXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXivpreprintarXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

YuanLiu,HaodongDuan,YuanhanZhang,BoLi,SongyangZhang,WangboZhao,YikeYuan,JiaqiWang,Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

2024
[53]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXivpreprintarXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025

Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025

work page arXiv 2025
[55]

Unitok: A unified tokenizer for visual generation and understanding

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. InNeurIPS, 2025

2025
[56]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprintarXiv:2309.15505, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprintarXiv:2503.07265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Dall·e 3.https://openai.com/research/dall-e-3, September 2023

OpenAI. Dall·e 3.https://openai.com/research/dall-e-3, September 2023

2023
[59]

Introducing gpt-4.1 in the api

OpenAI. Introducing gpt-4.1 in the api. https://openai.com/index/gpt-4-1/, 2025

2025
[60]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, April 2025. URL https://cdn. openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf . Ac- cessed: 2026-01-28. 15

2025
[61]

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

2023
[63]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Variational autoencoder for deep learning of images, labels and captions

Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. Variational autoencoder for deep learning of images, labels and captions. InNeurIPS, 2016

2016
[65]

arXiv preprint arXiv:2503.21758 , year=

Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprintarXiv:2503.21758, 2025

work page arXiv 2025
[66]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

2021
[67]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

2022
[68]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXivpreprintarXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InECCS, 2025

2025
[71]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[72]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXivpreprintarXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InCVPR, 2024

2024
[74]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Qwen2 Technical Report

Qwen Team et al. Qwen2 technical report.arXivpreprintarXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation

Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, and Afshin Dehghan. Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation. InNeurIPS, 2025

2025
[77]

Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning

Rui Tian, Mingfei Gao, Haiming Gang, Jiasen Lu, Zhe Gan, Yinfei Yang, Zuxuan Wu, and Afshin Dehghan. Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning. In CVPR, 2026

2026
[78]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoderswithimprovedsemanticunderstanding,localization,anddensefeatures. arXivpreprintarXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Growing visual generative capacity for pre-trained mllms.arXivpreprintarXiv:2510.01546, 2025

Hanyu Wang, Jiaming Han, Ziyan Yang, Qi Zhao, Shanchuan Lin, Xiangyu Yue, Abhinav Shrivastava, Zhenheng Yang, and Hao Chen. Growing visual generative capacity for pre-trained mllms.arXivpreprintarXiv:2510.01546, 2025. 16

work page arXiv 2025
[80]

Omnitokenizer: A joint image-video tokenizer for visual generation

Junke Wang, Yi Jiang, Zehuan Yuan, Bingyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024

2024

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

JoshAchiam,StevenAdler,SandhiniAgarwal,LamaAhmad,IlgeAkkaya,FlorenciaLeoniAleman,DiogoAlmeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXivpreprintarXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022

2022

[3] [3]

Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXivpreprintarXiv:2412.04280, 2024

Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXivpreprintarXiv:2412.04280, 2024

work page arXiv 2024

[4] [4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shĳie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXivpreprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024

[6] [6]

Perception encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025

2025

[7] [7]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InCVPR, 2023

2023

[8] [8]

Video generation models as world simulators.OpenAI Blog, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 2024

2024

[9] [9]

Language models are few-shot learners

TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal,ArvindNeelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, 2020

2020

[10] [10]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InCVPR, 2022

2022

[11] [11]

Muse: Text-to-image generation via masked generative transformers

HuiwenChang,HanZhang,JarredBarber,AJMaschinot,JoseLezama,LuJiang,Ming-HsuanYang,KevinMurphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXivpreprintarXiv:2301.00704, 2023

work page arXiv 2023

[12] [12]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXivpreprintarXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXivpreprintarXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

work page arXiv 2025

[15] [15]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprintarXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Common crawl: Open repository of web crawl data.https://commoncrawl.org/, 2007

Common Crawl. Common crawl: Open repository of web crawl data.https://commoncrawl.org/, 2007

2007

[18] [18]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXivpreprintarXiv:2505.14683, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

2009

[20] [20]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021

2021

[21] [21]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

2024

[22] [22]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXivpreprintarXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sĳie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodalmodelswithunifiedmulti-granularitycomprehensionandgeneration. arXivpreprintarXiv:2404.14396, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

2023

[25] [25]

Generative adversarial nets

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014

2014

[26] [26]

Experiment with gemini 2.0 flash native image generation.https://developers

Google Developers Blog. Experiment with gemini 2.0 flash native image generation.https://developers. googleblog.com/en/experiment-with-gemini-20-flash-native-image-generation/, March 2025

2025

[27] [27]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 2025

2025

[28] [28]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXivpreprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025

Pengbo Guo, Junke Wang, Zhen Xing, Chengxu Liu, Daoguo Dong, Xueming Qian, and Zuxuan Wu. Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025

work page arXiv 2025

[30] [30]

arXiv preprint arXiv:2506.18898 , year=

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXivpreprint arXiv:2506.18898, 2025

work page arXiv 2025

[31] [31]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InCVPR, 2025

2025

[32] [32]

Mvimgnet2

Xiaoguang Han, Yushuang Wu, Luyue Shi, Haolin Liu, Hongjie Liao, Lingteng Qiu, Weihao Yuan, Xiaodong Gu, Zilong Dong, and Shuguang Cui. Mvimgnet2. 0: A larger-scale dataset of multi-view images.arXiv preprint arXiv:2412.01430, 2024

work page arXiv 2024

[33] [33]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

2020

[35] [35]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXivpreprintarXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering

DrewAHudsonandChristopherDManning. Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering. InCVPR, 2019

2019

[37] [37]

Hq-edit: A high-quality dataset for instruction-based image editing.arXivpreprintarXiv:2404.09990, 2024

MudeHui, SiweiYang, BingchenZhao, YichunShi, HengWang, PengWang, YuyinZhou, andCihangXie. Hq-edit: A high-quality dataset for instruction-based image editing.arXivpreprintarXiv:2404.09990, 2024

work page arXiv 2024

[38] [38]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXivpreprintarXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprintarXiv:1312.6114, 2013. 14

work page internal anchor Pith review Pith/arXiv arXiv 2013

[40] [40]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXivpreprintarXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, 2024

2024

[42] [42]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023

2023

[43] [43]

arXiv preprint arXiv:2406.08418 , year=

Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text.arXivpreprintarXiv:2406.08418, 2024

work page arXiv 2024

[44] [44]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXivpreprintarXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXivpreprintarXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXivpreprintarXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[48] [48]

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXivpreprint arXiv:2402.08268, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

2023

[50] [50]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXivpreprintarXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXivpreprintarXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[52] [52]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

YuanLiu,HaodongDuan,YuanhanZhang,BoLi,SongyangZhang,WangboZhao,YikeYuan,JiaqiWang,Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

2024

[53] [53]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXivpreprintarXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[54] [54]

Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025

Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025

work page arXiv 2025

[55] [55]

Unitok: A unified tokenizer for visual generation and understanding

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. InNeurIPS, 2025

2025

[56] [56]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprintarXiv:2309.15505, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprintarXiv:2503.07265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Dall·e 3.https://openai.com/research/dall-e-3, September 2023

OpenAI. Dall·e 3.https://openai.com/research/dall-e-3, September 2023

2023

[59] [59]

Introducing gpt-4.1 in the api

OpenAI. Introducing gpt-4.1 in the api. https://openai.com/index/gpt-4-1/, 2025

2025

[60] [60]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, April 2025. URL https://cdn. openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf . Ac- cessed: 2026-01-28. 15

2025

[61] [61]

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

2023

[63] [63]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [64]

Variational autoencoder for deep learning of images, labels and captions

Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. Variational autoencoder for deep learning of images, labels and captions. InNeurIPS, 2016

2016

[65] [65]

arXiv preprint arXiv:2503.21758 , year=

Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprintarXiv:2503.21758, 2025

work page arXiv 2025

[66] [66]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

2021

[67] [67]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

2022

[68] [68]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [69]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXivpreprintarXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InECCS, 2025

2025

[71] [71]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[72] [72]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXivpreprintarXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[73] [73]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InCVPR, 2024

2024

[74] [74]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [75]

Qwen2 Technical Report

Qwen Team et al. Qwen2 technical report.arXivpreprintarXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[76] [76]

Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation

Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, and Afshin Dehghan. Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation. InNeurIPS, 2025

2025

[77] [77]

Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning

Rui Tian, Mingfei Gao, Haiming Gang, Jiasen Lu, Zhe Gan, Yinfei Yang, Zuxuan Wu, and Afshin Dehghan. Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning. In CVPR, 2026

2026

[78] [78]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoderswithimprovedsemanticunderstanding,localization,anddensefeatures. arXivpreprintarXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[79] [79]

Growing visual generative capacity for pre-trained mllms.arXivpreprintarXiv:2510.01546, 2025

Hanyu Wang, Jiaming Han, Ziyan Yang, Qi Zhao, Shanchuan Lin, Xiangyu Yue, Abhinav Shrivastava, Zhenheng Yang, and Hao Chen. Growing visual generative capacity for pre-trained mllms.arXivpreprintarXiv:2510.01546, 2025. 16

work page arXiv 2025

[80] [80]

Omnitokenizer: A joint image-video tokenizer for visual generation

Junke Wang, Yi Jiang, Zehuan Yuan, Bingyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024

2024