ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations
Pith reviewed 2026-06-27 13:27 UTC · model grok-4.3
The pith
An autoregressive model using a shared discrete visual tokenizer unifies image understanding, generation, and editing, with RL inducing task synergy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a discrete representation-based autoregressive model can unify image understanding, generation, and editing in one framework. The central mechanism is a multi-objective discrete semantic visual tokenizer that maps images to compact token sequences supporting all three tasks in a shared space. Training a 7B autoregressive model on large-scale token sequences develops the necessary perception and generation abilities. Reinforcement learning is then applied to optimize for visual quality, instruction adherence, and edit consistency, which improves target metrics and surprisingly creates cross-task synergy.
What carries the argument
The multi-objective discrete semantic visual tokenizer that creates token sequences for a shared latent space across tasks, processed through autoregressive next-token prediction and refined by reinforcement learning.
If this is right
- Metrics on text-to-image generation and instruction-guided editing improve after RL.
- Cross-task synergy emerges between generation and editing.
- The unified approach provides a scalable foundation for multimodal intelligence using next-token prediction.
- Separate models for understanding, generation, and editing become unnecessary.
Where Pith is reading between the lines
- If the shared space works, the same method could incorporate additional tasks like visual reasoning without new architectures.
- The observed synergy implies that preference optimization can uncover beneficial interactions between tasks in the representation space.
- Further scaling of the autoregressive model or data might amplify these unified capabilities.
Load-bearing premise
The multi-objective discrete semantic visual tokenizer succeeds in producing token sequences that support understanding, generation, and editing together without major task interference or information loss.
What would settle it
An experiment in which applying RL to optimize generation and editing causes a measurable decline in understanding performance or eliminates the reported synergy.
read the original abstract
This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ARM, a 7B autoregressive multimodal model built on a multi-objective discrete semantic visual tokenizer that produces unified token sequences for image understanding, generation, and editing. After pretraining the tokenizer and the AR model on large-scale text-image sequences, the authors apply reinforcement learning to optimize task-level objectives for text-to-image generation and instruction-guided editing; they report metric gains (WISE overall 0.50→0.56; GEdit-Bench-EN G_O 5.75→6.68) and claim that the RL stage induces cross-task synergy between the two generation/editing tasks.
Significance. If the synergy result is shown to arise from joint rather than independent optimization, the work would strengthen the case for unified discrete token spaces as a scalable substrate for multimodal AR models and would illustrate positive transfer from preference optimization across related vision-language tasks. The public code release is a concrete strength that supports reproducibility.
major comments (1)
- [Abstract] Abstract: the central claim that RL 'induces cross-task synergy' between text-to-image generation and instruction-guided editing rests on an untested assumption. The abstract reports only the joint-RL outcome and attributes mutual gains to synergy, without any comparison to separate per-task RL runs (generation-only or editing-only). This control is load-bearing for the synergy interpretation; its absence leaves open the possibility that the observed improvements are additive rather than interactive.
Simulated Author's Rebuttal
We thank the referee for the detailed review and for highlighting the evidentiary gap in our synergy claim. We agree that the abstract's attribution of cross-task improvements to synergy requires qualification in the absence of per-task RL controls, and we will revise the manuscript to address this directly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that RL 'induces cross-task synergy' between text-to-image generation and instruction-guided editing rests on an untested assumption. The abstract reports only the joint-RL outcome and attributes mutual gains to synergy, without any comparison to separate per-task RL runs (generation-only or editing-only). This control is load-bearing for the synergy interpretation; its absence leaves open the possibility that the observed improvements are additive rather than interactive.
Authors: We acknowledge the validity of this observation. The manuscript reports performance gains on both tasks after joint RL but does not include separate generation-only or editing-only RL baselines that would isolate interactive effects from additive ones. The synergy interpretation is therefore inferential rather than directly tested. We will revise the abstract (and corresponding discussion) to describe the observed joint improvements without using the term 'synergy' or implying interactive transfer, and we will add an explicit limitations statement noting the lack of these controls. If compute resources allow during revision, we will attempt to run the per-task ablations and report them; otherwise the claim will be removed. revision: yes
Circularity Check
No circularity: empirical pipeline with no derivation reductions
full rationale
The paper presents an empirical training sequence—multi-objective discrete tokenizer, autoregressive next-token modeling on text/image tokens, then RL for preference objectives—without any claimed mathematical derivations, uniqueness theorems, or predictions that reduce by construction to fitted inputs or self-citations. Reported metric gains (e.g., WISE 0.50→0.56) are framed as training outcomes, not tautological restatements. The work is self-contained against external benchmarks with no load-bearing self-citation chains or ansatz smuggling visible in the provided text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
JoshAchiam,StevenAdler,SandhiniAgarwal,LamaAhmad,IlgeAkkaya,FlorenciaLeoniAleman,DiogoAlmeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXivpreprintarXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022
2022
-
[3]
Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXivpreprintarXiv:2412.04280, 2024
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXivpreprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
2024
-
[6]
Perception encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025
2025
-
[7]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InCVPR, 2023
2023
-
[8]
Video generation models as world simulators.OpenAI Blog, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 2024
2024
-
[9]
Language models are few-shot learners
TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal,ArvindNeelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, 2020
2020
-
[10]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InCVPR, 2022
2022
-
[11]
Muse: Text-to-image generation via masked generative transformers
HuiwenChang,HanZhang,JarredBarber,AJMaschinot,JoseLezama,LuJiang,Ming-HsuanYang,KevinMurphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXivpreprintarXiv:2301.00704, 2023
-
[12]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXivpreprintarXiv:2505.09568, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXivpreprintarXiv:2310.00426, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025
-
[15]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprintarXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Common crawl: Open repository of web crawl data.https://commoncrawl.org/, 2007
Common Crawl. Common crawl: Open repository of web crawl data.https://commoncrawl.org/, 2007
2007
-
[18]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXivpreprintarXiv:2505.14683, 2025. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009
2009
-
[20]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021
2021
-
[21]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024
2024
-
[22]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXivpreprintarXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodalmodelswithunifiedmulti-granularitycomprehensionandgeneration. arXivpreprintarXiv:2404.14396, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Geneval: An object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023
2023
-
[25]
Generative adversarial nets
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014
2014
-
[26]
Experiment with gemini 2.0 flash native image generation.https://developers
Google Developers Blog. Experiment with gemini 2.0 flash native image generation.https://developers. googleblog.com/en/experiment-with-gemini-20-flash-native-image-generation/, March 2025
2025
-
[27]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 2025
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 2025
2025
-
[28]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXivpreprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025
Pengbo Guo, Junke Wang, Zhen Xing, Chengxu Liu, Daoguo Dong, Xueming Qian, and Zuxuan Wu. Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025
-
[30]
arXiv preprint arXiv:2506.18898 , year=
Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXivpreprint arXiv:2506.18898, 2025
-
[31]
Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InCVPR, 2025
2025
- [32]
-
[33]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXivpreprintarXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020
2020
-
[35]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXivpreprintarXiv:2403.05135, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering
DrewAHudsonandChristopherDManning. Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering. InCVPR, 2019
2019
-
[37]
MudeHui, SiweiYang, BingchenZhao, YichunShi, HengWang, PengWang, YuyinZhou, andCihangXie. Hq-edit: A high-quality dataset for instruction-based image editing.arXivpreprintarXiv:2404.09990, 2024
-
[38]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXivpreprintarXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprintarXiv:1312.6114, 2013. 14
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[40]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXivpreprintarXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Seed-bench: Benchmarking multimodal large language models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, 2024
2024
-
[42]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023
2023
-
[43]
arXiv preprint arXiv:2406.08418 , year=
Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text.arXivpreprintarXiv:2406.08418, 2024
-
[44]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXivpreprintarXiv:2305.10355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXivpreprintarXiv:2506.03147, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXivpreprintarXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXivpreprint arXiv:2402.08268, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023
2023
-
[50]
Step1X-Edit: A Practical Framework for General Image Editing
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXivpreprintarXiv:2504.17761, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXivpreprintarXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
Mmbench: Is your multi-modal model an all-around player? InECCV, 2024
YuanLiu,HaodongDuan,YuanhanZhang,BoLi,SongyangZhang,WangboZhao,YikeYuan,JiaqiWang,Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024
2024
-
[53]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXivpreprintarXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[54]
Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025
Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.arXivpreprintarXiv:2509.14476, 2025
-
[55]
Unitok: A unified tokenizer for visual generation and understanding
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. InNeurIPS, 2025
2025
-
[56]
Finite Scalar Quantization: VQ-VAE Made Simple
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprintarXiv:2309.15505, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprintarXiv:2503.07265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Dall·e 3.https://openai.com/research/dall-e-3, September 2023
OpenAI. Dall·e 3.https://openai.com/research/dall-e-3, September 2023
2023
-
[59]
Introducing gpt-4.1 in the api
OpenAI. Introducing gpt-4.1 in the api. https://openai.com/index/gpt-4-1/, 2025
2025
-
[60]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, April 2025. URL https://cdn. openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf . Ac- cessed: 2026-01-28. 15
2025
-
[61]
Transfer between Modalities with MetaQueries
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023
2023
-
[63]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Variational autoencoder for deep learning of images, labels and captions
Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. Variational autoencoder for deep learning of images, labels and captions. InNeurIPS, 2016
2016
-
[65]
arXiv preprint arXiv:2503.21758 , year=
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprintarXiv:2503.21758, 2025
-
[66]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021
2021
-
[67]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022
2022
-
[68]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXivpreprintarXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InECCS, 2025
2025
-
[71]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[72]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXivpreprintarXiv:2406.06525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
Generative multimodal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InCVPR, 2024
2024
-
[74]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Qwen Team et al. Qwen2 technical report.arXivpreprintarXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation
Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, and Afshin Dehghan. Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation. InNeurIPS, 2025
2025
-
[77]
Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning
Rui Tian, Mingfei Gao, Haiming Gang, Jiasen Lu, Zhe Gan, Yinfei Yang, Zuxuan Wu, and Afshin Dehghan. Unigen-1.5: Enhancing image generation and editing through reward unification in reinforcement learning. In CVPR, 2026
2026
-
[78]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoderswithimprovedsemanticunderstanding,localization,anddensefeatures. arXivpreprintarXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Growing visual generative capacity for pre-trained mllms.arXivpreprintarXiv:2510.01546, 2025
Hanyu Wang, Jiaming Han, Ziyan Yang, Qi Zhao, Shanchuan Lin, Xiangyu Yue, Abhinav Shrivastava, Zhenheng Yang, and Hao Chen. Growing visual generative capacity for pre-trained mllms.arXivpreprintarXiv:2510.01546, 2025. 16
-
[80]
Omnitokenizer: A joint image-video tokenizer for visual generation
Junke Wang, Yi Jiang, Zehuan Yuan, Bingyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.