Recognition: no theorem link
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
Pith reviewed 2026-05-11 02:40 UTC · model grok-4.3
The pith
Autoregressive normalizing flows share the causal structure of language models, enabling unified multimodal generation of interleaved text and images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Autoregressive normalizing flows are autoregressive Transformers sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs, making them the most natural paradigm for true unified multimodal generation. STARFlow2 uses the Pretzel architecture to vertically interleave a pretrained VLM stream with a TarFlow stream via residual skip connections, both under the same causal mask, along with a deep-shallow flow design and unified FAE latent space, enabling cache-friendly interleaved generation where text and visual outputs directly enter the KV-cache.
What carries the argument
The Pretzel architecture, which vertically interleaves a pretrained VLM with a TarFlow stream using residual skip connections under a shared causal mask, supported by a deep-shallow flow design and unified FAE latent space.
If this is right
- Text and visual outputs enter the KV-cache directly without re-encoding, supporting efficient interleaved sequence generation.
- Unified FAE latent space enables consistent token handling across modalities in a single generative process.
- Performance on image generation and multimodal understanding benchmarks shows flows can serve as a foundation for unified modeling.
- Cache-friendly design allows longer mixed outputs compared to approaches with structural mismatches between modalities.
Where Pith is reading between the lines
- The interleaving method could extend to additional modalities such as audio by adding parallel flow streams under the same mask.
- Residual connections between pretrained components might reduce training costs when adapting the approach to new domains.
- This structure suggests testing whether flow-based components improve long-range coherence in multimodal sequences relative to separate modality models.
Load-bearing premise
That the structural similarity between autoregressive flows and LLMs is sufficient to make flows the most natural foundation for unified multimodal generation, and that vertically interleaving a pretrained VLM with a TarFlow stream via residual connections will produce effective cache-friendly interleaved outputs without major compatibility issues.
What would settle it
An experiment where STARFlow2 generates incoherent interleaved text-image sequences or where visual outputs require re-encoding instead of directly entering the shared KV-cache.
read the original abstract
Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that autoregressive normalizing flows are structurally identical to autoregressive Transformers (sharing causal masks, KV-cache, and left-to-right generation), making them the most natural paradigm for unified multimodal generation of interleaved text-image sequences. It introduces STARFlow2 via the Pretzel architecture, which vertically interleaves a pretrained VLM stream with a TarFlow stream using residual skip connections under a shared causal mask, augmented by a deep-shallow flow design and unified FAE latent space. This enables cache-friendly interleaved generation where outputs directly populate the KV-cache. The paper asserts strong performance on image generation and multimodal understanding benchmarks, validating flows as a foundation for unified multimodal modeling.
Significance. If the architectural claims hold, the work could be significant by offering a structurally coherent alternative to hybrid LLM-diffusion multimodal systems, preserving exact-likelihood training and KV-cache efficiency across modalities. The core observation equating autoregressive flows with Transformers is a potentially useful insight that could guide future unified models. However, the absence of any quantitative results or implementation details in the abstract substantially weakens the ability to gauge its practical impact or novelty relative to existing flow-based or autoregressive multimodal efforts.
major comments (2)
- [Pretzel architecture] The Pretzel architecture description (abstract and architecture section): residual skip connections from the non-invertible pretrained VLM stream into the TarFlow stream risk violating bijectivity and tractable Jacobian computation required for a valid normalizing flow. The manuscript must explicitly show how invertibility is maintained (e.g., via invertible residual blocks or Jacobian adjustments) because this is load-bearing for the exact-likelihood objective and the claim that the model qualifies as a flow-based unified generator.
- [Abstract] Abstract (Experiments paragraph): the claim of 'strong performance across image generation and multimodal understanding benchmarks' is unsupported by any quantitative results, error bars, ablation studies, or experimental details. This prevents assessment of whether the implementation actually validates the central structural analogy and architectural choices.
minor comments (1)
- [Abstract] The abstract introduces multiple new terms (Pretzel architecture, TarFlow stream, FAE latent space) without immediate definitions or references to prior work; these should be clarified on first use for readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the Pretzel architecture's invertibility and the need for quantitative support in the abstract. We address each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Pretzel architecture] The Pretzel architecture description (abstract and architecture section): residual skip connections from the non-invertible pretrained VLM stream into the TarFlow stream risk violating bijectivity and tractable Jacobian computation required for a valid normalizing flow. The manuscript must explicitly show how invertibility is maintained (e.g., via invertible residual blocks or Jacobian adjustments) because this is load-bearing for the exact-likelihood objective and the claim that the model qualifies as a flow-based unified generator.
Authors: We agree this is a critical point for validating the flow-based claims. In the Pretzel design, the VLM stream functions purely as a fixed conditioner whose outputs are projected and incorporated via invertible residual blocks inside the TarFlow layers; the overall transformation remains bijective because the Jacobian is computed solely over the flow parameters (the VLM contributes no additional determinant term). However, the current manuscript does not provide an explicit derivation or diagram of this Jacobian handling. We will add a formal subsection with the invertibility proof and Jacobian formula in the architecture section. revision: yes
-
Referee: [Abstract] Abstract (Experiments paragraph): the claim of 'strong performance across image generation and multimodal understanding benchmarks' is unsupported by any quantitative results, error bars, ablation studies, or experimental details. This prevents assessment of whether the implementation actually validates the central structural analogy and architectural choices.
Authors: We acknowledge that the abstract's performance statement would be stronger with concrete numbers. The full Experiments section already contains quantitative results (FID, IS, and multimodal accuracy metrics with baselines and ablations). To address the concern directly, we will revise the abstract to include two or three key quantitative highlights and a brief note on the evaluation protocol while respecting length constraints. revision: yes
Circularity Check
No significant circularity; structural analogy is observational motivation
full rationale
The paper motivates STARFlow2 by observing that autoregressive normalizing flows share causal mask, KV-cache, and left-to-right structure with LLMs, then proposes the Pretzel architecture (vertical interleaving of VLM and TarFlow streams via residuals under one causal mask, plus deep-shallow flow and unified FAE latent space). This is presented as an empirical design choice validated by experiments, not a derivation that reduces by construction to fitted inputs, self-definitions, or self-citation chains. No equations or load-bearing premises collapse to tautology; the analogy is external to the model equations and does not import uniqueness theorems or ansatzes from prior self-work. The derivation chain remains self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Autoregressive normalizing flows share the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs
invented entities (3)
-
Pretzel architecture
no independent evidence
-
TarFlow stream
no independent evidence
-
FAE latent space
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, JankoAltenschmidt, SamAltman, ShyamalAnadkat, etal. Gpt-4technicalreport.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.CoRR, abs/2308.12966,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025a. Tianrong Chen, Jiatao Gu, David Berthelot, Joshua Susskind, and Shuangfei Zhai. Normalizin...
-
[5]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, and Long Chen. Comm: A coherent interleaved image-text dataset for multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8073–8082, 2025b. XiaokangChen, ZhiyuWu, XingchaoLiu, ZizhengPan, WenLiu, ZhendaXie, XingkaiYu, and...
work page internal anchor Pith review arXiv
-
[6]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
NICE: Non-linear Independent Components Estimation
Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation.arXiv preprint arXiv:1410.8516,
work page internal anchor Pith review arXiv
-
[8]
Density estimation using Real NVP
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.arXiv preprint arXiv:1605.08803,
work page internal anchor Pith review arXiv
-
[9]
Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829,
-
[10]
Seed-data-edit technical report: A hybrid dataset for instructional image editing
Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007,
-
[11]
13 Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Josh Susskind, and Shuangfei Zhai. Starflow: Scaling latent normalizing flows for high- resolution image synthesis.arXiv preprint arXiv:2506.06276, 2025a. Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Ang...
-
[12]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024a. Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal...
work page internal anchor Pith review arXiv
-
[13]
Zebra-cot: A dataset for interleaved vision language reasoning
Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025a. Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with gener...
-
[14]
Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation. arXiv preprint arXiv:2509.03498, 2025b. Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis,...
-
[15]
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472,
-
[16]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147,
work page internal anchor Pith review arXiv
-
[17]
URLhttps://openreview.net/ forum?id=PqvMRDCJT9t. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024a. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaq...
-
[18]
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, et al. Tuna-2: Pixel embeddings beat vision encoders for multimodal understanding and generation.arXiv preprint arXiv:2604.24763,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,
-
[20]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Transfer between Modalities with MetaQueries
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256,
work page internal anchor Pith review arXiv
-
[22]
Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico- banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808,
-
[23]
Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Lmfusion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188,
-
[24]
Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024a. ShengbangTong, DavidFan, JiachenZhu, YunyangXi...
-
[25]
Emu3: Next-Token Prediction is All You Need
Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, and Hang Xu. Illume: Illuminating your llms to see, draw, and self-enhance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21612–21622, 2025a. Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, ...
work page internal anchor Pith review arXiv
-
[26]
15 Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, and Xihui Liu. Bridging continuous and discrete tokens for autoregressive visual generation.arXiv preprint arXiv:2503.16430, 2025b. Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supe...
-
[27]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
URLhttps://arxiv.org/abs/2506.18871. Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564,
work page internal anchor Pith review arXiv
-
[30]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039,
work page internal anchor Pith review arXiv
- [31]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.