pith. machine review for the scientific record. sign in

arxiv: 2605.08029 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.LG

Recognition: no theorem link

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:40 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords multimodal generationnormalizing flowsautoregressive modelsvision-language modelsunified generationinterleaved sequencestext-image generation
0
0 comments X

The pith

Autoregressive normalizing flows share the causal structure of language models, enabling unified multimodal generation of interleaved text and images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that autoregressive normalizing flows are structurally identical to autoregressive transformers used in large language models, sharing causal masking, key-value caching, and sequential generation order. This similarity makes flows a better match than diffusion models for building systems that generate mixed text and image sequences in one pass. The authors introduce STARFlow2, which combines a pretrained vision-language model with a flow-based stream through residual connections under a shared causal mask. This setup allows both text and visual tokens to be processed and cached together without needing to re-encode outputs. If successful, it provides a path to more coherent and efficient multimodal models that handle understanding and generation in the same framework.

Core claim

Autoregressive normalizing flows are autoregressive Transformers sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs, making them the most natural paradigm for true unified multimodal generation. STARFlow2 uses the Pretzel architecture to vertically interleave a pretrained VLM stream with a TarFlow stream via residual skip connections, both under the same causal mask, along with a deep-shallow flow design and unified FAE latent space, enabling cache-friendly interleaved generation where text and visual outputs directly enter the KV-cache.

What carries the argument

The Pretzel architecture, which vertically interleaves a pretrained VLM with a TarFlow stream using residual skip connections under a shared causal mask, supported by a deep-shallow flow design and unified FAE latent space.

If this is right

  • Text and visual outputs enter the KV-cache directly without re-encoding, supporting efficient interleaved sequence generation.
  • Unified FAE latent space enables consistent token handling across modalities in a single generative process.
  • Performance on image generation and multimodal understanding benchmarks shows flows can serve as a foundation for unified modeling.
  • Cache-friendly design allows longer mixed outputs compared to approaches with structural mismatches between modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The interleaving method could extend to additional modalities such as audio by adding parallel flow streams under the same mask.
  • Residual connections between pretrained components might reduce training costs when adapting the approach to new domains.
  • This structure suggests testing whether flow-based components improve long-range coherence in multimodal sequences relative to separate modality models.

Load-bearing premise

That the structural similarity between autoregressive flows and LLMs is sufficient to make flows the most natural foundation for unified multimodal generation, and that vertically interleaving a pretrained VLM with a TarFlow stream via residual connections will produce effective cache-friendly interleaved outputs without major compatibility issues.

What would settle it

An experiment where STARFlow2 generates incoherent interleaved text-image sequences or where visual outputs require re-encoding instead of directly entering the shared KV-cache.

read the original abstract

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that autoregressive normalizing flows are structurally identical to autoregressive Transformers (sharing causal masks, KV-cache, and left-to-right generation), making them the most natural paradigm for unified multimodal generation of interleaved text-image sequences. It introduces STARFlow2 via the Pretzel architecture, which vertically interleaves a pretrained VLM stream with a TarFlow stream using residual skip connections under a shared causal mask, augmented by a deep-shallow flow design and unified FAE latent space. This enables cache-friendly interleaved generation where outputs directly populate the KV-cache. The paper asserts strong performance on image generation and multimodal understanding benchmarks, validating flows as a foundation for unified multimodal modeling.

Significance. If the architectural claims hold, the work could be significant by offering a structurally coherent alternative to hybrid LLM-diffusion multimodal systems, preserving exact-likelihood training and KV-cache efficiency across modalities. The core observation equating autoregressive flows with Transformers is a potentially useful insight that could guide future unified models. However, the absence of any quantitative results or implementation details in the abstract substantially weakens the ability to gauge its practical impact or novelty relative to existing flow-based or autoregressive multimodal efforts.

major comments (2)
  1. [Pretzel architecture] The Pretzel architecture description (abstract and architecture section): residual skip connections from the non-invertible pretrained VLM stream into the TarFlow stream risk violating bijectivity and tractable Jacobian computation required for a valid normalizing flow. The manuscript must explicitly show how invertibility is maintained (e.g., via invertible residual blocks or Jacobian adjustments) because this is load-bearing for the exact-likelihood objective and the claim that the model qualifies as a flow-based unified generator.
  2. [Abstract] Abstract (Experiments paragraph): the claim of 'strong performance across image generation and multimodal understanding benchmarks' is unsupported by any quantitative results, error bars, ablation studies, or experimental details. This prevents assessment of whether the implementation actually validates the central structural analogy and architectural choices.
minor comments (1)
  1. [Abstract] The abstract introduces multiple new terms (Pretzel architecture, TarFlow stream, FAE latent space) without immediate definitions or references to prior work; these should be clarified on first use for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the Pretzel architecture's invertibility and the need for quantitative support in the abstract. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Pretzel architecture] The Pretzel architecture description (abstract and architecture section): residual skip connections from the non-invertible pretrained VLM stream into the TarFlow stream risk violating bijectivity and tractable Jacobian computation required for a valid normalizing flow. The manuscript must explicitly show how invertibility is maintained (e.g., via invertible residual blocks or Jacobian adjustments) because this is load-bearing for the exact-likelihood objective and the claim that the model qualifies as a flow-based unified generator.

    Authors: We agree this is a critical point for validating the flow-based claims. In the Pretzel design, the VLM stream functions purely as a fixed conditioner whose outputs are projected and incorporated via invertible residual blocks inside the TarFlow layers; the overall transformation remains bijective because the Jacobian is computed solely over the flow parameters (the VLM contributes no additional determinant term). However, the current manuscript does not provide an explicit derivation or diagram of this Jacobian handling. We will add a formal subsection with the invertibility proof and Jacobian formula in the architecture section. revision: yes

  2. Referee: [Abstract] Abstract (Experiments paragraph): the claim of 'strong performance across image generation and multimodal understanding benchmarks' is unsupported by any quantitative results, error bars, ablation studies, or experimental details. This prevents assessment of whether the implementation actually validates the central structural analogy and architectural choices.

    Authors: We acknowledge that the abstract's performance statement would be stronger with concrete numbers. The full Experiments section already contains quantitative results (FID, IS, and multimodal accuracy metrics with baselines and ablations). To address the concern directly, we will revise the abstract to include two or three key quantitative highlights and a brief note on the evaluation protocol while respecting length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; structural analogy is observational motivation

full rationale

The paper motivates STARFlow2 by observing that autoregressive normalizing flows share causal mask, KV-cache, and left-to-right structure with LLMs, then proposes the Pretzel architecture (vertical interleaving of VLM and TarFlow streams via residuals under one causal mask, plus deep-shallow flow and unified FAE latent space). This is presented as an empirical design choice validated by experiments, not a derivation that reduces by construction to fitted inputs, self-definitions, or self-citation chains. No equations or load-bearing premises collapse to tautology; the analogy is external to the model equations and does not import uniqueness theorems or ansatzes from prior self-work. The derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Based solely on the abstract, the central claim rests on the domain assumption that structural equivalence in causal masking and KV caching makes autoregressive flows the natural choice for multimodal unification. No explicit free parameters or invented entities with independent evidence are detailed.

axioms (1)
  • domain assumption Autoregressive normalizing flows share the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs
    Directly stated in the abstract as the key observation motivating the work.
invented entities (3)
  • Pretzel architecture no independent evidence
    purpose: Vertically interleaves pretrained VLM stream with TarFlow stream via residual skip connections
    Introduced as the backbone of STARFlow2
  • TarFlow stream no independent evidence
    purpose: Autoregressive flow component operating under the shared causal mask
    Core generative stream in the proposed model
  • FAE latent space no independent evidence
    purpose: Unified latent space allowing text and visual outputs to enter KV-cache directly
    Enables cache-friendly interleaved generation

pith-pipeline@v0.9.0 · 5511 in / 1514 out tokens · 59538 ms · 2026-05-11T02:40:33.244855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 17 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, JankoAltenschmidt, SamAltman, ShyamalAnadkat, etal. Gpt-4technicalreport.arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.CoRR, abs/2308.12966,

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  4. [4]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025a. Tianrong Chen, Jiatao Gu, David Berthelot, Joshua Susskind, and Shuangfei Zhai. Normalizin...

  5. [5]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, and Long Chen. Comm: A coherent interleaved image-text dataset for multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8073–8082, 2025b. XiaokangChen, ZhiyuWu, XingchaoLiu, ZizhengPan, WenLiu, ZhendaXie, XingkaiYu, and...

  6. [6]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

  7. [7]

    NICE: Non-linear Independent Components Estimation

    Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation.arXiv preprint arXiv:1410.8516,

  8. [8]

    Density estimation using Real NVP

    Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.arXiv preprint arXiv:1605.08803,

  9. [9]

    One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829,

    Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829,

  10. [10]

    Seed-data-edit technical report: A hybrid dataset for instructional image editing

    Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007,

  11. [11]

    A., Susskind, J., and Zhai, S

    13 Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Josh Susskind, and Shuangfei Zhai. Starflow: Scaling latent normalizing flows for high- resolution image synthesis.arXiv preprint arXiv:2506.06276, 2025a. Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Ang...

  12. [12]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024a. Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal...

  13. [13]

    Zebra-cot: A dataset for interleaved vision language reasoning

    Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025a. Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with gener...

  14. [14]

    Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

    Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation. arXiv preprint arXiv:2509.03498, 2025b. Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis,...

  15. [15]

    Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

    Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472,

  16. [16]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147,

  17. [17]

    Tuna: Taming unified visual representations for native unified multimodal models.arXiv preprint arXiv:2512.02014, 2025

    URLhttps://openreview.net/ forum?id=PqvMRDCJT9t. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024a. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaq...

  18. [18]

    Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, et al. Tuna-2: Pixel embeddings beat vision encoders for multimodal understanding and generation.arXiv preprint arXiv:2604.24763,

  19. [19]

    Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation.arXiv preprint arXiv:2409.04410, 2024

    Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

  20. [20]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  21. [21]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256,

  22. [22]

    Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

    Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico- banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808,

  23. [23]

    Llamafu- sion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188, 2024

    Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Lmfusion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188,

  24. [24]

    Metamorph: Multimodal under- standing and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024a. ShengbangTong, DavidFan, JiachenZhu, YunyangXi...

  25. [25]

    Emu3: Next-Token Prediction is All You Need

    Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, and Hang Xu. Illume: Illuminating your llms to see, draw, and self-enhance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21612–21622, 2025a. Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, ...

  26. [26]

    Bridging continuous and discrete tokens for autoregressive visual generation.arXiv preprint arXiv:2503.16430,

    15 Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, and Xihui Liu. Bridging continuous and discrete tokens for autoregressive visual generation.arXiv preprint arXiv:2503.16430, 2025b. Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supe...

  27. [27]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324,

  28. [28]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    URLhttps://arxiv.org/abs/2506.18871. Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,

  29. [29]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564,

  30. [30]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039,

  31. [31]

    clockwise

    Zijun Zhou, Yingying Deng, Xiangyu He, Weiming Dong, and Fan Tang. Multi-turn consistent image editing.arXiv preprint arXiv:2505.04320,