pith. sign in

arxiv: 2605.17766 · v1 · pith:A7BR4SYOnew · submitted 2026-05-18 · 💻 cs.CV

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

Pith reviewed 2026-05-20 12:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modelslatent space alignmentcross-modal consistencysemantic driftmultimodal generationdual capacity alignmentlatent dynamics stabilization
0
0 comments X

The pith

Unified multimodal models reduce functional inconsistency by explicitly aligning transformations into and out of their shared latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Unified multimodal models learn a shared latent space for both understanding and generation but still show inconsistency when moving between these capabilities. The authors trace this not to missing shared representations but to unaligned mappings that take data in and out of the latent space, which lets semantic drift occur during modality switches. They introduce LatentUMM to build a stronger shared latent space through two stages of dual latent alignment followed by dynamics stabilization. If the approach holds, models could maintain semantic consistency across generation and re-encoding without redesigning their core architectures.

Core claim

The central claim is that functional inconsistency between generation and re-encoding in unified multimodal models stems from the absence of explicit alignment between the transformations mapping into and out of the shared latent space. LatentUMM addresses this by first applying dual latent alignment at modality level through cross-modal semantics from a stronger embedding model and at capacity level for bidirectional consistency under generation and re-encoding, then applying latent dynamics stabilization with stochastic latent rollouts and preference optimization to favor consistent trajectories. Experiments indicate this yields improved cross-modal consistency across diverse architectures

What carries the argument

Dual latent alignment, which enforces consistency at the modality level using structured cross-modal semantics and at the capacity level for bidirectional generation and re-encoding.

If this is right

  • Generation and re-encoding follow more consistent trajectories that preserve semantic content across modality changes.
  • Cross-modal alignment imposes structured semantics that strengthen the shared latent space without new training data.
  • Bidirectional capacity alignment reduces drift in both directions between generation and understanding.
  • Stochastic rollouts combined with preference optimization favor trajectories that maintain consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment principle could extend to models handling more than two modalities at once.
  • Similar stabilization steps might reduce drift in sequential tasks such as video captioning followed by re-generation.
  • Applying the method to larger-scale unified models would show whether the gains remain architecture-agnostic.

Load-bearing premise

That the inconsistency arises mainly from unaligned input and output transformations in the latent space rather than from other training factors or insufficient shared representations.

What would settle it

Running the dual alignment stage on a model and then measuring whether generation-re-encoding consistency improves only when the stronger embedding model is used for cross-modal alignment would test the claim directly.

Figures

Figures reproduced from arXiv: 2605.17766 by Hayes Bai, Jindong Wang, Marios Savvides, Wenwen Wang, Yinyi Luo.

Figure 1
Figure 1. Figure 1: Overview of baseline UMMs (left) versus our proposed LatentUMM (right). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LatentUMM. 2 Related Work Unified Multimodal Models. Recent advances in UMMs aim to integrate multimodal understanding and generation within a single architecture [1, 6, 10, 39, 53, 58]. A dominant approach adopts decoder-only autoregressive transformers trained on interleaved multimodal tokens [5, 40]. Another line of work explores hybrid generative frameworks that combine autoregressive model… view at source ↗
Figure 3
Figure 3. Figure 3: Latent space analysis that the model not only generates high-quality outputs but also maintains stronger semantic alignment when re-interpreting them. In contrast, SFT achieves competitive results on individual tasks but underperforms in the unified setting, highlighting the limitation of optimizing generation and understanding independently. A scene where a silver spoon is inside a transparent glass, a ye… view at source ↗
Figure 4
Figure 4. Figure 4: Baseline UMMs (left) vs. LatentUMM (right) on a sequential interaction task. Analysis of latent space alignment. To directly assess whether LatentUMM improves the internal alignment of multimodal representations, we evalu￾ate the text and image embeddings in the shared la￾tent space [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Expected output [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Expected output [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative Image generation D.2.2 (2) Effect of consistency weight We analyze the impact of the consistency weight λ1 on the trade-off between reconstruction consistency and output diversity. We report relative changes with respect to the lowest setting (λ1 = 0.1). λ1 Consistency (% change) Diversity (% change) 0.1 0.0% 0.0% 0.5 +5.8% -6.9% 1.0 +8.6% -7.7% [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LatentUMM, a two-stage post-training framework for unified multimodal models (UMMs) that addresses functional inconsistency between generation and understanding tasks. It claims this inconsistency arises specifically from unaligned in/out transformations of the shared latent space rather than insufficient shared representations. Stage 1 performs dual latent alignment (cross-modal alignment via a stronger embedding model plus dual-capacity bidirectional consistency enforcement); Stage 2 applies latent dynamics stabilization through stochastic rollouts and preference optimization. Experiments are reported to show consistent gains in cross-modal consistency across diverse UMM architectures, with code released.

Significance. If the central attribution holds and the gains are isolated from confounds, the work could offer a lightweight, architecture-agnostic way to reduce semantic drift in existing UMMs, which would be practically useful. The public code release is a positive factor for reproducibility. The significance is tempered by the need to confirm that observed improvements stem from the proposed explicit alignment rather than auxiliary factors such as embedding model capacity.

major comments (2)
  1. [Abstract / §1] Abstract and §1 (motivation): the claim that inconsistency 'does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations' is load-bearing for the entire framework. No ablation is described that holds representation capacity fixed while varying only the alignment losses, leaving the weakest assumption under-supported.
  2. [§3] §3 (Dual Latent Alignment, cross-modal component): the method explicitly uses a stronger embedding model to impose structured cross-modal semantics. This creates a direct confound with the central claim, as any consistency gains could be driven by the superior pretraining or capacity of the stronger model rather than the alignment or bidirectional consistency losses. An ablation that applies the same stronger model with and without the proposed alignment objectives is required to isolate the contribution.
minor comments (2)
  1. [Experiments] The description of 'multimodal consistency' metrics and the exact baselines used in the experiments section would benefit from additional quantitative detail and error bars to allow direct comparison.
  2. [§2 / §3] Notation for the latent transformations (e.g., the in/out maps) should be introduced with explicit equations early in §2 or §3 to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments, which help strengthen the empirical support for our central claims. We address each major point below and have revised the manuscript to incorporate the requested ablations.

read point-by-point responses
  1. Referee: [Abstract / §1] Abstract and §1 (motivation): the claim that inconsistency 'does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations' is load-bearing for the entire framework. No ablation is described that holds representation capacity fixed while varying only the alignment losses, leaving the weakest assumption under-supported.

    Authors: We agree that the load-bearing claim requires an ablation that isolates alignment losses while holding representation capacity fixed. In the revised manuscript we have added this ablation: we train with the base embedding model (fixed capacity) and compare the full dual latent alignment objectives against a variant that removes the alignment losses but retains the same shared latent space and training budget. The results show consistent gains from the alignment terms alone, providing direct support for the claim that the inconsistency arises from unaligned in/out transformations rather than insufficient shared representations. revision: yes

  2. Referee: [§3] §3 (Dual Latent Alignment, cross-modal component): the method explicitly uses a stronger embedding model to impose structured cross-modal semantics. This creates a direct confound with the central claim, as any consistency gains could be driven by the superior pretraining or capacity of the stronger model rather than the alignment or bidirectional consistency losses. An ablation that applies the same stronger model with and without the proposed alignment objectives is required to isolate the contribution.

    Authors: The referee correctly notes the potential confound introduced by the stronger embedding model. We have added the requested ablation in the revised manuscript: we apply the identical stronger embedding model in two settings—one with our dual latent alignment objectives and one without—while keeping all other training details fixed. The comparison isolates the contribution of the alignment and bidirectional consistency losses, showing that these objectives account for the majority of the observed consistency improvements beyond the capacity of the stronger model alone. revision: yes

Circularity Check

0 steps flagged

No circularity: framework adds explicit alignment stages as independent construction

full rationale

The paper's derivation begins with an empirical observation that functional inconsistency arises specifically from absent explicit alignment between latent in/out transformations rather than insufficient shared representations. It then introduces LatentUMM as a two-stage construction: dual latent alignment (cross-modal via stronger embedding plus bidirectional capacity consistency) followed by stabilization via stochastic rollouts and preference optimization. No equations, fitted parameters, or self-citations are shown reducing the claimed consistency gains to the original inputs by construction. The central claim remains a novel architectural addition whose validity is tested externally via experiments on diverse architectures, keeping the chain self-contained without self-definitional loops or load-bearing prior-author uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on domain assumptions about the value of stronger embeddings for semantics and the enforceability of bidirectional consistency; no explicit free parameters or new invented entities are named in the abstract.

axioms (2)
  • domain assumption A stronger embedding model imposes structured cross-modal semantics that improve the shared latent space
    Invoked to justify the cross-modal alignment stage.
  • domain assumption Bidirectional consistency between generation and re-encoding can be directly enforced by alignment at the capacity level
    Central premise of the dual capacity alignment component.

pith-pipeline@v0.9.0 · 5752 in / 1325 out tokens · 60232 ms · 2026-05-20T12:35:15.890541+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 21 internal anchors

  1. [1]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models- architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

  2. [2]

    Improving faithfulness in abstractive summa- rization with contrast candidate generation and selection

    Sihao Chen, Fan Zhang, Kazoo Sone, and Dan Roth. Improving faithfulness in abstractive summa- rization with contrast candidate generation and selection. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5935–5941, 2021

  3. [3]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  4. [4]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  5. [5]

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners. arXiv preprint arXiv:2510.26583, 2025

  6. [6]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  7. [7]

    VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

    Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improving visual-semantic embeddings with hard negatives.arXiv preprint arXiv:1707.05612, 2017

  8. [8]

    Devise: A deep visual-semantic embedding model.Advances in neural information processing systems, 26, 2013

    Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model.Advances in neural information processing systems, 26, 2013

  9. [9]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  10. [10]

    TokenFlow: Consistent Diffusion Features for Consistent Video Editing

    Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023

  11. [11]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

  12. [12]

    Turning internal gap into self-improvement: Promoting the generation-understanding unification in mllms.arXiv preprint arXiv:2507.16663, 2025

    Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Liu, Yingya Zhang, Shiwei Zhang, and Difan Zou. Turning internal gap into self-improvement: Promoting the generation-understanding unification in mllms.arXiv preprint arXiv:2507.16663, 2025

  13. [13]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  14. [14]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 11

  15. [15]

    Interleaving reasoning for better text-to-image generation

    Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, et al. Interleaving reasoning for better text-to-image generation.arXiv preprint arXiv:2509.06945, 2025

  16. [16]

    Can unified generation and understanding models maintain semantic equivalence across different output modalities?arXiv preprint arXiv:2602.23711, 2026

    Hongbo Jiang, Jie Li, Yunhang Shen, Pingyang Dai, Xing Sun, Haoyu Cao, and Liujuan Cao. Can unified generation and understanding models maintain semantic equivalence across different output modalities?arXiv preprint arXiv:2602.23711, 2026

  17. [17]

    Understanding and constructing latent modality structures in multi-modal representation learning

    Qian Jiang, Changyou Chen, Han Zhao, Liqun Chen, Qing Ping, Son Dinh Tran, Yi Xu, Belinda Zeng, and Trishul Chilimbi. Understanding and constructing latent modality structures in multi-modal representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7661–7671, 2023

  18. [18]

    Latentum: Unleashing the potential of interleaved cross-modal reasoning via a latent-space unified model.arXiv preprint arXiv:2604.02097, 2026

    Jiachun Jin, Zetong Zhou, Xiao Yang, Hao Zhang, Pengfei Liu, Jun Zhu, and Zhijie Deng. Latentum: Unleashing the potential of interleaved cross-modal reasoning via a latent-space unified model.arXiv preprint arXiv:2604.02097, 2026

  19. [19]

    Srum: Fine-grained self-rewarding for unified multimodal models.arXiv preprint arXiv:2510.12784, 2025

    Weiyang Jin, Yuwei Niu, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, and Xihui Liu. Srum: Fine-grained self-rewarding for unified multimodal models.arXiv preprint arXiv:2510.12784, 2025

  20. [20]

    Gemini Embedding: Generalizable Embeddings from Gemini

    Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gus- tavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, et al. Gemini embedding: Generalizable embeddings from gemini.arXiv preprint arXiv:2503.07891, 2025

  21. [21]

    Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155,

    Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, and Zhuang Liu. Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155, 2026

  22. [22]

    Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  23. [23]

    Representation interpretation with spatial encoding and multimodal analytics

    Ninghao Liu, Mengnan Du, and Xia Hu. Representation interpretation with spatial encoding and multimodal analytics. InProceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 60–68, 2019

  24. [24]

    Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

  25. [25]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  26. [26]

    Self-corrected image generation with explainable latent rewards.arXiv preprint arXiv:2603.24965, 2026

    Yinyi Luo, Hrishikesh Gokhale, Marios Savvides, Jindong Wang, and Shengfeng He. Self-corrected image generation with explainable latent rewards.arXiv preprint arXiv:2603.24965, 2026

  27. [27]

    TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

    Yinyi Luo, Wenwen Wang, Hayes Bai, Hongyu Zhu, Hao Chen, Pan He, Marios Savvides, Sharon Li, and Jindong Wang. Torchumm: A unified multimodal model codebase for evaluation, analysis, and post-training.arXiv preprint arXiv:2604.10784, 2026

  28. [28]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024. 12

  29. [29]

    Unirl: Self-improving unified multimodal models via supervised and reinforcement learning.arXiv preprint arXiv:2505.23380, 2025

    Weijia Mao, Zhenheng Yang, and Mike Zheng Shou. Unirl: Self-improving unified multimodal models via supervised and reinforcement learning.arXiv preprint arXiv:2505.23380, 2025

  30. [30]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

  31. [31]

    Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606,

    Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606, 2025

  32. [32]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  33. [33]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  34. [34]

    Object hallucination in image captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, 2018

  35. [35]

    Realunify: Do unified models truly benefit from unification? a comprehensive benchmark.arXiv preprint arXiv:2509.24897, 2025

    Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, et al. Realunify: Do unified models truly benefit from unification? a comprehensive benchmark.arXiv preprint arXiv:2509.24897, 2025

  36. [36]

    Unigame: Turning a unified multimodal model into its own adversary

    Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, and Jindong Wang. Unigame: Turning a unified multimodal model into its own adversary. InCVPR, 2026

  37. [37]

    Generation enhances understanding in unified multimodal models via multi-representation generation

    Zihan Su, Hongyang Wei, Kangrui Cen, Yong Wang, Guanhua Chen, Chun Yuan, and Xiangxiang Chu. Generation enhances understanding in unified multimodal models via multi-representation generation. arXiv preprint arXiv:2601.21406, 2026

  38. [38]

    Quantifying the gap between understanding and generation within unified multimodal models.arXiv preprint arXiv:2602.02140, 2026a

    Chenlong Wang, Yuhang Chen, Zhihan Hu, Dongping Chen, Wenhu Chen, Sarah Wiegreffe, and Tianyi Zhou. Quantifying the gap between understanding and generation within unified multimodal models. arXiv preprint arXiv:2602.02140, 2026

  39. [39]

    Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing.arXiv preprint arXiv:2602.12205, 2026

    Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, et al. Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing.arXiv preprint arXiv:2602.12205, 2026

  40. [40]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  41. [41]

    Unig2u-bench: Do unified models advance multimodal understanding?arXiv preprint arXiv:2603.03241, 2026

    Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, et al. Unig2u-bench: Do unified models advance multimodal understanding?arXiv preprint arXiv:2603.03241, 2026

  42. [42]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024. 13

  43. [43]

    Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025b

    Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, and Chen Change Loy. Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025

  44. [44]

    Harmonizing visual representations for unified multimodal understanding and generation

    Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representations for unified multimodal understanding and generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17739–17750, 2025

  45. [45]

    Omnibridge: Unified multimodal understanding, generation, and retrieval via latent space alignment.arXiv preprint arXiv:2509.19018, 2025

    Teng Xiao, Zuchao Li, and Lefei Zhang. Omnibridge: Unified multimodal understanding, generation, and retrieval via latent space alignment.arXiv preprint arXiv:2509.19018, 2025

  46. [46]

    Reconstruction alignment improves unified multimodal models

    Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal models. InICLR, 2026

  47. [47]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  48. [48]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

  49. [49]

    Can understanding and generation truly benefit together–or just coexist?arXiv e-prints, pages arXiv–2509, 2025

    Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, et al. Can understanding and generation truly benefit together–or just coexist?arXiv e-prints, pages arXiv–2509, 2025

  50. [50]

    Hermesflow: Seamlessly closing the gap in multimodal understanding and generation.arXiv preprint arXiv:2502.12148, 2025

    Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, and Bin Cui. Hermesflow: Seamlessly closing the gap in multimodal understanding and generation.arXiv preprint arXiv:2502.12148, 2025

  51. [51]

    Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

    Songlin Yang, Xianghao Kong, and Anyi Rao. Pseudo-unification: Entropy probing reveals divergent information patterns in unified multimodal models.arXiv preprint arXiv:2604.10949, 2026

  52. [52]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

  53. [53]

    A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

  54. [54]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023

  55. [55]

    The latent space: Foundation, evolution, mechanism, ability, and outlook

    Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook. arXiv preprint arXiv:2604.02029, 2026

  56. [56]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

  57. [57]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 14

  58. [58]

    Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567,

    Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025

  59. [59]

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017

  60. [60]

    Latentexplainer: Explaining latent representations in deep generative models with multi-modal foundation models.arXiv preprint arXiv:2406.14862, 2024

    Mengdan Zhu, Raasikh Kanjiani, Jiahui Lu, Andrew Choi, Qirui Ye, and Liang Zhao. Latentexplainer: Explaining latent representations in deep generative models with multi-modal foundation models.arXiv preprint arXiv:2406.14862, 2024

  61. [61]

    Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

    Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark.arXiv preprint arXiv:2510.13759, 2025. 15 Appendix Contents 1 Introduction 1 2 Related Work 3 3 Method 3 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . ....