pith. sign in

arxiv: 2605.18714 · v1 · pith:U6HXRAVQnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

Semantic Generative Tuning for Unified Multimodal Models

Pith reviewed 2026-05-20 11:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords unified multimodal modelssemantic generative tuningimage segmentationgenerative post-trainingmultimodal alignmentfeature separabilityvision-language models
0
0 comments X

The pith

Image segmentation as a generative proxy aligns understanding and generation in unified multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Unified multimodal models combine visual understanding and generation in one system, yet separate training leaves their representation spaces misaligned. The paper shows that high-level semantic tasks like image segmentation work best as generative proxies during post-training because they supply structural information instead of low-level texture details. This choice improves both perception accuracy and the spatial coherence of generated outputs. The resulting Semantic Generative Tuning method also sharpens feature separability and adjusts attention patterns between visual and text elements.

Core claim

High-level semantic tasks, particularly image segmentation, serve as optimal proxies that significantly enhance both vision-centric perception and generative layout fidelity. SGT formulates these tasks as generative objectives to bridge the isolation between understanding and generation, improving feature linear separability and optimizing visual-textual attention allocation.

What carries the argument

Semantic Generative Tuning (SGT), a post-training method that treats image segmentation as the central generative proxy to align multimodal representation spaces.

If this is right

  • Better linear separability of visual features.
  • More effective visual-textual attention patterns.
  • Higher scores on both understanding and generation benchmarks.
  • Tighter coupling between perception and layout fidelity in generated images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar proxy tasks could be tested for video or 3D generation to check if structural semantics remain effective.
  • The method might reduce reliance on separate specialist models for understanding versus creation.
  • Real-world editing or scene synthesis pipelines could gain from the reported layout improvements.

Load-bearing premise

Segmentation supplies structural semantics without introducing its own biases or texture distractions, and the observed gains on tested benchmarks and models will transfer to other architectures and real-world data.

What would settle it

Running SGT on a new model or dataset yields no gain or a drop in both comprehension accuracy and generative layout metrics relative to the decoupled baseline.

Figures

Figures reproduced from arXiv: 2605.18714 by Songsong Yu, Yanwei Li, Ying Shan, Yuxin Chen.

Figure 1
Figure 1. Figure 1: Comparison of alignment strategies for UMMs. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the generative tuning paradigm. An RGB image and a concise textual instruction are processed by respective vision and text encoders to extract independent embeddings. UMMs then integrate these embeddings and map the rep￾resentations to the designated task. Because empirical evaluations demonstrate that visual generation targets at an advanced semantic level yield the most significant per￾forman… view at source ↗
Figure 3
Figure 3. Figure 3: Empirical evaluation of the hierarchical task ladder across diverse understand￾ing and generation dimensions. (a) High-level proxy tasks yield greater performance gains than low-level tasks in multimodal understanding. (b) Various generative objec￾tives consistently improve performance in the position dimension, yielding comparable overall gains. (From left to right): Position, Colors, Color Attributes, Co… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on compositional text-to-image generation. 4.3 More Explorations Optimal data recipe. While our analysis in Sec. 3.3 indicates that SGT independently enhances both understanding and generation, we posit that a comprehensive post-training regime must synergize SGT objectives with SFT data to maximize performance. Therefore, we conduct an ablation study to determine the optimal data sa… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation studies on segmentation data integration. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics with different SFT:Seg ratios. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Feature space analysis on fine-grained classes. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analysis of attention patterns. (a) Layer-wise changes in attention to vi￾sual features for three proxy tasks relative to the BAGEL baseline, demonstrating a consistent increase in visual focus in deeper layers. (b) Attention distribution over text tokens. The segmentation objective effectively enhancing the focus on critical tokens (Object, Color, Relation). 4.4 Mechanistic Insights: Why Semantic Proxies … view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of various computer vision tasks. Top row: RGB Image, Semantic Segmentation, Instance Segmentation, Panoptic Segmentation, Object Detection, and Depth Estimation. Bottom row: (This figure serves solely illustrative purposes and does not originate from the MS COCO dataset.) De-raining, De-hazing, Denoising, Image Super-Resolution (ISR), Deblurring, Edge Detection, Low-light Enhancement, and RGB… view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of images generated by SGT, demonstrating high-quality and diverse generations across a wide range of prompts and scenes. compared to GenEval. The lack of substantial improvement on this benchmark suggests that the SGT framework does not inherently facilitate complex instruc￾tion parsing capabilities. Further enhancement of these editing proficiencies likely requires the integration of speci… view at source ↗
Figure 11
Figure 11. Figure 11: Pseudocode for t-SNE visualization pipeline. visualization, we first apply Principal Component Analysis (PCA) to reduce the feature dimensionality to 50, followed by t-SNE projection onto a 2D plane for visualization. For t-SNE, we adopt the default perplexity value of 30. The complete pipeline is summarized in [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Pseudocode for keyword-level attention analysis during image generation. We extract keywords from the prompt, compute GQA attention maps at selected timesteps and layers, and aggregate attention scores for each keyword to quantify its influence on the generated image [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Token-level attention distribution during image generation. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
read the original abstract

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Semantic Generative Tuning (SGT), a generative post-training paradigm for unified multimodal models (UMMs). It formulates hierarchical visual tasks as generative proxies to address the misalignment between sparse-text understanding and dense-pixel generation. The central empirical finding is that high-level semantic tasks, particularly image segmentation, serve as optimal proxies by supplying structural semantics that improve vision-centric perception and generative layout fidelity, in contrast to low-level tasks that introduce texture distractions. The work supports this with mechanistic analyses of feature linear separability and visual-textual attention allocation, plus benchmark evaluations showing consistent gains; code is released.

Significance. If the empirical results and mechanistic claims hold under rigorous controls, the work would offer a practical and conceptually clean route to joint optimization of understanding and generation in UMMs. The emphasis on mechanistic diagnostics (separability, attention patterns) and the release of code are positive contributions that aid reproducibility and follow-up work. The absence of concrete numerical results, error bars, baseline tables, or dataset specifications in the abstract, however, prevents a precise evaluation of effect sizes or generalizability at this stage.

major comments (1)
  1. The optimality claim for segmentation as a proxy (abstract and empirical investigation section) rests on the assertion that benefits arise specifically from structural semantics rather than from any dense pixel-level supervision. The manuscript compares segmentation against low-level tasks but does not report a controlled ablation that replaces segmentation with another high-level dense generative proxy (e.g., depth estimation or panoptic segmentation) under identical generative formulation, loss weighting, and compute budget. Without this isolation, it remains unclear whether the observed gains are attributable to the claimed semantic property or to generic properties of dense supervision; this directly affects the load-bearing distinction drawn in the abstract.
minor comments (2)
  1. Abstract: the statement that SGT 'consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks' is presented without any quantitative deltas, baseline names, or dataset identifiers, which reduces immediate assessability of the empirical contribution.
  2. The mechanistic analysis section would benefit from explicit statements of the linear-separability metric and the precise attention-allocation statistic used, together with statistical significance tests against the untuned baseline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address the major comment point by point below, providing clarifications and outlining revisions where appropriate.

read point-by-point responses
  1. Referee: The optimality claim for segmentation as a proxy (abstract and empirical investigation section) rests on the assertion that benefits arise specifically from structural semantics rather than from any dense pixel-level supervision. The manuscript compares segmentation against low-level tasks but does not report a controlled ablation that replaces segmentation with another high-level dense generative proxy (e.g., depth estimation or panoptic segmentation) under identical generative formulation, loss weighting, and compute budget. Without this isolation, it remains unclear whether the observed gains are attributable to the claimed semantic property or to generic properties of dense supervision; this directly affects the load-bearing distinction drawn in the abstract.

    Authors: We thank the referee for highlighting this important aspect of our empirical investigation. Our current comparisons focus on contrasting high-level semantic tasks like segmentation with low-level tasks (e.g., edge detection) to demonstrate that structural semantics, rather than mere dense pixel supervision, drive the improvements in perception and generation. However, we acknowledge that a direct comparison with other high-level dense proxies such as depth estimation or panoptic segmentation, under matched conditions, would further strengthen the specificity of our claims regarding segmentation's optimality. We will incorporate additional ablation studies in the revised version, ensuring identical generative formulation, loss weighting, and compute budget. This will help isolate whether the benefits are unique to segmentation's structural semantics or shared among high-level tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proxy selection rests on external benchmarks

full rationale

The paper conducts an empirical investigation comparing hierarchical visual tasks as generative proxies for UMM post-training. The claim that segmentation is optimal is grounded in benchmark improvements and mechanistic analyses (linear separability, attention patterns) rather than any internal derivation, fitted parameter renamed as prediction, or self-referential definition. No equations reduce outputs to inputs by construction, and evaluations use external datasets and models. The contribution is self-contained against independent measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions about representation alignment in multimodal models and introduces no new free parameters, invented entities, or non-standard axioms beyond the empirical choice of proxy task.

axioms (1)
  • domain assumption Decoupled optimization of understanding via text and generation via pixels produces misaligned representation spaces that hinder mutual reinforcement.
    This premise is stated directly in the opening of the abstract as the motivation for the work.

pith-pipeline@v0.9.0 · 5730 in / 1197 out tokens · 63323 ms · 2026-05-20T11:28:40.210151+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 22 internal anchors

  1. [1]

    Assran, Q

    Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., Le- Cun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predictive architecture. arXiv:2301.08243 (2023) 2, 7

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv:2502.13923 (2025) 9

  3. [3]

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators1

  4. [4]

    Chen, F., Jing, M., Lu, W., Feng, Y., Li, X., Cao, X.: Unihetero: Could gen- eration enhance understanding for vision-language-model at large data scale? arXiv:2512.23512 (2025) 5

  5. [5]

    Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision-language models? NeurIPS37, 27056–27087 (2024) 6, 9, 11, 17

  6. [6]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv:2501.17811 (2025) 3, 5, 10

  7. [7]

    arXiv:2401.14404 (2024) 4

    Chen, X., Liu, Z., Xie, S., He, K.: Deconstructing denoising diffusion models for self-supervised learning. arXiv:2401.14404 (2024) 4

  8. [8]

    NeurIPS36, 49250–49267 (2023) 2

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. NeurIPS36, 49250–49267 (2023) 2

  9. [9]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv:2505.14683 (2025) 3, 5, 7, 9, 10

  10. [10]

    In: ICLR (2024),https://openreview.net/forum? id=y01KGvd9Bw2

    Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., Kong, X., Zhang, X., Ma, K., Yi, L.: DreamLLM: Synergistic multimodal comprehension and creation. In: ICLR (2024),https://openreview.net/forum? id=y01KGvd9Bw2

  11. [11]

    arXiv preprint arXiv:2511.23386 (2025) 4, 7, 9

    Du, S., Guo, J., Li, B., Cui, S., Xu, Z., Luo, Y., Wei, Y., Gai, K., Wang, X., Wu, K., et al.: Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction. arXiv:2511.23386 (2025) 4

  12. [12]

    In: ACMMM

    Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: ACMMM. pp. 11198–11201 (2024) 9

  13. [13]

    In: ICML (2024) 1, 4

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024) 1, 4

  14. [14]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394 (2023) 9, 11

  15. [15]

    In: ECCV

    Fu, X., Yin, W., Hu, M., Wang, K., Ma, Y., Tan, P., Shen, S., Lin, D., Long, X.: Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In: ECCV. pp. 241–258. Springer (2024) 4 26 Songsong Yu et al

  16. [16]

    BLINK: Multimodal Large Language Models Can See but Not Perceive

    Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. arXiv:2404.12390 (2024) 9, 11

  17. [17]

    arXiv:2407.00783 (2024) 4

    Fuest, M., Ma, P., Gui, M., Schusterbauer, J., Hu, V.T., Ommer, B.: Diffusion models and representation learning: A survey. arXiv:2407.00783 (2024) 4

  18. [18]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Ge,Y.,Zhao,S.,Zhu,J.,Ge,Y.,Yi,K.,Song,L.,Li,C.,Ding,X.,Shan,Y.:Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv:2404.14396 (2024) 3

  19. [19]

    NeurIPS36, 52132–52152 (2023) 3, 6, 9, 14

    Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. NeurIPS36, 52132–52152 (2023) 3, 6, 9, 14

  20. [20]

    In: CVPR

    Graikos,A.,Yellapragada,S.,Le,M.Q.,Kapse,S.,Prasanna,P.,Saltz,J.,Samaras, D.: Learned representation-guided diffusion models for large-image generation. In: CVPR. pp. 8532–8542 (2024) 4

  21. [21]

    In: CVPR

    Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled lan- guage hallucination and visual illusion in large vision-language models. In: CVPR. pp. 14375–14385 (2024) 6, 9, 11, 17

  22. [22]

    In: CVPR

    Han, J., Liu, J., Jiang, Y., Yan, B., Zhang, Y., Yuan, Z., Peng, B., Liu, X.: Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In: CVPR. pp. 15733–15744 (2025) 1

  23. [23]

    In: CVPR

    Hudson, D.A., Zoran, D., Malinowski, M., Lampinen, A.K., Jaegle, A., McClelland, J.L., Matthey, L., Hill, F., Lerchner, A.: Soda: Bottleneck diffusion models for representation learning. In: CVPR. pp. 23115–23127 (2024) 4

  24. [24]

    arXiv:2402.03161 (2024) 2

    Jin, Y., Sun, Z., Xu, K., Chen, L., Jiang, H., Huang, Q., Song, C., Liu, Y., Zhang, D., Song, Y., et al.: Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv:2402.03161 (2024) 2

  25. [25]

    Unified language-vision pretraining in llm with dynamic discrete visual tokenization.arXiv preprint arXiv:2309.04669, 2023

    Jin, Y., Xu, K., Chen, L., Liao, C., Tan, J., Huang, Q., Chen, B., Lei, C., Liu, A., Song, C., et al.: Unified language-vision pretraining in llm with dynamic discrete visual tokenization. arXiv:2309.04669 (2023) 2

  26. [26]

    In: ICCV

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: ICCV. pp. 4015–4026 (2023) 9

  27. [27]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer. arXiv:2408.03326 (2024) 9

  28. [28]

    In: EMNLP

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP. pp. 292–305 (2023) 6, 9, 11, 17

  29. [29]

    Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

    Li, Z., Liu, Z., Zhang, Q., Lin, B., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., Yuan, L.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv:2510.16888 (2025) 2

  30. [30]

    arXiv:2512.19680 (2025) 5

    Liao, X., He, Q., Xu, K., Qu, X., Li, Y., Wei, W., Yao, A.: Va-π: Variational policy alignment for pixel-aware autoregressive generation. arXiv:2512.19680 (2025) 5

  31. [31]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv:2506.03147 (2025) 10

  32. [32]

    Transactions of the Association for Computational Linguistics (2023) 6, 9, 17

    Liu, F., Emerson, G.E.T., Collier, N.: Visual spatial reasoning. Transactions of the Association for Computational Linguistics (2023) 6, 9, 17

  33. [33]

    In: NeurIPS (2023) 1

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) 1

  34. [34]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., Li, G., Peng, Y., Sun, Q., Wu, J., Cai, Y., Ge, Z., Ming, R., Xia, L., Semantic Generative Tuning for Unified Multimodal Models 27 Zeng, X., Zhu, Y., Jiao, B., Zhang, X., Yu, G., Jiang, D.: Step1x-edit: A practical framework for general image editing. arXiv:2504...

  35. [35]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: ECCV. pp. 216–233. Springer (2024) 9

  36. [36]

    Science China Information Sciences67(12), 220102 (2024) 6, 17

    Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences67(12), 220102 (2024) 6, 17

  37. [37]

    arXiv:2312.17172 (2023) 3

    Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A.: Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv:2312.17172 (2023) 3

  38. [38]

    In: ICLR (2024) 6, 9, 17

    Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In: ICLR (2024) 6, 9, 17

  39. [39]

    In: NeurIPS (2022) 6, 17

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022) 6, 17

  40. [40]

    arXiv:2405.15232 (2024) 4

    Luo, R., Li, Y., Chen, L., He, W., Lin, T.E., Liu, Z., Zhang, L., Song, Z., Xia, X., Liu, T., et al.: Deem: Diffusion models serve as the eyes of large language models for image perception. arXiv:2405.15232 (2024) 4

  41. [41]

    Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

    Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Unitok: A unified tokenizer for visual generation and understanding. arXiv:2502.20321 (2025) 3

  42. [42]

    In: ICCV

    Ma, S., Ge, Y., Wang, T., Guo, Y., Ge, Y., Shan, Y.: Genhancer: Imperfect genera- tive models are secretly strong vision-centric enhancers. In: ICCV. pp. 24402–24412 (2025) 4, 5, 7

  43. [43]

    In: CVPR

    Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., Yu, X., et al.: Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. In: CVPR. pp. 7739–7751 (2025) 2

  44. [44]

    In: WACV

    Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: WACV. pp. 2200–2209 (2021) 6, 17

  45. [45]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Niu, Y., Ning, M., Zheng, M., Jin, W., Lin, B., Jin, P., Liao, J., Ning, K., Feng, C., Zhu, B., Yuan, L.: Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv:2503.07265 (2025) 2

  46. [47]

    Transfer between Modalities with MetaQueries

    Pan, X., Shukla, S.N., Singh, A., Zhao, Z., Mishra, S.K., Wang, J., Xu, Z., Chen, J., Li, K., Juefei-Xu, F., et al.: Transfer between modalities with metaqueries. arXiv:2504.06256 (2025) 2

  47. [48]

    In: ECCV

    Parihar, R., Sachidanand, V., Mani, S., Karmali, T., Venkatesh Babu, R.: Pre- cisecontrol: Enhancing text-to-image diffusion models with fine-grained attribute control. In: ECCV. pp. 469–487. Springer (2024) 4

  48. [49]

    In: CVPR (2022) 14

    Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: CVPR (2022) 14

  49. [50]

    In: CVPR

    Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: CVPR. pp. 2545–2555 (2025) 4

  50. [51]

    2025.doi:10.48550/arXiv.2412.15188

    Shi,W.,Han,X.,Zhou,C.,Liang,W.,Lin,X.V.,Zettlemoyer,L.,Yu,L.:Lmfusion: Adapting pretrained language models for multimodal generation. arXiv:2412.15188 (2024) 2 28 Songsong Yu et al

  51. [52]

    In: CVPR

    Shipard, J., Wiliem, A., Thanh, K.N., Xiang, W., Fookes, C.: Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion. In: CVPR. pp. 769–778 (2023) 4

  52. [53]

    Generation enhances understanding in unified multimodal models via multi-representation generation

    Su, Z., Wei, H., Cen, K., Wang, Y., Chen, G., Yuan, C., Chu, X.: Generation en- hances understanding in unified multimodal models via multi-representation gen- eration. arXiv:2601.21406 (2026) 4, 10

  53. [54]

    Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

    Tang, H., Xie, C., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv:2507.23278 (2025) 10

  54. [55]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Team, C.: Chameleon: Mixed-modal early-fusion foundation models. arXiv:2405.09818 (2024) 4, 10

  55. [56]

    NeurIPS37, 84839–84865 (2024) 1

    Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive model- ing: Scalable image generation via next-scale prediction. NeurIPS37, 84839–84865 (2024) 1

  56. [57]

    NeurIPS 36, 48382–48402 (2023) 4

    Tian, Y., Fan, L., Isola, P., Chang, H., Krishnan, D.: Stablerep: Synthetic images from text-to-image models make strong visual representation learners. NeurIPS 36, 48382–48402 (2023) 4

  57. [58]

    NeurIPs37, 87310–87356 (2024) 3, 6, 9, 11, 17

    Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms. NeurIPs37, 87310–87356 (2024) 3, 6, 9, 11, 17

  58. [59]

    MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv:2412.14164 (2024) 4

  59. [60]

    In: CVPR

    Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: CVPR. pp. 9568–9578 (2024) 6, 9, 11, 17

  60. [61]

    Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024

    Wang, H., Zheng, A., Zhao, Y., Wang, T., Ge, Z., Zhang, X., Zhang, Z.: Recon- structive visual instruction tuning. arXiv:2410.09575 (2024) 4, 5

  61. [62]

    arXiv:2508.03320 (2025) 2

    Wang, P., Peng, Y., Gan, Y., Hu, L., Xie, T., Wang, X., Wei, Y., Tang, C., Zhu, B., Li, C., et al.: Skywork unipic: Unified autoregressive modeling for visual un- derstanding and generation. arXiv:2508.03320 (2025) 2

  62. [63]

    arXiv preprint arXiv:2407.20171 , year=

    Wang, W., Sun, Q., Zhang, F., Tang, Y., Liu, J., Wang, X.: Diffusion feedback helps clip see better. arXiv:2407.20171 (2024) 4, 5

  63. [64]

    Emu3: Next-Token Prediction is All You Need

    Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv:2409.18869 (2024) 3, 10

  64. [65]

    In: ICML

    Wang, Y., Schiff, Y., Gokaslan, A., Pan, W., Wang, F., De Sa, C., Kuleshov, V.: Infodiffusion: Representation learning using information maximizing diffusion models. In: ICML. pp. 36336–36354. PMLR (2023) 4

  65. [66]

    arXiv:2510.22946 (2025) 3

    Wang, Z., Chen, Z., Gou, C., Li, F., Deng, C., Zhu, D., Li, K., Yu, W., Tu, H., Fan, H., et al.: Lightfusion: A light-weighted, double fusion framework for unified multimodal understanding and generation. arXiv:2510.22946 (2025) 3

  66. [67]

    In: ICCV

    Wei, C., Mangalam, K., Huang, P.Y., Li, Y., Fan, H., Xu, H., Wang, H., Xie, C., Yuille, A., Feichtenhofer, C.: Diffusion models as masked autoencoders. In: ICCV. pp. 16284–16294 (2023) 4

  67. [68]

    In: ECCV

    Weng, N., Pegios, P., Petersen, E., Feragen, A., Bigdeli, S.: Fast diffusion-based counterfactuals for shortcut removal and generation. In: ECCV. pp. 338–357. Springer (2024) 4

  68. [69]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., et al.: Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv:2410.13848 (2024) 2 Semantic Generative Tuning for Unified Multimodal Models 29

  69. [70]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv:2506.18871 (2025) 3, 5, 7, 9, 10

  70. [71]

    arXiv preprint arXiv:2507.01467 (2025)

    Wu, G., Zhang, S., Shi, R., Gao, S., Chen, Z., Wang, L., Chen, Z., Gao, H., Tang, Y., Yang, J., et al.: Representation entanglement for generation: Training diffusion transformers is much easier than you think. arXiv:2507.01467 (2025) 4

  71. [72]

    IJCV (2025) 3

    Wu, J., Jiang, Y., Ma, C., Liu, Y., Zhao, H., Yuan, Z., Bai, S., Bai, X.: Liquid: Language models are scalable and unified multi-modal generators. IJCV (2025) 3

  72. [73]

    Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025b

    Wu, S., Wu, Z., Gong, Z., Tao, Q., Jin, S., Li, Q., Li, W., Loy, C.C.: Ope- nuni: A simple baseline for unified multimodal understanding and generation. arXiv:2505.23661 (2025) 5, 10

  73. [74]

    In: ICCV

    Wu, S., Zhang, W., Xu, L., Jin, S., Wu, Z., Tao, Q., Liu, W., Li, W., Loy, C.C.: Harmonizing visual representations for unified multimodal understanding and gen- eration. In: ICCV. pp. 17739–17750 (2025) 10

  74. [75]

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv:2409.04429 (2024) 3

  75. [76]

    xAI: Grok-1.5 vision preview.https://x.ai/news/grok-1.5v(2024) 9

  76. [77]

    "Your output must be a single JSON object.\n\n

    Xie, J., Darrell, T., Zettlemoyer, L., Wang, X.: Reconstruction alignment improves unified multimodal models. arXiv:2509.07295 (2025) 2, 4, 8, 10

  77. [78]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv:2408.12528 (2024) 4, 10

  78. [79]

    Show-o2: Improved Native Unified Multimodal Models

    Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv:2506.15564 (2025) 4

  79. [80]

    In: CVPR

    Yang, J., Yin, D., Zhou, Y., Rao, F., Zhai, W., Cao, Y., Zha, Z.J.: Mmar: Towards lossless multi-modal auto-regressive probabilistic modeling. In: CVPR. pp. 7974– 7985 (2025) 2

  80. [81]

    In: ICCV

    Yang, X., Wang, X.: Diffusion model as representation learner. In: ICCV. pp. 18938–18949 (2023) 4

Showing first 80 references.