Semantic Generative Tuning for Unified Multimodal Models

Songsong Yu; Yanwei Li; Ying Shan; Yuxin Chen

arxiv: 2605.18714 · v1 · pith:U6HXRAVQnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

Semantic Generative Tuning for Unified Multimodal Models

Songsong Yu , Yuxin Chen , Ying Shan , Yanwei Li This is my paper

Pith reviewed 2026-05-20 11:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords unified multimodal modelssemantic generative tuningimage segmentationgenerative post-trainingmultimodal alignmentfeature separabilityvision-language models

0 comments

The pith

Image segmentation as a generative proxy aligns understanding and generation in unified multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Unified multimodal models combine visual understanding and generation in one system, yet separate training leaves their representation spaces misaligned. The paper shows that high-level semantic tasks like image segmentation work best as generative proxies during post-training because they supply structural information instead of low-level texture details. This choice improves both perception accuracy and the spatial coherence of generated outputs. The resulting Semantic Generative Tuning method also sharpens feature separability and adjusts attention patterns between visual and text elements.

Core claim

High-level semantic tasks, particularly image segmentation, serve as optimal proxies that significantly enhance both vision-centric perception and generative layout fidelity. SGT formulates these tasks as generative objectives to bridge the isolation between understanding and generation, improving feature linear separability and optimizing visual-textual attention allocation.

What carries the argument

Semantic Generative Tuning (SGT), a post-training method that treats image segmentation as the central generative proxy to align multimodal representation spaces.

If this is right

Better linear separability of visual features.
More effective visual-textual attention patterns.
Higher scores on both understanding and generation benchmarks.
Tighter coupling between perception and layout fidelity in generated images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar proxy tasks could be tested for video or 3D generation to check if structural semantics remain effective.
The method might reduce reliance on separate specialist models for understanding versus creation.
Real-world editing or scene synthesis pipelines could gain from the reported layout improvements.

Load-bearing premise

Segmentation supplies structural semantics without introducing its own biases or texture distractions, and the observed gains on tested benchmarks and models will transfer to other architectures and real-world data.

What would settle it

Running SGT on a new model or dataset yields no gain or a drop in both comprehension accuracy and generative layout metrics relative to the decoupled baseline.

Figures

Figures reproduced from arXiv: 2605.18714 by Songsong Yu, Yanwei Li, Ying Shan, Yuxin Chen.

**Figure 2.** Figure 2: Overview of the generative tuning paradigm. An RGB image and a concise textual instruction are processed by respective vision and text encoders to extract independent embeddings. UMMs then integrate these embeddings and map the representations to the designated task. Because empirical evaluations demonstrate that visual generation targets at an advanced semantic level yield the most significant performan… view at source ↗

**Figure 3.** Figure 3: Empirical evaluation of the hierarchical task ladder across diverse understanding and generation dimensions. (a) High-level proxy tasks yield greater performance gains than low-level tasks in multimodal understanding. (b) Various generative objectives consistently improve performance in the position dimension, yielding comparable overall gains. (From left to right): Position, Colors, Color Attributes, Co… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on compositional text-to-image generation. 4.3 More Explorations Optimal data recipe. While our analysis in Sec. 3.3 indicates that SGT independently enhances both understanding and generation, we posit that a comprehensive post-training regime must synergize SGT objectives with SFT data to maximize performance. Therefore, we conduct an ablation study to determine the optimal data sa… view at source ↗

**Figure 5.** Figure 5: Ablation studies on segmentation data integration. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics with different SFT:Seg ratios. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Feature space analysis on fine-grained classes. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Analysis of attention patterns. (a) Layer-wise changes in attention to visual features for three proxy tasks relative to the BAGEL baseline, demonstrating a consistent increase in visual focus in deeper layers. (b) Attention distribution over text tokens. The segmentation objective effectively enhancing the focus on critical tokens (Object, Color, Relation). 4.4 Mechanistic Insights: Why Semantic Proxies … view at source ↗

**Figure 9.** Figure 9: Illustration of various computer vision tasks. Top row: RGB Image, Semantic Segmentation, Instance Segmentation, Panoptic Segmentation, Object Detection, and Depth Estimation. Bottom row: (This figure serves solely illustrative purposes and does not originate from the MS COCO dataset.) De-raining, De-hazing, Denoising, Image Super-Resolution (ISR), Deblurring, Edge Detection, Low-light Enhancement, and RGB… view at source ↗

**Figure 10.** Figure 10: Visualization of images generated by SGT, demonstrating high-quality and diverse generations across a wide range of prompts and scenes. compared to GenEval. The lack of substantial improvement on this benchmark suggests that the SGT framework does not inherently facilitate complex instruction parsing capabilities. Further enhancement of these editing proficiencies likely requires the integration of speci… view at source ↗

**Figure 11.** Figure 11: Pseudocode for t-SNE visualization pipeline. visualization, we first apply Principal Component Analysis (PCA) to reduce the feature dimensionality to 50, followed by t-SNE projection onto a 2D plane for visualization. For t-SNE, we adopt the default perplexity value of 30. The complete pipeline is summarized in [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Pseudocode for keyword-level attention analysis during image generation. We extract keywords from the prompt, compute GQA attention maps at selected timesteps and layers, and aggregate attention scores for each keyword to quantify its influence on the generated image [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Token-level attention distribution during image generation. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

read the original abstract

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Segmentation as generative proxy improves UMM alignment more than low-level tasks, but the semantic explanation still needs tighter isolation from other dense objectives.

read the letter

The main takeaway is that post-training unified multimodal models with segmentation as a generative proxy produces measurable gains in both perception and generation fidelity compared to low-level pixel tasks. The paper frames this as the first systematic study of generative post-training and identifies segmentation as the strongest proxy among the ones tested. They back the claim with mechanistic checks showing improved feature linear separability and more coherent visual-textual attention patterns, plus benchmark improvements and public code. That combination of empirical results and some explanatory analysis is the part that stands out as useful rather than routine extension work. The systematic comparison across task types is also a clear step beyond single-proxy experiments in prior tuning papers. The soft spot is the causal attribution. The central argument is that segmentation supplies structural semantics without texture distractions, yet the comparisons are mainly against low-level tasks. There is no direct ablation that swaps in other high-level dense proxies such as depth estimation or panoptic segmentation under matched generative formulation and compute. Without that control, the observed benefits could trace to supervision density or loss properties shared across any dense pixel objective rather than the claimed semantic structure. If the full paper already contains those runs, the concern shrinks; from the presented evidence it remains open. This work is aimed at researchers who train or fine-tune unified multimodal architectures and want a practical post-training recipe. It has enough concrete method, analysis, and reproducibility elements to merit a serious referee, even if revisions will likely focus on stronger isolation of the mechanism. I would send it out for review rather than desk reject.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Semantic Generative Tuning (SGT), a generative post-training paradigm for unified multimodal models (UMMs). It formulates hierarchical visual tasks as generative proxies to address the misalignment between sparse-text understanding and dense-pixel generation. The central empirical finding is that high-level semantic tasks, particularly image segmentation, serve as optimal proxies by supplying structural semantics that improve vision-centric perception and generative layout fidelity, in contrast to low-level tasks that introduce texture distractions. The work supports this with mechanistic analyses of feature linear separability and visual-textual attention allocation, plus benchmark evaluations showing consistent gains; code is released.

Significance. If the empirical results and mechanistic claims hold under rigorous controls, the work would offer a practical and conceptually clean route to joint optimization of understanding and generation in UMMs. The emphasis on mechanistic diagnostics (separability, attention patterns) and the release of code are positive contributions that aid reproducibility and follow-up work. The absence of concrete numerical results, error bars, baseline tables, or dataset specifications in the abstract, however, prevents a precise evaluation of effect sizes or generalizability at this stage.

major comments (1)

The optimality claim for segmentation as a proxy (abstract and empirical investigation section) rests on the assertion that benefits arise specifically from structural semantics rather than from any dense pixel-level supervision. The manuscript compares segmentation against low-level tasks but does not report a controlled ablation that replaces segmentation with another high-level dense generative proxy (e.g., depth estimation or panoptic segmentation) under identical generative formulation, loss weighting, and compute budget. Without this isolation, it remains unclear whether the observed gains are attributable to the claimed semantic property or to generic properties of dense supervision; this directly affects the load-bearing distinction drawn in the abstract.

minor comments (2)

Abstract: the statement that SGT 'consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks' is presented without any quantitative deltas, baseline names, or dataset identifiers, which reduces immediate assessability of the empirical contribution.
The mechanistic analysis section would benefit from explicit statements of the linear-separability metric and the precise attention-allocation statistic used, together with statistical significance tests against the untuned baseline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address the major comment point by point below, providing clarifications and outlining revisions where appropriate.

read point-by-point responses

Referee: The optimality claim for segmentation as a proxy (abstract and empirical investigation section) rests on the assertion that benefits arise specifically from structural semantics rather than from any dense pixel-level supervision. The manuscript compares segmentation against low-level tasks but does not report a controlled ablation that replaces segmentation with another high-level dense generative proxy (e.g., depth estimation or panoptic segmentation) under identical generative formulation, loss weighting, and compute budget. Without this isolation, it remains unclear whether the observed gains are attributable to the claimed semantic property or to generic properties of dense supervision; this directly affects the load-bearing distinction drawn in the abstract.

Authors: We thank the referee for highlighting this important aspect of our empirical investigation. Our current comparisons focus on contrasting high-level semantic tasks like segmentation with low-level tasks (e.g., edge detection) to demonstrate that structural semantics, rather than mere dense pixel supervision, drive the improvements in perception and generation. However, we acknowledge that a direct comparison with other high-level dense proxies such as depth estimation or panoptic segmentation, under matched conditions, would further strengthen the specificity of our claims regarding segmentation's optimality. We will incorporate additional ablation studies in the revised version, ensuring identical generative formulation, loss weighting, and compute budget. This will help isolate whether the benefits are unique to segmentation's structural semantics or shared among high-level tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proxy selection rests on external benchmarks

full rationale

The paper conducts an empirical investigation comparing hierarchical visual tasks as generative proxies for UMM post-training. The claim that segmentation is optimal is grounded in benchmark improvements and mechanistic analyses (linear separability, attention patterns) rather than any internal derivation, fitted parameter renamed as prediction, or self-referential definition. No equations reduce outputs to inputs by construction, and evaluations use external datasets and models. The contribution is self-contained against independent measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions about representation alignment in multimodal models and introduces no new free parameters, invented entities, or non-standard axioms beyond the empirical choice of proxy task.

axioms (1)

domain assumption Decoupled optimization of understanding via text and generation via pixels produces misaligned representation spaces that hinder mutual reinforcement.
This premise is stated directly in the opening of the abstract as the motivation for the work.

pith-pipeline@v0.9.0 · 5730 in / 1197 out tokens · 63323 ms · 2026-05-20T11:28:40.210151+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

high-level semantic tasks, particularly image segmentation, serve as optimal proxies... segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 22 internal anchors

[1]

Assran, Q

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., Le- Cun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predictive architecture. arXiv:2301.08243 (2023) 2, 7

work page arXiv 2023
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv:2502.13923 (2025) 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators1

work page 2024
[4]

Chen, F., Jing, M., Lu, W., Feng, Y., Li, X., Cao, X.: Unihetero: Could gen- eration enhance understanding for vision-language-model at large data scale? arXiv:2512.23512 (2025) 5

work page arXiv 2025
[5]

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision-language models? NeurIPS37, 27056–27087 (2024) 6, 9, 11, 17

work page 2024
[6]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv:2501.17811 (2025) 3, 5, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

arXiv:2401.14404 (2024) 4

Chen, X., Liu, Z., Xie, S., He, K.: Deconstructing denoising diffusion models for self-supervised learning. arXiv:2401.14404 (2024) 4

work page arXiv 2024
[8]

NeurIPS36, 49250–49267 (2023) 2

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. NeurIPS36, 49250–49267 (2023) 2

work page 2023
[9]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv:2505.14683 (2025) 3, 5, 7, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

In: ICLR (2024),https://openreview.net/forum? id=y01KGvd9Bw2

Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., Kong, X., Zhang, X., Ma, K., Yi, L.: DreamLLM: Synergistic multimodal comprehension and creation. In: ICLR (2024),https://openreview.net/forum? id=y01KGvd9Bw2

work page 2024
[11]

arXiv preprint arXiv:2511.23386 (2025) 4, 7, 9

Du, S., Guo, J., Li, B., Cui, S., Xu, Z., Luo, Y., Wei, Y., Gai, K., Wang, X., Wu, K., et al.: Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction. arXiv:2511.23386 (2025) 4

work page arXiv 2025
[12]

In: ACMMM

Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: ACMMM. pp. 11198–11201 (2024) 9

work page 2024
[13]

In: ICML (2024) 1, 4

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024) 1, 4

work page 2024
[14]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394 (2023) 9, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

In: ECCV

Fu, X., Yin, W., Hu, M., Wang, K., Ma, Y., Tan, P., Shen, S., Lin, D., Long, X.: Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In: ECCV. pp. 241–258. Springer (2024) 4 26 Songsong Yu et al

work page 2024
[16]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. arXiv:2404.12390 (2024) 9, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

arXiv:2407.00783 (2024) 4

Fuest, M., Ma, P., Gui, M., Schusterbauer, J., Hu, V.T., Ommer, B.: Diffusion models and representation learning: A survey. arXiv:2407.00783 (2024) 4

work page arXiv 2024
[18]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Ge,Y.,Zhao,S.,Zhu,J.,Ge,Y.,Yi,K.,Song,L.,Li,C.,Ding,X.,Shan,Y.:Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv:2404.14396 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

NeurIPS36, 52132–52152 (2023) 3, 6, 9, 14

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. NeurIPS36, 52132–52152 (2023) 3, 6, 9, 14

work page 2023
[20]

In: CVPR

Graikos,A.,Yellapragada,S.,Le,M.Q.,Kapse,S.,Prasanna,P.,Saltz,J.,Samaras, D.: Learned representation-guided diffusion models for large-image generation. In: CVPR. pp. 8532–8542 (2024) 4

work page 2024
[21]

In: CVPR

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled lan- guage hallucination and visual illusion in large vision-language models. In: CVPR. pp. 14375–14385 (2024) 6, 9, 11, 17

work page 2024
[22]

In: CVPR

Han, J., Liu, J., Jiang, Y., Yan, B., Zhang, Y., Yuan, Z., Peng, B., Liu, X.: Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In: CVPR. pp. 15733–15744 (2025) 1

work page 2025
[23]

In: CVPR

Hudson, D.A., Zoran, D., Malinowski, M., Lampinen, A.K., Jaegle, A., McClelland, J.L., Matthey, L., Hill, F., Lerchner, A.: Soda: Bottleneck diffusion models for representation learning. In: CVPR. pp. 23115–23127 (2024) 4

work page 2024
[24]

arXiv:2402.03161 (2024) 2

Jin, Y., Sun, Z., Xu, K., Chen, L., Jiang, H., Huang, Q., Song, C., Liu, Y., Zhang, D., Song, Y., et al.: Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv:2402.03161 (2024) 2

work page arXiv 2024
[25]

Unified language-vision pretraining in llm with dynamic discrete visual tokenization.arXiv preprint arXiv:2309.04669, 2023

Jin, Y., Xu, K., Chen, L., Liao, C., Tan, J., Huang, Q., Chen, B., Lei, C., Liu, A., Song, C., et al.: Unified language-vision pretraining in llm with dynamic discrete visual tokenization. arXiv:2309.04669 (2023) 2

work page arXiv 2023
[26]

In: ICCV

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: ICCV. pp. 4015–4026 (2023) 9

work page 2023
[27]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer. arXiv:2408.03326 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

In: EMNLP

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP. pp. 292–305 (2023) 6, 9, 11, 17

work page 2023
[29]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Li, Z., Liu, Z., Zhang, Q., Lin, B., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., Yuan, L.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv:2510.16888 (2025) 2

work page internal anchor Pith review arXiv 2025
[30]

arXiv:2512.19680 (2025) 5

Liao, X., He, Q., Xu, K., Qu, X., Li, Y., Wei, W., Yao, A.: Va-π: Variational policy alignment for pixel-aware autoregressive generation. arXiv:2512.19680 (2025) 5

work page arXiv 2025
[31]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv:2506.03147 (2025) 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Transactions of the Association for Computational Linguistics (2023) 6, 9, 17

Liu, F., Emerson, G.E.T., Collier, N.: Visual spatial reasoning. Transactions of the Association for Computational Linguistics (2023) 6, 9, 17

work page 2023
[33]

In: NeurIPS (2023) 1

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) 1

work page 2023
[34]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., Li, G., Peng, Y., Sun, Q., Wu, J., Cai, Y., Ge, Z., Ming, R., Xia, L., Semantic Generative Tuning for Unified Multimodal Models 27 Zeng, X., Zhu, Y., Jiao, B., Zhang, X., Yu, G., Jiang, D.: Step1x-edit: A practical framework for general image editing. arXiv:2504...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: ECCV. pp. 216–233. Springer (2024) 9

work page 2024
[36]

Science China Information Sciences67(12), 220102 (2024) 6, 17

Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences67(12), 220102 (2024) 6, 17

work page 2024
[37]

arXiv:2312.17172 (2023) 3

Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A.: Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv:2312.17172 (2023) 3

work page arXiv 2023
[38]

In: ICLR (2024) 6, 9, 17

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In: ICLR (2024) 6, 9, 17

work page 2024
[39]

In: NeurIPS (2022) 6, 17

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022) 6, 17

work page 2022
[40]

arXiv:2405.15232 (2024) 4

Luo, R., Li, Y., Chen, L., He, W., Lin, T.E., Liu, Z., Zhang, L., Song, Z., Xia, X., Liu, T., et al.: Deem: Diffusion models serve as the eyes of large language models for image perception. arXiv:2405.15232 (2024) 4

work page arXiv 2024
[41]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Unitok: A unified tokenizer for visual generation and understanding. arXiv:2502.20321 (2025) 3

work page arXiv 2025
[42]

In: ICCV

Ma, S., Ge, Y., Wang, T., Guo, Y., Ge, Y., Shan, Y.: Genhancer: Imperfect genera- tive models are secretly strong vision-centric enhancers. In: ICCV. pp. 24402–24412 (2025) 4, 5, 7

work page 2025
[43]

In: CVPR

Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., Yu, X., et al.: Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. In: CVPR. pp. 7739–7751 (2025) 2

work page 2025
[44]

In: WACV

Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: WACV. pp. 2200–2209 (2021) 6, 17

work page 2021
[45]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Niu, Y., Ning, M., Zheng, M., Jin, W., Lin, B., Jin, P., Liao, J., Ning, K., Feng, C., Zhu, B., Yuan, L.: Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv:2503.07265 (2025) 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Transfer between Modalities with MetaQueries

Pan, X., Shukla, S.N., Singh, A., Zhao, Z., Mishra, S.K., Wang, J., Xu, Z., Chen, J., Li, K., Juefei-Xu, F., et al.: Transfer between modalities with metaqueries. arXiv:2504.06256 (2025) 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

In: ECCV

Parihar, R., Sachidanand, V., Mani, S., Karmali, T., Venkatesh Babu, R.: Pre- cisecontrol: Enhancing text-to-image diffusion models with fine-grained attribute control. In: ECCV. pp. 469–487. Springer (2024) 4

work page 2024
[49]

In: CVPR (2022) 14

Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: CVPR (2022) 14

work page 2022
[50]

In: CVPR

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: CVPR. pp. 2545–2555 (2025) 4

work page 2025
[51]

2025.doi:10.48550/arXiv.2412.15188

Shi,W.,Han,X.,Zhou,C.,Liang,W.,Lin,X.V.,Zettlemoyer,L.,Yu,L.:Lmfusion: Adapting pretrained language models for multimodal generation. arXiv:2412.15188 (2024) 2 28 Songsong Yu et al

work page arXiv 2024
[52]

In: CVPR

Shipard, J., Wiliem, A., Thanh, K.N., Xiang, W., Fookes, C.: Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion. In: CVPR. pp. 769–778 (2023) 4

work page 2023
[53]

Generation enhances understanding in unified multimodal models via multi-representation generation

Su, Z., Wei, H., Cen, K., Wang, Y., Chen, G., Yuan, C., Chu, X.: Generation en- hances understanding in unified multimodal models via multi-representation gen- eration. arXiv:2601.21406 (2026) 4, 10

work page arXiv 2026
[54]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278,

Tang, H., Xie, C., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv:2507.23278 (2025) 10

work page arXiv 2025
[55]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Team, C.: Chameleon: Mixed-modal early-fusion foundation models. arXiv:2405.09818 (2024) 4, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

NeurIPS37, 84839–84865 (2024) 1

Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive model- ing: Scalable image generation via next-scale prediction. NeurIPS37, 84839–84865 (2024) 1

work page 2024
[57]

NeurIPS 36, 48382–48402 (2023) 4

Tian, Y., Fan, L., Isola, P., Chang, H., Krishnan, D.: Stablerep: Synthetic images from text-to-image models make strong visual representation learners. NeurIPS 36, 48382–48402 (2023) 4

work page 2023
[58]

NeurIPs37, 87310–87356 (2024) 3, 6, 9, 11, 17

Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms. NeurIPs37, 87310–87356 (2024) 3, 6, 9, 11, 17

work page 2024
[59]

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv:2412.14164 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

In: CVPR

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: CVPR. pp. 9568–9578 (2024) 6, 9, 11, 17

work page 2024
[61]

Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024

Wang, H., Zheng, A., Zhao, Y., Wang, T., Ge, Z., Zhang, X., Zhang, Z.: Recon- structive visual instruction tuning. arXiv:2410.09575 (2024) 4, 5

work page arXiv 2024
[62]

arXiv:2508.03320 (2025) 2

Wang, P., Peng, Y., Gan, Y., Hu, L., Xie, T., Wang, X., Wei, Y., Tang, C., Zhu, B., Li, C., et al.: Skywork unipic: Unified autoregressive modeling for visual un- derstanding and generation. arXiv:2508.03320 (2025) 2

work page arXiv 2025
[63]

arXiv preprint arXiv:2407.20171 , year=

Wang, W., Sun, Q., Zhang, F., Tang, Y., Liu, J., Wang, X.: Diffusion feedback helps clip see better. arXiv:2407.20171 (2024) 4, 5

work page arXiv 2024
[64]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv:2409.18869 (2024) 3, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

In: ICML

Wang, Y., Schiff, Y., Gokaslan, A., Pan, W., Wang, F., De Sa, C., Kuleshov, V.: Infodiffusion: Representation learning using information maximizing diffusion models. In: ICML. pp. 36336–36354. PMLR (2023) 4

work page 2023
[66]

arXiv:2510.22946 (2025) 3

Wang, Z., Chen, Z., Gou, C., Li, F., Deng, C., Zhu, D., Li, K., Yu, W., Tu, H., Fan, H., et al.: Lightfusion: A light-weighted, double fusion framework for unified multimodal understanding and generation. arXiv:2510.22946 (2025) 3

work page arXiv 2025
[67]

In: ICCV

Wei, C., Mangalam, K., Huang, P.Y., Li, Y., Fan, H., Xu, H., Wang, H., Xie, C., Yuille, A., Feichtenhofer, C.: Diffusion models as masked autoencoders. In: ICCV. pp. 16284–16294 (2023) 4

work page 2023
[68]

In: ECCV

Weng, N., Pegios, P., Petersen, E., Feragen, A., Bigdeli, S.: Fast diffusion-based counterfactuals for shortcut removal and generation. In: ECCV. pp. 338–357. Springer (2024) 4

work page 2024
[69]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., et al.: Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv:2410.13848 (2024) 2 Semantic Generative Tuning for Unified Multimodal Models 29

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv:2506.18871 (2025) 3, 5, 7, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

arXiv preprint arXiv:2507.01467 (2025)

Wu, G., Zhang, S., Shi, R., Gao, S., Chen, Z., Wang, L., Chen, Z., Gao, H., Tang, Y., Yang, J., et al.: Representation entanglement for generation: Training diffusion transformers is much easier than you think. arXiv:2507.01467 (2025) 4

work page arXiv 2025
[72]

IJCV (2025) 3

Wu, J., Jiang, Y., Ma, C., Liu, Y., Zhao, H., Yuan, Z., Bai, S., Bai, X.: Liquid: Language models are scalable and unified multi-modal generators. IJCV (2025) 3

work page 2025
[73]

Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025b

Wu, S., Wu, Z., Gong, Z., Tao, Q., Jin, S., Li, Q., Li, W., Loy, C.C.: Ope- nuni: A simple baseline for unified multimodal understanding and generation. arXiv:2505.23661 (2025) 5, 10

work page arXiv 2025
[74]

In: ICCV

Wu, S., Zhang, W., Xu, L., Jin, S., Wu, Z., Tao, Q., Liu, W., Li, W., Loy, C.C.: Harmonizing visual representations for unified multimodal understanding and gen- eration. In: ICCV. pp. 17739–17750 (2025) 10

work page 2025
[75]

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv:2409.04429 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

xAI: Grok-1.5 vision preview.https://x.ai/news/grok-1.5v(2024) 9

work page 2024
[77]

"Your output must be a single JSON object.\n\n

Xie, J., Darrell, T., Zettlemoyer, L., Wang, X.: Reconstruction alignment improves unified multimodal models. arXiv:2509.07295 (2025) 2, 4, 8, 10

work page arXiv 2025
[78]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv:2408.12528 (2024) 4, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Show-o2: Improved Native Unified Multimodal Models

Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv:2506.15564 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

In: CVPR

Yang, J., Yin, D., Zhou, Y., Rao, F., Zhai, W., Cao, Y., Zha, Z.J.: Mmar: Towards lossless multi-modal auto-regressive probabilistic modeling. In: CVPR. pp. 7974– 7985 (2025) 2

work page 2025
[81]

In: ICCV

Yang, X., Wang, X.: Diffusion model as representation learner. In: ICCV. pp. 18938–18949 (2023) 4

work page 2023

Showing first 80 references.

[1] [1]

Assran, Q

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., Le- Cun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predictive architecture. arXiv:2301.08243 (2023) 2, 7

work page arXiv 2023

[2] [2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv:2502.13923 (2025) 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators1

work page 2024

[4] [4]

Chen, F., Jing, M., Lu, W., Feng, Y., Li, X., Cao, X.: Unihetero: Could gen- eration enhance understanding for vision-language-model at large data scale? arXiv:2512.23512 (2025) 5

work page arXiv 2025

[5] [5]

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision-language models? NeurIPS37, 27056–27087 (2024) 6, 9, 11, 17

work page 2024

[6] [6]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv:2501.17811 (2025) 3, 5, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

arXiv:2401.14404 (2024) 4

Chen, X., Liu, Z., Xie, S., He, K.: Deconstructing denoising diffusion models for self-supervised learning. arXiv:2401.14404 (2024) 4

work page arXiv 2024

[8] [8]

NeurIPS36, 49250–49267 (2023) 2

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. NeurIPS36, 49250–49267 (2023) 2

work page 2023

[9] [9]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv:2505.14683 (2025) 3, 5, 7, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

In: ICLR (2024),https://openreview.net/forum? id=y01KGvd9Bw2

Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., Kong, X., Zhang, X., Ma, K., Yi, L.: DreamLLM: Synergistic multimodal comprehension and creation. In: ICLR (2024),https://openreview.net/forum? id=y01KGvd9Bw2

work page 2024

[11] [11]

arXiv preprint arXiv:2511.23386 (2025) 4, 7, 9

Du, S., Guo, J., Li, B., Cui, S., Xu, Z., Luo, Y., Wei, Y., Gai, K., Wang, X., Wu, K., et al.: Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction. arXiv:2511.23386 (2025) 4

work page arXiv 2025

[12] [12]

In: ACMMM

Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: ACMMM. pp. 11198–11201 (2024) 9

work page 2024

[13] [13]

In: ICML (2024) 1, 4

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024) 1, 4

work page 2024

[14] [14]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394 (2023) 9, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

In: ECCV

Fu, X., Yin, W., Hu, M., Wang, K., Ma, Y., Tan, P., Shen, S., Lin, D., Long, X.: Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In: ECCV. pp. 241–258. Springer (2024) 4 26 Songsong Yu et al

work page 2024

[16] [16]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. arXiv:2404.12390 (2024) 9, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

arXiv:2407.00783 (2024) 4

Fuest, M., Ma, P., Gui, M., Schusterbauer, J., Hu, V.T., Ommer, B.: Diffusion models and representation learning: A survey. arXiv:2407.00783 (2024) 4

work page arXiv 2024

[18] [18]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Ge,Y.,Zhao,S.,Zhu,J.,Ge,Y.,Yi,K.,Song,L.,Li,C.,Ding,X.,Shan,Y.:Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv:2404.14396 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

NeurIPS36, 52132–52152 (2023) 3, 6, 9, 14

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. NeurIPS36, 52132–52152 (2023) 3, 6, 9, 14

work page 2023

[20] [20]

In: CVPR

Graikos,A.,Yellapragada,S.,Le,M.Q.,Kapse,S.,Prasanna,P.,Saltz,J.,Samaras, D.: Learned representation-guided diffusion models for large-image generation. In: CVPR. pp. 8532–8542 (2024) 4

work page 2024

[21] [21]

In: CVPR

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled lan- guage hallucination and visual illusion in large vision-language models. In: CVPR. pp. 14375–14385 (2024) 6, 9, 11, 17

work page 2024

[22] [22]

In: CVPR

Han, J., Liu, J., Jiang, Y., Yan, B., Zhang, Y., Yuan, Z., Peng, B., Liu, X.: Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In: CVPR. pp. 15733–15744 (2025) 1

work page 2025

[23] [23]

In: CVPR

Hudson, D.A., Zoran, D., Malinowski, M., Lampinen, A.K., Jaegle, A., McClelland, J.L., Matthey, L., Hill, F., Lerchner, A.: Soda: Bottleneck diffusion models for representation learning. In: CVPR. pp. 23115–23127 (2024) 4

work page 2024

[24] [24]

arXiv:2402.03161 (2024) 2

Jin, Y., Sun, Z., Xu, K., Chen, L., Jiang, H., Huang, Q., Song, C., Liu, Y., Zhang, D., Song, Y., et al.: Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv:2402.03161 (2024) 2

work page arXiv 2024

[25] [25]

Unified language-vision pretraining in llm with dynamic discrete visual tokenization.arXiv preprint arXiv:2309.04669, 2023

Jin, Y., Xu, K., Chen, L., Liao, C., Tan, J., Huang, Q., Chen, B., Lei, C., Liu, A., Song, C., et al.: Unified language-vision pretraining in llm with dynamic discrete visual tokenization. arXiv:2309.04669 (2023) 2

work page arXiv 2023

[26] [26]

In: ICCV

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: ICCV. pp. 4015–4026 (2023) 9

work page 2023

[27] [27]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer. arXiv:2408.03326 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

In: EMNLP

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP. pp. 292–305 (2023) 6, 9, 11, 17

work page 2023

[29] [29]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Li, Z., Liu, Z., Zhang, Q., Lin, B., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., Yuan, L.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv:2510.16888 (2025) 2

work page internal anchor Pith review arXiv 2025

[30] [30]

arXiv:2512.19680 (2025) 5

Liao, X., He, Q., Xu, K., Qu, X., Li, Y., Wei, W., Yao, A.: Va-π: Variational policy alignment for pixel-aware autoregressive generation. arXiv:2512.19680 (2025) 5

work page arXiv 2025

[31] [31]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv:2506.03147 (2025) 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Transactions of the Association for Computational Linguistics (2023) 6, 9, 17

Liu, F., Emerson, G.E.T., Collier, N.: Visual spatial reasoning. Transactions of the Association for Computational Linguistics (2023) 6, 9, 17

work page 2023

[33] [33]

In: NeurIPS (2023) 1

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) 1

work page 2023

[34] [34]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., Li, G., Peng, Y., Sun, Q., Wu, J., Cai, Y., Ge, Z., Ming, R., Xia, L., Semantic Generative Tuning for Unified Multimodal Models 27 Zeng, X., Zhu, Y., Jiao, B., Zhang, X., Yu, G., Jiang, D.: Step1x-edit: A practical framework for general image editing. arXiv:2504...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: ECCV. pp. 216–233. Springer (2024) 9

work page 2024

[36] [36]

Science China Information Sciences67(12), 220102 (2024) 6, 17

Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences67(12), 220102 (2024) 6, 17

work page 2024

[37] [37]

arXiv:2312.17172 (2023) 3

Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A.: Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv:2312.17172 (2023) 3

work page arXiv 2023

[38] [38]

In: ICLR (2024) 6, 9, 17

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In: ICLR (2024) 6, 9, 17

work page 2024

[39] [39]

In: NeurIPS (2022) 6, 17

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022) 6, 17

work page 2022

[40] [40]

arXiv:2405.15232 (2024) 4

Luo, R., Li, Y., Chen, L., He, W., Lin, T.E., Liu, Z., Zhang, L., Song, Z., Xia, X., Liu, T., et al.: Deem: Diffusion models serve as the eyes of large language models for image perception. arXiv:2405.15232 (2024) 4

work page arXiv 2024

[41] [41]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Unitok: A unified tokenizer for visual generation and understanding. arXiv:2502.20321 (2025) 3

work page arXiv 2025

[42] [42]

In: ICCV

Ma, S., Ge, Y., Wang, T., Guo, Y., Ge, Y., Shan, Y.: Genhancer: Imperfect genera- tive models are secretly strong vision-centric enhancers. In: ICCV. pp. 24402–24412 (2025) 4, 5, 7

work page 2025

[43] [43]

In: CVPR

Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., Yu, X., et al.: Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. In: CVPR. pp. 7739–7751 (2025) 2

work page 2025

[44] [44]

In: WACV

Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: WACV. pp. 2200–2209 (2021) 6, 17

work page 2021

[45] [45]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Niu, Y., Ning, M., Zheng, M., Jin, W., Lin, B., Jin, P., Liao, J., Ning, K., Feng, C., Zhu, B., Yuan, L.: Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv:2503.07265 (2025) 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [47]

Transfer between Modalities with MetaQueries

Pan, X., Shukla, S.N., Singh, A., Zhao, Z., Mishra, S.K., Wang, J., Xu, Z., Chen, J., Li, K., Juefei-Xu, F., et al.: Transfer between modalities with metaqueries. arXiv:2504.06256 (2025) 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [48]

In: ECCV

Parihar, R., Sachidanand, V., Mani, S., Karmali, T., Venkatesh Babu, R.: Pre- cisecontrol: Enhancing text-to-image diffusion models with fine-grained attribute control. In: ECCV. pp. 469–487. Springer (2024) 4

work page 2024

[48] [49]

In: CVPR (2022) 14

Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: CVPR (2022) 14

work page 2022

[49] [50]

In: CVPR

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: CVPR. pp. 2545–2555 (2025) 4

work page 2025

[50] [51]

2025.doi:10.48550/arXiv.2412.15188

Shi,W.,Han,X.,Zhou,C.,Liang,W.,Lin,X.V.,Zettlemoyer,L.,Yu,L.:Lmfusion: Adapting pretrained language models for multimodal generation. arXiv:2412.15188 (2024) 2 28 Songsong Yu et al

work page arXiv 2024

[51] [52]

In: CVPR

Shipard, J., Wiliem, A., Thanh, K.N., Xiang, W., Fookes, C.: Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion. In: CVPR. pp. 769–778 (2023) 4

work page 2023

[52] [53]

Generation enhances understanding in unified multimodal models via multi-representation generation

Su, Z., Wei, H., Cen, K., Wang, Y., Chen, G., Yuan, C., Chu, X.: Generation en- hances understanding in unified multimodal models via multi-representation gen- eration. arXiv:2601.21406 (2026) 4, 10

work page arXiv 2026

[53] [54]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278,

Tang, H., Xie, C., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv:2507.23278 (2025) 10

work page arXiv 2025

[54] [55]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Team, C.: Chameleon: Mixed-modal early-fusion foundation models. arXiv:2405.09818 (2024) 4, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [56]

NeurIPS37, 84839–84865 (2024) 1

Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive model- ing: Scalable image generation via next-scale prediction. NeurIPS37, 84839–84865 (2024) 1

work page 2024

[56] [57]

NeurIPS 36, 48382–48402 (2023) 4

Tian, Y., Fan, L., Isola, P., Chang, H., Krishnan, D.: Stablerep: Synthetic images from text-to-image models make strong visual representation learners. NeurIPS 36, 48382–48402 (2023) 4

work page 2023

[57] [58]

NeurIPs37, 87310–87356 (2024) 3, 6, 9, 11, 17

Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms. NeurIPs37, 87310–87356 (2024) 3, 6, 9, 11, 17

work page 2024

[58] [59]

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv:2412.14164 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [60]

In: CVPR

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: CVPR. pp. 9568–9578 (2024) 6, 9, 11, 17

work page 2024

[60] [61]

Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024

Wang, H., Zheng, A., Zhao, Y., Wang, T., Ge, Z., Zhang, X., Zhang, Z.: Recon- structive visual instruction tuning. arXiv:2410.09575 (2024) 4, 5

work page arXiv 2024

[61] [62]

arXiv:2508.03320 (2025) 2

Wang, P., Peng, Y., Gan, Y., Hu, L., Xie, T., Wang, X., Wei, Y., Tang, C., Zhu, B., Li, C., et al.: Skywork unipic: Unified autoregressive modeling for visual un- derstanding and generation. arXiv:2508.03320 (2025) 2

work page arXiv 2025

[62] [63]

arXiv preprint arXiv:2407.20171 , year=

Wang, W., Sun, Q., Zhang, F., Tang, Y., Liu, J., Wang, X.: Diffusion feedback helps clip see better. arXiv:2407.20171 (2024) 4, 5

work page arXiv 2024

[63] [64]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv:2409.18869 (2024) 3, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [65]

In: ICML

Wang, Y., Schiff, Y., Gokaslan, A., Pan, W., Wang, F., De Sa, C., Kuleshov, V.: Infodiffusion: Representation learning using information maximizing diffusion models. In: ICML. pp. 36336–36354. PMLR (2023) 4

work page 2023

[65] [66]

arXiv:2510.22946 (2025) 3

Wang, Z., Chen, Z., Gou, C., Li, F., Deng, C., Zhu, D., Li, K., Yu, W., Tu, H., Fan, H., et al.: Lightfusion: A light-weighted, double fusion framework for unified multimodal understanding and generation. arXiv:2510.22946 (2025) 3

work page arXiv 2025

[66] [67]

In: ICCV

Wei, C., Mangalam, K., Huang, P.Y., Li, Y., Fan, H., Xu, H., Wang, H., Xie, C., Yuille, A., Feichtenhofer, C.: Diffusion models as masked autoencoders. In: ICCV. pp. 16284–16294 (2023) 4

work page 2023

[67] [68]

In: ECCV

Weng, N., Pegios, P., Petersen, E., Feragen, A., Bigdeli, S.: Fast diffusion-based counterfactuals for shortcut removal and generation. In: ECCV. pp. 338–357. Springer (2024) 4

work page 2024

[68] [69]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., et al.: Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv:2410.13848 (2024) 2 Semantic Generative Tuning for Unified Multimodal Models 29

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [70]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv:2506.18871 (2025) 3, 5, 7, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [71]

arXiv preprint arXiv:2507.01467 (2025)

Wu, G., Zhang, S., Shi, R., Gao, S., Chen, Z., Wang, L., Chen, Z., Gao, H., Tang, Y., Yang, J., et al.: Representation entanglement for generation: Training diffusion transformers is much easier than you think. arXiv:2507.01467 (2025) 4

work page arXiv 2025

[71] [72]

IJCV (2025) 3

Wu, J., Jiang, Y., Ma, C., Liu, Y., Zhao, H., Yuan, Z., Bai, S., Bai, X.: Liquid: Language models are scalable and unified multi-modal generators. IJCV (2025) 3

work page 2025

[72] [73]

Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025b

Wu, S., Wu, Z., Gong, Z., Tao, Q., Jin, S., Li, Q., Li, W., Loy, C.C.: Ope- nuni: A simple baseline for unified multimodal understanding and generation. arXiv:2505.23661 (2025) 5, 10

work page arXiv 2025

[73] [74]

In: ICCV

Wu, S., Zhang, W., Xu, L., Jin, S., Wu, Z., Tao, Q., Liu, W., Li, W., Loy, C.C.: Harmonizing visual representations for unified multimodal understanding and gen- eration. In: ICCV. pp. 17739–17750 (2025) 10

work page 2025

[74] [75]

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv:2409.04429 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [76]

xAI: Grok-1.5 vision preview.https://x.ai/news/grok-1.5v(2024) 9

work page 2024

[76] [77]

"Your output must be a single JSON object.\n\n

Xie, J., Darrell, T., Zettlemoyer, L., Wang, X.: Reconstruction alignment improves unified multimodal models. arXiv:2509.07295 (2025) 2, 4, 8, 10

work page arXiv 2025

[77] [78]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv:2408.12528 (2024) 4, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[78] [79]

Show-o2: Improved Native Unified Multimodal Models

Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv:2506.15564 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[79] [80]

In: CVPR

Yang, J., Yin, D., Zhou, Y., Rao, F., Zhai, W., Cao, Y., Zha, Z.J.: Mmar: Towards lossless multi-modal auto-regressive probabilistic modeling. In: CVPR. pp. 7974– 7985 (2025) 2

work page 2025

[80] [81]

In: ICCV

Yang, X., Wang, X.: Diffusion model as representation learner. In: ICCV. pp. 18938–18949 (2023) 4

work page 2023