arxiv: 2408.12528 · v7 · submitted 2024-08-22 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

David Junhao Zhang, Jinheng Xie, Kevin Qinghong Lin, Mike Zheng Shou, Weihao Wang, Weijia Mao, Yuchao Gu, Zechen Bai, Zhenheng Yang, Zhijie Chen

Pith reviewed 2026-05-11 20:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal understandingmultimodal generationunified transformerautoregressive modelingdiscrete diffusionvision-language taskstext-to-image generation

0 comments

The pith

A single transformer unifies multimodal understanding and generation by combining autoregressive and discrete diffusion modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents Show-o as a single transformer model capable of both understanding multimodal inputs and generating outputs across modalities. It achieves this by integrating autoregressive modeling, which processes data sequentially, with discrete diffusion modeling, which refines data in parallel steps. The model supports tasks such as visual question answering, text-to-image generation, and mixed modality operations. If the approach works as described, it could allow one model to replace multiple specialized systems for vision and language tasks. Readers might care because this points to simpler, more general AI architectures for handling diverse multimodal challenges.

Core claim

We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation.

What carries the argument

Show-o, the unified transformer that integrates autoregressive modeling for sequential tasks and discrete diffusion for generative refinement to handle mixed modalities adaptively.

If this is right

The model can answer questions about images using its understanding capabilities.
It can generate images from text descriptions using the diffusion component.
It enables text-guided image editing such as inpainting and extrapolation.
It supports generation involving mixed text and image sequences.
It matches or exceeds the performance of separate specialized models using only one set of parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could simplify training and deployment by removing the need for separate understanding and generation models in production systems.
Interactive applications might allow real-time shifts from interpreting input to producing new content without model changes.
Scaling the hybrid approach may reveal different efficiency patterns than pure autoregressive or pure diffusion models.

Load-bearing premise

A single transformer can be trained to balance the requirements of autoregressive sequential prediction and discrete diffusion iterative refinement without major performance losses in understanding or generation tasks.

What would settle it

A direct comparison on a standard text-to-image benchmark where Show-o produces images with substantially higher FID scores than a dedicated diffusion model of the same size, or lower accuracy on a visual question answering benchmark than a specialized autoregressive vision-language model.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Show-o puts autoregressive and discrete diffusion into one transformer for mixed multimodal tasks and releases the code, but the real test is whether the unification delivers without noticeable trade-offs on either understanding or generation.

read the letter

Show-o tries to handle both understanding and generation in a single transformer by letting the model switch between autoregressive token prediction and discrete diffusion denoising depending on the input and output. That adaptive mix is the main technical move, and it covers a useful set of tasks: VQA, text-to-image, inpainting, extrapolation, and mixed-modality generation all from the same weights. Releasing code and models is straightforward and lets people check the claims directly, which is the right way to do this kind of work. The architecture description shows how they keep the same backbone while routing the modeling style per task, which avoids the need for two separate systems. That is a clean practical step beyond fully autoregressive unification attempts. The performance claims are that it matches or beats specialized models of similar or larger size across benchmarks. If the full experiments include proper ablations on the loss balancing and task-specific metrics, that would make the central argument solid. Right now the abstract leaves the strength of those numbers open, so the soft spot is whether the single model really avoids paying a price in one direction or the other. The math and setup look standard for the area with no obvious internal contradictions or circular steps. Citations are typical for multimodal transformer papers. This is worth a reading group for anyone working on unified vision-language models, because the implementation is public and the task coverage is broad. A reader who wants to see how far one transformer can stretch across understanding and generation will get concrete value from the details. It deserves peer review because the idea is timely, the release makes verification possible, and the unification claim is testable even if the results need tightening.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces Show-o, a single transformer architecture that unifies multimodal understanding and generation by integrating autoregressive token prediction with discrete diffusion denoising. The model adaptively handles mixed-modality inputs and outputs to support tasks including visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. It reports performance that is comparable or superior to specialized models of similar or larger scale across benchmarks, with code and weights publicly released.

Significance. If the empirical claims hold, the work is significant for demonstrating a practical unification of autoregressive and diffusion paradigms in one transformer, advancing toward generalist multimodal foundation models. The public release of code and models is a clear strength that supports reproducibility and independent verification of the adaptive handling mechanism.

major comments (2)

[§3.2] §3.2 and Eq. (3): the adaptive routing between autoregressive and diffusion paths is described at a high level but lacks a precise formulation of the conditioning or gating mechanism that decides the modeling mode for a given token or modality; without this, it is difficult to assess whether the unification introduces hidden task-specific biases that could undermine the 'single transformer' claim.
[Table 4] Table 4, VQA and text-to-image rows: the reported gains over baselines are modest (e.g., +1.2 accuracy points on VQAv2) yet the paper does not include an ablation isolating the contribution of the discrete diffusion component versus the shared transformer backbone; this makes it hard to confirm that unification does not incur the performance trade-off the central claim implicitly rules out.

minor comments (3)

The abstract states 'comparable or superior performance' but does not quantify parameter counts for all compared models; adding a column or footnote in the main results tables would improve clarity.
[Figure 2] Figure 2: the diagram of the unified forward pass would benefit from explicit labels distinguishing AR token prediction steps from diffusion denoising steps.
[§5.3] §5.3: the training details mention a combined loss but do not specify the weighting hyperparameter schedule; a short paragraph or equation would help readers reproduce the balancing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, constructive feedback, and recommendation for minor revision. We address each major comment below and will update the manuscript accordingly to improve clarity and strengthen the empirical support.

read point-by-point responses

Referee: [§3.2] §3.2 and Eq. (3): the adaptive routing between autoregressive and diffusion paths is described at a high level but lacks a precise formulation of the conditioning or gating mechanism that decides the modeling mode for a given token or modality; without this, it is difficult to assess whether the unification introduces hidden task-specific biases that could undermine the 'single transformer' claim.

Authors: We thank the referee for highlighting the need for greater precision here. Section 3.2 and Eq. (3) describe the unified objective that combines autoregressive cross-entropy loss on understanding tokens with discrete diffusion denoising loss on generation tokens, with the shared transformer processing a mixed sequence. The routing is conditioned explicitly on modality indicator tokens and task prompts prepended to the input, which control the attention mask (causal for AR segments, bidirectional for diffusion segments) and the loss applied to each position. No additional learned gating parameters are used. To eliminate any ambiguity about potential hidden biases, we will revise §3.2 to include a formal definition of the mode selection function, pseudocode for the forward pass, and an explicit statement that all decisions derive solely from the input conditioning rather than task-specific modules. revision: yes
Referee: Table 4, VQA and text-to-image rows: the reported gains over baselines are modest (e.g., +1.2 accuracy points on VQAv2) yet the paper does not include an ablation isolating the contribution of the discrete diffusion component versus the shared transformer backbone; this makes it hard to confirm that unification does not incur the performance trade-off the central claim implicitly rules out.

Authors: We appreciate the referee's observation that the numerical gains are modest in some settings and that an explicit ablation would better isolate the diffusion component's role. The central claim is that a single transformer can match or exceed specialized models without architectural specialization; Table 4 supports this by direct comparison to larger or task-specific baselines. Nevertheless, we agree that an ablation against a pure-AR variant on the identical backbone would provide stronger evidence against hidden trade-offs. In the revised manuscript we will add such an ablation (on VQAv2 and a text-to-image benchmark) comparing the full hybrid model to an AR-only ablation that uses the same transformer weights and training data but replaces the diffusion loss with standard next-token prediction. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents Show-o as an empirical architectural contribution: a single transformer that interleaves autoregressive token prediction with discrete diffusion denoising steps to handle mixed-modality inputs and outputs. All load-bearing claims (unification across VQA, text-to-image, inpainting, etc.) are justified by benchmark comparisons against specialized models and by the public release of code/weights for external reproduction. No equations or sections reduce a claimed prediction to a fitted parameter by construction, invoke a self-citation as a uniqueness theorem, or smuggle an ansatz via prior work. The derivation chain is therefore self-contained against external benchmarks rather than internally tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the claim rests on the empirical effectiveness of the proposed hybrid modeling approach, which in ML papers typically involves many training hyperparameters and standard transformer assumptions.

pith-pipeline@v0.9.0 · 5456 in / 1057 out tokens · 41014 ms · 2026-05-11T20:58:40.213140+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear
We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities.
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
Show-o innovatively unifies autoregressive and (discrete) diffusion modeling within one single transformer
IndisputableMonolith.Foundation.PhiForcing phi_forcing unclear
the unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
Exploring Spatial Intelligence from a Generative Perspective
cs.CV 2026-04 unverdicted novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...
IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation
cs.CV 2026-04 unverdicted novelty 7.0

IAD-Unify unifies industrial anomaly segmentation, region-grounded language understanding, and mask-guided generation in one framework using DINOv2 token injection into Qwen3.5, supported by the new Anomaly-56K datase...
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
Learning Vision-Language-Action World Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
cs.CV 2026-05 unverdicted novelty 6.0

AlphaGRPO uses GRPO on unified multimodal models together with decompositional verifiable rewards to unlock self-reflective reasoning and refinement, yielding benchmark gains in generation and zero-shot editing.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
cs.CV 2026-05 unverdicted novelty 6.0

PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
cs.CV 2026-04 unverdicted novelty 6.0

SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
cs.CV 2026-04 unverdicted novelty 6.0

CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
cs.CV 2026-04 unverdicted novelty 6.0

IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
cs.CV 2026-04 unverdicted novelty 6.0

Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.
Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection
cs.CV 2026-04 unverdicted novelty 6.0

MAFL uses adversarial training to suppress pattern and content biases, guiding models to learn shared generative features for better cross-model generalization in detecting AI images.
Nucleus-Image: Sparse MoE for Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
cs.CV 2026-04 unverdicted novelty 6.0

Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose...
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
cs.CV 2026-05 unverdicted novelty 5.0

Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
cs.CV 2025-05 conditional novelty 5.0

BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
cs.AI 2026-04 unverdicted novelty 4.0

TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 39 Pith papers · 16 internal anchors

[1]

Gemini: A Family of Highly Capable Multimodal Models

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 1,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Muse: Text-to-image generation via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image gen- eration via masked generative transformers.arXiv preprint arXiv:2301.00704,

work page arXiv
[5]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

11 Published as a conference paper at ICLR 2025 Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793,

work page internal anchor Pith review arXiv 2025
[6]

Seed-x: Multimodal models with unified multi-granularity comprehension and generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396,

work page arXiv
[7]

Generative adversarial nets.NeurIPS,

12 Published as a conference paper at ICLR 2025 Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.NeurIPS,

work page 2025
[8]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos ´e Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125,

work page arXiv
[11]

Textbooks Are All You Need II: phi-1.5 technical report

Yuanzhi Li, S´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463,

work page internal anchor Pith review arXiv
[12]

McKinzie, Z

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training.arXiv preprint arXiv:2403.09611,

work page arXiv
[13]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

13 Published as a conference paper at ICLR 2025 Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with CLIP latents.CoRR, abs/2204.06125, 2022a. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022b. Su...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review arXiv
[17]

Generative multimodal models are in-context learners.CoRR, abs/2312.13286, 2023a

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners.CoRR, abs/2312.13286, 2023a. 14 Published as a conference paper at ICLR 2025 Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongche...

work page arXiv 2025
[18]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms.arXiv preprint arXiv:2406.16860,

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.arXiv preprint arXiv:2406.16860,

work page arXiv
[20]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur ´elien Rodriguez, Ar- mand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.CoRR, abs/2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Emu3: Next-Token Prediction is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017a. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017b. Xinlong Wang, Xiaos...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Small-scale proxies for large-scale transformer training instabilities

Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D Co- Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, et al. Small-scale proxies for large-scale transformer training instabilities.arXiv preprint arXiv:2309.14322,

work page arXiv
[23]

Next-gpt: Any-to-any multimodal llm

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InICCV, 2023a. Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multi- modal llm.arXiv preprint arXi...

work page arXiv
[24]

X-vila: Cross-modality alignment for large language model.arXiv preprint arXiv:2405.19335, 2024a

Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, et al. X-vila: Cross-modality alignment for large language model.arXiv preprint arXiv:2405.19335, 2024a. Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal...

work page arXiv 2025
[25]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, Jos´e Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion– tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737,

work page internal anchor Pith review arXiv
[26]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

URLhttps://api. semanticscholar.org/CorpusID:271909855. Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.CoRR, abs/2304.10592, 2023a. Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, and Ying Shan. VL-GPT: A ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Further, MaskGIT (Chang et al.,

and Copilot4D (Zhang et al., 2024). Further, MaskGIT (Chang et al.,

work page 2024
[28]

=q(x t−1|xt)because of the Markov property. In the following, we introduce theAbsorbing-Uniform Discrete Diffusionby defining the stochas- tic transition matrixQ t as follows: Qt =Q a t Qu t ,(7) wheree m is a one-hot vector with a value of 1 at the index of[MASK]token,Q a t = (1−α t)I+ αt1e⊤ m, andQ u t =I−β t(I−e me⊤ m) + βt (K+1) (1−e m)(1−e m)⊤. Here,...

work page 2025
[29]

and the following parameterization of the reverse process: pθ(xt−1|xt) = X x0 q(xt−1|xt,x 0)pθ(x0|xt),(10) the variational lower bound can be further expressed under the image distributionq(x 0)(referring to the proof provided by Zhang et al. (2024) as detailed in the Appendix B): Eq(x0)[logp θ(x0)]≥E q(x0)[−LELBO(x0, θ)]≥ TX t=1 Eq(x0)q(xt|x0)[logp θ(x0|...

work page 2024
[30]

has successfully scaled up such a paradigm for text-to-image models of 3B parameters using 460M image-text pairs. 18 Published as a conference paper at ICLR 2025 B ALTERNATIVELOWERBOUND FOR THE VARIATIONAL DIFFUSION Eq(x0)[logp θ(x0)] =E q(x0)[log Z pθ(x0,x 1 · · ·x T )dx1 · · ·x T ] =E q(x0) logE q(x1:T |x0) pθ(x0:T−1 |xT ) q(x1:T |x0) p(xT ) ≥E q(x0)q(x...

work page 2025
[31]

Here, we directly lever- age the class names from ImageNet-1K as textual inputs for learning class-conditional image gen- eration

and large-scale image-text pairs are adopted to train Show- o for class-conditional image generation and image captioning, respectively. Here, we directly lever- age the class names from ImageNet-1K as textual inputs for learning class-conditional image gen- eration. This stage primarily involves the learning of new learnable embeddings for discrete image...

work page 2025
[32]

dataset. iii)Image-Text Data: For pre-training tasks correspond to multi- modal understanding and generation, we assemble roughly 35M image-text pairs from the publicly available datasets including CC12M (Changpinyo et al., 2021), SA1B (Kirillov et al., 2023), and LAION-aesthetics-12M† . Additionally, we further increase the data scale to around 2.0B by i...

work page 2021
[33]

A 3D render of a futuristic car made of glass, driving through a city of mirrors

is utilized for mixed-modality generation. F IMPLEMENTATIONDETAILS We initially conduct joint training of Show-o using the RefinedWeb, a collection of image-text pairs, and the ImageNet-1K for language modeling, image captioning, and class-conditional image gener- ation, respectively, over 500K steps. Subsequently, we replace the class-conditional generat...

work page 2023
[34]

Abluecardrivespastawhitepicketfenceonasunnyday

I ABLATIONSTUDIES Impact of Vision Encoder for Multimodal Understanding.The default Show-o employs MAGVIT-v2 to encode images into discrete tokens for both multimodal understanding and gen- eration. Inspired by the literature (Liu et al., 2024b), we investigate the impact of the most popular design choice of vision encoder,i.e.,the pre-trained CLIP ViT (R...

work page 2021
[35]

Increasing the sampling steps to25allows the synthesis of an image that closely adheres to the prompt

With just five steps, Show-o can produce an image that is roughly related to the given prompt. Increasing the sampling steps to25allows the synthesis of an image that closely adheres to the prompt. When the sampling step is set as 50, the generated image becomes more detailed and realistic. In contrast, auto-regressive models Team (2024); Sun 24 Published...

work page 2024