arxiv: 2307.05222 · v2 · submitted 2023-07-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Emu: Generative Pretraining in Multimodality

Quan Sun , Qiying Yu , Yufeng Cui , Fan Zhang , Xiaosong Zhang , Yueze Wang , Hongcheng Gao , Jingjing Liu

show 2 more authors

Tiejun Huang Xinlong Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodalgenerative pretrainingautoregressiveimage generationvisual question answeringtransformerzero-shotfew-shot

0 comments

The pith

A single Transformer model generates images and text by autoregressively predicting the next token or visual embedding from interleaved inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Emu processes any mix of images, videos, and text by encoding visuals into embeddings that sit in the same sequence as text tokens. Training uses a single objective: predict the next text token or regress the next visual embedding. This lets the model train on web-scale interleaved data without modality-specific components. The result is a generalist that handles captioning, visual and video question answering, and text-to-image generation in zero-shot and few-shot settings.

Core claim

The central discovery is that encoding visual signals into embeddings and training a Transformer autoregressively with a unified objective of next text token classification or next visual embedding regression on interleaved multimodal sequences produces a model capable of both image-to-text and text-to-image tasks at competitive levels.

What carries the argument

The unified next-token and next-embedding prediction objective applied to sequences mixing text tokens and visual embeddings.

If this is right

Handles interleaved image, text, and video inputs indiscriminately.
Shows strong zero-shot and few-shot performance on image captioning, VQA, video QA, and text-to-image generation.
Supports in-context multimodal generation.
Extends to instruction-tuned multimodal assistants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying the same unified objective to additional modalities like audio could create broader foundation models.
Reducing reliance on modality-specific architectures may simplify scaling to new data sources.
The approach opens questions about whether regression on embeddings captures fine-grained visual details as effectively as specialized generative models.

Load-bearing premise

That jointly training on next text token and next visual embedding prediction will produce coherent and useful generations across modalities without dedicated per-modality losses or decoders.

What would settle it

If evaluations show that Emu underperforms significantly on text-to-image generation metrics compared to models using diffusion or other specialized objectives, while still doing well on understanding tasks, the unified objective would be falsified for generation.

read the original abstract

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Emu shows a clean unified autoregressive recipe for interleaved multimodal sequences but supplies no numbers or ablations to check whether the single objective actually delivers the claimed gains.

read the letter

The core move is straightforward: encode visuals to embeddings, interleave them with text tokens, and train one Transformer with next-token classification for text plus next-embedding regression for visuals. That lets the same model do captioning, VQA, video QA, and text-to-image generation from the same pretraining run on mixed web data. The data mix itself—interleaved video frames with text, webpages, and image-text pairs—is a practical way to scale without modality-specific pipelines, and the abstract positions this as an extension beyond Flamingo-style models that stayed mostly understanding-focused. If the numbers hold, the recipe could simplify future foundation-model stacks. What the paper does well is keep the architecture minimal and show that in-context generation across modalities is at least feasible in principle. The soft spot is the complete absence of quantitative results, loss-weighting details, training curves, or ablations in the material available. Claims of “superb performance” against SOTA rest on unspecified implementation choices and data filtering that cannot be checked. Without tables or scores, it is impossible to tell whether the unified objective produced coherent outputs on its own or whether extra per-modality engineering was still required. This is the kind of paper that interests groups already running large multimodal pretraining runs and looking for simpler objectives. A reader who wants the actual training recipe and benchmark numbers will get value only if the full version supplies them; otherwise the work stays at the level of an interesting sketch. It deserves a serious referee because the idea is concrete and the data scale is ambitious, even though the current draft needs the missing evidence before anyone can rely on it.

Referee Report

2 major / 1 minor

Summary. The paper presents Emu, a Transformer-based multimodal foundation model that encodes visual signals (images and video frames) into embeddings, interleaves them with text tokens, and trains end-to-end via a single autoregressive objective: next-token classification for text and next-embedding regression for visuals. The model is claimed to handle arbitrary single- or multi-modal inputs (including interleaved video-text and webpage data) and to achieve superb zero-shot/few-shot performance on image captioning, visual question answering, video question answering, and text-to-image generation relative to prior large multimodal models; extended instruction-tuned capabilities are also shown.

Significance. If the performance claims are substantiated, the work would be significant for demonstrating that a single unified autoregressive objective on interleaved embeddings can produce coherent cross-modal generation without modality-specific losses or architectures, thereby simplifying scaling to diverse pretraining corpora such as videos with interleaved frames and web-scale image-text pairs.

major comments (2)

[Abstract] Abstract and results sections: the central claim of 'superb performance' across captioning, VQA, video QA, and text-to-image tasks is unsupported by any quantitative benchmark scores, tables, or direct comparisons to SOTA models; without these numbers it is impossible to evaluate whether the unified next-token/next-embedding objective actually delivers the claimed gains or whether hidden per-modality engineering was required.
[Abstract] Training objective description: the unified loss is described only at a high level (next-text-token classification or next-visual-embedding regression) with no specification of the relative weighting between the two heads, the precise regression loss (e.g., L2 vs. cosine), or the projection layers that map visual embeddings into the shared sequence; these details are load-bearing for the claim that modality-specific components were truly omitted.

minor comments (1)

[Abstract] The abstract uses the term 'omnivore model' without prior definition; a brief parenthetical gloss would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. The manuscript contains quantitative results and training details in the full text, but we agree the abstract can be strengthened for clarity. We address each point below and revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract and results sections: the central claim of 'superb performance' across captioning, VQA, video QA, and text-to-image tasks is unsupported by any quantitative benchmark scores, tables, or direct comparisons to SOTA models; without these numbers it is impossible to evaluate whether the unified next-token/next-embedding objective actually delivers the claimed gains or whether hidden per-modality engineering was required.

Authors: The full manuscript includes quantitative benchmark scores and direct SOTA comparisons in Section 4 (Experiments), with tables reporting CIDEr/BLEU for captioning, accuracy for VQA and video QA, and FID/IS for text-to-image generation against models such as Flamingo, BLIP-2, and Parti. We will revise the abstract to include key numerical highlights (e.g., specific zero-shot scores) to make the performance claims more concrete without exceeding length limits. revision: partial
Referee: [Abstract] Training objective description: the unified loss is described only at a high level (next-text-token classification or next-visual-embedding regression) with no specification of the relative weighting between the two heads, the precise regression loss (e.g., L2 vs. cosine), or the projection layers that map visual embeddings into the shared sequence; these details are load-bearing for the claim that modality-specific components were truly omitted.

Authors: We agree these details strengthen the unified-objective claim. Section 3.2 of the manuscript specifies equal weighting between the classification and regression heads, MSE (L2) loss on normalized visual embeddings for regression, and linear projection layers to align visual encoder outputs with the shared transformer dimension. We have added a concise version of these specifications to the abstract in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training procedure with external benchmarks

full rationale

The paper presents Emu as a Transformer trained end-to-end on interleaved visual embeddings and text tokens using a unified next-token classification / next-embedding regression objective. No equations, fitted parameters, or self-citations are shown that reduce the claimed multimodal coherence or task performance to internal definitions by construction. All performance assertions reference external zero-shot/few-shot benchmarks rather than tautological internal quantities, leaving the derivation self-contained as a standard autoregressive pretraining recipe.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard Transformer assumptions plus the unproven premise that a single autoregressive loss suffices for both modalities; no new entities are postulated.

free parameters (2)

visual embedding dimension and projection layers
Chosen to align image features with text token space; specific values not stated in abstract.
training data mixture ratios
Weights for videos, webpages, image-text pairs, and video-text pairs are selected to balance modalities.

axioms (1)

domain assumption Visual signals can be encoded into fixed-length embeddings that are compatible with text token sequences for autoregressive modeling
Invoked in the first paragraph when describing how visual signals are encoded and interleaved with text tokens.

pith-pipeline@v0.9.0 · 5549 in / 1274 out tokens · 18922 ms · 2026-05-16T20:18:23.073660+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
cs.CV 2024-10 unverdicted novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
cs.CV 2026-04 unverdicted novelty 6.0

CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
cs.CV 2026-04 unverdicted novelty 6.0

MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
cs.LG 2026-04 unverdicted novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
Mind the Gap No More: Achieving Zero-Gap Multimodal Integration via One Tokenizer
q-bio.GN 2026-01 unverdicted novelty 6.0

One Tokenizer achieves zero-gap multimodal integration by mapping all inputs to a unified token vocabulary, allowing native LLMs to perform deep cross-modal reasoning without modular encoders or fusion layers, and out...
MMaDA: Multimodal Large Diffusion Language Models
cs.CV 2025-05 unverdicted novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
cs.CV 2024-04 unverdicted novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
eess.AS 2023-11 unverdicted novelty 6.0

Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
cs.AI 2024-03 unverdicted novelty 4.0

DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 17 Pith papers

[1]

{question}

and contains large-scale image-text pairs data. LAION-COCO (lai, b) is captioned 600M images from LAION-2B with an ensemble of BLIP (Li et al., 2022) and CLIP (Radford et al., 2021) models. Whereas the text in LAION-COCO (lai, b) exhibits enhanced fluency and relevance to the associated images, it has insufficient text diversity and a potential loss of hi...

work page 2022
[2]

Make sure to check the weather forecast before your visit and pack appropriate clothing and gear

Weather: The weather at the lake can be unpredictable, with sudden changes in temperature, wind, and precipitation. Make sure to check the weather forecast before your visit and pack appropriate clothing and gear

work page
[3]

Make sure to stay on designated trails and keep your distance from any wildlife you encounter

Wildlife: The lake is home to a variety of wildlife, including bears, wolves, and moose. Make sure to stay on designated trails and keep your distance from any wildlife you encounter

work page
[4]

Make sure to check with local authorities before swimming or boating in the lake to ensure it is safe to do so

Water quality: The lake's water quality can vary depending on the time of year and the weather conditions. Make sure to check with local authorities before swimming or boating in the lake to ensure it is safe to do so

work page
[5]

Make sure to familiarize yourself with the lake's layout and any potential hazards before venturing out on the water

Navigation: The lake is large and can be difficult to navigate, especially for inexperienced boaters. Make sure to familiarize yourself with the lake's layout and any potential hazards before venturing out on the water

work page
[6]

By taking these precautions, you can ensure a safe and enjoyable visit to the lake

Safety equipment: Make sure to bring any necessary safety equipment, such as life jackets, first aid kits, and communication devices, in case of emergencies. By taking these precautions, you can ensure a safe and enjoyable visit to the lake. Do you have any suggestions about safety equipment? Can you write a poem about the image? Sure, here's a poem about...

work page
[7]

Life jackets: Make sure to bring a life jacket for each person in your group, especially if you plan on boating or swimming in the lake

work page
[8]

Make sure to include items such as bandages, antiseptic wipes, and pain relievers

First aid kit: A well-stocked first aid kit can be a lifesaver in case of emergencies. Make sure to include items such as bandages, antiseptic wipes, and pain relievers

work page
[9]

Communication devices: Bring a reliable communication device, such as a cell phone or walkie-talkie, in case of emergencies

work page
[10]

Impression, Sunrise

Navigation equipment: If you plan on boating or kayaking in the lake, make sure to bring a map, compass, and GPS device to help you navigate the waters. By bringing these safety equipment, you can ensure a safe and enjoyable visit to the lake. Figure 6: Examples of multi-turn dialogue. 24 Published as a conference paper at ICLR 2024 Input Prompt Completio...

work page 2024
[11]

The Mysterious Affair at Styles

"The Mysterious Affair at Styles" (1920)

work page 1920
[12]

The Secret Adversary

"The Secret Adversary" (1922)

work page 1922
[13]

The Murder on the Links

"The Murder on the Links" (1923)

work page 1923
[14]

The Man in the Brown Suit

"The Man in the Brown Suit" (1924)

work page 1924
[15]

The Secret of Chimneys

"The Secret of Chimneys" (1925)

work page 1925
[16]

The Murder of Roger Ackroyd

"The Murder of Roger Ackroyd" (1926)

work page 1926
[17]

The Big Four

"The Big Four" (1927)

work page 1927
[18]

The Murder at the Vicarage

"The Murder at the Vicarage" (1930) Recommendation: "The Murder of Roger Ackroyd" (1926) "The Murder of Roger Ackroyd" is considered by many to be Agatha Christie's greatest mystery novel. It is a classic whodunit with a twist ending that will keep you guessing until the very end. The woman in the image, Agatha Christie, has written numerous books, totali...

work page 1930
[19]

Murder on the Orient Express (1934): A classic mystery novel about the investigation of a murder on a luxury train

work page 1934
[20]

And Then There Were None (also known as Ten Little Indians) ( 10 Little Indian, 2015)

work page 2015
[21]

The Murders on The Rue Morgue (first published in 90s) - A detective story about a series of murders in Paris

work page
[22]

A play that was adapted for a movie and later became a TV mini-series

A MurDER IS ANNOUNCED (published in the 40's). A play that was adapted for a movie and later became a TV mini-series. It'll give you a glimpse into her writing style. InstructBLIP the spiderwick chronicles Figure 11: Comparison of Emu with other methods in term of following human instructions. 28 Published as a conference paper at ICLR 2024 Video-ChatGPT ...

work page 2024