pith. machine review for the scientific record. sign in

arxiv: 2505.14683 · v3 · submitted 2025-05-20 · 💻 cs.CV

Recognition: 1 theorem link

Emerging Properties in Unified Multimodal Pretraining

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modeldecoder-only pretraininginterleaved dataemergent reasoningmultimodal generationimage manipulationworld navigation
0
0 comments X

The pith

A unified decoder-only model pretrained on trillions of interleaved multimodal tokens exhibits emerging capabilities in complex multimodal reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BAGEL as an open-source foundational model that supports both multimodal understanding and generation within a single decoder-only architecture. It is pretrained on a curated collection of trillions of tokens drawn from large-scale interleaved text, image, video, and web data. The central observation is that scaling this unified pretraining produces new abilities for complex multimodal reasoning that were not present in smaller or less diverse trainings. These abilities let the model outperform other open-source unified models on standard benchmarks for generation and understanding while also handling tasks such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation.

Core claim

BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning.

What carries the argument

The unified decoder-only architecture pretrained on interleaved multimodal data at trillion-token scale, which processes sequences containing mixed modalities in a single forward pass.

Load-bearing premise

That the observed performance gains and new reasoning abilities result specifically from the unified decoder-only pretraining on the curated interleaved data rather than from other unstated factors such as model size or data quality details.

What would settle it

A controlled comparison in which a model of comparable size is trained on the same total token volume but with modalities presented separately rather than interleaved, then tested for the presence of the advanced reasoning abilities.

read the original abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces BAGEL, an open-source unified decoder-only multimodal model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. It claims that this scaling produces emerging capabilities in complex multimodal reasoning, leading to significant outperformance over open-source unified models on standard benchmarks for both multimodal generation and understanding, plus advanced abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. Pretraining details, data creation protocol, code, and checkpoints are released to support further research.

Significance. If the empirical claims are substantiated with proper controls and quantitative evidence, the work would demonstrate the viability of unified decoder-only pretraining on diverse interleaved multimodal data at trillion-token scale, highlighting potential emergent reasoning properties. The explicit release of code and checkpoints strengthens reproducibility and enables community follow-up, which is a clear positive for the field.

major comments (2)
  1. Abstract: strong claims of benchmark outperformance and emerging abilities (free-form manipulation, future frame prediction, 3D/world navigation) are asserted without any quantitative scores, tables, figures, or error analysis, leaving the central empirical claims unsupported by visible evidence.
  2. Experiments section (and associated ablations): no controlled comparisons are presented that hold model scale, optimizer, and total compute fixed while varying only the unified decoder-only architecture and interleaved multimodal mixture versus text-only or non-unified baselines; this prevents secure causal attribution of gains to the claimed pretraining approach rather than unstated factors such as data quality or scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The two major comments highlight important aspects of presentation and experimental rigor. We address each point below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: Abstract: strong claims of benchmark outperformance and emerging abilities (free-form manipulation, future frame prediction, 3D/world navigation) are asserted without any quantitative scores, tables, figures, or error analysis, leaving the central empirical claims unsupported by visible evidence.

    Authors: We agree that the abstract would benefit from explicit quantitative anchors. The full paper reports benchmark results in Section 4 (with Tables 1–4 providing exact scores on multimodal understanding and generation tasks) and qualitative demonstrations of the advanced capabilities in Section 5. We will revise the abstract to include the primary numerical improvements (e.g., the reported gains over prior open-source unified models) and will add parenthetical references to the relevant tables and figures. This change will be made in the next revision. revision: yes

  2. Referee: Experiments section (and associated ablations): no controlled comparisons are presented that hold model scale, optimizer, and total compute fixed while varying only the unified decoder-only architecture and interleaved multimodal mixture versus text-only or non-unified baselines; this prevents secure causal attribution of gains to the claimed pretraining approach rather than unstated factors such as data quality or scale.

    Authors: We acknowledge the value of perfectly controlled ablations. At trillion-token scale, however, training multiple models while strictly holding architecture, optimizer, and total compute constant is not feasible within reasonable resource limits. Our experimental design instead compares BAGEL against published open-source unified models that report comparable training scales and data volumes, and we include targeted ablations on data mixture and model components that were computationally tractable. We will add a new subsection (likely in Section 4.4 or a dedicated Limitations paragraph) that explicitly discusses the practical constraints on controlled experiments at this scale, clarifies what factors are matched in our comparisons, and notes that the released code and checkpoints enable independent follow-up studies with tighter controls. This constitutes a partial revision focused on transparency rather than new large-scale runs. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical pretraining claims rest on benchmark outcomes, not self-referential definitions or fitted predictions.

full rationale

The paper reports results from scaling a decoder-only model on trillions of curated interleaved multimodal tokens, with performance gains and emerging abilities (manipulation, prediction, navigation) measured on standard benchmarks. No equations, derivations, or first-principles steps are presented that reduce to inputs by construction. Claims attribute outcomes to the unified pretraining setup but do so via direct empirical comparison rather than self-definition, renamed known results, or load-bearing self-citations. The absence of any mathematical chain or parameter-fitting loop keeps the argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the effectiveness of large-scale interleaved multimodal pretraining in a decoder-only model. Key free parameters include total training tokens, model size, and data curation rules, all chosen by the authors. The main domain assumption is that a single decoder-only transformer can natively support both understanding and generation when trained on mixed sequences.

free parameters (2)
  • total training tokens
    Trillions of tokens are used; the exact count and sampling ratios are design choices that the performance claims depend on.
  • model scale hyperparameters
    Number of parameters, layers, and hidden size are selected by hand to enable the reported scaling behavior.
axioms (1)
  • domain assumption A decoder-only transformer can jointly model multimodal understanding and generation when trained on interleaved sequences.
    This architectural premise underpins the entire unified pretraining approach described in the abstract.

pith-pipeline@v0.9.0 · 5489 in / 1331 out tokens · 85604 ms · 2026-05-10T16:17:39.545061+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

    cs.AI 2026-04 unverdicted novelty 8.0

    FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.

  2. ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

    cs.CV 2026-05 unverdicted novelty 7.0

    ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.

  3. From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

  4. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

  5. Inline Critic Steers Image Editing

    cs.CV 2026-05 conditional novelty 7.0

    Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

  6. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  7. Action Emergence from Streaming Intent

    cs.RO 2026-05 unverdicted novelty 7.0

    A new VLA model called SI uses a four-step chain-of-thought to derive driving intent and applies it via classifier-free guidance to a flow-matching trajectory generator, showing competitive Waymo scores and intent-con...

  8. G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

    cs.CV 2026-05 unverdicted novelty 7.0

    G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.

  9. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  10. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.

  11. Dynamic Execution Commitment of Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 7.0

    A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.

  12. UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

    cs.MM 2026-05 unverdicted novelty 7.0

    UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...

  13. Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

    cs.CV 2026-05 unverdicted novelty 7.0

    DRoRAE fuses multi-layer features from pretrained vision encoders to recover lost low-level details, reducing rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256.

  14. Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

    cs.CV 2026-05 unverdicted novelty 7.0

    DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revea...

  15. What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 7.0

    A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.

  16. MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.

  17. ScribbleEdit: Synthetic Data for Image Editing with Scribbles and Text

    cs.CV 2026-05 conditional novelty 7.0

    ScribbleEdit is a synthetic dataset combining scribbles and text for training image editing models that produce spatially aligned and semantically consistent results.

  18. Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...

  19. Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.

  20. SketchVLM: Vision language models can annotate images to explain thoughts and guide users

    cs.CV 2026-04 unverdicted novelty 7.0

    SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.

  21. Probing Visual Planning in Image Editing Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.

  22. Exploring Spatial Intelligence from a Generative Perspective

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

  23. ATIR: Towards Audio-Text Interleaved Contextual Retrieval

    cs.SD 2026-04 unverdicted novelty 7.0

    Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.

  24. HP-Edit: A Human-Preference Post-Training Framework for Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

  25. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  26. UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

  27. Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.

  28. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  29. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.

  30. Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

  31. RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

    cs.CV 2026-04 unverdicted novelty 7.0

    RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.

  32. V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

    cs.CV 2026-03 unverdicted novelty 7.0

    V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...

  33. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  34. Action Emergence from Streaming Intent

    cs.RO 2026-05 unverdicted novelty 6.0

    Streaming Intent lets a VLA model derive driving intent via streamed chain-of-thought reasoning and use it to steer a flow-matching action head, yielding competitive Waymo scores plus intent-based trajectory control w...

  35. Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

    cs.CV 2026-05 unverdicted novelty 6.0

    V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...

  36. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

  37. GeoR-Bench: Evaluating Geoscience Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.

  38. PresentAgent-2: Towards Generalist Multimodal Presentation Agents

    cs.CV 2026-05 unverdicted novelty 6.0

    PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.

  39. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  40. Do multimodal models imagine electric sheep?

    cs.CV 2026-05 conditional novelty 6.0

    Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

  41. SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

  42. STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.

  43. Continuous Latent Diffusion Language Model

    cs.CL 2026-05 unverdicted novelty 6.0

    Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...

  44. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

    cs.CV 2026-05 unverdicted novelty 6.0

    MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

  45. AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion

    cs.CV 2026-05 unverdicted novelty 6.0

    AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.

  46. PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.

  47. Proteo-R1: Reasoning Foundation Models for De Novo Protein Design

    cs.LG 2026-05 unverdicted novelty 6.0

    Proteo-R1 decouples an MLLM-based understanding expert that selects functional residues from a diffusion-based generation expert that builds protein structures under those explicit constraints.

  48. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  49. SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.

  50. Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...

  51. DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.

  52. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  53. SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.

  54. Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

    cs.CV 2026-04 unverdicted novelty 6.0

    IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.

  55. Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

    cs.CV 2026-04 unverdicted novelty 6.0

    Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.

  56. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  57. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 6.0

    UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

  58. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  59. Self-Adversarial One Step Generation via Condition Shifting

    cs.CV 2026-04 unverdicted novelty 6.0

    APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.

  60. Nucleus-Image: Sparse MoE for Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · cited by 99 Pith papers · 24 internal anchors

  1. [1]

    Scaling laws for generative mixed-modal language models

    Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. In ICML, 2023

  2. [2]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang- hai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

  3. [3]

    Humanedit: A high-quality human-rewarded dataset for instruction-based image editing

    Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  5. [5]

    Improving image generation with better captions.OpenAI blog, 2023

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.OpenAI blog, 2023

  6. [6]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023

  7. [7]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021

  8. [8]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

  9. [9]

    Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In ECCV, 2024

  10. [10]

    An empirical study of gpt-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

    Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, Shilin Xu, Tao Zhang, Haobo Yuan, Yikang Zhou, Wei Chow, Linfeng Li, Xiangtai Li, Lei Zhu, and Lu Qi. An empirical study of gpt-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

  11. [11]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Chengyue Wu, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  12. [12]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jia...

  13. [13]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. SCIS, 2024

  14. [14]

    Common crawl - open repository of web crawl data., 2007

    Common Crawl. Common crawl - open repository of web crawl data., 2007. URLhttps://commoncrawl.org/

  15. [15]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In ICML, 2023

  16. [16]

    Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. InNeurIPS, 2023. 25

  17. [17]

    arXiv preprint arXiv:2412.12095 , year=

    Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling. arXiv preprint arXiv:2412.12095, 2024

  18. [18]

    Dreamllm: Synergistic multimodal comprehension and creation

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. InICLR, 2024

  19. [19]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024

  20. [20]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

  21. [21]

    Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

  22. [22]

    Seed-data-edit technical report: A hybrid dataset for instructional image editing

    Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007, 2024

  23. [23]

    Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396,

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arxiv:2404.14396, 2024

  24. [24]

    Experiment with gemini 2.0 flash native image generation, 2025

    Google Gemini2. Experiment with gemini 2.0 flash native image generation, 2025. URLhttps://developers. googleblog.com/en/experiment-with-gemini-20-flash-native-image-generation/

  25. [25]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

  26. [26]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  27. [27]

    Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237,

    Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237, 2024

  28. [28]

    Mvimgnet2

    Xiaoguang Han, Yushuang Wu, Luyue Shi, Haolin Liu, Hongjie Liao, Lingteng Qiu, Weihao Yuan, Xiaodong Gu, Zilong Dong, and Shuguang Cui. Mvimgnet2. 0: A larger-scale dataset of multi-view images.arXiv preprint arXiv:2412.01430, 2024

  29. [29]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshopon Deep Generative Models and Downstream Applications, 2021

  30. [30]

    Minicpm: Unveiling the potential of small language models with scalable training strategies

    Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, dahai li, Zhiyuan Liu, and Maosong Sun. Minicpm: Unveiling the potential of small language models with...

  31. [31]

    Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks

    Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks. InICML, 2023

  32. [32]

    arXiv preprint arXiv:2404.09990 , year=

    Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing.arXiv preprint arXiv:2404.09990, 2024

  33. [33]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  34. [34]

    FastText.zip: Compressing text classification models

    ArmandJoulin, EdouardGrave, PiotrBojanowski, MatthijsDouze, HérveJégou, andTomasMikolov. Fasttext.zip: Compressing text classification models.arXiv preprint arXiv:1612.03651, 2016

  35. [35]

    Flux, 2024

    Black Forest Labs. Flux, 2024. URLhttps://github.com/black-forest-labs/flux

  36. [36]

    Autoregressive image generation using residual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InCVPR, 2022. 26

  37. [37]

    LLaVA-onevision: Easy visual task transfer.TMLR, 2025

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-onevision: Easy visual task transfer.TMLR, 2025

  38. [38]

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

  39. [39]

    Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text

    Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024

  40. [40]

    Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.TMLR, 2025

    Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.TMLR, 2025

  41. [41]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In ICLR, 2023

  42. [42]

    World model on million-length video and language with blockwise ringattention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention.arXiv preprint arxiv:2402.08268, 2024

  43. [44]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, ChunruiHan, etal. Step1x-edit: Apracticalframeworkforgeneralimageediting. arXivpreprintarXiv:2504.17761, 2025

  44. [45]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

  45. [46]

    Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

  46. [47]

    Decoupled Weight Decay Regularization

    I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  47. [48]

    Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

    Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, pages 26439–26455, 2024

  48. [49]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InNeurIPS Workshop on Mathematical Reasoning and AI, 2023

  49. [50]

    Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmo- nizing autoregression and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

  50. [51]

    Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023

  51. [52]

    A theory on Adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871, 2023

    Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, et al. A theory on adam instability in large-scale machine learning. arXiv preprint arXiv:2304.09871, 2023

  52. [53]

    Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

    Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

  53. [54]

    Introducing gpt-4.1 in the api.OpenAI Blog, 2025

    OpenAI. Introducing gpt-4.1 in the api.OpenAI Blog, 2025. URL https://openai.com/index/gpt-4-1/

  54. [55]

    Introducing 4o image generation, 2025

    OpenAI. Introducing 4o image generation, 2025. URL https://openai.com/index/ introducing-4o-image-generation/. 27

  55. [56]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

  56. [57]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, and Saining Xie. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025

  57. [58]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024

  58. [59]

    Du, Zehuan Yuan, and Xinglong Wu

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation.arXiv preprint arXiv:2412.03069, 2024

  59. [60]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arxiv:2204.06125, 2022

  60. [61]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

  61. [62]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022

  62. [63]

    Seaweed-7b: Cost-effective training of video generation foundation model

    Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective training of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025

  63. [64]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  64. [65]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  65. [66]

    Llamafusion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188,

    Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Llamafusion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188, 2024

  66. [67]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  67. [68]

    Generative pretraining in mul- timodality

    QuanSun, QiyingYu, YufengCui, FanZhang, XiaosongZhang, YuezeWang, HongchengGao, JingjingLiu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality.arXiv preprint arXiv:2307.05222, 2023

  68. [69]

    Generative multimodal models are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InCVPR, 2024

  69. [70]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  70. [71]

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabi...

  71. [72]

    Flexattention: The flexibility of pytorch with the performance of flashattention.Pytorch Blog,

    Pytorch Team. Flexattention: The flexibility of pytorch with the performance of flashattention.Pytorch Blog,

  72. [73]

    URL https://pytorch.org/blog/flexattention/

  73. [74]

    Metamorph: Multimodal understanding and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024

    Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024

  74. [75]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, pages 9568–9578, 2024

  75. [76]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  76. [77]

    ILLUME: illuminating your llms to see, draw, and self-enhance.arXiv preprint arXiv:2412.06673, 2024

    Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, and Hang Xu. Illume: Illuminating your llms to see, draw, and self-enhance.arXiv preprint arXiv:2412.06673, 2024

  77. [78]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

  78. [79]

    arXiv preprint arXiv:2410.08260 , year=

    Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content.arXiv preprint arXiv:2410.08260, 2024

  79. [80]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arxiv:2409.18869, 2024

  80. [81]

    Omniedit: Building image editing generalist models through specialist supervision

    Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supervision. InICLR, 2024

Showing first 80 references.