super hub Mixed citations

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Chenhui Gou, Deyao Zhu, Feng Li, Kunchang Li, Zeyu Wang · 2025 · cs.CV · arXiv 2505.14683

Mixed citation behavior. Most common role is background (62%).

208 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 208 citing papers more from Chaorui Deng arXiv PDF

abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 47 baseline 17 method 7 other 2 dataset 1

citation-polarity summary

background 46 baseline 17 use method 7 unclear 2 support 1 use dataset 1

claims ledger

abstract Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in bot

authors

Chaorui Deng Chenhui Gou Deyao Zhu Feng Li Kunchang Li Zeyu Wang

co-cited works

representative citing papers

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.

ChronoPhyBench: Do MLLMs Truly Understand the World or Merely Exploit Language Priors?

cs.CV · 2026-06-06 · unverdicted · novelty 7.0

ChronoPhyBench is a new benchmark and dataset for chronological physical dynamics reasoning that combines video-conditioned next-state prediction with VQA to reduce language bias in MLLM evaluation.

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Astra couples an RL-trained VLM policy with a view-consistent Bagel-based world simulator to enable agentic imagination during spatial reasoning, yielding benchmark gains on MMSI-Bench and MindCube.

MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

MetaPoint represents 2D coordinates as special tokens in visual generative models to enable precise spatial control using existing positional encodings without architectural modifications.

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

VPE inserts an internal autoregressive visual semantic token generation step to guide image token production in unified models, reporting faster convergence, higher quality, and superior editing preservation (PSNR 26.76 vs 19.92) versus external alternatives.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

MemoGen: Can Past Experience Improve Future Text-to-Image Generation?

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

MemoGen is a training-free agentic framework that stores task understanding, references, visual feedback, and lessons from past generations as reusable memory to improve text-to-image output over evolution rounds.

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

Introduces E2V-Bench benchmark for equation-to-visual generation in early arithmetic education, shows T2I models fail on numerical accuracy and relations, and reports partial gains from benchmark-guided enhancements.

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

View Dropout forces reliance on intermediate thinking images in unified multimodal models, with panoramic renderings proving most effective for out-of-domain cross-view spatial reasoning.

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

GAMSI is a dual-pathway Geometry-Aware MLLM using Metric-Structure Decoupled Queries and Expert-Guided Visual Grounding on RGB inputs alone, trained on a new 152k-sample MTS dataset to reach SOTA on seven spatial benchmarks.

ETCHR: Editing To Clarify and Harness Reasoning

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.

VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

VisAnalog is a new controlled benchmark showing VLMs substantially underperform humans on visual concept transfer under one- to four-step deterministic transformations, with relation inference as the main failure mode.

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

Accelerating Rectified Flow Models via Trajectory-Aware Caching

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

TACache accelerates rectified flow sampling up to 4.14x for text-to-image and 2.11x for text-to-video via offline skip scheduling from cumulative variation thresholds and online velocity reconstruction using historical orthogonal directions.

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

Presents Entity-Rubrics and AbstractEdit benchmark to measure image editing models on abstract intent, finding standard models struggle to balance edit intent with image preservation.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

Inline Critic Steers Image Editing

cs.CV · 2026-05-12 · conditional · novelty 7.0

Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

citing papers explorer

Showing 7 of 7 citing papers after filters.

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning cs.MM · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities cs.LG · 2026-04-16 · unverdicted · none · ref 107 · internal anchor
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
Leveraging Verifier-Based Reinforcement Learning in Image Editing cs.CV · 2026-04-30 · unverdicted · none · ref 14 · 2 links · internal anchor
Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.
Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models cs.CV · 2026-04-28 · unverdicted · none · ref 12 · internal anchor
Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than editing-based methods.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 17 · internal anchor
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 13 · internal anchor
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
Towards Robust Sequential Decomposition for Complex Image Editing cs.CV · 2026-05-10 · unverdicted · none · ref 10 · 2 links · internal anchor
Develops a synthetic data pipeline for training sequential decomposition in generative image editing, showing robust gains with complexity and sim-to-real transfer via co-training.

Emerging Properties in Unified Multimodal Pretraining

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer