3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
hub
Transfer between modalities with metaqueries
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than editing-based methods.
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.
Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose generative knowledge for discriminative tasks.
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models on the new GEdit-Bench.
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
citing papers explorer
-
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
-
LTX-2: Efficient Joint Audio-Visual Foundation Model
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
-
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.