pith. machine review for the scientific record. sign in

arxiv: 2311.15127 · v1 · submitted 2023-11-25 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Adam Letts, Andreas Blattmann, Daniel Mendelevitch, Dominik Lorenz, Maciej Kilian, Robin Rombach, Sumith Kulal, Tim Dockhorn, Varun Jampani, Vikram Voleti, Yam Levi, Zion English

Authors on Pith no claims yet

Pith reviewed 2026-05-10 22:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords latent video diffusiontext-to-video generationimage-to-video generationdataset curationmulti-view diffusionpretraining stagesvideo finetuningmotion representation
0
0 comments X

The pith

Three training stages on a curated large dataset turn latent diffusion models into competitive video generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that scaling up video generation requires following a precise sequence of training phases rather than simply adding temporal layers to image models. It begins with pretraining on images, moves to video pretraining on a large dataset prepared with systematic captioning and filtering, and finishes with finetuning on high-quality videos. This recipe produces a base model whose text-to-video output competes with closed systems. The same base model supports direct image-to-video animation and can be adapted with modest additional training for generating multiple consistent views of an object. The work highlights that careful data curation is what makes the later stages effective.

Core claim

We present Stable Video Diffusion, a latent video diffusion model for high-resolution text-to-video and image-to-video generation. We identify three stages for successful training: text-to-image pretraining, video pretraining on a well-curated dataset, and high-quality video finetuning. A systematic curation process of captioning and filtering is required to produce high-quality videos. The resulting base model is competitive with closed-source text-to-video systems, provides a strong motion representation for image-to-video tasks, and supplies a multi-view 3D prior that allows finetuning into a feedforward multi-view diffusion model outperforming image-based methods at a fraction of the 3D-

What carries the argument

The three-stage training pipeline of text-to-image pretraining, video pretraining on a curated dataset, and high-quality video finetuning, together with the dataset curation process of captioning and filtering.

If this is right

  • The base model adapts to image-to-video generation and to camera-motion control through low-rank adaptation modules.
  • It can be further finetuned into a multi-view diffusion model that jointly generates multiple object views in one forward pass.
  • The overall training strategy produces video generation quality that matches closed-source systems.
  • Releasing the trained weights and code makes the approach available for community use on related video tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged curation and training pattern may transfer to scaling diffusion models for other sequential data such as audio or 3D motion sequences.
  • The learned multi-view prior could support downstream tasks like video depth estimation or novel-view synthesis with little extra supervision.
  • Widespread use of this open recipe may reduce dependence on proprietary video datasets in the broader field of generative modeling.

Load-bearing premise

That the three identified training stages plus systematic captioning and filtering of the pretraining data are necessary and sufficient for high-quality video output.

What would settle it

Training an otherwise identical model on the same videos but without the described captioning and filtering steps, or with the stages reordered or collapsed, and measuring whether text-to-video quality remains competitive.

read the original abstract

We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents Stable Video Diffusion (SVD), a latent video diffusion model for high-resolution text-to-video and image-to-video generation. It identifies three training stages—text-to-image pretraining, video pretraining on a systematically curated large dataset (with captioning and filtering), and high-quality finetuning—and claims this pipeline produces models competitive with closed-source systems. The work further shows the base model provides strong motion priors for image-to-video tasks and camera-motion LoRAs, and serves as an effective starting point for finetuning a multi-view diffusion model that jointly generates multiple object views in a feedforward manner, outperforming image-based methods at lower compute. Code and weights are released publicly.

Significance. If the results hold, the work is significant as one of the first detailed, large-scale open efforts to scale latent video diffusion models, providing a reproducible recipe for data curation and staged training that addresses the field's lack of consensus on video data strategies. The public release of code/weights and the demonstration of the model's utility as a 3D prior for multi-view generation (at a fraction of prior compute) are clear strengths that can accelerate downstream research in generative video and 3D vision.

major comments (2)
  1. [§3 and §4] §3 (Training stages) and §4 (Experiments): The central claim that the three-stage pipeline plus systematic curation (captioning + filtering) is necessary for competitive performance lacks load-bearing ablations. No quantitative comparisons (FVD, CLIP similarity, or human preference) are reported for an otherwise identical model trained on uncurated or randomly subsampled data of equal size, or for variants that skip one stage (e.g., video pretraining without image pretraining). This makes it impossible to isolate whether headline results are driven by the claimed pipeline versus model scale and the base image LDM.
  2. [§4.3] §4.3 (Multi-view generation): The claim that the finetuned multi-view model outperforms image-based methods at a fraction of compute is load-bearing for the 3D-prior contribution, yet the section provides no exact baseline details (number of views, resolution, or total FLOPs) or error bars on the reported metrics, preventing verification of the efficiency advantage.
minor comments (2)
  1. [Figure 2] Figure 2 and associated text: Qualitative video examples would benefit from explicit mention of sampling parameters (guidance scale, number of frames, inference steps) to allow reproduction.
  2. [§2] Notation in §2 (Preliminaries): The temporal layer insertion into the U-Net is described at a high level; adding a short equation or diagram for the 3D convolution / attention modification would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our work. We appreciate the emphasis on strengthening the experimental claims through additional ablations and details. Below we respond point-by-point to the major comments, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Training stages) and §4 (Experiments): The central claim that the three-stage pipeline plus systematic curation (captioning + filtering) is necessary for competitive performance lacks load-bearing ablations. No quantitative comparisons (FVD, CLIP similarity, or human preference) are reported for an otherwise identical model trained on uncurated or randomly subsampled data of equal size, or for variants that skip one stage (e.g., video pretraining without image pretraining). This makes it impossible to isolate whether headline results are driven by the claimed pipeline versus model scale and the base image LDM.

    Authors: We acknowledge that direct ablations comparing the full pipeline against uncurated data of equal size or ablated stages would provide stronger isolation of each component's contribution. However, each full-scale training run on our dataset size requires substantial compute resources that were not available for multiple parallel experiments. The staged approach builds directly on established practices from large-scale image latent diffusion models, where text-to-image pretraining has been shown to be critical for high-quality generation. Our results demonstrate that the complete pipeline yields competitive performance with closed-source systems, and the public release of code and weights enables the community to conduct further controlled ablations. In the revision we will add an explicit discussion of this limitation and the practical constraints that prevented exhaustive ablations. revision: partial

  2. Referee: [§4.3] §4.3 (Multi-view generation): The claim that the finetuned multi-view model outperforms image-based methods at a fraction of compute is load-bearing for the 3D-prior contribution, yet the section provides no exact baseline details (number of views, resolution, or total FLOPs) or error bars on the reported metrics, preventing verification of the efficiency advantage.

    Authors: We agree that precise baseline specifications and uncertainty estimates are necessary to substantiate the efficiency claims. In the revised manuscript we will expand §4.3 to include the exact number of views, output resolution, and estimated total FLOPs for both our multi-view model and the compared image-based methods. We will also report standard deviations or error bars on the quantitative metrics where multiple runs or samples permit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training pipeline with external benchmarks

full rationale

The paper reports results from training a latent video diffusion model via three sequential stages (text-to-image pretraining, video pretraining on curated data, high-quality finetuning) and evaluates performance on standard external metrics and tasks such as text-to-video generation, image-to-video, and multi-view synthesis. No equations, fitted parameters, or predictions are presented as independent derivations; claims rest on experimental outcomes compared to closed-source baselines and prior image-based methods. Self-citations to the authors' earlier Stable Diffusion work describe the base architecture but do not form a load-bearing circular chain, as the video-specific contributions (curation process, stage ordering, LoRA adaptations) are validated through new training runs and downstream evaluations rather than reducing to definitions or prior self-referential results.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions from the latent diffusion literature plus empirical choices in data filtering and training schedules; no new physical entities or ungrounded mathematical axioms are introduced.

free parameters (2)
  • training hyperparameters across stages
    Learning rates, batch sizes, and schedule details for the three training phases are chosen empirically but not enumerated in the abstract.
  • data filtering thresholds
    Specific captioning and quality filters applied during pretraining dataset curation are described at a high level without exact parameter values.
axioms (2)
  • domain assumption Latent diffusion models trained on images can be extended to video by inserting temporal layers and continued training.
    Invoked in the description of turning 2D image LDMs into video models.
  • domain assumption Well-curated large-scale video data improves generative quality over smaller or unfiltered sets.
    Central to the claim that systematic curation is necessary for strong base models.

pith-pipeline@v0.9.0 · 5620 in / 1571 out tokens · 44167 ms · 2026-05-10T22:53:09.529342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

    cs.CV 2026-05 unverdicted novelty 8.0

    TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

  2. PhysInOne: Visual Physics Learning and Reasoning in One Suite

    cs.CV 2026-04 unverdicted novelty 8.0

    PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...

  3. OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression

    cs.CV 2026-05 unverdicted novelty 7.0

    OP4KSR enables efficient one-step 4K super-resolution without patches by adapting Flux with RoPE rescaling and periodicity loss to suppress artifacts.

  4. JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...

  5. OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.

  6. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  7. MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

    cs.CV 2026-05 unverdicted novelty 7.0

    MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...

  8. MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

    cs.CV 2026-05 unverdicted novelty 7.0

    MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.

  9. RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

    cs.CV 2026-05 unverdicted novelty 7.0

    RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.

  10. Single-Shot HDR Recovery via a Video Diffusion Prior

    cs.CV 2026-05 unverdicted novelty 7.0

    Single-shot HDR is achieved by conditioning a video diffusion model on an LDR input to generate an exposure bracket and fusing the bracket with per-pixel weights from a lightweight UNet.

  11. HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

    cs.CV 2026-05 conditional novelty 7.0

    HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

  12. DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.

  13. FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.

  14. Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

    cs.CV 2026-05 unverdicted novelty 7.0

    Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency improves training speed, temporal coherence, and artifact reduction in diffusion-based image animation.

  15. Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

    cs.CV 2026-05 unverdicted novelty 7.0

    Eulerian adjacent-frame motion fields with bidirectional cycle consistency checks enable faster parallel training and fewer artifacts in diffusion model image animation compared to initial-frame Lagrangian guidance.

  16. EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

    cs.CV 2026-05 unverdicted novelty 7.0

    EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

  17. Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes

    cs.CV 2026-05 unverdicted novelty 7.0

    S2C-3D reconstructs complete high-fidelity 3D scenes from as few as 6-8 images by finetuning a diffusion model on scene data, applying consistency-conditioned sampling, and planning trajectories for full coverage.

  18. A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

    cs.CV 2026-05 unverdicted novelty 7.0

    Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...

  19. Generative Modeling with Orbit-Space Particle Flow Matching

    cs.GR 2026-05 unverdicted novelty 7.0

    OGPP is a particle flow-matching method using orbit-space canonicalization and geometric paths that achieves lower error and fewer steps than prior approaches on 3D benchmarks.

  20. TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks

    cs.CV 2026-05 unverdicted novelty 7.0

    TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.

  21. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  22. Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

    cs.RO 2026-05 unverdicted novelty 7.0

    Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.

  23. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...

  24. TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

    cs.CV 2026-04 unverdicted novelty 7.0

    TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, ...

  25. $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...

  26. Latent Space Probing for Adult Content Detection in Video Generative Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

  27. WorldMark: A Unified Benchmark Suite for Interactive Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.

  28. Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.

  29. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  30. DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.

  31. LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

    cs.CL 2026-04 unverdicted novelty 7.0

    LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.

  32. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  33. Tracking High-order Evolutions via Cascading Low-rank Fitting

    cs.LG 2026-04 unverdicted novelty 7.0

    Cascading low-rank fitting approximates successive high-order derivatives in diffusion models via a shared base function with sequentially added low-rank components, accompanied by theorems proving monotonic non-incre...

  34. Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degra...

  35. Envisioning the Future, One Step at a Time

    cs.CV 2026-04 unverdicted novelty 7.0

    An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.

  36. RewardFlow: Generate Images by Optimizing What You Reward

    cs.CV 2026-04 unverdicted novelty 7.0

    RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.

  37. Novel View Synthesis as Video Completion

    cs.CV 2026-04 unverdicted novelty 7.0

    Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.

  38. DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning

    eess.IV 2026-04 unverdicted novelty 7.0

    DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.

  39. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  40. SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.

  41. OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.

  42. HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.

  43. A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

    cs.CV 2026-04 conditional novelty 7.0

    Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.

  44. UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining

    cs.CV 2026-04 unverdicted novelty 7.0

    UENR-600K is a 600,000-frame synthetic dataset for nighttime video deraining that uses 3D rain particle simulation in Unreal Engine to enable better generalization to real scenes.

  45. Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

    cs.CV 2026-04 conditional novelty 7.0

    SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.

  46. ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

    cs.AR 2026-03 unverdicted novelty 7.0

    ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

  47. GVCC: Zero-Shot Video Compression via Codebook-Driven Stochastic Rectified Flow

    cs.CV 2026-03 unverdicted novelty 7.0

    GVCC achieves the lowest LPIPS on UVG at bitrates down to 0.003 bpp by encoding stochastic innovations in a marginal-preserving stochastic process derived from a pretrained rectified-flow video model, with 65% LPIPS r...

  48. OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    cs.CV 2024-07 unverdicted novelty 7.0

    OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.

  49. Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

    cs.CV 2026-05 unverdicted novelty 6.0

    V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...

  50. UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

    cs.CV 2026-05 unverdicted novelty 6.0

    UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on no...

  51. OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

    cs.CV 2026-05 unverdicted novelty 6.0

    OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.

  52. VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors

    cs.CV 2026-05 unverdicted novelty 6.0

    VidSplat iteratively synthesizes novel views with geometry-guided video diffusion to enable robust Gaussian splatting reconstruction from sparse or single-image inputs.

  53. GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

    cs.CV 2026-05 unverdicted novelty 6.0

    GemDepth achieves improved 3D-consistent video depth by embedding predicted inter-frame camera poses into a network with an Alternating Spatio-Temporal Transformer for better spatial precision and temporal coherence.

  54. GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

    cs.CV 2026-05 unverdicted novelty 6.0

    GemDepth embeds predicted camera poses into a spatio-temporal transformer to achieve state-of-the-art 3D-consistent video depth estimation.

  55. GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

    cs.CV 2026-05 unverdicted novelty 6.0

    GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.

  56. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  57. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  58. How Mobile World Model Guides GUI Agents?

    cs.AI 2026-05 unverdicted novelty 6.0

    Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...

  59. SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.

  60. Implicit Preference Alignment for Human Image Animation

    cs.CV 2026-05 unverdicted novelty 6.0

    IPA aligns animation models for superior hand quality via implicit reward maximization on self-generated samples plus hand-focused local optimization, avoiding expensive paired data.

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · cited by 144 Pith papers · 21 internal anchors

  1. [1]

    Latent-Shift: Latent diffusion with temporal shift for efficient text-to-video generation.arXiv preprint arXiv:2304.08477,

    Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia- Bin Huang, Jiebo Luo, and Xi Yin. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video gen- eration. arXiv preprint arXiv:2304.08477, 2023. 3

  2. [2]

    Renderdiffusion: Image diffusion for 3d reconstruction, in- painting and generation

    Titas Anciukevi ˇcius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J Mitra, and Paul Guerrero. Renderdiffusion: Image diffusion for 3d reconstruction, in- painting and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12608–12618, 2023. 15

  3. [3]

    A general language assistant as a laboratory for alignment, 2021

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Ka- mal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labo...

  4. [4]

    Campbell, and Sergey Levine

    Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. Stochastic vari- ational video prediction. In International Conference on Learning Representations, 2018. 15

  5. [5]

    Character region awareness for text de- tection

    Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. Character region awareness for text de- tection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9365–9374,

  6. [6]

    Training a helpful and harm- less assistant with reinforcement learning from human feed- back, 2022

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  7. [7]

    Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022. 3, 5, 15

  8. [8]

    ipoke: Poking a still image for con- trolled stochastic video synthesis

    Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bj ¨orn Ommer. ipoke: Poking a still image for con- trolled stochastic video synthesis. In 2021 IEEE/CVF In- ternational Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021. 15

  9. [9]

    Align your latents: High-resolution video synthesis with latent diffusion models, 2023

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your Latents: High-Resolution Video Synthe- sis with Latent Diffusion Models. arXiv:2304.08818, 2023. 2, 3, 4, 5, 6, 7, 15, 19, 20, 23, 25

  10. [10]

    Generating long videos of dynamic scenes

    Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei A Efros, and Tero Karras. Generating long videos of dynamic scenes. In NeurIPS, 2022. 3, 15, 24

  11. [11]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 15, 24

  12. [12]

    Im- proved conditional vrnns for video prediction

    Lluis Castrejon, Nicolas Ballas, and Aaron Courville. Im- proved conditional vrnns for video prediction. In The IEEE International Conference on Computer Vision (ICCV) ,

  13. [13]

    Emu: Enhancing image generation models using photogenic nee- dles in a haystack, 2023

    Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xi- aofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Mot- wani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ra- manathan, Zijian He, Peter Vajda...

  14. [14]

    Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-XL: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023. 2, 5, 7, 8

  15. [15]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 7

  16. [16]

    Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors

    Congyue Deng, Chiyu Jiang, Charles R Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir Anguelov, et al. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20637–20647, 2023. 15

  17. [17]

    Stochastic video genera- tion with a learned prior

    Emily Denton and Rob Fergus. Stochastic video genera- tion with a learned prior. In Proceedings of the 35th In- ternational Conference on Machine Learning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018 ,

  18. [18]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021. 25 9

  19. [19]

    Derpanis, and Bj¨orn Om- mer

    Michael Dorkenwald, Timo Milbich, Andreas Blattmann, Robin Rombach, Konstantinos G. Derpanis, and Bj¨orn Om- mer. Stochastic image-to-video synthesis using cinns. In IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2021, virtual, June 19-25, 2021, 2021. 15

  20. [20]

    Google scanned objects: A high-quality dataset of 3d scanned household items

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automa- tion (ICRA), pages 2553–2560. IEEE, 2022. 8

  21. [21]

    Arpad E. Elo. The Rating of Chessplayers, Past and Present. Arco Pub., New York, 1978. 4, 22

  22. [22]

    Esser, R

    Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Tam- ing transformers for high-resolution image synthesis. arXiv preprint arXiv:2012.09841, 2020. 3

  23. [23]

    Structure and content-guided video synthesis with diffusion models,

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models,

  24. [24]

    Two-frame motion estimation based on polynomial expansion

    Gunnar Farneb ¨ack. Two-frame motion estimation based on polynomial expansion. pages 363–370, 2003. 4, 17

  25. [25]

    Stylevideogan: A temporal generative model using a pretrained stylegan

    Gereon Fox, Ayush Tewari, Mohamed Elgharib, and Chris- tian Theobalt. Stylevideogan: A temporal generative model using a pretrained stylegan. In British Machine Vision Con- ference (BMVC), 2021. 15

  26. [26]

    Stochastic latent residual video prediction

    Jean-Yves Franceschi, Edouard Delasalles, Micka ¨el Chen, Sylvain Lamprier, and Patrick Gallinari. Stochastic latent residual video prediction. In Proceedings of the 37th Inter- national Conference on Machine Learning, 2020. 15

  27. [27]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for lan- guage modeling. arXiv preprint arXiv:2101.00027, 2020. 3

  28. [28]

    Long video generation with time-agnostic vqgan and time- sensitive transformer

    Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time- sensitive transformer. In Computer Vision – ECCV 2022 , pages 102–118, Cham, 2022. Springer Nature Switzerland. 15

  29. [29]

    Preserve your own cor- relation: A noise prior for video diffusion models

    Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, An- drew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own cor- relation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023. 2, 3, 6, 15

  30. [30]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. 3

  31. [31]

    Reuse and diffuse: Iterative denoising for text-to-video generation

    Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing Zhang, Zuxuan Wu, Songcen Xu, Wei Zhang, Yu-Gang Jiang, and Hang Xu. Reuse and diffuse: Iterative denoising for text-to-video generation. arXiv preprint arXiv:2309.03549, 2023. 3

  32. [32]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without spe- cific tuning. arXiv preprint arXiv:2307.04725 , 2023. 2, 3, 7, 15, 20

  33. [33]

    Rv-gan: Recurrent gan for unconditional video generation

    Sonam Gupta, Arti Keshari, and Sukhendu Das. Rv-gan: Recurrent gan for unconditional video generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2024– 2033, 2022. 15

  34. [34]

    Diffusion with offset noise, 2023

    Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023. 19, 23

  35. [35]

    Latent video diffusion models for high- fidelity long video generation, 2023

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high- fidelity long video generation, 2023. 3

  36. [36]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 7, 19, 20

  37. [37]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022. 19, 23

  38. [38]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In Advances in Neural Infor- mation Processing Systems, 2020. 2, 25

  39. [39]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021. 7, 15, 20

  40. [41]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Sali- mans. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022. 2, 3, 4, 15, 20

  41. [42]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video dif- fusion models. arXiv preprint arXiv:2204.03458, 2022. 2, 15

  42. [43]

    Cogvideo: Large-scale pretraining for text-to- video generation via transformers, 2022

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to- video generation via transformers, 2022. 2, 3, 6, 15

  43. [44]

    Simple diffusion: End-to-end diffusion for high resolution images.arXiv preprint arXiv:2301.11093, 2023

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023. 6

  44. [45]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 2

  45. [46]

    Estimation of Non- Normalized Statistical Models by Score Matching

    Aapo Hyv ¨arinen and Peter Dayan. Estimation of Non- Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005. 18

  46. [47]

    Open- clip, 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, 10 Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. 3

  47. [48]

    Open source computer vision library

    Itseez. Open source computer vision library. https:// github.com/itseez/opencv, 2015. 4, 17

  48. [49]

    Shap-e: Generating condi- tional 3d implicit functions, 2023

    Heewoo Jun and Alex Nichol. Shap-e: Generating condi- tional 3d implicit functions, 2023. 15

  49. [50]

    Lower dimensional kernels for video discriminators

    Emmanuel Kahembwe and Subramanian Ramamoorthy. Lower dimensional kernels for video discriminators. Neu- ral Networks, 132:506–520, 2020. 15

  50. [51]

    Elucidating the Design Space of Diffusion-Based Generative Models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the Design Space of Diffusion-Based Genera- tive Models. arXiv:2206.00364, 2022. 3, 6, 18, 19

  51. [52]

    Text2video-zero: Text- to-image diffusion models are zero-shot video generators,

    Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators,

  52. [53]

    Variational diffusion models

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural in- formation processing systems, 34:21696–21707, 2021. 19

  53. [54]

    Pika labs, https://www.pika.art/ ,

    Pika Labs. Pika labs, https://www.pika.art/ ,

  54. [55]

    X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S

    Alex X. Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018. 15

  55. [56]

    arXiv.csabs/2305.08891(2023) 1

    Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023. 19

  56. [57]

    Zero-1-to-3: Zero-shot one image to 3d object, 2023

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023. 2, 5, 8, 15

  57. [58]

    arXiv preprint arXiv:2309.03453 , year=

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023. 2, 5, 7, 8, 15, 16

  58. [59]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 19, 20, 23

  59. [60]

    Transformation-based adversarial video predic- tion on large-scale data

    Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cassirer, and Karen Si- monyan. Transformation-based adversarial video predic- tion on large-scale data. ArXiv, 2020. 15

  60. [61]

    Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans

    Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models, 2023. 15

  61. [62]

    Point-e: A system for generating 3d point clouds from complex prompts, 2022

    Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts, 2022. 15

  62. [63]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobei- dli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Lau- nay. The RefinedWeb dataset for Falcon LLM: outperform- ing curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116, 2023. 3

  63. [64]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving Latent Dif- fusion Models for High-Resolution Image Synthesis. arXiv:2307.01952, 2023. 2, 3, 5, 24

  64. [65]

    Training contrastive captioners

    Giovanni Puccetti, Maciej Kilian, and Romain Beaumont. Training contrastive captioners. LAION blog, 2023. 17

  65. [66]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021. 2, 3, 4, 8, 18

  66. [67]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv e-prints, 2019. 3

  67. [68]

    How dall·e 2 works, 2022

    Aditya Ramesh. How dall·e 2 works, 2022. 2

  68. [70]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022. 16

  69. [72]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution im- age synthesis with latent diffusion models. arXiv preprint arXiv:2112.10752, 2021. 7, 8

  70. [73]

    U-Net: Convolutional Networks for Biomedical Image Segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Net: Convolutional Networks for Biomedical Image Seg- mentation. arXiv:1505.04597, 2015. 7, 20, 23

  71. [74]

    Gen-2 by runway, https://research

    RunwayML. Gen-2 by runway, https://research. runwayml.com/gen2, 2023. 2, 6, 7, 24

  72. [75]

    Fleet, and Mohammad Norouzi

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- imans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021. 2

  73. [76]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 2

  74. [77]

    Tempo- ral generative adversarial nets with singular value clipping

    Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Tempo- ral generative adversarial nets with singular value clipping. In ICCV, 2017. 15

  75. [78]

    Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan

    Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision, 2020. 15 11

  76. [79]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022. 15, 19

  77. [80]

    Laion-5b: An open large-scale dataset for train- ing next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for train- ing next generation image-text models. Advances in Neural Information Processing Systems , 35:25278–25294, 2022. 3, 4, 16, 18

  78. [81]

    Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023. 15

  79. [82]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taig- man. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022. 2, 3, 4, 5, 6, 15, 20

  80. [83]

    Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2

    Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho- seiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3626–3636, 2022. 15

Showing first 80 references.