pith. machine review for the scientific record. sign in

arxiv: 2506.08009 · v2 · submitted 2025-06-09 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords autoregressive video diffusionexposure biasself forcingKV cachingreal-time video generationtrain-test gapstreaming video
0
0 comments X

The pith

Self Forcing trains autoregressive video diffusion models on their own generated outputs to close the exposure bias gap and enable real-time streaming.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Self Forcing to address exposure bias in autoregressive video diffusion models, where training uses ground-truth context but inference must rely on the model's own imperfect outputs. It performs autoregressive rollout with key-value caching during training so each frame is conditioned on previously self-generated frames, then applies a holistic loss over the full video sequence. Efficiency comes from using a few-step diffusion process together with stochastic gradient truncation. The approach also adds a rolling KV cache for extrapolation. This yields real-time streaming video generation at sub-second latency on a single GPU while matching or exceeding the quality of slower non-causal models.

Core claim

Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value caching during training. This enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives, and supports efficient inference via few-step diffusion, stochastic gradient truncation, and a rolling KV cache mechanism.

What carries the argument

Self Forcing, the training paradigm of autoregressive rollout with KV caching that conditions each frame on self-generated prior outputs and applies video-level loss.

If this is right

  • Real-time streaming video generation with sub-second latency on a single GPU
  • Generation quality that matches or surpasses significantly slower non-causal diffusion models
  • Efficient autoregressive video extrapolation through the rolling KV cache
  • Holistic video-level supervision instead of per-frame objectives

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-conditioning idea could reduce error accumulation in other long-horizon autoregressive tasks such as audio or 3D scene generation.
  • Rolling KV caches may allow extension to substantially longer output sequences without proportional memory growth.
  • Stochastic gradient truncation could be combined with other efficiency techniques to scale the method to higher-resolution video.

Load-bearing premise

That autoregressive rollout with KV caching during training using a few-step diffusion model and stochastic gradient truncation accurately simulates inference conditions without introducing substantial new biases or quality degradation.

What would settle it

A side-by-side evaluation in which Self Forcing models produce lower perceptual quality scores or exceed sub-second latency on a single GPU compared with non-causal diffusion models run under identical inference settings.

read the original abstract

We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: http://self-forcing.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Self Forcing, a training paradigm for autoregressive video diffusion models that mitigates exposure bias by performing autoregressive rollout with KV caching during training, conditioning each frame on previously self-generated outputs rather than ground-truth context. It employs a few-step diffusion model and stochastic gradient truncation to maintain training efficiency, introduces a rolling KV cache for extrapolation, and reports a holistic video-level loss. Experiments claim this enables real-time streaming video generation with sub-second latency on a single GPU while matching or surpassing the quality of slower, non-causal diffusion baselines.

Significance. If the training-time approximations faithfully reproduce inference conditions, the approach could enable practical causal autoregressive video generation for low-latency applications. The KV-caching and rolling-cache mechanisms provide concrete efficiency gains, and the shift to self-conditioned training with video-level supervision is a direct procedural response to exposure bias.

major comments (2)
  1. [Section 3] Training procedure (Section 3): The few-step diffusion approximation combined with stochastic gradient truncation is presented as sufficient to simulate full inference-time error accumulation and KV-cache evolution, yet no quantitative analysis (e.g., comparison of noise schedules, drift metrics, or cache-state divergence) is provided to bound the discrepancy; this directly underpins the headline claim that Self Forcing closes the train-test gap without quality degradation.
  2. [Section 4] Experimental validation (Section 4): The reported sub-second latency and quality parity with non-causal models rely on the truncated training procedure, but the manuscript lacks ablations isolating the effects of step count and truncation probability on long-horizon consistency and cache behavior; without these, it is unclear whether the performance gains are robust or artifacts of the efficiency shortcuts.
minor comments (2)
  1. [Abstract] The abstract states that supervision occurs 'through a holistic loss at the video level,' but the precise formulation of this loss relative to the standard per-frame diffusion objective is not shown as an equation; adding it would clarify the difference from prior frame-wise training.
  2. Figure captions and method diagrams would benefit from explicit annotation of the stochastic truncation points and the rolling KV-cache update rule to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's thoughtful and constructive comments on our manuscript. We appreciate the focus on the training approximations and experimental rigor. Below we address each major comment point by point and describe the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Section 3] Training procedure (Section 3): The few-step diffusion approximation combined with stochastic gradient truncation is presented as sufficient to simulate full inference-time error accumulation and KV-cache evolution, yet no quantitative analysis (e.g., comparison of noise schedules, drift metrics, or cache-state divergence) is provided to bound the discrepancy; this directly underpins the headline claim that Self Forcing closes the train-test gap without quality degradation.

    Authors: We agree that explicit quantitative bounds on the discrepancy would strengthen the justification for the few-step diffusion and stochastic gradient truncation. The current manuscript demonstrates effectiveness through end-to-end video-level quality metrics, latency results, and comparisons to non-causal baselines, which indirectly support that the approximations preserve the benefits of self-forcing. To directly address the concern, we will add a new analysis subsection in Section 3 that includes quantitative comparisons such as cache-state divergence (measured via L2 distance on KV tensors) and drift metrics (e.g., accumulated noise schedule deviation) between truncated and full rollouts on short sequences. This will provide explicit bounds and better support the claim that the train-test gap is closed without quality degradation. revision: yes

  2. Referee: [Section 4] Experimental validation (Section 4): The reported sub-second latency and quality parity with non-causal models rely on the truncated training procedure, but the manuscript lacks ablations isolating the effects of step count and truncation probability on long-horizon consistency and cache behavior; without these, it is unclear whether the performance gains are robust or artifacts of the efficiency shortcuts.

    Authors: We acknowledge that dedicated ablations isolating step count and truncation probability would improve clarity on robustness. The existing experiments already vary sequence lengths and report consistent quality across different video durations, with the rolling KV cache enabling extrapolation. However, to isolate these hyperparameters, we will expand Section 4 with new ablation tables that vary diffusion steps (1, 2, 4, 8) and truncation probabilities (0.1, 0.3, 0.5), reporting metrics for long-horizon consistency (e.g., temporal coherence scores) and cache behavior (e.g., cache hit rates and state divergence over 100+ frames). These additions will confirm that the gains are not artifacts of the shortcuts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; training procedure is a self-contained procedural change.

full rationale

The paper presents Self Forcing as a training strategy that performs autoregressive rollout with KV caching to address exposure bias, supplemented by few-step diffusion and stochastic gradient truncation for tractability. No equations, fitted parameters, or self-citations are shown to reduce the claimed performance gains (real-time causal generation matching non-causal baselines) to the inputs by construction. The derivation chain consists of standard diffusion objectives with modified conditioning and rollout, evaluated externally against baselines. This is the most common honest finding for method papers without mathematical self-reference.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that few-step diffusion plus stochastic gradient truncation preserves sufficient training signal for the autoregressive objective, plus standard diffusion model assumptions about noise schedules and conditioning.

free parameters (1)
  • number of diffusion steps
    Few-step diffusion model chosen to balance training speed and quality; exact count not specified in abstract.
axioms (1)
  • domain assumption Few-step diffusion approximates the full multi-step denoising process sufficiently for training the autoregressive objective
    Invoked to enable efficient autoregressive rollout during training.

pith-pipeline@v0.9.0 · 5502 in / 1215 out tokens · 30872 ms · 2026-05-11T01:30:50.959758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 54 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video

    cs.CV 2026-04 unverdicted novelty 8.0

    ReconPhys is the first feedforward neural network that jointly reconstructs 3D geometry and appearance via Gaussian Splatting while estimating physical attributes from a single monocular video using self-supervised training.

  2. KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

    cs.CV 2026-05 unverdicted novelty 7.0

    KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.

  3. Discrete Stochastic Localization for Non-autoregressive Generation

    cs.LG 2026-05 unverdicted novelty 7.0

    Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.

  4. CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

    cs.CV 2026-05 unverdicted novelty 7.0

    CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

  5. DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

    cs.RO 2026-05 unverdicted novelty 7.0

    DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.

  6. FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.

  7. Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

  8. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...

  9. Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

    cs.CV 2026-04 unverdicted novelty 7.0

    Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...

  10. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  11. Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.

  12. X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference

    cs.CV 2026-04 unverdicted novelty 7.0

    X-Cache achieves 71% block skip rate and 2.6x wall-clock speedup in few-step autoregressive multi-camera driving world models via cross-chunk residual caching with dual-metric gating and forced KV updates.

  13. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

  14. Speculative Decoding for Autoregressive Video Generation

    cs.CV 2026-04 conditional novelty 7.0

    A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...

  15. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  16. DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning

    eess.IV 2026-04 unverdicted novelty 7.0

    DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.

  17. Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...

  18. Quantitative Video World Model Evaluation for Geometric-Consistency

    cs.CV 2026-05 unverdicted novelty 6.0

    PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.

  19. Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

    cs.CV 2026-05 unverdicted novelty 6.0

    Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.

  20. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  21. Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.

  22. SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...

  23. Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.

  24. ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...

  25. FlashMol: High-Quality Molecule Generation in as Few as Four Steps

    cs.LG 2026-05 unverdicted novelty 6.0

    FlashMol produces chemically valid 3D molecules in 4 steps via distribution matching distillation with respaced timesteps and Jensen-Shannon regularization, matching or exceeding 1000-step teacher performance on QM9 a...

  26. RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control

    cs.CV 2026-05 unverdicted novelty 6.0

    RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.

  27. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  28. Stream-T1: Test-Time Scaling for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...

  29. Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...

  30. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

  31. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.

  32. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

  33. CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

  34. Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  35. Repurposing 3D Generative Model for Autoregressive Layout Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.

  36. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  37. From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

    cs.CV 2026-04 unverdicted novelty 6.0

    Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.

  38. DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

    cs.CV 2026-04 unverdicted novelty 6.0

    RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...

  39. Lyra 2.0: Explorable Generative 3D Worlds

    cs.CV 2026-04 unverdicted novelty 6.0

    Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

  40. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  41. Lighting-grounded Video Generation with Renderer-based Agent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    LiVER conditions video diffusion models on renderer-derived 3D control signals for disentangled, editable control over object layout, lighting, and camera trajectory.

  42. LPM 1.0: Video-based Character Performance Model

    cs.CV 2026-04 unverdicted novelty 6.0

    LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.

  43. INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...

  44. Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.

  45. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  46. LongLive: Real-time Interactive Long Video Generation

    cs.CV 2025-09 conditional novelty 6.0

    LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.

  47. Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

    cs.CV 2026-05 unverdicted novelty 5.0

    Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-trai...

  48. A Systematic Post-Train Framework for Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.

  49. PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

    cs.CV 2026-04 unverdicted novelty 5.0

    PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.

  50. TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.

  51. MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering

    cs.CV 2026-03 unverdicted novelty 5.0

    MuSteerNet generates realistic 3D human reactions from videos by mutually steering visual observations and reaction motions to reduce content mismatch.

  52. HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

    cs.CV 2026-04 unverdicted novelty 4.0

    HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...

  53. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

  54. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 52 Pith papers · 11 internal anchors

  1. [1]

    Block diffusion: Interpolating between autoregressive and diffusion language models

    Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InICLR, 2025

  2. [2]

    Toward one-second latency: Evolution of live media streaming.IEEE Communications Surveys & Tutorials, 2025

    Abdelhak Bentaleb, May Lim, Mehmet N Akcay, Ali C Begen, Sarra Hammoudi, and Roger Zimmermann. Toward one-second latency: Evolution of live media streaming.IEEE Communications Surveys & Tutorials, 2025. 10

  3. [3]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  4. [4]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InCVPR, 2023

  5. [5]

    Generating long videos of dynamic scenes.NeurIPS, 2022

    Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes.NeurIPS, 2022

  6. [6]

    Video generation models as world simulators, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024

  7. [7]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024

  8. [8]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

  9. [9]

    Streaming video diffusion: Online video editing with diffusion models.arXiv preprint arXiv:2405.19726, 2024

    Feng Chen, Zhen Yang, Bohan Zhuang, and Qi Wu. Streaming video diffusion: Online video editing with diffusion models.arXiv preprint arXiv:2405.19726, 2024

  10. [10]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Juncheng Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengchen Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  11. [11]

    Oasis: A universe in a transformer, 2024

    Julian Decart, Quinn Quevedo, Spruce McIntyre, Xinlei Campbell, Robert Chen, and Wachen. Oasis: A universe in a transformer, 2024

  12. [12]

    arXiv preprint arXiv:2412.12095 , year=

    Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

  13. [13]

    Autoregressive video generation without vector quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InICLR, 2025

  14. [14]

    Unsupervised learning of disentangled representations from video

    Emily L Denton et al. Unsupervised learning of disentangled representations from video. InNeurIPS, 2017

  15. [15]

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A program- ming model for generating optimized attention kernels.ArXiv, abs/2412.05496, 2024

  16. [16]

    arXiv preprint arXiv:2411.16375 (2024)

    Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, and Long Chen. Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024

  17. [17]

    Long video generation with time-agnostic vqgan and time-sensitive transformer

    Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. InECCV, 2022

  18. [18]

    Generative adversarial nets

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014

  19. [19]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InCOLM, 2024

  20. [20]

    Long-context autoregressive video modeling with next-frame prediction

    Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025

  21. [21]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  22. [22]

    Long context tuning for video generation

    Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025

  23. [23]

    Photorealistic video generation with diffusion models

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InECCV, 2024. 11

  24. [24]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  25. [25]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models.ArXiv, abs/2210.02303, 2022

  26. [26]

    Video diffusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. InNeurIPS, 2022

  27. [27]

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InICLR, 2023

  28. [28]

    arXiv preprint arXiv:2412.07720 (2024)

    Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei- Ying Ma, and Maosong Sun. Acdit: Interpolating autoregressive conditional modeling and diffusion transformer.arXiv preprint arXiv:2412.07720, 2024

  29. [29]

    The gan is dead; long live the gan! a modern gan baseline

    Nick Huang, Aaron Gokaslan, V olodymyr Kuleshov, and James Tompkin. The gan is dead; long live the gan! a modern gan baseline. InNeurIPS, 2024

  30. [30]

    Flow generator matching.arXiv preprint arXiv:2410.19310, 2024

    Zemin Huang, Zhengyang Geng, Weijian Luo, and Guo-jun Qi. Flow generator matching.arXiv preprint arXiv:2410.19310, 2024

  31. [31]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

  32. [32]

    On stabilizing generative adversarial training with noise

    Simon Jenni and Paolo Favaro. On stabilizing generative adversarial training with noise. InCVPR, 2019

  33. [33]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. In ICLR, 2025

  34. [34]

    The relativistic discriminator: a key element missing from standard gan

    Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. In ICLR, 2019

  35. [35]

    Fifo-diffusion: Generating infinite videos from text without training

    Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training. InNeurIPS, 2024

  36. [36]

    Variational diffusion models

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. InNeurIPS, 2021

  37. [37]

    Auto-encoding variational bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014

  38. [38]

    Videopoet: A large language model for zero-shot video generation

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. InICML, 2024

  39. [39]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  40. [40]

    Professor forcing: A new algorithm for training recurrent networks

    Alex M Lamb, Anirudh Goyal ALIAS PARTH GOY AL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. InNeurIPS, 2016

  41. [41]

    Latency reducing in real-time internet video transport: A survey.SSRN 4654242, 2023

    Qing Li, Xun Tang, Junkun Peng, Yuanzheng Tan, and Yong Jiang. Latency reducing in real-time internet video transport: A survey.SSRN 4654242, 2023

  42. [42]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

  43. [43]

    Autoregressive image generation without vector quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InNeurIPS, 2024

  44. [44]

    Infinitenature-zero: Learning perpetual view generation of natural scenes from single images

    Zhengqi Li, Qianqian Wang, Noah Snavely, and Angjoo Kanazawa. Infinitenature-zero: Learning perpetual view generation of natural scenes from single images. InECCV, 2022. 12

  45. [45]

    Arlon: Boosting diffusion transformers with autoregressive models for long video generation

    Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation. InICLR, 2025

  46. [46]

    Looking backward: Streaming video-to-video translation with feature banks

    Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. Looking backward: Streaming video-to-video translation with feature banks. InICLR, 2025

  47. [47]

    arXiv preprint arXiv:2501.08316 (2025) 2, 3, 4

    Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

  48. [48]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

  49. [49]

    Infinite nature: Perpetual view generation of natural scenes from a single image

    Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: Perpetual view generation of natural scenes from a single image. InICCV, 2021

  50. [50]

    Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, and Juan-Manuel Pérez-Rúa

    Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

  51. [51]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

  52. [52]

    Redefining temporal modeling in video diffusion: The vectorized timestep approach.arXiv preprint arXiv:2410.03160, 2024

    Yaofang Liu, Yumeng Ren, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Raymond H Chan, and Jean-michel Morel. Redefining temporal modeling in video diffusion: The vectorized timestep approach. arXiv preprint arXiv:2410.03160, 2024

  53. [53]

    Autoregressive diffusion transformer for text-to-speech synthesis

    Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, and Haizhou Li. Autoregressive diffusion transformer for text-to-speech synthesis.arXiv preprint arXiv:2406.05551, 2024

  54. [54]

    Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models

    Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. InNeurIPS, 2023

  55. [55]

    One-step diffusion distillation through score implicit matching.NeurIPS, 2024

    Weijian Luo, Zemin Huang, Zhengyang Geng, J Zico Kolter, and Guo-jun Qi. One-step diffusion distillation through score implicit matching.NeurIPS, 2024

  56. [56]

    Osv: One step is enough for high-quality image to video generation

    Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang, and Wenhan Luo. Osv: One step is enough for high-quality image to video generation. InCVPR, 2025

  57. [57]

    The parallelism tradeoff: Limitations of log-precision transformers

    William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. TACL, 2023

  58. [58]

    Which training methods for gans do actually converge? InICML, 2018

    Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? InICML, 2018

  59. [59]

    X-fusion: Introducing new modality to frozen large language models.arXiv preprint arXiv:2504.20996, 2025

    Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, et al. X-fusion: Introducing new modality to frozen large language models.arXiv preprint arXiv:2504.20996, 2025

  60. [60]

    Elucidating the exposure bias in diffusion models

    Mang Ning, Mingxiao Li, Jianlin Su, Albert Ali Salah, and Itir Onal Ertugrul. Elucidating the exposure bias in diffusion models. InICLR, 2024

  61. [61]

    Genie 2: A large-scale foundation world model, 2024

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

  62. [62]

    Scalable diffusion models with transformers

    William S Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  63. [63]

    Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025

    Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025

  64. [64]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

  65. [65]

    Sequence level training with recurrent neural networks

    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InICLR, 2016. 13

  66. [66]

    arXiv preprint arXiv:2502.07737 (2025)

    Shuhuai Ren, Shuming Ma, Xu Sun, and Furu Wei. Next block prediction: Video generation via semi-auto-regressive modeling.arXiv preprint arXiv:2502.07737, 2025

  67. [67]

    Rolling diffusion models

    David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. InICML, 2024

  68. [68]

    Temporal generative adversarial nets with singular value clipping

    Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. InICCV, 2017

  69. [69]

    Magi-1: Autoregressive video generation at scale, 2025

    Sand-AI. Magi-1: Autoregressive video generation at scale, 2025

  70. [70]

    Fast high-resolution image synthesis with latent adversarial diffusion distillation

    Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  71. [71]

    Generalization in generation: A closer look at exposure bias.EMNLP-IJCNLP 2019, page 157, 2019

    Florian Schmidt. Generalization in generation: A closer look at exposure bias.EMNLP-IJCNLP 2019, page 157, 2019

  72. [72]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InNeurIPS, 2024

  73. [73]

    History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History- guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

  74. [74]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

  75. [75]

    Maximum likelihood training of score-based diffusion models

    Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. InNeurIPS, 2021

  76. [76]

    Ar-diffusion: Asynchronous video generation with auto-regressive diffusion

    Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion. In CVPR, 2025

  77. [77]

    Mocogan: Decomposing motion and content for video generation

    Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. InCVPR, 2018

  78. [78]

    Diffusion models are real-time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InICLR, 2025

  79. [79]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

  80. [80]

    Phenaki: Variable length video generation from open domain textual descriptions

    R Villegas, H Moraldo, S Castro, M Babaeizadeh, H Zhang, J Kunze, PJ Kindermans, MT Saffar, and D Erhan. Phenaki: Variable length video generation from open domain textual descriptions. InICLR, 2023

Showing first 80 references.