Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Chongxuan Li; Guande He; Hang Su; Hongzhou Zhu; Jun Zhu; Min Zhao

arxiv: 2602.02214 · v5 · pith:77QS24PVnew · submitted 2026-02-02 · 💻 cs.CV

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu , Min Zhao , Guande He , Hang Su , Chongxuan Li , Jun Zhu This is my paper

Pith reviewed 2026-05-22 11:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords autoregressive video generationdiffusion distillationcausal attentionreal-time interactive videoODE initializationDMD procedureSelf Forcing

0 comments

The pith

Causal Forcing uses an autoregressive teacher for ODE initialization to recover the teacher's flow map when distilling into causal video models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that distilling bidirectional video diffusion models into autoregressive students creates an architectural mismatch because bidirectional teachers violate frame-level injectivity under the PF-ODE. This forces the student toward a conditional-expectation solution rather than the desired flow map. Causal Forcing fixes the initialization step by switching to an autoregressive teacher before running the same DMD procedure used in prior work. The resulting models generate higher-quality real-time interactive video. A sympathetic reader cares because the change directly improves metrics that matter for dynamic, instruction-following video without changing the downstream training loop.

Core claim

By initializing the autoregressive student via ODE distillation from an autoregressive teacher, Causal Forcing satisfies the frame-level injectivity condition that bidirectional teachers violate, thereby recovering the teacher's flow map rather than converging to a conditional-expectation solution, after which the DMD procedure produces superior few-step causal video generators.

What carries the argument

Causal Forcing, which replaces the bidirectional teacher with an autoregressive teacher solely for the ODE initialization step to enforce injectivity before applying DMD.

If this is right

Autoregressive video generators distilled this way outperform prior Self Forcing baselines on dynamic degree, vision reward, and instruction following.
Causal attention can replace full attention in the student without the performance penalty previously observed.
Real-time interactive video generation becomes viable at higher visual and temporal fidelity.
The same two-stage recipe (AR-teacher ODE init followed by DMD) applies to other diffusion-based sequence models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may extend to distilling other teacher-student pairs that differ in causality or attention scope.
Longer video sequences or higher frame rates could be tested to check whether the injectivity benefit persists.
Interactive applications such as live editing or simulation may see reduced latency once the distilled models run at full speed.
Combining Causal Forcing with additional compression steps could push generation toward sub-frame latency.

Load-bearing premise

An autoregressive teacher produces frame-level injectivity under the PF-ODE so that the flow map can be recovered.

What would settle it

A controlled comparison in which an autoregressive teacher for ODE initialization yields equal or worse final AR student quality than a bidirectional teacher on the same downstream DMD stage would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.02214 by Chongxuan Li, Guande He, Hang Su, Hongzhou Zhu, Jun Zhu, Min Zhao.

**Figure 1.** Figure 1: Limitations of existing methods. While distilling from the same bidirectional base model, SOTA autoregressive diffusion distillation methods like Self-Forcing still lag significantly behind standard DMD, which distills a bidirectional student. frames. This violation of frame-level injectivity results in blurred and inconsistent video generation. Building on the above analysis, we propose Causal Forcing, wh… view at source ↗

**Figure 2.** Figure 2: DMD fails to bridge the architectural gap. Initializing the autoregressive student with standard DMD removes the samplingstep gap and isolates the architectural gap, yet still underperforms standard DMD. This indicates that the architectural gap cannot be resolved by the DMD stage and should instead be addressed during the preceding ODE initialization. to x0 <i. In contrast, DF targets the noisy-condition… view at source ↗

**Figure 3.** Figure 3: Necessary principle for ODE initialization and why Self Forcing is flawed. ODE distillation requires injective paired data. (a) Standard ODE distillation, which distills a bidirectional teacher to a bidirectional student, satisfies this requirement at the video level. (b) For an AR student, injectivity must hold at the frame level: each noisy frame maps to a unique clean frame via the PF-ODE of the AR teac… view at source ↗

**Figure 4.** Figure 4: TF vs. DF in AR diffusion training. Contrary to common belief, DF leads to video collapse due to the traininginference gap, whereas TF produces higher visual quality. 3.3. Causal Forcing Building on the above analysis, bridging the architectural gap requires ODE distillation to satisfy the frame-level injectivity condition in Eq. (4), which in turn requires an autoregressive diffusion model as the teache… view at source ↗

**Figure 5.** Figure 5: Performance comparison between Self Forcing (SF) and ours. DMD with Self Forcing’s ODE initialization shows weaker dynamics and artifacts, whereas with causal ODE initialization, it achieves stronger dynamics with higher visual fidelity. 4. Experiments 4.1. Setup Implementation details. Following Self Forcing (Huang et al., 2025a), we adopt Wan2.1-T2V-1.3B (Wan et al., 2025) as our base model to fine-tune… view at source ↗

**Figure 6.** Figure 6: Qualitative comparisons with existing methods. Our method achieves substantially higher dynamics and better visual quality than existing distilled autoregressive video models (Causvid and Self Forcing), while matching or even surpassing bidirectional diffusion models (Wan2.1). More video demos and all the prompts used in this paper are provided in the supplementary materials. sulting model initializes asym… view at source ↗

**Figure 7.** Figure 7: Performance comparison with 4-step generation before the DMD stage. Without having reached the DMD stage yet, we directly compare the 4-step generation of the autoregressive diffusion model with the 4-step generation of the causal ODE-distilled model. Autoregressive diffusion exhibits inter-frame abrupt changes, indicating suboptimal causality under 4 steps, whereas the causal ODE–distilled model remains m… view at source ↗

**Figure 8.** Figure 8: Performance comparison of DMD with different initialization. DMD with Self Forcing’s ODE initialization shows weak dynamics and abrupt artifacts. Initializing with TF-trained autoregressive diffusion brings a large improvement but still exhibits abrupt changes (e.g., two red flowers turning into one), whereas causal ODE initialization yields the highest quality and the most stable results. 17 [PITH_FULL_I… view at source ↗

**Figure 9.** Figure 9: Student initialization is not the bottleneck of ODE distillation. With causal ODE distillation, the student with a bidirectional initial model achieves similar performance to that with a causal initial model, both better than Self Forcing’s ODE distillation. C.3. Causal ODE Distillation from Bidirectional Initial Model Recall from Sec. 3.2 that we claim that we should adopt causal ODE distillation rather t… view at source ↗

**Figure 10.** Figure 10: Comparison between asymmetric CD and causal CD. Asymmetric CD appears highly blurry and exhibits abrupt artifacts, whereas causal CD results remain much better quality and more stable. model vθ, an x0-prediction form for Gθ already satisfies the required boundary conditions, without any additional design: Gθ(x i , x <i gt , t) = x i − tvθ(x i , x <i gt , t). (31) This simplified design may not be optimal … view at source ↗

read the original abstract

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing, which uses an autoregressive teacher for ODE initialization to bridge the architectural gap, and then applies the same DMD procedure as in Self Forcing. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; the code: \href{https://github.com/thu-ml/Causal-Forcing}{https://github.com/thu-ml/Causal-Forcing}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Causal Forcing adds an autoregressive teacher for ODE initialization to bridge the causal gap in video diffusion distillation, but the injectivity assumption stays unverified.

read the letter

The main thing to know is that this paper proposes using an autoregressive teacher for the ODE initialization step when distilling bidirectional video diffusion models into fast autoregressive ones. The authors claim this restores frame-level injectivity under the probability flow ODE, letting the student recover the teacher's flow map instead of settling for a conditional expectation that hurts quality. They then run the same DMD procedure as Self Forcing on top of that initialization and report concrete gains over the prior state of the art.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Causal Forcing to distill pretrained bidirectional video diffusion models into few-step autoregressive models for real-time interactive video generation. It identifies an architectural gap arising from replacing full attention with causal attention and proposes using an autoregressive teacher for ODE initialization to satisfy frame-level injectivity under the PF-ODE (allowing recovery of the teacher's flow map rather than a conditional-expectation solution), followed by the DMD procedure from Self Forcing. Empirical results claim outperformance over all baselines, including gains of 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following relative to Self Forcing.

Significance. If the frame-level injectivity assumption holds, the work supplies a mechanistically motivated fix for the bidirectional-to-autoregressive distillation gap and reports concrete metric improvements in dynamic content and instruction adherence. The public release of code and a project page is a clear strength for reproducibility and follow-up work.

major comments (1)

Abstract: the central mechanistic claim is that 'frame-level injectivity' holds under the PF-ODE for an autoregressive teacher (due to causal attention) but is violated by bidirectional teachers (due to future context). No formal argument, injectivity proof, or numerical verification (e.g., checking uniqueness of the noisy-to-clean mapping on held-out frames) is supplied. This assumption is load-bearing for attributing the reported gains to flow-map recovery rather than to other factors such as training schedule or initialization details.

minor comments (1)

The abstract reports specific percentage improvements but does not indicate whether they are averaged over multiple seeds or include error bars; adding this information would strengthen the empirical section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We appreciate the recognition of the mechanistic motivation and the value of our public code release. We address the major comment below.

read point-by-point responses

Referee: Abstract: the central mechanistic claim is that 'frame-level injectivity' holds under the PF-ODE for an autoregressive teacher (due to causal attention) but is violated by bidirectional teachers (due to future context). No formal argument, injectivity proof, or numerical verification (e.g., checking uniqueness of the noisy-to-clean mapping on held-out frames) is supplied. This assumption is load-bearing for attributing the reported gains to flow-map recovery rather than to other factors such as training schedule or initialization details.

Authors: We agree that a more explicit justification would strengthen the manuscript. The core intuition, as stated in the paper, is that causal attention in the autoregressive teacher restricts the PF-ODE evolution of frame t to depend only on frames 1 through t. This per-frame conditioning makes the mapping from a noisy frame to its clean counterpart unique under the teacher's flow, satisfying frame-level injectivity. Bidirectional attention, by contrast, allows future-frame information to influence the ODE trajectory of earlier frames, rendering the per-frame mapping non-injective and yielding a conditional-expectation solution instead of the teacher's flow map. While the initial submission relied on this architectural reasoning without a formal injectivity proof or additional numerical checks, we will add a dedicated paragraph in Section 3 together with a simple numerical verification on a low-dimensional toy diffusion model to confirm uniqueness of the noisy-to-clean mapping for causal versus bidirectional teachers. We believe these additions will better isolate the contribution of the AR-teacher initialization from other training factors; the existing ablations already show that replacing the bidirectional teacher with an autoregressive one yields the reported gains even under matched schedules. revision: partial

Circularity Check

0 steps flagged

Minor self-citation to prior DMD procedure; core AR-teacher initialization is independent

full rationale

The paper re-uses the DMD procedure from Self Forcing but introduces a distinct initialization step that relies on the stated frame-level injectivity property of autoregressive teachers under the PF-ODE. This assumption is presented as a direct consequence of causal attention lacking future context, separate from any fitted parameters or self-referential definitions within the current work. No equation or derivation reduces by construction to prior outputs of the same run, and the empirical comparisons are reported as external validation. The self-citation is not load-bearing for the novel contribution and does not trigger higher circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that an autoregressive teacher satisfies frame-level injectivity under the PF-ODE, plus the empirical claim that the resulting student outperforms baselines. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Frame-level injectivity is required for ODE distillation to recover the teacher's flow map rather than a conditional-expectation solution.
Explicitly invoked in the abstract to explain why bidirectional-to-AR distillation fails.

pith-pipeline@v0.9.0 · 5792 in / 1252 out tokens · 50840 ms · 2026-05-22T11:42:38.845391+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Q-ARVD: Quantizing Autoregressive Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

Q-ARVD introduces final-quality-aware frame weighting and outlier-aware adaptive dual-scale quantization to enable accurate low-bit inference for autoregressive video diffusion models.
Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell ...
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
cs.CV 2026-05 unverdicted novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
cs.CV 2026-05 conditional novelty 7.0

HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models
cs.CV 2026-05 unverdicted novelty 6.0

SCOPE adds per-pixel action conditioning to pretrained video diffusion models and releases the CrossFPS multi-game dataset to support cross-game FPS world model simulation with zero-shot transfer.
WorldKV: Efficient World Memory with World Retrieval and Compression
cs.CV 2026-05 unverdicted novelty 6.0

WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

DySink uses adaptive retrieval of relevant historical frames plus a sink anomaly gate to improve dynamic degree and temporal quality in minute-long autoregressive video generation.
Xiaomi Auto World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

Xiaomi EV World Model integrates WorldRec for sparse-query 3D Gaussian reconstruction and WorldGen for fast causal video generation via bidirectional pretraining and causal fine-tuning to support autonomous driving si...
FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization
cs.CV 2026-05 unverdicted novelty 6.0

FashionChameleon achieves interactive multi-garment video customization in real time by training a teacher model with in-context learning on single-garment pairs, applying streaming distillation, and using training-fr...
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
cs.CV 2026-05 conditional novelty 6.0

PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
cs.CV 2026-05 unverdicted novelty 6.0

HorizonDrive is a new anti-drifting autoregressive training and distillation method that enables minute-scale stable driving video rollouts by making the teacher model rollout-capable via scheduled rollout recovery an...
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
cs.CV 2026-04 conditional novelty 6.0

Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
One-Forcing: Towards Stable One-Step Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 5.0

One-Forcing augments DMD with a GAN loss to enable stable one-step causal autoregressive video generation, reporting a VBench score of 83.76 as SOTA among one-step methods.
One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems
cs.CV 2026-05 unverdicted novelty 5.0

A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.
Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion
cs.CV 2026-05 unverdicted novelty 5.0

Focused Forcing is a training-free per-frame KV selection method that combines attention scores with diversity metrics and head-importance estimation to accelerate autoregressive video diffusion up to 1.48x while impr...
A Systematic Post-Train Framework for Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...