VideoMLA applies multi-head latent attention with 3D-RoPE decoupling to autoregressive video diffusion, delivering 92.7% KV memory reduction while matching short-horizon baselines and leading long-horizon VBench scores.
hub Canonical reference
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
Canonical reference. 78% of citing Pith papers cite this work as background.
abstract
To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing, which uses an autoregressive teacher for ODE initialization to bridge the architectural gap, and then applies the same DMD procedure as in Self Forcing. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; the code: \href{https://github.com/thu-ml/Causal-Forcing}{https://github.com/thu-ml/Causal-Forcing}.
hub tools
citation-role summary
citation-polarity summary
years
2026 50representative citing papers
ISPA reduces KV cache size by up to 50% in AR video models by transitioning layers to local attention and applying instance-specific least-squares weight modulation to compensate for lost history.
TempAct introduces a planner-executor RL framework with hierarchical group exploration and rewards to improve temporal consistency in autoregressive video diffusion models.
TetherCache organizes KV-cache into sink, memory, and recent regions and applies gated recall with attention-diversity balancing plus trusted memory editing to stabilize long-horizon autoregressive video diffusion.
LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
AdaState replaces the static first-frame KV anchor with an evolving hidden latent that the model denoises alongside content, treating time as relative to enable recurrence and richer dynamics in streaming video generation.
Future Forcing constructs a future query proxy from historical pre-RoPE statistics to score and merge KV tokens, improving subject consistency by up to 1.49 on VBench-Long for 60s AR video generation.
Q-ARVD introduces final-quality-aware frame weighting and outlier-aware adaptive dual-scale quantization to enable accurate low-bit inference for autoregressive video diffusion models.
DySink maintains a memory bank and retrieves relevant historical frames as dynamic sinks while using an anomaly gate to suppress collapse, yielding higher temporal quality and dynamic degree on minute-long videos.
Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
HandsOnWorld creates a hand-controlled egocentric video generator from unconstrained monocular video via a new EgoVid-Pro dataset from monocular reconstruction and a Plücker Hand Map that disentangles camera and hand motion.
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
LiveEdit distills a bidirectional video foundation model into a unidirectional streaming editor via three-stage training plus mask caching to reach 12.66 FPS with stable edits.
ZeroGVC performs zero-shot generative video compression by guiding pretrained autoregressive diffusion priors with codebook noise vectors for P-frames after encoding the initial I-frame.
ActWorld extends navigation-centric world models to support mid-rollout object interactions via chunk-autoregressive generation, action-aware memory routing, and a persistent memory bank, backed by a 100K annotated interaction dataset.
AnchorEdit is the first autoregressive diffusion framework for causal multi-turn image editing, achieving claimed SOTA consistency over 10+ rounds via three-stage training and a memory mechanism.
K-Forcing introduces progressive self-forcing distillation to train a conditional push-forward model that jointly decodes k future tokens per forward pass, yielding 2.4-3.5x speedup at k=4 with modest quality loss on LM1B and OpenWebText.
Echo-Infinity replaces handcrafted KV-cache schedules with end-to-end optimized Memory Queries and a Unified Relative RoPE recipe to support real-time infinite video generation in diffusion transformers.
Training method distills non-causal future targets into causal video diffusion states to boost long-horizon consistency without changing inference architecture or cost.
citing papers explorer
-
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
VideoMLA applies multi-head latent attention with 3D-RoPE decoupling to autoregressive video diffusion, delivering 92.7% KV memory reduction while matching short-horizon baselines and leading long-horizon VBench scores.
-
Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption
ISPA reduces KV cache size by up to 50% in AR video models by transitioning layers to local attention and applying instance-specific least-squares weight modulation to compensate for lost history.
-
TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL
TempAct introduces a planner-executor RL framework with hierarchical group exploration and rewards to improve temporal consistency in autoregressive video diffusion models.
-
TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment
TetherCache organizes KV-cache into sink, memory, and recent regions and applies gated recall with attention-diversity balancing plus trusted memory editing to stabilize long-horizon autoregressive video diffusion.
-
LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation
LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.
-
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
-
AdaState: Self-Evolving Anchors for Streaming Video Generation
AdaState replaces the static first-frame KV anchor with an evolving hidden latent that the model denoises alongside content, treating time as relative to enable recurrence and richer dynamics in streaming video generation.
-
Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation
Future Forcing constructs a future query proxy from historical pre-RoPE statistics to score and merge KV tokens, improving subject consistency by up to 1.49 on VBench-Long for 60s AR video generation.
-
Q-ARVD: Quantizing Autoregressive Video Diffusion Models
Q-ARVD introduces final-quality-aware frame weighting and outlier-aware adaptive dual-scale quantization to enable accurate low-bit inference for autoregressive video diffusion models.
-
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
DySink maintains a memory bank and retrieves relevant historical frames as dynamic sinks while using an anomaly gate to suppress collapse, yielding higher temporal quality and dynamic degree on minute-long videos.
-
Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation
Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.
-
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
-
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control
HandsOnWorld creates a hand-controlled egocentric video generator from unconstrained monocular video via a new EgoVid-Pro dataset from monocular reconstruction and a Plücker Hand Map that disentangles camera and hand motion.
-
ABot-M0.5: Unified Mobility-and-Manipulation World Action Model
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
-
LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing
LiveEdit distills a bidirectional video foundation model into a unidirectional streaming editor via three-stage training plus mask caching to reach 12.66 FPS with stable edits.
-
ZeroGVC: Zero-Shot Generative Video Compression with Autoregressive Diffusion Priors
ZeroGVC performs zero-shot generative video compression by guiding pretrained autoregressive diffusion priors with codebook noise vectors for P-frames after encoding the initial I-frame.
-
ActWorld: From Explorable to Interactive World Model via Action-Aware Memory
ActWorld extends navigation-centric world models to support mid-rollout object interactions via chunk-autoregressive generation, action-aware memory routing, and a persistent memory bank, backed by a 100K annotated interaction dataset.
-
AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory
AnchorEdit is the first autoregressive diffusion framework for causal multi-turn image editing, achieving claimed SOTA consistency over 10+ rounds via three-stage training and a memory mechanism.
-
K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling
K-Forcing introduces progressive self-forcing distillation to train a conditional push-forward model that jointly decodes k future tokens per forward pass, yielding 2.4-3.5x speedup at k=4 with modest quality loss on LM1B and OpenWebText.
-
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation
Echo-Infinity replaces handcrafted KV-cache schedules with end-to-end optimized Memory Queries and a Unified Relative RoPE recipe to support real-time infinite video generation in diffusion transformers.
-
Video-Mirai: Autoregressive Video Diffusion Models Need Foresight
Training method distills non-causal future targets into causal video diffusion states to boost long-horizon consistency without changing inference architecture or cost.
-
PointAction: 3D Points as Universal Action Representations for Robot Control
PointAction uses predicted dynamic 3D pointmaps from fine-tuned video models as an embodiment-agnostic action representation to map video predictions to executable robot actions.
-
LiveBand: Live Accompaniment Generation in the Audio Domain
LiveBand generates high-fidelity music accompaniments to live audio in real time via a causal transformer in audio latent space trained with adversarial sequence-level supervision.
-
Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation
Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting SOTA results on ScanNet, DL3DV, and OmniWorldGame.
-
SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer
SANA-Streaming delivers 1280x704 streaming video editing at 24 FPS end-to-end on an RTX 5090 using hybrid DiT blocks, cycle-reverse training, and mixed-precision quantization.
-
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
minWM supplies an end-to-end pipeline that fine-tunes bidirectional T2V/TI2V models with camera control then distills them via Causal Forcing into few-step autoregressive generators for low-latency rollout.
-
SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models
SCOPE adds per-pixel action conditioning to pretrained video diffusion models and releases the CrossFPS multi-game dataset to support cross-game FPS world model simulation with zero-shot transfer.
-
WorldKV: Efficient World Memory with World Retrieval and Compression
WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.
-
FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization
FashionChameleon achieves interactive multi-garment video customization at 23.8 FPS via in-context teacher models, streaming distillation, and training-free KV cache rescheduling while using only single-garment data.
-
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
-
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
HorizonDrive is a new anti-drifting autoregressive training and distillation method that enables minute-scale stable driving video rollouts by making the teacher model rollout-capable via scheduled rollout recovery and teacher rollout DMD.
-
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model
MaineCoon is presented as the first 22B-parameter real-time streaming audio-visual autoregressive model optimized for social-interactive applications, using novel training techniques and an agentic inference framework.
-
WorldOlympiad: Can Your World Model Survive a Triathlon?
WorldOlympiad is a new benchmark decomposing world-model evaluation into physical, geometry, and interaction tracks using segmentation, MLLM judges, Gaussian splatting, and action prompts on diverse scenarios.
-
Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions
Ultra Flash introduces a cascaded streaming super-resolution framework with specialized training, upsampling, and optimization to enable real-time high-resolution video generation from low-res diffusion models.
-
One-Forcing: Towards Stable One-Step Autoregressive Video Generation
One-Forcing augments DMD with a GAN loss to enable stable one-step causal autoregressive video generation, reporting a VBench score of 83.76 as SOTA among one-step methods.
-
One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems
A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.
-
Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion
Focused Forcing is a training-free per-frame KV selection method that combines attention scores with diversity metrics and head-importance estimation to accelerate autoregressive video diffusion up to 1.48x while improving quality.
-
Xiaomi Auto World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving
A unified system integrating sparse-query 3D Gaussian reconstruction with multi-stage causal video generation for autonomous driving world models.
-
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
Causal Forcing++ applies causal consistency distillation to enable scalable frame-wise 1-2 step autoregressive video generation, outperforming prior 4-step chunk-wise methods on quality metrics while halving first-frame latency.
-
A Systematic Post-Train Framework for Video Generation
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
-
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive distillation on a 5B model.