AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
hub Canonical reference
MAGI-1: Autoregressive Video Generation at Scale
Canonical reference. 74% of citing Pith papers cite this work as background.
abstract
We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
ISPA reduces KV cache size by up to 50% in AR video models by transitioning layers to local attention and applying instance-specific least-squares weight modulation to compensate for lost history.
MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.
PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.
PhaseLock extracts motion priors from 2-step inference and enforces them via Latent Delta Guidance to raise physical consistency scores by 6.2 points on average in image-to-video diffusion models.
LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
Q-ARVD introduces final-quality-aware frame weighting and outlier-aware adaptive dual-scale quantization to enable accurate low-bit inference for autoregressive video diffusion models.
DySink maintains a memory bank and retrieves relevant historical frames as dynamic sinks while using an anomaly gate to suppress collapse, yielding higher temporal quality and dynamic degree on minute-long videos.
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressive video synthesis.
A single transformer model using a new markup representation generates functional floorplans from diverse conditions and outperforms prior task-specific methods on the RPLAN dataset.
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
Echo-Infinity replaces handcrafted KV-cache schedules with end-to-end optimized Memory Queries and a Unified Relative RoPE recipe to support real-time infinite video generation in diffusion transformers.
DSA adds a jointly trained confidence head to autoregressive video diffusion models that dynamically allocates fewer or more denoising steps per frame, achieving 22.63 FPS real-time generation on H100 while matching VBench quality.
AAD-1 uses a causal generator with a bidirectional holistic discriminator plus phased distribution matching before adversarial training to reach state-of-the-art one-step autoregressive video generation on VBench.
Training method distills non-causal future targets into causal video diffusion states to boost long-horizon consistency without changing inference architecture or cost.
A causal VAE with variable reference guidance and a Rectified Flow Transformer enables real-time streamable high-quality talking portrait video generation from audio and images.
citing papers explorer
-
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
-
Generative View Stitching
Generative View Stitching samples full video sequences in parallel using off-the-shelf Diffusion Forcing models plus Omni Guidance to produce stable, collision-free, loop-closing camera-guided videos.
-
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
-
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.
-
LongLive: Real-time Interactive Long Video Generation
LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
-
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Matrix-Game 2.0 introduces a scalable data pipeline, action-injection module, and few-step distillation to enable real-time streaming video generation at 25 FPS from game-engine interactions, with open-sourced weights and code.
-
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Inferix provides an optimized inference engine for semi-autoregressive block-diffusion decoding to support high-quality, variable-length video generation in world simulation applications.