AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
hub Canonical reference
MAGI-1: Autoregressive Video Generation at Scale
Canonical reference. 74% of citing Pith papers cite this work as background.
abstract
We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.
PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.
LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
Q-ARVD introduces final-quality-aware frame weighting and outlier-aware adaptive dual-scale quantization to enable accurate low-bit inference for autoregressive video diffusion models.
DySink maintains a memory bank and retrieves relevant historical frames as dynamic sinks while using an anomaly gate to suppress collapse, yielding higher temporal quality and dynamic degree on minute-long videos.
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressive video synthesis.
A single transformer model using a new markup representation generates functional floorplans from diverse conditions and outperforms prior task-specific methods on the RPLAN dataset.
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
DSA adds a jointly trained confidence head to autoregressive video diffusion models that dynamically allocates fewer or more denoising steps per frame, achieving 22.63 FPS real-time generation on H100 while matching VBench quality.
AAD-1 uses a causal generator with a bidirectional holistic discriminator plus phased distribution matching before adversarial training to reach state-of-the-art one-step autoregressive video generation on VBench.
Training method distills non-causal future targets into causal video diffusion states to boost long-horizon consistency without changing inference architecture or cost.
A causal VAE with variable reference guidance and a Rectified Flow Transformer enables real-time streamable high-quality talking portrait video generation from audio and images.
Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.
Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting SOTA results on ScanNet, DL3DV, and OmniWorldGame.
Block-diffusion decoder with prior-calibrated scoring and early stopping produces streaming zero-shot TTS at quality comparable to AR and NAR baselines with lower real-time factor.
citing papers explorer
No citing papers match the current filters.