hub Canonical reference

MAGI-1: Autoregressive Video Generation at Scale

Sand.ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li · 2025 · cs.CV · arXiv 2505.13211

Canonical reference. 74% of citing Pith papers cite this work as background.

76 Pith papers citing it

Background 74% of classified citations

open full Pith review browse 76 citing papers arXiv PDF

abstract

We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 baseline 2 dataset 1 method 1

citation-polarity summary

background 14 baseline 2 unclear 1 use dataset 1 use method 1

representative citing papers

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

PhysInOne: Visual Physics Learning and Reasoning in One Suite

cs.CV · 2026-04-10 · unverdicted · novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.

Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

ISPA reduces KV cache size by up to 50% in AR video models by transitioning layers to local attention and applying instance-specific least-squares weight modulation to compensate for lost history.

MemLearner: Learning to Query Context memory for Video World Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

PhaseLock extracts motion priors from 2-step inference and enforces them via Latent Delta Guidance to raise physical consistency scores by 6.2 points on average in image-to-video diffusion models.

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Q-ARVD introduces final-quality-aware frame weighting and outlier-aware adaptive dual-scale quantization to enable accurate low-bit inference for autoregressive video diffusion models.

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

DySink maintains a memory bank and retrieves relevant historical frames as dynamic sinks while using an anomaly gate to suppress collapse, yielding higher temporal quality and dynamic degree on minute-long videos.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.

FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.

Envisioning the Future, One Step at a Time

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressive video synthesis.

Unified Vector Floorplan Generation via Markup Representation

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

A single transformer model using a new markup representation generates functional floorplans from diverse conditions and outperforms prior task-specific methods on the RPLAN dataset.

Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

cs.CV · 2026-04-03 · conditional · novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.

Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

cs.MM · 2026-06-03 · unverdicted · novelty 6.0

Echo-Infinity replaces handcrafted KV-cache schedules with end-to-end optimized Memory Queries and a Unified Relative RoPE recipe to support real-time infinite video generation in diffusion transformers.

DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation

cs.CV · 2026-06-03 · unverdicted · novelty 6.0

DSA adds a jointly trained confidence head to autoregressive video diffusion models that dynamically allocates fewer or more denoising steps per frame, achieving 22.63 FPS real-time generation on H100 while matching VBench quality.

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

AAD-1 uses a causal generator with a bidirectional holistic discriminator plus phased distribution matching before adversarial training to reach state-of-the-art one-step autoregressive video generation on VBench.

Video-Mirai: Autoregressive Video Diffusion Models Need Foresight

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

Training method distills non-causal future targets into causal video diffusion states to boost long-horizon consistency without changing inference architecture or cost.

Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

A causal VAE with variable reference guidance and a Rectified Flow Transformer enables real-time streamable high-quality talking portrait video generation from audio and images.

citing papers explorer

Showing 50 of 76 citing papers.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation cs.CV · 2026-05-13 · unverdicted · none · ref 39 · internal anchor
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
PhysInOne: Visual Physics Learning and Reasoning in One Suite cs.CV · 2026-04-10 · unverdicted · none · ref 1 · internal anchor
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption cs.CV · 2026-07-01 · unverdicted · none · ref 37 · internal anchor
ISPA reduces KV cache size by up to 50% in AR video models by transitioning layers to local attention and applying instance-specific least-squares weight modulation to compensate for lost history.
MemLearner: Learning to Query Context memory for Video World Models cs.CV · 2026-06-30 · unverdicted · none · ref 1 · internal anchor
MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.
Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation cs.CV · 2026-06-26 · unverdicted · none · ref 12 · internal anchor
PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.
Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them cs.CV · 2026-06-04 · unverdicted · none · ref 87 · internal anchor
PhaseLock extracts motion priors from 2-step inference and enforces them via Latent Delta Guidance to raise physical consistency scores by 6.2 points on average in image-to-video diffusion models.
LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation cs.CV · 2026-06-01 · unverdicted · none · ref 42 · internal anchor
LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models cs.CV · 2026-05-22 · unverdicted · none · ref 46 · internal anchor
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
Q-ARVD: Quantizing Autoregressive Video Diffusion Models cs.CV · 2026-05-20 · unverdicted · none · ref 19 · internal anchor
Q-ARVD introduces final-quality-aware frame weighting and outlier-aware adaptive dual-scale quantization to enable accurate low-bit inference for autoregressive video diffusion models.
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation cs.CV · 2026-05-20 · unverdicted · none · ref 14 · 2 links · internal anchor
DySink maintains a memory bank and retrieves relevant historical frames as dynamic sinks while using an anomaly gate to suppress collapse, yielding higher temporal quality and dynamic degree on minute-long videos.
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 57 · internal anchor
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation cs.CV · 2026-05-15 · unverdicted · none · ref 15 · internal anchor
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 30 · internal anchor
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation cs.CV · 2026-05-05 · unverdicted · none · ref 30 · internal anchor
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 16 · internal anchor
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
Envisioning the Future, One Step at a Time cs.CV · 2026-04-10 · unverdicted · none · ref 100 · internal anchor
An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis cs.CV · 2026-04-08 · unverdicted · none · ref 22 · internal anchor
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressive video synthesis.
Unified Vector Floorplan Generation via Markup Representation cs.CV · 2026-04-06 · unverdicted · none · ref 28 · internal anchor
A single transformer model using a new markup representation generates functional floorplans from diverse conditions and outperforms prior task-specific methods on the RPLAN dataset.
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation cs.CV · 2026-04-03 · conditional · none · ref 48 · internal anchor
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation cs.MM · 2026-06-03 · unverdicted · none · ref 32 · internal anchor
Echo-Infinity replaces handcrafted KV-cache schedules with end-to-end optimized Memory Queries and a Unified Relative RoPE recipe to support real-time infinite video generation in diffusion transformers.
DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation cs.CV · 2026-06-03 · unverdicted · none · ref 55 · internal anchor
DSA adds a jointly trained confidence head to autoregressive video diffusion models that dynamically allocates fewer or more denoising steps per frame, achieving 22.63 FPS real-time generation on H100 while matching VBench quality.
AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation cs.CV · 2026-06-02 · unverdicted · none · ref 15 · internal anchor
AAD-1 uses a causal generator with a bidirectional holistic discriminator plus phased distribution matching before adversarial training to reach state-of-the-art one-step autoregressive video generation on VBench.
Video-Mirai: Autoregressive Video Diffusion Models Need Foresight cs.CV · 2026-06-02 · unverdicted · none · ref 31 · internal anchor
Training method distills non-causal future targets into causal video diffusion states to boost long-horizon consistency without changing inference architecture or cost.
Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs cs.CV · 2026-06-01 · unverdicted · none · ref 44 · internal anchor
A causal VAE with variable reference guidance and a Rectified Flow Transformer enables real-time streamable high-quality talking portrait video generation from audio and images.
Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models cs.CV · 2026-05-29 · unverdicted · none · ref 42 · internal anchor
Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.
Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation cs.CV · 2026-05-29 · unverdicted · none · ref 51 · internal anchor
Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting SOTA results on ScanNet, DL3DV, and OmniWorldGame.
Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS cs.SD · 2026-05-29 · unverdicted · none · ref 1 · internal anchor
Block-diffusion decoder with prior-calibrated scoring and early stopping produces streaming zero-shot TTS at quality comparable to AR and NAR baselines with lower real-time factor.
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models cs.CV · 2026-05-28 · unverdicted · none · ref 27 · internal anchor
minWM supplies an end-to-end pipeline that fine-tunes bidirectional T2V/TI2V models with camera control then distills them via Causal Forcing into few-step autoregressive generators for low-latency rollout.
WorldKV: Efficient World Memory with World Retrieval and Compression cs.CV · 2026-05-21 · unverdicted · none · ref 24 · internal anchor
WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.
World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks cs.CV · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.
Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models cs.CV · 2026-05-19 · unverdicted · none · ref 20 · 3 links · internal anchor
Reference-frame dominance in self-attention suppresses motion in image-to-video models; DyMoS rebalances attention from generated frames to the reference during initial denoising steps to improve dynamics while preserving fidelity.
Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory cs.CV · 2026-05-18 · unverdicted · none · ref 35 · internal anchor
IAMFlow is a training-free identity-aware memory system that tracks entities via LLM global ID assignment and VLM frame verification to reduce identity drift in narrative long video generation from shifting prompts.
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos cs.CV · 2026-05-18 · unverdicted · none · ref 23 · internal anchor
MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on VBench and NarrLV.
AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling cs.CV · 2026-05-15 · unverdicted · none · ref 18 · internal anchor
AtlasVid proposes a decoupled global-local diffusion framework that trains at low resolution with LoRA and generalizes to ultra-high-resolution long video synthesis via semantic proxy guidance and locality-preserving attention.
RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO cs.CV · 2026-05-14 · unverdicted · none · ref 16 · internal anchor
RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.
EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration cs.CV · 2026-05-14 · unverdicted · none · ref 38 · internal anchor
EverAnimate restores drifted latent flow trajectories in chunked video generation via persistent latent propagation and restorative flow matching, achieving measurable gains in PSNR, SSIM, LPIPS, and FID over prior long-animation methods with only LoRA tuning.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity cs.CV · 2026-05-14 · unverdicted · none · ref 47 · internal anchor
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation cs.CV · 2026-05-13 · unverdicted · none · ref 2 · internal anchor
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models cs.CV · 2026-05-10 · unverdicted · none · ref 4 · internal anchor
Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 79 · internal anchor
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation cs.CV · 2026-05-07 · unverdicted · none · ref 1 · 2 links · internal anchor
SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution synthesis.
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control cs.CV · 2026-05-07 · unverdicted · none · ref 36 · internal anchor
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
Stream-T1: Test-Time Scaling for Streaming Video Generation cs.CV · 2026-05-06 · unverdicted · none · ref 36 · internal anchor
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.
Motion-Aware Caching for Efficient Autoregressive Video Generation cs.CV · 2026-05-03 · conditional · none · ref 33 · 2 links · internal anchor
MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation cs.CV · 2026-04-28 · unverdicted · none · ref 40 · internal anchor
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 46 · internal anchor
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer cs.CV · 2026-04-15 · unverdicted · none · ref 30 · internal anchor
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and reference-guided video stylization.
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation cs.CV · 2026-04-08 · unverdicted · none · ref 29 · internal anchor
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
GeoWorld: Geometric World Models cs.CV · 2026-02-26 · unverdicted · none · ref 73 · internal anchor
GeoWorld applies hyperbolic geometry to JEPA world models and introduces geometric reinforcement learning, reporting modest success-rate gains of ~3% and ~2% on 3- and 4-step planning tasks versus V-JEPA 2.
World Action Models are Zero-shot Policies cs.RO · 2026-02-17 · unverdicted · none · ref 78 · internal anchor
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.

MAGI-1: Autoregressive Video Generation at Scale

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer