End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Ceyuan Yang; Dahua Lin; Hao He; Meng Wei; Weilin Huang; Yang Zhao; Yuwei Guo; Zhenheng Yang

arxiv: 2512.15702 · v2 · pith:O644HULKnew · submitted 2025-12-17 · 💻 cs.CV

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Yuwei Guo , Ceyuan Yang , Hao He , Yang Zhao , Meng Wei , Zhenheng Yang , Weilin Huang , Dahua Lin This is my paper

classification 💻 cs.CV

keywords trainingautoregressivediffusionhistoryvideowhileapproachend-to-end

0 comments

read the original abstract

Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
cs.CV 2026-05 unverdicted novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

DySink maintains a memory bank and retrieves relevant historical frames as dynamic sinks while using an anomaly gate to suppress collapse, yielding higher temporal quality and dynamic degree on minute-long videos.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
cs.CV 2026-05 unverdicted novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
cs.CV 2026-05 conditional novelty 7.0

HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting S...
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

DySink uses adaptive retrieval of relevant historical frames plus a sink anomaly gate to improve dynamic degree and temporal quality in minute-long autoregressive video generation.
RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO
cs.CV 2026-05 unverdicted novelty 6.0

RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
cs.CV 2026-05 unverdicted novelty 6.0

HorizonDrive is a new anti-drifting autoregressive training and distillation method that enables minute-scale stable driving video rollouts by making the teacher model rollout-capable via scheduled rollout recovery an...
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
cs.CV 2026-02 unverdicted novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
cs.CV 2026-02 conditional novelty 6.0

Causal Forcing uses an autoregressive teacher for ODE initialization in diffusion distillation to close the causal attention gap and deliver better real-time video generation than Self Forcing.
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
cs.CV 2026-02 conditional novelty 6.0

Causal Forcing initializes autoregressive diffusion students from AR teachers to recover flow maps that bidirectional teachers cannot provide, delivering 19%+ gains over Self Forcing on dynamic degree and related metrics.
One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems
cs.CV 2026-05 unverdicted novelty 5.0

A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
cs.CV 2026-05 unverdicted novelty 5.0

Causal Forcing++ applies causal consistency distillation to enable scalable frame-wise 1-2 step autoregressive video generation, outperforming prior 4-step chunk-wise methods on quality metrics while halving first-fra...