pith. sign in

arxiv: 2603.17426 · v2 · pith:RLFPJZQAnew · submitted 2026-03-18 · 💻 cs.CV

SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

classification 💻 cs.CV
keywords fine-tuningmotionunderlinediffusionmodelsshiftvideoadversarial
0
0 comments X
read the original abstract

Image-conditioned video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose \underline{S}mooth \underline{H}ybr\underline{i}d \underline{F}ine-\underline{t}uning (SHIFT), a scalable reward-driven framework that unifies supervised fine-tuning and advantage-weighted fine-tuning. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning. Project page: https://xiye20.github.io/projects/SHIFT/.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.