A Systematic Post-Train Framework for Video Generation

· 2026 · cs.CV · arXiv 2604.25427

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Do Joint Audio-Video Generation Models Understand Physics?

cs.SD · 2026-05-08 · unverdicted · novelty 7.0

Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.

Video Models Can Reason with Verifiable Rewards

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.

citing papers explorer

Showing 2 of 2 citing papers.

Do Joint Audio-Video Generation Models Understand Physics? cs.SD · 2026-05-08 · unverdicted · none · ref 42 · internal anchor
Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
Video Models Can Reason with Verifiable Rewards cs.CV · 2026-05-14 · unverdicted · none · ref 43 · internal anchor
VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.

A Systematic Post-Train Framework for Video Generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer