AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization
read the original abstract
Streaming autoregressive (AR) video generators combined with few-step distillation achieve low-latency, high-quality synthesis, yet remain difficult to align via reinforcement learning from human feedback (RLHF). Existing SDE-based GRPO methods face challenges in this setting: few-step ODEs and consistency model samplers deviate from standard flow-matching ODEs, and their short, low-stochasticity trajectories are highly sensitive to initialization noise, rendering intermediate SDE exploration ineffective. We propose AR-CoPO (AutoRegressive Contrastive Policy Optimization), a framework that adapts the Neighbor GRPO contrastive perspective to streaming AR generation. AR-CoPO introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. We further propose a semi-on-policy training strategy that complements on-policy exploration with exploitation over a replay buffer of reference rollouts, improving generation quality across domains. Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over the baseline, providing evidence of genuine alignment rather than reward hacking.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL
TempAct introduces a planner-executor RL framework with hierarchical group exploration and rewards to improve temporal consistency in autoregressive video diffusion models.
-
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
-
TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL
TempAct applies hierarchical planner-executor RL with group exploration and multi-level rewards to improve temporal consistency in autoregressive video models.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.