pith. sign in

hub Canonical reference

Improving Video Generation with Human Feedback

Canonical reference. 73% of citing Pith papers cite this work as background.

37 Pith papers citing it
Background 73% of classified citations
abstract

Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs.

hub tools

citation-role summary

background 16 baseline 4 method 2

citation-polarity summary

clear filters

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

cs.CV · 2026-05-14 · conditional · novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.

RewardHarness: Self-Evolving Agentic Post-Training

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

cs.CV · 2026-02-11 · unverdicted · novelty 7.0

DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

cs.AI · 2025-07-29 · unverdicted · novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.

Stream-T1: Test-Time Scaling for Streaming Video Generation

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.

citing papers explorer

Showing 10 of 10 citing papers after filters.

  • Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 14 · internal anchor

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  • MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE cs.AI · 2025-07-29 · unverdicted · none · ref 19 · internal anchor

    MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.

  • Unified Reward Model for Multimodal Understanding and Generation cs.CV · 2025-03-07 · unverdicted · none · ref 7 · internal anchor

    UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

  • Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation cs.CV · 2025-12-04 · conditional · none · ref 44 · internal anchor

    Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.

  • Seeing What Matters: Visual Preference Policy Optimization for Visual Generation cs.CV · 2025-11-24 · unverdicted · none · ref 23 · internal anchor

    ViPO enhances GRPO for visual generation by creating spatially and temporally aware advantage maps from pretrained vision models to focus optimization on perceptually important regions.

  • Self-Forcing++: Towards Minute-Scale High-Quality Video Generation cs.CV · 2025-10-02 · conditional · none · ref 37 · internal anchor

    Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.

  • DanceGRPO: Unleashing GRPO on Visual Generation cs.CV · 2025-05-12 · unverdicted · none · ref 14 · internal anchor

    DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.

  • SkyReels-V2: Infinite-length Film Generative Model cs.CV · 2025-04-17 · unverdicted · none · ref 46 · internal anchor

    SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.

  • World Simulation with Video Foundation Models for Physical AI cs.CV · 2025-10-28 · unverdicted · none · ref 47 · internal anchor

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

  • Seedance 1.0: Exploring the Boundaries of Video Generation Models cs.CV · 2025-06-10 · unverdicted · none · ref 18 · internal anchor

    Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.