pith. machine review for the scientific record. sign in

arxiv: 2604.25427 · v1 · submitted 2026-04-28 · 💻 cs.CV

Recognition: unknown

A Systematic Post-Train Framework for Video Generation

Zeyue Xue , Siming Fu , Jie Huang , Shuai Lu , Haoran Li , Yijun Liu , Yuming Li , Xiaoxuan He , Mengzhao Chen , Haoyang Huang , Nan Duan , Ping Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords video diffusion modelspost-trainingRLHFGRPOtemporal coherenceinstruction followinginference optimizationprompt enhancement
0
0 comments X

The pith

A four-stage post-training framework aligns video diffusion models with user intentions by using supervised fine-tuning, group-relative reinforcement learning, prompt enhancement, and inference optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a post-training process that takes pretrained video diffusion models and makes them more reliable for practical use. It begins with supervised fine-tuning to turn the model into one that follows instructions, then applies a custom reinforcement learning step called GRPO to raise visual quality and reduce flickering over time. Next it refines the input prompts with a separate language model and finally tunes the generation process for speed. A reader would care because current video generators often produce inconsistent or off-prompt results at high computational cost, and this pipeline claims to fix those problems without raising the cost of each new video.

Core claim

We propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a

What carries the argument

The Group Relative Policy Optimization (GRPO) method, a reinforcement learning technique that compares groups of video samples to assign relative rewards and thereby improves perceptual quality and temporal coherence after supervised fine-tuning.

If this is right

  • The resulting models follow text prompts more accurately than the original pretrained versions.
  • Generated videos exhibit fewer visual artifacts and smoother motion across frames.
  • Instruction following and visual quality both improve while the number of sampling steps stays fixed.
  • The same sequence of stages can be applied to other pretrained video diffusion models without architecture changes.
  • Real-world deployment becomes more practical because controllability and aesthetics rise without extra per-video compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The GRPO reward design might transfer to image-only diffusion models if the temporal coherence term is replaced by a spatial consistency term.
  • Repeated cycles of this post-training loop could gradually reduce the need for ever-larger pretraining runs.
  • If prompt enhancement is removed, the controllability gains from SFT alone may still hold, allowing lighter pipelines for resource-constrained settings.
  • Extending GRPO to longer video clips would test whether the group-relative comparison scales without quadratic growth in memory.

Load-bearing premise

The four stages can be combined so that the GRPO reinforcement step improves quality without erasing the instruction-following behavior learned earlier or creating new training instabilities.

What would settle it

Train the same base video diffusion model once with the full four-stage pipeline and once with supervised fine-tuning alone, then compare human preference scores and temporal consistency metrics on identical prompts while measuring exact sampling steps and wall-clock time per video.

Figures

Figures reproduced from arXiv: 2604.25427 by Haoran Li, Haoyang Huang, Jie Huang, Mengzhao Chen, Nan Duan, Ping Luo, Shuai Lu, Siming Fu, Xiaoxuan He, Yijun Liu, Yuming Li, Zeyue Xue.

Figure 1
Figure 1. Figure 1: Overview of our post-training framework for video generation. We organize the pipeline into four complementary stages to bridge pretrained models and practical deployment. In Phase 1, supervised fine-tuning (SFT) uses curated data to establish a stable instruction-following baseline. In Phase 2, RLHF via a GRPO-based trainer aligns the generator with multi-dimensional reward signals, improving aesthetics, … view at source ↗
Figure 2
Figure 2. Figure 2: The visualization of RLHF on Wan-2.1. 8 view at source ↗
read the original abstract

While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a four-stage post-training framework for video diffusion models to address prompt sensitivity, temporal inconsistency, and high inference costs. The stages are: (1) Supervised Fine-Tuning (SFT) to create a stable instruction-following policy, (2) RLHF using a novel Group Relative Policy Optimization (GRPO) method to improve perceptual quality and temporal coherence, (3) Prompt Enhancement via a specialized language model, and (4) Inference Optimization for efficiency. The central claim is that this synergistic pipeline mitigates artifacts, improves controllability and visual aesthetics, and preserves pretraining benefits while respecting sampling cost constraints, as demonstrated by extensive experiments.

Significance. If the empirical claims hold, the work could provide a practical, modular blueprint for post-training large video generation models, helping close the gap between pretraining capabilities and real-world deployment requirements in the video diffusion field.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics' is unsupported by any quantitative results, baselines, ablation studies, metrics, or error analysis. This is load-bearing for the central claim of synergistic improvement under fixed sampling cost.
  2. [§3 (Framework)] The description of GRPO (Group Relative Policy Optimization) and its integration after SFT lacks any derivation, loss formulation, or pseudocode; without these, it is impossible to evaluate whether the method preserves controllability from pretraining or introduces instabilities as assumed in the weakest premise.
minor comments (1)
  1. Ensure all acronyms (SFT, RLHF, GRPO) are expanded on first use in the main text and that the four stages are clearly numbered and cross-referenced in the experiments section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, with plans for targeted revisions to strengthen the presentation of our claims and methods.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics' is unsupported by any quantitative results, baselines, ablation studies, metrics, or error analysis. This is load-bearing for the central claim of synergistic improvement under fixed sampling cost.

    Authors: We agree that the abstract would benefit from explicit quantitative anchors to support the central claim. In the revised manuscript, we will update the abstract to reference specific metrics (e.g., relative gains in perceptual quality scores, temporal coherence indices, and instruction-following accuracy) drawn from the experiments, along with baseline comparisons and ablation highlights, all under fixed sampling budgets. The full quantitative results, including error analysis and synergistic effects, are presented in Sections 4–5; we will add a concise summary table or bullet points to the abstract for immediate visibility. revision: yes

  2. Referee: [§3 (Framework)] The description of GRPO (Group Relative Policy Optimization) and its integration after SFT lacks any derivation, loss formulation, or pseudocode; without these, it is impossible to evaluate whether the method preserves controllability from pretraining or introduces instabilities as assumed in the weakest premise.

    Authors: We acknowledge that the current description of GRPO in Section 3 is high-level and requires formalization for reproducibility and evaluation. In the revised version, we will expand this section to include: (1) the mathematical derivation of the GRPO objective as a group-relative extension of policy optimization tailored to video diffusion trajectories, (2) the explicit loss formulation that combines reward signals for perceptual quality and temporal consistency while regularizing against deviation from the SFT policy, and (3) pseudocode for the full GRPO training loop. This will explicitly demonstrate how the method initializes from the SFT checkpoint to preserve pretraining controllability and incorporates variance-reduction techniques to mitigate instability risks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is descriptive and experiment-driven

full rationale

The paper presents a four-stage post-training pipeline (SFT, GRPO-based RLHF, prompt enhancement, inference optimization) for video diffusion models. No mathematical derivations, equations, fitted parameters, or first-principles predictions appear in the provided abstract or description. Claims rest on experimental outcomes rather than any self-referential reduction where a result is defined by or equivalent to its inputs by construction. The central premise of synergistic stage integration is stated as an empirical observation, not a tautological or self-cited derivation. This is a standard systems/framework paper without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an applied engineering framework built on standard machine learning techniques without defining new mathematical axioms, free parameters, or invented physical entities. GRPO is presented as a methodological contribution rather than a new entity.

pith-pipeline@v0.9.0 · 5550 in / 1271 out tokens · 55994 ms · 2026-05-07T16:55:04.912896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Do Joint Audio-Video Generation Models Understand Physics?

    cs.SD 2026-05 unverdicted novelty 7.0

    Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.

Reference graph

Works this paper leans on

50 extracted references · 31 canonical work pages · cited by 1 Pith paper · 17 internal anchors

  1. [1]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  2. [2]

    Scaling rectified flow trans- 9 formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- 9 formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  3. [3]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  4. [4]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  5. [5]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  6. [6]

    Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

    Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

  7. [7]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

  8. [8]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  9. [9]

    Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

    Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

  10. [10]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  11. [11]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  12. [12]

    Evalcrafter: Benchmarking and evaluating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

  13. [13]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

  14. [14]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xi- aochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

  15. [15]

    Promptist: Automated prompt optimization for text-to- image synthesis

    WeiJie Li, Jin Wang, and Xuejie Zhang. Promptist: Automated prompt optimization for text-to- image synthesis. InCCF international conference on natural language processing and Chinese computing, pages 295–306. Springer, 2024

  16. [16]

    Neuroprompts: An adaptive framework to optimize prompts for text-to-image generation

    Shachar Rosenman, Vasudev Lal, and Phillip Howard. Neuroprompts: An adaptive framework to optimize prompts for text-to-image generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 159–167, 2024

  17. [17]

    Improving text-to-image consistency via automatic prompt optimization.arXiv preprint arXiv:2403.17804, 2024

    Oscar Mañas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, and Michal Drozdzal. Improving text-to-image consistency via automatic prompt optimization.arXiv preprint arXiv:2403.17804, 2024. 10

  18. [18]

    Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning.arXiv preprint arXiv:2505.17540, 2025

    Mingrui Wu, Lu Wang, Pu Zhao, Fangkai Yang, Jianjin Zhang, Jianfeng Liu, Yuefeng Zhan, Weihao Han, Hao Sun, Jiayi Ji, et al. Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning.arXiv preprint arXiv:2505.17540, 2025

  19. [19]

    Promptrl: Prompt matters in rl for flow-based image generation,

    Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, and Taesung Park. Promptrl: Prompt matters in rl for flow-based image generation.arXiv preprint arXiv:2602.01382, 2026

  20. [20]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  21. [21]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  22. [22]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  23. [23]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802, 2025

  24. [24]

    Coefficients-preserving sampling for reinforcement learning with flow matching

    Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952, 2025

  25. [25]

    Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324,

    Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025

  26. [26]

    G2rpo: Granular grpo for precise reward in flow models

    Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, and Guangtao Zhai. G2rpo: Granular grpo for precise reward in flow models. 2025

  27. [27]

    E-grpo: High entropy steps drive effective reinforcement learning for flow models.arXiv preprint arXiv:2601.00423, 2026

    Shengjun Zhang, Zhang Zhang, Chensheng Dai, and Yueqi Duan. E-grpo: High entropy steps drive effective reinforcement learning for flow models.arXiv preprint arXiv:2601.00423, 2026

  28. [28]

    arXiv preprint arXiv:2509.06040 (2025) 2, 3

    Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

  29. [29]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

  30. [30]

    arXiv preprint arXiv:2509.25050 , year=

    Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

  31. [31]

    Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051,

    Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, and Anyi Rao. Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051, 2026

  32. [32]

    Professor forcing: A new algorithm for training recurrent networks.Advances in neural information processing systems, 29, 2016

    Alex M Lamb, Anirudh Goyal ALIAS PARTH GOY AL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks.Advances in neural information processing systems, 29, 2016

  33. [33]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

  34. [34]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025. 11

  35. [35]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

  36. [36]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  37. [37]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

  38. [38]

    Omniforcing: Unleashing real-time joint audio-visual generation.arXiv preprint arXiv:2603.11647, 2026

    Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, Zezhong Qian, Haoyang Huang, and Nan Duan. Omniforcing: Unleashing real-time joint audio-visual generation.arXiv preprint arXiv:2603.11647, 2026

  39. [39]

    Causal forcing: Autoregressivediffusiondistillationdonerightforhigh-qualityreal-timeinteractivevideogeneration

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

  40. [40]

    Hpsv3: Towards wide-spectrum hu- man preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

  41. [41]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  42. [42]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

  43. [43]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

  44. [44]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  45. [45]

    Human preference score: Better aligning text-to-image models with human preference

    Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2096–2105, 2023

  46. [46]

    Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

    Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024

  47. [47]

    Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11269–11277, 2026

  48. [48]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

  49. [49]

    RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

    Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

  50. [50]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 12