arxiv: 2512.04678 · v2 · submitted 2025-12-04 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu , Yanhong Zeng , Haobo Li , Hao Ouyang , Qiuyu Wang , Ka Leong Cheng , Jiapeng Zhu , Hengyuan Cao

show 4 more authors

Zhipeng Zhang Xing Zhu Yujun Shen Min Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords streaming video generationvideo diffusion modelsknowledge distillationexponential moving averagemotion dynamicsreward modelreal-time inference

0 comments

The pith

EMA-updated sink tokens and reward-weighted distillation fix copied frames and weak motion in streaming video models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing few-step video diffusion models use sliding-window attention with the first frames held as fixed sink tokens. This design keeps long-range attention stable but causes later frames to duplicate the opening content and lose motion. Reward Forcing replaces the static sinks with EMA-Sink tokens that are continuously refreshed by folding evicted frames into the memory via exponential moving average, preserving both distant context and recent change at zero added cost. The same framework adds Rewarded Distribution Matching Distillation, which scores every training clip for motion strength with a vision-language model and up-weights high-scoring examples so the student model learns lively dynamics rather than averaging across all data equally.

Core claim

Reward Forcing shows that replacing fixed initial-frame sink tokens with exponentially moving-average updates and reweighting the distribution-matching loss toward high-dynamics samples eliminates frame copying, restores motion richness, and delivers state-of-the-art benchmark results together with 23.1 frames per second streaming generation on a single H100 GPU.

What carries the argument

EMA-Sink, a fixed-size set of tokens initialized from the first frames and updated by fusing tokens that exit the sliding attention window through exponential moving average, paired with Re-DMD which reweights the distillation objective according to per-sample motion rewards from a vision-language model.

If this is right

Streaming video generation reaches interactive speeds on single-GPU hardware while preserving long-horizon consistency.
Distilled models can emphasize dynamic content during training without losing fidelity to the teacher distribution.
Standard video benchmarks record state-of-the-art quality under the new EMA-Sink and Re-DMD regime.
High-quality real-time video becomes practical for interactive world simulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same EMA memory update could replace static context tokens in other long-sequence generators such as audio or 3-D scene models.
If the vision-language reward generalizes, the forcing technique may improve motion quality in non-streaming text-to-video systems.
Real-time performance enables closed-loop setups where generated video is fed back as new conditioning.
Selective emphasis on high-variance samples during distillation offers a general route to increasing output diversity.

Load-bearing premise

The vision-language model that scores motion dynamics supplies ratings that genuinely identify better training examples without introducing new artifacts or shifting the output distribution in unintended ways.

What would settle it

Train the model with Reward Forcing and measure both motion-magnitude metrics and the fraction of frames that closely match the initial frame; if motion scores fail to rise above the baseline or initial-frame duplication remains high, the claimed improvements do not hold.

read the original abstract

Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EMA-Sink updates sink tokens via moving average to stop initial-frame copying in streaming video diffusion, but Re-DMD's VLM-based weighting lacks shown validation for actual motion gains.

read the letter

The main point is that EMA-Sink and Re-DMD together tackle frame copying in sliding-window video diffusion by dynamically updating sink tokens and biasing distillation toward VLM-rated dynamic samples. This lets the model maintain consistency over long streams while improving motion at 23.1 FPS on a single H100. EMA-Sink stands out as a low-cost way to blend long-term context with recent changes using exponential moving average on evicted tokens. It directly addresses the static token dependency without changing the attention structure much. Re-DMD builds on distribution matching by weighting samples based on VLM dynamics scores, which is a logical step to avoid uniform treatment that dilutes motion. The paper shows quantitative and qualitative results claiming state-of-the-art on benchmarks, which gives some support for the claims. The weaker part is the VLM reward mechanism. The description relies on the VLM providing good signals for motion quality, but there is no mention of validation like checking against optical flow metrics or running human evaluations to confirm the ratings lead to better perceived dynamics. If the VLM introduces its own biases, the distilled model could suffer in temporal consistency exactly where streaming matters most. That assumption carries a fair amount of the improvement. The math and method descriptions look standard for this area, with no obvious circularity in the claims. This work is for people developing practical streaming video models for interactive uses. Anyone dealing with few-step diffusion or sliding window attention would get value from the concrete fixes. It is worth sending for peer review so referees can examine the full ablations and experiment details.

Referee Report

2 major / 2 minor

Summary. The paper claims that Reward Forcing enables efficient streaming video generation by combining EMA-Sink (exponential moving average updates to fixed-size sink tokens in sliding-window attention to preserve long-term context and recent dynamics without extra cost) with Rewarded Distribution Matching Distillation (Re-DMD), which biases distillation toward high-dynamics samples scored by a vision-language model. This addresses copied initial frames and reduced motion in prior distillation methods, yielding SOTA quantitative/qualitative results on standard benchmarks and 23.1 FPS inference on a single H100 GPU.

Significance. If the VLM reward signal proves reliable and unbiased, the work would meaningfully advance practical streaming video diffusion by mitigating error accumulation and motion degradation in long-horizon generation, with direct relevance to interactive simulation and real-time applications. The reported FPS and benchmark gains, if reproducible, represent a concrete efficiency-quality tradeoff improvement over existing sliding-window distillation baselines.

major comments (2)

[Experiments] Experiments section: The central claim that Re-DMD improves motion dynamics rests on VLM-rated sample prioritization, yet no quantitative validation is provided (e.g., Pearson correlation between VLM scores and optical-flow magnitude, or human preference studies comparing Re-DMD vs. uniform DMD). Without such checks, it remains possible that the VLM introduces systematic bias or artifacts that degrade temporal consistency precisely in the streaming regime.
[§3.2] §3.2 (Re-DMD description): The reward weighting strength is listed as a free hyperparameter, yet the abstract and method claim that Re-DMD 'significantly enhances motion quality while preserving data fidelity' without reporting sensitivity analysis or ablation on this parameter; if performance collapses outside a narrow range, the method's robustness is overstated.

minor comments (2)

[Abstract] The abstract states 'without additional computation cost' for EMA-Sink, but the EMA decay factor is a tunable hyperparameter whose overhead during training should be quantified.
[Tables] Table captions and metric definitions (e.g., exact formulation of the dynamics reward) would benefit from explicit equations to allow direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim that Re-DMD improves motion dynamics rests on VLM-rated sample prioritization, yet no quantitative validation is provided (e.g., Pearson correlation between VLM scores and optical-flow magnitude, or human preference studies comparing Re-DMD vs. uniform DMD). Without such checks, it remains possible that the VLM introduces systematic bias or artifacts that degrade temporal consistency precisely in the streaming regime.

Authors: We agree that additional quantitative validation would bolster the claims regarding the VLM reward signal. In the revised version, we will add an analysis computing the Pearson correlation between VLM-assigned scores and optical flow magnitudes across a validation set of video samples. Additionally, we will include results from a human preference study involving 50 participants comparing videos generated with Re-DMD against those from uniform DMD, focusing on motion quality and temporal consistency. These additions will help confirm the reliability of the VLM signal and address potential bias concerns. revision: yes
Referee: [§3.2] §3.2 (Re-DMD description): The reward weighting strength is listed as a free hyperparameter, yet the abstract and method claim that Re-DMD 'significantly enhances motion quality while preserving data fidelity' without reporting sensitivity analysis or ablation on this parameter; if performance collapses outside a narrow range, the method's robustness is overstated.

Authors: The referee is correct that the reward weighting strength λ is a hyperparameter. We chose λ=0.5 based on initial tuning to achieve the reported balance. To demonstrate robustness, the revised manuscript will include an ablation study varying λ from 0.1 to 1.0, reporting metrics such as FVD, motion magnitude, and data fidelity measures. This will show that performance remains stable within a reasonable range around the selected value, supporting the claims without overstatement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; novel components are independent of inputs

full rationale

The paper's core contributions—EMA-Sink token management via exponential moving average fusion of evicted tokens and Re-DMD which reweights distillation samples by external VLM-rated dynamics—are presented as new designs that extend standard sliding-window distillation without reducing to self-defined parameters or fitted inputs. No equations equate a claimed prediction or output distribution to an internal fit or prior self-citation by construction. The VLM reward signal is treated as an external oracle rather than derived from the model's own outputs, and performance claims rest on empirical benchmarks rather than tautological renaming or uniqueness theorems imported from the same authors. Any self-citations to prior distillation work are non-load-bearing and do not substitute for the novel mechanisms.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that a VLM can serve as a reliable motion-quality oracle and that EMA fusion preserves both consistency and dynamics without new failure modes; no new physical entities are introduced.

free parameters (2)

EMA decay factor
Controls how quickly evicted tokens influence the sink tokens; must be chosen or tuned.
Reward weighting strength
Determines how strongly high-dynamics samples are prioritized during distillation.

axioms (1)

domain assumption Vision-language models can reliably quantify motion dynamics in generated video frames
Invoked to create the reward signal for Re-DMD.

pith-pipeline@v0.9.0 · 5585 in / 1273 out tokens · 27927 ms · 2026-05-16T18:12:24.693015+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Re-DMD biases the model’s output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model.
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
cs.CV 2026-05 unverdicted novelty 7.0

KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

DMGD achieves better performance than fine-tuned SOTA methods in dataset distillation on ImageNet subsets by using semantic matching through conditional likelihood optimization and OT-based distribution matching in a ...
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
cs.CV 2026-04 unverdicted novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
cs.CV 2026-05 conditional novelty 6.0

PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...
Stream-T1: Test-Time Scaling for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
cs.CV 2026-04 conditional novelty 6.0

Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
cs.CV 2026-02 unverdicted novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
cs.CV 2026-02 unverdicted novelty 4.0

EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...
Advancing Open-source World Models
cs.CV 2026-01 unverdicted novelty 4.0

LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 17 Pith papers · 31 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brown- field, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kris- tian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung...

work page 2025
[3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators

work page
[5]

Dimension-reduction attack! video generative models are experts on controllable image synthesis.arXiv preprint arXiv:2505.23325, 2025

Hengyuan Cao, Yutong Feng, Biao Gong, Yijing Tian, Yun- hong Lu, Chuang Liu, and Bin Wang. Dimension-reduction attack! video generative models are experts on controllable image synthesis.arXiv preprint arXiv:2505.23325, 2025. 6

work page arXiv 2025
[6]

Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 2, 6

work page 2024
[7]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Longanimation: Long animation genera- tion with dynamic global-local memory.arXiv preprint arXiv:2507.01945, 2025

Nan Chen, Mengqi Huang, Yihao Meng, and Zhen- dong Mao. Longanimation: Long animation genera- tion with dynamic global-local memory.arXiv preprint arXiv:2507.01945, 2025. 6

work page arXiv 2025
[9]

Seine: Short-to-long video diffu- sion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. InThe Twelfth International Conference on Learning Representa- tions, 2023. 6

work page 2023
[10]

Vpo: Aligning text-to-video generation models with prompt optimization.arXiv preprint arXiv:2503.20491,

Jiale Cheng, Ruiliang Lyu, Xiaotao Gu, Xiao Liu, Jiazheng Xu, Yida Lu, Jiayan Teng, Zhuoyi Yang, Yuxiao Dong, Jie Tang, et al. Vpo: Aligning text-to-video generation models with prompt optimization.arXiv preprint arXiv:2503.20491,

work page arXiv
[11]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self- forcing++: Towards minute-scale high-quality video genera- tion.arXiv preprint arXiv:2510.02283, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

One-minute video generation with test-time training

Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025. 6

work page 2025
[13]

Autoregressive video generation with- out vector quantization.arXiv preprint arXiv:2412.14169,

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation with- out vector quantization.arXiv preprint arXiv:2412.14169,

work page arXiv
[14]

Inflvg: Reinforce inference-time con- sistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, and Guo-jun Qi. Inflvg: Reinforce inference-time con- sistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025. 6

work page arXiv 2025
[15]

The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024. 6

work page arXiv 2024
[16]

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, and Sherry Yang. Im- proving dynamic object interactions in text-to-video gener- ation with ai feedback.arXiv preprint arXiv:2412.02617,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261, 2024

Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, and Lin Shao. Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261, 2024. 6

work page arXiv 2024
[18]

Longvie: Multimodal-guided controllable ultra-long video generation

Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, and Ziwei Liu. Longvie: Multimodal-guided controllable ultra-long video generation. arXiv preprint arXiv:2508.03694, 2025. 2, 6

work page arXiv 2025
[19]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long- context autoregressive video modeling with next-frame pre- diction.arXiv preprint arXiv:2503.19325, 2025. 2

work page internal anchor Pith review arXiv 2025
[21]

Long context tuning for video generation,

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhi- jie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025. 6

work page arXiv 2025
[22]

Pre-trained video generative models as world simula- tors.arXiv preprint arXiv:2502.07825, 2025

Haoran He, Yang Zhang, Liang Lin, Zhongwen Xu, and Ling Pan. Pre-trained video generative models as world simula- tors.arXiv preprint arXiv:2502.07825, 2025. 6

work page arXiv 2025
[23]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025. 2, 6

work page 2025
[25]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

work page 2020
[26]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 6

work page 2022
[28]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

A reinforcement learning-based automatic video editing method using pre-trained vision-language model

Panwen Hu, Nan Xiao, Feifei Li, Yongquan Chen, and Rui Huang. A reinforcement learning-based automatic video editing method using pre-trained vision-language model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6441–6450, 2023. 6

work page 2023
[30]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train- test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 2, 3, 4, 6, 7, 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

VBench: Com- prehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Reco...

work page 2024
[32]

VBench++: Comprehensive and versatile bench- mark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying- Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Zi- wei Liu. VBench++: Comprehensive and versatile bench- mark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024. 6, 7

work page arXiv 2024
[33]

Pyramidal flow matching for efficient video generative modeling,

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

work page arXiv
[34]

Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022. 3

work page 2022
[35]

Vip: Iterative online pref- erence distillation for efficient video diffusion models

Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park, and Youngjae Yu. Vip: Iterative online pref- erence distillation for efficient video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17235–17245, 2025. 6

work page 2025
[36]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 6

work page internal anchor Pith review Pith/arXiv arXiv 2013
[37]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025. 2

work page arXiv 2025
[40]

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Li- uhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Autoregressive adversarial post-training for real-time inter- active video generation.arXiv preprint arXiv:2506.09350,

Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time inter- active video generation.arXiv preprint arXiv:2506.09350,

work page arXiv
[42]

Contentv: Efficient training of video generation models with limited compute.arXiv preprint arXiv:2506.05343, 2025

Wenfeng Lin, Renjie Chen, Boyuan Liu, Shiyue Yan, Ruoyu Feng, Jiangchuan Wei, Yichen Zhang, Yimeng Zhou, Chao Feng, Jiao Ran, et al. Contentv: Efficient training of video generation models with limited compute.arXiv preprint arXiv:2506.05343, 2025. 6

work page arXiv 2025
[43]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Videodpo: Omni- preference alignment for video diffusion generation

Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 2, 6

work page 2025
[47]

Freelong: Training-free long video generation with spectralblend tem- poral attention.Advances in Neural Information Processing Systems, 37:131434–131455, 2024

Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend tem- poral attention.Advances in Neural Information Processing Systems, 37:131434–131455, 2024. 6

work page 2024
[48]

Inpo: Inversion preference optimization with reparametrized ddim for efficient diffusion model alignment

Yunhong Lu, Qichao Wang, Hengyuan Cao, Xierui Wang, Xiaoyin Xu, and Min Zhang. Inpo: Inversion preference optimization with reparametrized ddim for efficient diffusion model alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28629–28639, 2025. 2

work page 2025
[49]

Smoothed preference optimization via renoise inversion for aligning diffusion models with varied human preferences.arXiv preprint arXiv:2506.02698, 2025

Yunhong Lu, Qichao Wang, Hengyuan Cao, Xiaoyin Xu, and Min Zhang. Smoothed preference optimization via renoise inversion for aligning diffusion models with varied human preferences.arXiv preprint arXiv:2506.02698, 2025. 2

work page arXiv 2025
[50]

Video to video generative adversarial network for few-shot learning based on policy gradient.arXiv preprint arXiv:2410.20657, 2024

Yintai Ma, Diego Klabjan, and Jean Utke. Video to video generative adversarial network for few-shot learning based on policy gradient.arXiv preprint arXiv:2410.20657, 2024. 6

work page arXiv 2024
[51]

T.K. Moon. The expectation-maximization algorithm.IEEE Signal Processing Magazine, 13(6):47–60, 1996. 5

work page 1996
[52]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR,

work page
[53]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[54]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Ma- chine Learning, page 745–750, New York, NY , USA, 2007. Association for Computing Machinery. 5

work page 2007
[55]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Kate- rina Fragkiadaki, and Deepak Pathak. Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

work page arXiv
[57]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024. 2

work page arXiv 2024
[58]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xin- tao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023. 6

work page arXiv 2023
[59]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

work page 2021
[60]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 2

work page 2023
[61]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 6

work page 2015
[62]

Motion- stream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Schechtman, and Xun Huang. Motion- stream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025. 2

work page arXiv 2025
[63]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv
[64]

History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[66]

Swiftvideo: A unified framework for few-step video generation through trajectory-distribution alignment.arXiv preprint arXiv:2508.06082, 2025

Yanxiao Sun, Jiafu Wu, Yun Cao, Chengming Xu, Yabiao Wang, Weijian Cao, Donghao Luo, Chengjie Wang, and Yan- wei Fu. Swiftvideo: A unified framework for few-step video generation through trajectory-distribution alignment.arXiv preprint arXiv:2508.06082, 2025. 6

work page arXiv 2025
[67]

Sutton and A.G

R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction.IEEE Transactions on Neural Networks, 9(5): 1054–1054, 1998. 2, 3, 4

work page 1998
[68]

Longcat-video techni- cal report.arXiv preprint arXiv:2510.22200, 2025

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video techni- cal report.arXiv preprint arXiv:2510.22200, 2025. 6

work page arXiv 2025
[69]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video genera- tion at scale.arXiv preprint arXiv:2505.13211, 2025. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4, 6

work page 2017
[71]

Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022. 6

work page arXiv 2022
[72]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion mod- els.Advances in Neural Information Processing Systems, 37: 65618–65642, 2024

Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion mod- els.Advances in Neural Information Processing Systems, 37: 65618–65642, 2024. 6

work page 2024
[74]

Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025. 6

work page 2025
[75]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Video models are zero-shot learners and reasoners

Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025. 6

work page arXiv 2025
[78]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.