Refining Multidimensional Video Reward Models via Disentangled Influence Functions

Hideki Nakayama; Muyao Wang; Zeke Xie

arxiv: 2605.28203 · v1 · pith:353ECJWKnew · submitted 2026-05-27 · 💻 cs.LG

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

Muyao Wang , Zeke Xie , Hideki Nakayama This is my paper

Pith reviewed 2026-06-29 13:43 UTC · model grok-4.3

classification 💻 cs.LG

keywords multidimensional video reward modelstext-to-video generationinfluence functionsdata pruningreweightingdimensional heterogeneitysupervision risk

0 comments

The pith

Disentangled influence functions estimate per-dimension supervision risk to refine multidimensional video reward models beyond global filtering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that training samples for video reward models have reliability that varies across different evaluation dimensions, a phenomenon called dimensional heterogeneity. Global filtering methods that use a single metric for all dimensions are therefore ill-suited for these tasks. The authors propose a disentangled influence framework to estimate risk separately for each dimension. Using this, they develop pruning and reweighting strategies that remove or downweight high-risk samples on a per-dimension basis. Experiments show these strategies produce reward models more aligned with ground truth than standard approaches.

Core claim

The central claim is that addressing dimensional heterogeneity through a disentangled influence framework enables dimension-specific data refinement strategies, specifically Dimension-Disentangled Pruning and Dimension-Disentangled Reweighting, which outperform global filtering baselines in aligning multidimensional video reward models with ground truth.

What carries the argument

The disentangled influence framework, which efficiently estimates dimension-specific supervision risk for each evaluation dimension.

If this is right

Dimension-Disentangled Pruning removes samples with extreme high-risk for specific dimensions.
Dimension-Disentangled Reweighting softly down-weights high-risk supervision per dimension.
These yield reward models with superior alignment to ground truth compared to global methods.
The approach handles the complex nature of video data in T2V tasks more effectively than scalar-metric filters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could extend to other multimodal tasks where evaluation dimensions show varying data reliability.
Influence functions may be adapted more broadly for fine-grained curation when scalar metrics fall short.
Scalability tests on larger video datasets could expose limits in the risk estimation process.

Load-bearing premise

Dimension-specific supervision risk can be reliably estimated by the disentangled influence framework without the estimation itself introducing new biases or requiring unavailable per-dimension ground-truth labels.

What would settle it

An experiment on a held-out test set where the disentangled pruning and reweighting strategies produce no better or worse alignment metrics than global filtering baselines would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.28203 by Hideki Nakayama, Muyao Wang, Zeke Xie.

**Figure 1.** Figure 1: Dimensional Heterogeneity in Video Training Data. We visualize the distribution of sample self-influence scores for Visual Quality (xaxis) versus Dynamic Degree (y-axis). The scatter plot exhibits a dispersed distribution with negligible correlation, indicating that supervision reliability is dimension-specific. Samples with high self-influence in one dimension, which we use as a proxy for dimension-spe… view at source ↗

**Figure 2.** Figure 2: We analyze the impact of removing the top- [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Dimensional Heterogeneity across reward dimensions. We visualize the relationship between Visual Quality influence (x-axis) and three other reward dimensions (y-axis): (a) Temporal Consistency, (b) Text Alignment, and (c) Factual Consistency. The color indicates the normalized Standard Influence (Total Loss). The distinct “L-shaped” distribution reveals that supervision reliability is disentangled. Crucial… view at source ↗

**Figure 4.** Figure 4: Qualitative Visualization of Label Noise in T2V Alignment. We show samples with high T2V self-influence but low visual self-influence. Although they are labeled as Score 1 (Worst Alignment) in the training set, human re-evaluation rates them as Score 4, as the videos match substantial parts of their prompts, such as “a knight on horse” or “kid sleeping with cat”. This confirms label noise in the T2V alignm… view at source ↗

**Figure 5.** Figure 5: Scatter plot comparisons across dimensions. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Label noise visualization. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Label noise visualization. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional Video Reward Models (MVRMs), which decompose the evaluation process to better align with the multifaceted nature of human visual perception. However, training effective MVRMs is fundamentally challenged by the complex nature of video data. In this work, we identify a critical phenomenon termed Dimensional Heterogeneity: the reliability of a training sample can vary substantially across evaluation dimensions, meaning that a sample may provide reliable supervision for one objective while inducing high supervision risk for another. Consequently, prevailing data-centric methods that filter based on global scalar metrics are ill-posed for T2V tasks. To address this, we propose a disentangled influence framework that that efficiently estimates dimension-specific supervision risk. Leveraging this framework, we introduce two dimension-disentangled refinement strategies: Dimension-Disentangled Pruning, which removes extreme high-risk samples, and Dimension-Disentangled Reweighting, which softly down-weights high-risk supervision. Extensive experiments demonstrate that our disentangled strategies significantly outperform global filtering baselines, yielding reward models with superior alignment to ground truth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags that sample reliability varies across reward dimensions in video models and adapts influence functions for per-dimension pruning and reweighting, but the estimates lack direct validation.

read the letter

The main thing here is that the authors notice training samples for multidimensional video reward models can be good for one evaluation axis and bad for another, which global filtering misses. They call this dimensional heterogeneity and build a disentangled influence approach to score risk per dimension, then prune or reweight accordingly. That framing and the two concrete strategies are the actual new pieces relative to standard influence-function data cleaning.

What works is the problem statement. It lines up with how video data really behaves—motion quality and semantic match do not always move together—and the abstract shows they ran comparisons against global baselines. If the experiments hold up with proper controls, the refinement step could be a practical tweak for anyone training these reward models.

The soft spot is validation. Influence functions give an estimate from gradients, but splitting them across dimensions requires some decomposition or attribution step. Without per-dimension ground-truth labels, there is no direct check on whether the risk scores match actual supervision quality. The abstract notes labels are unavailable in practice, so any reported gains could be sensitive to how the disentangling is done or to shared features across dimensions. If the estimates are noisy, the advantage over global methods might shrink.

This is for people working on reward models for text-to-video or similar multimodal alignment tasks. A reader who already uses influence functions or data pruning in that area would get the most out of the adaptation. The work is coherent enough on its own terms to go to referees, though the lack of visible derivation or metric details means it would need a full methods check. I would send it for review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper identifies a phenomenon called Dimensional Heterogeneity in training data for Multidimensional Video Reward Models (MVRMs) for Text-to-Video generation, where a sample's reliability can differ substantially across evaluation dimensions. It argues that global scalar filtering is therefore ill-posed and proposes a disentangled influence framework to estimate per-dimension supervision risk. From this it derives two strategies—Dimension-Disentangled Pruning (removing extreme high-risk samples) and Dimension-Disentangled Reweighting (softly down-weighting high-risk supervision)—and reports that both outperform global filtering baselines, producing reward models with superior alignment to ground truth.

Significance. If the dimension-specific influence estimates prove accurate and the reported gains hold under proper controls, the work would offer a practical data-centric improvement for training fine-grained reward models on complex video data, where scalar metrics are known to be insufficient.

major comments (2)

[Abstract (and § on disentangled influence framework)] The central claim that the disentangled strategies outperform global baselines rests on the accuracy of the per-dimension supervision-risk estimates. The abstract states that per-dimension ground-truth labels are unavailable in practice, yet provides no alternative validation (e.g., synthetic probes, human correlation studies, or ablation on known noisy dimensions) that the influence-derived scores recover true dimension-specific quality rather than shared video-feature artifacts. Without such evidence the reported gains could be artifacts of the estimation procedure itself.
[Disentangled influence framework description] Influence functions are defined on a scalar loss; the manuscript must specify the exact decomposition (loss splitting, gradient attribution, or auxiliary heads) used to obtain dimension-specific scores. If the decomposition correlates estimates across dimensions or re-introduces the global metric through shared parameters, the “disentangled” claim is undermined. No equation or algorithmic box in the provided description clarifies this step.

minor comments (1)

[Abstract] Abstract contains a repeated word: “framework that that efficiently”.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and evidence.

read point-by-point responses

Referee: [Abstract (and § on disentangled influence framework)] The central claim that the disentangled strategies outperform global baselines rests on the accuracy of the per-dimension supervision-risk estimates. The abstract states that per-dimension ground-truth labels are unavailable in practice, yet provides no alternative validation (e.g., synthetic probes, human correlation studies, or ablation on known noisy dimensions) that the influence-derived scores recover true dimension-specific quality rather than shared video-feature artifacts. Without such evidence the reported gains could be artifacts of the estimation procedure itself.

Authors: We agree that the absence of direct per-dimension ground-truth labels makes validation important. The reported gains in ground-truth alignment provide indirect support that the estimates capture dimension-specific effects rather than artifacts, as global baselines underperform. To strengthen this, we will add synthetic probe experiments with controlled dimension-specific noise injection in the revision. revision: yes
Referee: [Disentangled influence framework description] Influence functions are defined on a scalar loss; the manuscript must specify the exact decomposition (loss splitting, gradient attribution, or auxiliary heads) used to obtain dimension-specific scores. If the decomposition correlates estimates across dimensions or re-introduces the global metric through shared parameters, the “disentangled” claim is undermined. No equation or algorithmic box in the provided description clarifies this step.

Authors: The full manuscript specifies the decomposition via auxiliary dimension-specific heads combined with a per-dimension loss split in the framework section. We will add an explicit algorithmic box and equations in the revision to clarify the procedure and confirm that shared parameters do not reintroduce global metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and empirical claims are independent of inputs

full rationale

The provided abstract and description introduce Dimensional Heterogeneity as an observed phenomenon and propose a disentangled influence framework to estimate per-dimension supervision risk, followed by pruning and reweighting strategies. No equations, derivations, or self-citations are present that reduce any prediction or uniqueness claim to a fitted parameter or prior author result by construction. The central claim of outperformance over global baselines is presented as an empirical finding from experiments, not a definitional or self-referential necessity. This is the most common honest outcome when no load-bearing reduction is exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be identified from the manuscript.

pith-pipeline@v0.9.1-grok · 5748 in / 1121 out tokens · 33154 ms · 2026-06-29T13:43:14.733386+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 14 canonical work pages · 10 internal anchors

[1]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

2020
[8]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[9]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

2017
[10]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22139–22149, 2024

2024
[11]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[12]

Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation

Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems, 36:62352–62387, 2023

2023
[13]

Mj-video: Benchmarking and rewarding video generation with fine-grained video preference

Haibo Tong, Zhaoyang Wang, Zhaorun Chen, Haonian Ji, Shi Qiu, Siwei Han, Kexin Geng, Zhongkai Xue, Yiyang Zhou, Peng Xia, et al. Mj-video: Benchmarking and rewarding video generation with fine-grained video preference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[14]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024

2024
[15]

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Understanding impact of human feedback via influence functions.arXiv preprint arXiv:2501.05790, 2025

Taywon Min, Haeone Lee, Yongchan Kwon, and Kimin Lee. Understanding impact of human feedback via influence functions.arXiv preprint arXiv:2501.05790, 2025

work page arXiv 2025
[18]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, pages 1885–1894. PMLR, 2017

2017
[19]

Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33:19920–19930, 2020

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33:19920–19930, 2020

2020
[20]

Boosting text-to-video generative model with mllms feedback.Advances in Neural Information Processing Systems, 37:139444–139469, 2024

Xun Wu, Shaohan Huang, Guolong Wang, Jing Xiong, and Furu Wei. Boosting text-to-video generative model with mllms feedback.Advances in Neural Information Processing Systems, 37:139444–139469, 2024

2024
[21]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

2021
[23]

Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research
[24]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

2024
[25]

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018

2018
[26]

Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

2020
[27]

Reasonable effectiveness of random weighting: A litmus test for multi-task learning.Transactions on Machine Learning Research

Baijiong Lin, Feiyang Ye, Yu Zhang, and Ivor Tsang. Reasonable effectiveness of random weighting: A litmus test for multi-task learning.Transactions on Machine Learning Research
[28]

Data pruning via moving-one-sample-out.Advances in Neural Information Processing Systems, 36, 2024

Haoru Tan, Sitong Wu, Fei Du, Yukang Chen, Zhibin Wang, Fan Wang, and Xiaojuan Qi. Data pruning via moving-one-sample-out.Advances in Neural Information Processing Systems, 36, 2024

2024
[29]

Relatif: Identifying explanatory training samples via relative influence

Elnaz Barshan, Marc-Etienne Brunet, and Gintare Karolina Dziugaite. Relatif: Identifying explanatory training samples via relative influence. InInternational Conference on Artificial Intelligence and Statistics, pages 1899–1909. PMLR, 2020

1909
[30]

Trak: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186, 2023

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186, 2023

work page arXiv 2023
[31]

Fast approximate natural gradient descent in a kronecker factored eigenbasis.Advances in neural information processing systems, 31, 2018

Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis.Advances in neural information processing systems, 31, 2018

2018
[32]

Variational bayesian last layers.arXiv preprint arXiv:2404.11599, 2024

James Harrison, John Willes, and Jasper Snoek. Variational bayesian last layers.arXiv preprint arXiv:2404.11599, 2024

work page arXiv 2024
[33]

Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024. 11 A Proof of Theorem 3.1 In this section, we provide the detailed derivation of the Disentangled Influence Decomposition presented i...

2024
[34]

Parameter-Space Reduction:By restricting the influence analysis to the last linear layer, we reduce the problem scale from the vast full-parameter space ( P≈8B ) to the low- dimensional feature space (d≈4096), realizing a complexity reduction ofO(P)→ O(d)
[35]

Backward-Free Estimation:Crucially, our closed-form derivation allows us to compute the exact gradient norm using only forward-pass statistics (residuals and embeddings). This completely eliminates the need for the computationally expensive backpropagation process (backward pass), which typically consumes significantly more time and memory than the forwar...

work page arXiv

[1] [1]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

2020

[8] [8]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[9] [9]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

2017

[10] [10]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22139–22149, 2024

2024

[11] [11]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[12] [12]

Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation

Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems, 36:62352–62387, 2023

2023

[13] [13]

Mj-video: Benchmarking and rewarding video generation with fine-grained video preference

Haibo Tong, Zhaoyang Wang, Zhaorun Chen, Haonian Ji, Shi Qiu, Siwei Han, Kexin Geng, Zhongkai Xue, Yiyang Zhou, Peng Xia, et al. Mj-video: Benchmarking and rewarding video generation with fine-grained video preference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

[14] [14]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024

2024

[15] [15]

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Understanding impact of human feedback via influence functions.arXiv preprint arXiv:2501.05790, 2025

Taywon Min, Haeone Lee, Yongchan Kwon, and Kimin Lee. Understanding impact of human feedback via influence functions.arXiv preprint arXiv:2501.05790, 2025

work page arXiv 2025

[18] [18]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, pages 1885–1894. PMLR, 2017

2017

[19] [19]

Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33:19920–19930, 2020

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33:19920–19930, 2020

2020

[20] [20]

Boosting text-to-video generative model with mllms feedback.Advances in Neural Information Processing Systems, 37:139444–139469, 2024

Xun Wu, Shaohan Huang, Guolong Wang, Jing Xiong, and Furu Wei. Boosting text-to-video generative model with mllms feedback.Advances in Neural Information Processing Systems, 37:139444–139469, 2024

2024

[21] [21]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

2021

[23] [23]

Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research

[24] [24]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

2024

[25] [25]

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018

2018

[26] [26]

Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

2020

[27] [27]

Reasonable effectiveness of random weighting: A litmus test for multi-task learning.Transactions on Machine Learning Research

Baijiong Lin, Feiyang Ye, Yu Zhang, and Ivor Tsang. Reasonable effectiveness of random weighting: A litmus test for multi-task learning.Transactions on Machine Learning Research

[28] [28]

Data pruning via moving-one-sample-out.Advances in Neural Information Processing Systems, 36, 2024

Haoru Tan, Sitong Wu, Fei Du, Yukang Chen, Zhibin Wang, Fan Wang, and Xiaojuan Qi. Data pruning via moving-one-sample-out.Advances in Neural Information Processing Systems, 36, 2024

2024

[29] [29]

Relatif: Identifying explanatory training samples via relative influence

Elnaz Barshan, Marc-Etienne Brunet, and Gintare Karolina Dziugaite. Relatif: Identifying explanatory training samples via relative influence. InInternational Conference on Artificial Intelligence and Statistics, pages 1899–1909. PMLR, 2020

1909

[30] [30]

Trak: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186, 2023

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186, 2023

work page arXiv 2023

[31] [31]

Fast approximate natural gradient descent in a kronecker factored eigenbasis.Advances in neural information processing systems, 31, 2018

Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis.Advances in neural information processing systems, 31, 2018

2018

[32] [32]

Variational bayesian last layers.arXiv preprint arXiv:2404.11599, 2024

James Harrison, John Willes, and Jasper Snoek. Variational bayesian last layers.arXiv preprint arXiv:2404.11599, 2024

work page arXiv 2024

[33] [33]

Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024. 11 A Proof of Theorem 3.1 In this section, we provide the detailed derivation of the Disentangled Influence Decomposition presented i...

2024

[34] [34]

Parameter-Space Reduction:By restricting the influence analysis to the last linear layer, we reduce the problem scale from the vast full-parameter space ( P≈8B ) to the low- dimensional feature space (d≈4096), realizing a complexity reduction ofO(P)→ O(d)

[35] [35]

Backward-Free Estimation:Crucially, our closed-form derivation allows us to compute the exact gradient norm using only forward-pass statistics (residuals and embeddings). This completely eliminates the need for the computationally expensive backpropagation process (backward pass), which typically consumes significantly more time and memory than the forwar...

work page arXiv