arxiv: 2604.16272 · v2 · submitted 2026-04-17 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Bangya Liu, Haozhi Wang, Jiayi Zhang, Jie Yang, Jiongze Yu, Minglai Yang, Mingyang Wu, Qing Yin, Qi Zheng, Sicong Jiang, Siyuan Yang, Xiangbo Gao, Xinghao Chen, Zhengzhong Tu, Zihan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords editingvideoqualityvefx-rewardbenchmarkgenericinstructionsystems

0 comments

The pith

VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors noticed that testing AI video editors is hard because there are no good large datasets with human ratings or specialized judges. They built VEFX-Dataset by collecting thousands of before-and-after video pairs across many edit types and had people rate each one separately on whether the edit followed the instruction, whether the final video looked good, and whether the changes stayed only in the intended areas. From this data they trained VEFX-Reward, a model that looks at the original video, the text instruction, and the edited result together to predict scores on those three aspects using ordinal regression. They also created VEFX-Bench, a smaller fixed set of 300 examples for fair head-to-head tests of different editing systems. When they compared VEFX-Reward to ordinary vision-language models, it matched human opinions more closely on standard metrics and preference tests. Applying the new evaluator to existing commercial and open-source editors showed that current systems often produce visually plausible results but still fail at precise instruction following or keeping edits localized.

Core claim

Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation.

Load-bearing premise

The three decoupled dimensions (Instruction Following, Rendering Quality, Edit Exclusivity) are assumed to comprehensively and independently capture editing quality, and the collected human annotations are treated as reliable ground truth without detailed discussion of inter-annotator agreement or potential biases.

read the original abstract

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models. Our project page is https://xiangbogaobarry.github.io/VEFX-Bench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VEFX-Bench supplies a new 5k-example dataset and specialized reward model for video editing evaluation, but the human labels lack any reported consistency checks.

read the letter

This paper's main contribution is a 5,049-example human-annotated dataset for instruction-guided video editing, split across nine major categories and 32 subcategories, plus a reward model trained to score outputs on three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. They also release a 300-pair benchmark for comparing editing systems. The reward model takes the source video, instruction, and edited result as joint input and uses ordinal regression for per-dimension scores. That setup is more targeted than generic VLM judges or prior small editing datasets, and the authors use it to highlight gaps in current commercial and open-source editors on locality and instruction adherence. The scale and the three-way labeling scheme are genuine additions that address a clear gap in the field. The work is honest about the need for better, editing-specific evaluators rather than relying on expensive manual review or off-the-shelf metrics. The soft spot is exactly what the stress-test note flags: zero information on inter-annotator agreement, annotator qualifications, or potential biases in the human labels. Without those numbers, the claim that VEFX-Reward aligns better with humans rests on unverified ground truth, and the training procedure, data splits, and ablations are also missing from the abstract. This makes it difficult to assess how solid the superiority results actually are. Researchers working on video editing models or evaluation methods would find the released resources practical to try, especially if they need a starting point for dimension-specific scoring. The paper deserves serious refereeing because the field needs standardized benchmarks and this one is large enough to matter, even if the current version leaves methodological questions that referees can push to resolve.

Referee Report

3 major / 1 minor

Summary. The paper introduces VEFX-Dataset, a human-annotated collection of 5,049 video editing examples spanning 9 major categories and 32 subcategories, with each example labeled on three decoupled dimensions (Instruction Following, Rendering Quality, and Edit Exclusivity) via ordinal regression. It proposes VEFX-Reward, a model that jointly processes source video, editing instruction, and edited output to predict per-dimension quality scores, and releases VEFX-Bench, a standardized set of 300 video-prompt pairs. Experiments claim that VEFX-Reward exhibits stronger correlation with human judgments than generic VLM judges and prior reward models on IQA/VQA metrics and group-wise preference evaluation, and use the model to benchmark commercial and open-source editing systems.

Significance. If the human annotations prove reliable and the reported alignment gains hold under scrutiny, the work supplies a valuable large-scale resource and specialized evaluator for instruction-guided video editing, an area currently hampered by reliance on generic VLMs or costly manual review. The public release of the dataset, reward model, and benchmark would enable reproducible comparisons and targeted improvements in editing locality, instruction adherence, and rendering quality.

major comments (3)

[VEFX-Dataset description] Description of VEFX-Dataset: No inter-annotator agreement statistics (Fleiss' kappa, Krippendorff's alpha, or pairwise correlations), annotator count, qualification criteria, or bias analysis are reported for the 5,049 labels across the three dimensions. This directly undermines the central claim that VEFX-Reward aligns more strongly with 'human judgments,' as the ground truth reliability remains unverified.
[Experiments] Experiments section: The abstract and results claim superior performance of VEFX-Reward over baselines on IQA/VQA metrics and group-wise preference, yet supply no details on training procedure, data splits, hyperparameter choices, or ablation studies. Without these, the superiority result cannot be assessed for robustness or overfitting to the annotation set.
[Introduction and VEFX-Dataset] Introduction and VEFX-Dataset: The three dimensions are asserted to be 'decoupled' and jointly sufficient for editing quality, but no evidence (e.g., correlation matrices between dimensions or coverage analysis against editing failure modes) is provided to support independence or completeness. If dimensions overlap or miss key aspects, both the reward model training and the benchmark conclusions become questionable.

minor comments (1)

[Abstract] The abstract states that VEFX-Reward 'jointly processes the source video, the editing instruction, and the edited video,' but the precise architecture (e.g., fusion mechanism, backbone choice) is not summarized, hindering quick assessment of novelty relative to prior multimodal reward models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: Description of VEFX-Dataset: No inter-annotator agreement statistics (Fleiss' kappa, Krippendorff's alpha, or pairwise correlations), annotator count, qualification criteria, or bias analysis are reported for the 5,049 labels across the three dimensions. This directly undermines the central claim that VEFX-Reward aligns more strongly with 'human judgments,' as the ground truth reliability remains unverified.

Authors: We agree that reporting inter-annotator agreement is important for validating annotation reliability. The original manuscript omitted these details for brevity. We have computed Fleiss' kappa and Krippendorff's alpha across the three dimensions (showing substantial agreement), and will add these statistics, the annotator count (five qualified editors), qualification criteria, and bias analysis to the revised VEFX-Dataset section to better support the human judgment alignment claims. revision: yes
Referee: Experiments section: The abstract and results claim superior performance of VEFX-Reward over baselines on IQA/VQA metrics and group-wise preference, yet supply no details on training procedure, data splits, hyperparameter choices, or ablation studies. Without these, the superiority result cannot be assessed for robustness or overfitting to the annotation set.

Authors: We acknowledge the need for full experimental transparency. The revised manuscript will include a new subsection detailing the VEFX-Reward training procedure, data splits (80/10/10), hyperparameter choices, optimization settings, and ablation studies on input modalities and loss terms. This will allow assessment of robustness and reduce concerns about overfitting. revision: yes
Referee: Introduction and VEFX-Dataset: The three dimensions are asserted to be 'decoupled' and jointly sufficient for editing quality, but no evidence (e.g., correlation matrices between dimensions or coverage analysis against editing failure modes) is provided to support independence or completeness. If dimensions overlap or miss key aspects, both the reward model training and the benchmark conclusions become questionable.

Authors: The dimensions were chosen to target distinct failure modes based on video editing literature. To provide empirical support, the revised VEFX-Dataset section will include a correlation matrix demonstrating low inter-dimension correlations and a coverage analysis against common editing failure modes. These additions will substantiate the design choices and their sufficiency for the reward model and benchmark. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external human annotations as independent ground truth

full rationale

The paper collects a new human-annotated dataset (VEFX-Dataset) with 5,049 examples labeled on three dimensions, trains VEFX-Reward on it via ordinal regression, and evaluates alignment against held-out human judgments plus comparisons to external VLM judges and prior reward models. No equations, fitted parameters, or self-citations reduce the reported correlations or benchmark results to quantities defined by the same inputs. The central claims are empirical performance numbers on separate test splits and external baselines, with no self-definitional loops, renamed predictions, or load-bearing self-citations. This is the standard non-circular structure for a benchmark-plus-reward-model paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on the assumption that human annotations constitute reliable ground truth and that the chosen three dimensions plus ordinal regression are appropriate for modeling editing quality.

axioms (2)

domain assumption Human annotations on the three dimensions provide reliable and unbiased ground truth for video editing quality
The dataset labels are used both to train VEFX-Reward and to validate its alignment with humans.
ad hoc to paper The three dimensions are independent and jointly sufficient to evaluate editing quality
The paper introduces and relies on this decoupled labeling scheme without external validation of independence.

pith-pipeline@v0.9.0 · 5652 in / 1485 out tokens · 54243 ms · 2026-05-10T08:20:17.513667+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 31 canonical work pages · 12 internal anchors

[1]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Goku: Flow based video generative foundation models,

S. Chen, C. Ge, Y. Zhang, Y. Zhang, F. Zhu, H. Yang, H. Hao, H. Wu, Z. Lai, Y. Huet al., “Goku: Flow based video generative foundation models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 23516–23527

2025
[3]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Movie Gen: A Cast of Media Foundation Models

A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuanget al., “Movie gen: A cast of media foundation models,”arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review arXiv 2024
[5]

Sora: Creating video from text,

OpenAI, “Sora: Creating video from text,” 2024

2024
[6]

arXiv preprint arXiv:2406.13527 (2024)

R. Li, P. Pan, B. Yang, D. Xu, S. Zhou, X. Zhang, Z. Li, A. Kadambi, Z. Wang, Z. Tuet al., “4k4dgen: Panoramic 4d generation at 4k resolution,”arXiv preprint arXiv:2406.13527, 2024

work page arXiv 2024
[7]

Consid-gen: View-consistent and identity-preserving image-to-video generation,

M. Wu, A. Mishra, S. Dey, S. Xing, N. Ravipati, H. Wu, B. Li, and Z. Tu, “Consid-gen: View-consistent and identity-preserving image-to-video generation,”arXiv preprint arXiv:2602.10113, 2026

work page arXiv 2026
[8]

Veo3 technical report,

DeepMind, “Veo3 technical report,” DeepMind, Technical Report, 2025, accessed: 2026-02-18. [Online]. Available: https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

2025
[9]

Kling AI Omni / VIDEO O1 creative interface,

Kling AI, “Kling AI Omni / VIDEO O1 creative interface,” 2025

2025
[10]

Grok Imagine — ai image & video generation by xai,

xAI, “Grok Imagine — ai image & video generation by xai,” 2026

2026
[11]

Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

Z. Jiang, Z. Han, C. Mao, J. Zhang, Y. Pan, and Y. Liu, “Vace: All-in-one video creation and editing,”arXiv preprint arXiv:2503.07598, 2025

work page arXiv 2025
[12]

Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025a

C. Wei, Q. Liu, Z. Ye, Q. Wang, X. Wang, P. Wan, K. Gai, and W. Chen, “Univideo: Unified understanding, generation, and editing for videos,”arXiv preprint arXiv:2510.08377, 2025

work page arXiv 2025
[13]

Editboard: Towards a comprehensive evaluation benchmark for text-based video editing models,

Y. Chen, P. Chen, X. Zhang, Y. Huang, and Q. Xie, “Editboard: Towards a comprehensive evaluation benchmark for text-based video editing models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 15, 2025, pp. 15975–15983

2025
[14]

Five: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models.arXiv preprint arXiv:2503.13684, 2025

M. Li, C. Xie, Y. Wu, L. Zhang, and M. Wang, “Five: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models,”arXiv preprint arXiv:2503.13684, 2025

work page arXiv 2025
[15]

Ivebench: Modern benchmark suite for instruction-guided video editing assessment.arXiv preprint arXiv:2510.11647, 2025

Y. Chen, J. Zhang, T. Hu, Y. Zeng, Z. Xue, Q. He, C. Wang, Y. Liu, X. Hu, and S. Yan, “Ivebench: Modern benchmark suite for instruction-guided video editing assessment,”arXiv preprint arXiv:2510.11647, 2025

work page arXiv 2025
[16]

Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

H. He, J. Wang, J. Zhang, Z. Xue, X. Bu, Q. Yang, S. Wen, and L. Xie, “Openve-3m: A large-scale high-quality dataset for instruction-guided video editing,”arXiv preprint arXiv:2512.07826, 2025

work page arXiv 2025
[17]

Ve-bench: Subjective-aligned benchmark suite for text-driven video editing quality assessment,

S. Sun, X. Liang, S. Fan, W. Gao, and W. Gao, “Ve-bench: Subjective-aligned benchmark suite for text-driven video editing quality assessment,”arXiv preprint arXiv:2408.11481, 2024

work page arXiv 2024
[18]

Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

K. Wu, S. Jiang, M. Ku, P. Nie, M. Liu, and W. Chen, “Editreward: A human-aligned reward model for instruction-guided image editing,”arXiv preprint arXiv:2509.26346, 2025

work page arXiv 2025
[19]

Improving Video Generation with Human Feedback

J. Liu, G. Liu, J. Liang, Z. Yuan, X. Liu, M. Zheng, X. Wu, Q. Wang, M. Xia, X. Wanget al., “Improving video generation with human feedback,”arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review arXiv 2025
[20]

Physics-Aware Video Instance Removal Benchmark

Z. Li, X. Chen, L. Jiang, D. Hou, F. Lin, K. Yamada, X. Gao, and Z. Tu, “Physics-aware video instance removal benchmark,”arXiv preprint arXiv:2604.05898, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

V oid: Video object and interaction deletion.arXiv preprint arXiv:2604.02296, 2026

S. Motamed, W. Harvey, B. Klein, L. Van Gool, Z. Yuan, and T.-Y. Cheng, “Void: Video object and interaction deletion,”arXiv preprint arXiv:2604.02296, 2026

work page arXiv 2026
[22]

arXiv preprint arXiv:2511.20640 (2025)

R. Burgert, C. Herrmann, F. Cole, M. S. Ryoo, N. Wadhwa, A. Voynov, and N. Ruiz, “Motionv2v: Editing motion in a video,”arXiv preprint arXiv:2511.20640, 2025

work page arXiv 2025
[23]

Pisco: Precise video instance insertion with sparse control.arXiv preprint arXiv:2602.08277, 2026

X. Gao, R. Li, X. Chen, Y. Wu, S. Feng, Q. Yin, and Z. Tu, “Pisco: Precise video instance insertion with sparse control,”arXiv preprint arXiv:2602.08277, 2026

work page arXiv 2026
[24]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,”arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review arXiv 2023
[25]

Diffusion models: A comprehensive survey of methods and applications,

L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,”ACM computing surveys, vol. 56, no. 4, pp. 1–39, 2023. 14

2023
[26]

Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

G. Cheng, X. Gao, L. Hu, S. Hu, M. Huang, C. Ji, J. Li, D. Meng, J. Qi, P. Qiaoet al., “Wan-animate: Unified character animation and replacement with holistic replication,”arXiv preprint arXiv:2509.14055, 2025

work page arXiv 2025
[27]

Luma ray2,

Luma AI, “Luma ray2,”https://lumalabs.ai/ray2, 2025

2025
[28]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” inICML, 2021

2021
[29]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595

2018
[30]

Vbench: Comprehensive benchmark suite for video generative models,

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisitet al., “Vbench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21807–21818

2024
[31]

Vbench++: Comprehensive and versatile benchmark suite for video generative models,

Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jianget al., “Vbench++: Comprehensive and versatile benchmark suite for video generative models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[32]

Imagereward: Learning and evaluating human preferences for text-to-image generation,

J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagereward: Learning and evaluating human preferences for text-to-image generation,”Advances in Neural Information Processing Systems, vol. 36, pp. 15903–15935, 2023

2023
[33]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis,”arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review arXiv 2023
[34]

Pick-a-pic: An open dataset of user preferences for text-to-image generation,

Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image generation,”Advances in neural information processing systems, vol. 36, pp. 36652– 36663, 2023

2023
[35]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,

X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulrajet al., “Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 2105–2123

2024
[36]

Densedpo: Fine-grained temporal preference optimization for video diffusion models,

Z. Wu, A. Kag, I. Skorokhodov, W. Menapace, A. Mirzaei, I. Gilitschenski, S. Tulyakov, and A. Siarohin, “Densedpo: Fine-grained temporal preference optimization for video diffusion models,”arXiv preprint arXiv:2506.03517, 2025

work page arXiv 2025
[37]

Worldscore: A unified evaluation benchmark for world generation,

H. Duan, H.-X. Yu, S. Chen, L. Fei-Fei, and J. Wu, “Worldscore: A unified evaluation benchmark for world generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 27713– 27724

2025
[38]

The pulse of motion: Measuring physical frame rate from visual dynamics.arXiv preprint arXiv:2603.14375, 2026

X. Gao, M. Wu, S. Yang, J. Yu, P. Taghavi, F. Lin, and Z. Tu, “The pulse of motion: Measuring physical frame rate from visual dynamics,”arXiv preprint arXiv:2603.14375, 2026

work page arXiv 2026
[39]

Open-Sora: Democratizing Efficient Video Production for All

Z. Zhenget al., “Open-Sora: Democratizing efficient video production for all,”arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review arXiv 2024
[40]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

K. Nanet al., “OpenVid-1M: A large-scale high-quality dataset for text-to-video generation,”arXiv preprint arXiv:2407.02371, 2024

work page internal anchor Pith review arXiv 2024
[41]

Gemini 3 Flash — deepmind ai model,

Google DeepMind, “Gemini 3 Flash — deepmind ai model,” 2025

2025
[42]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Deep ordinal regression using optimal transport loss and unimodal output probabilities,

U. Shaham, I. Zaidman, and J. Svirsky, “Deep ordinal regression using optimal transport loss and unimodal output probabilities,”arXiv preprint arXiv:2011.07607, 2020

work page arXiv 2011
[44]

Gemini 3.1 Pro — deepmind ai model,

Google DeepMind, “Gemini 3.1 Pro — deepmind ai model,” 2025

2025
[45]

ByteDance Seed: Models and research,

ByteDance Seed Team, “ByteDance Seed: Models and research,” https://seed.bytedance.com/, 2025, accessed: 2026-02-27

2025
[46]

Kling video 3.0 model user guide,

Kling AI, “Kling video 3.0 model user guide,” https://kling.ai/quickstart/klingai-video-3-model-user-guide, Feb. 2026, accessed: 2026-04-16

2026
[47]

Kling video o1 user guide,

——, “Kling video o1 user guide,” https://kling.ai/quickstart/klingai-video-o1-user-guide, Dec. 2025, accessed: 2026-04-16

2025
[48]

Introducing runway gen-4.5: A new frontier for video generation,

Runway, “Introducing runway gen-4.5: A new frontier for video generation,” https://runwayml.com/research/ introducing-runway-gen-4.5, Dec. 2025, accessed: 2026-04-16

2025
[49]

Seedance 2.0 official launch,

ByteDance Seed Team, “Seedance 2.0 official launch,” https://seed.bytedance.com/en/blog/ 15 official-launch-of-seedance-2-0, Feb. 2026, accessed: 2026-04-16

2026
[50]

Grok imagine api,

xAI, “Grok imagine api,” https://x.ai/news/grok-imagine-api, Jan. 2026, accessed: 2026-04-16

2026
[51]

Luma ai launches ray3,

Luma AI, “Luma ai launches ray3,” https://lumalabs.ai/news/ray3, Sep. 2025, accessed: 2026-04-16

2025
[52]

Alibaba unveils wan2.6 series enabling everyone to star in videos,

Alibaba Cloud, “Alibaba unveils wan2.6 series enabling everyone to star in videos,” https://www.alibabacloud.com/ blog/alibaba-unveils-wan2-6-series-enabling-everyone-to-star-in-videos_602742, Dec. 2025, accessed: 2026-04-16

2025
[53]

Introducing ray2,

Luma AI, “Introducing ray2,” https://lumalabs.ai/changelog/introducing-ray2, Jan. 2025, accessed: 2026-04-16

2025
[54]

R. L. Keeney and H. Raiffa,Decisions with Multiple Objectives: Preferences and Value Tradeoffs. Cambridge University Press, 1993

1993
[55]

Multiplicative utility functions,

R. L. Keeney, “Multiplicative utility functions,”Operations Research, vol. 22, no. 1, pp. 22–34, 1974

1974
[56]

Inference and missing data,

D. B. Rubin, “Inference and missing data,”Biometrika, vol. 63, no. 3, pp. 581–592, 1976

1976
[57]

A generalization of sampling without replacement from a finite universe,

D. G. Horvitz and D. J. Thompson, “A generalization of sampling without replacement from a finite universe,” Journal of the American Statistical Association, vol. 47, no. 260, pp. 663–685, 1952

1952
[58]

Estimation of regression coefficients when some regressors are not always observed,

J. M. Robins, A. Rotnitzky, and L. P. Zhao, “Estimation of regression coefficients when some regressors are not always observed,”Journal of the American Statistical Association, vol. 89, no. 427, pp. 846–866, 1994

1994
[59]

Review of inverse probability weighting for dealing with missing data,

S. R. Seaman and I. R. White, “Review of inverse probability weighting for dealing with missing data,”Statistical Methods in Medical Research, vol. 22, no. 3, pp. 278–295, 2013

2013
[60]

SAM 2: Segment Anything in Images and Videos

N. Raviet al., “SAM 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review arXiv 2024
[61]

Rose: Remove objects with side effects in videos.arXiv preprint arXiv:2508.18633,

C. Miao, Y. Feng, J. Zeng, Z. Gao, H. Liu, Y. Yan, D. Qi, X. Chen, B. Wang, and H. Zhao, “Rose: Remove objects with side effects in videos,”arXiv preprint arXiv:2508.18633, 2025

work page arXiv 2025
[62]

ViTPose: Simple vision transformer baselines for human pose estimation,

Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: Simple vision transformer baselines for human pose estimation,” inNeurIPS, 2022

2022
[63]

Depth Anything 3: Recovering the Visual Space from Any Views

H. Lin, S. Chen, J. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review arXiv 2025
[64]

ReCamMaster: Camera-controlled generative rendering from a single video,

J. Heet al., “ReCamMaster: Camera-controlled generative rendering from a single video,”arXiv preprint arXiv:2501.12007, 2025

work page arXiv 2025
[65]

Light-x: Generative 4d video rendering with camera and illumination control,

T. Liu, Z. Chen, Z. Huang, S. Xu, S. Zhang, C. Ye, B. Li, Z. Cao, W. Li, H. Zhaoet al., “Light-x: Generative 4d video rendering with camera and illumination control,”arXiv preprint arXiv:2512.05115, 2025. 16 Appendix A Additional Training Details forVEFX-Reward We provide the implementation details omitted from the main paper.VEFX-Reward is trained on the...

work page arXiv 2025