pith. machine review for the scientific record. sign in

arxiv: 2604.16272 · v2 · submitted 2026-04-17 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Bangya Liu, Haozhi Wang, Jiayi Zhang, Jie Yang, Jiongze Yu, Minglai Yang, Mingyang Wu, Qing Yin, Qi Zheng, Sicong Jiang, Siyuan Yang, Xiangbo Gao, Xinghao Chen, Zhengzhong Tu, Zihan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords editingvideoqualityvefx-rewardbenchmarkgenericinstructionsystems
0
0 comments X

The pith

VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors noticed that testing AI video editors is hard because there are no good large datasets with human ratings or specialized judges. They built VEFX-Dataset by collecting thousands of before-and-after video pairs across many edit types and had people rate each one separately on whether the edit followed the instruction, whether the final video looked good, and whether the changes stayed only in the intended areas. From this data they trained VEFX-Reward, a model that looks at the original video, the text instruction, and the edited result together to predict scores on those three aspects using ordinal regression. They also created VEFX-Bench, a smaller fixed set of 300 examples for fair head-to-head tests of different editing systems. When they compared VEFX-Reward to ordinary vision-language models, it matched human opinions more closely on standard metrics and preference tests. Applying the new evaluator to existing commercial and open-source editors showed that current systems often produce visually plausible results but still fail at precise instruction following or keeping edits localized.

Core claim

Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation.

Load-bearing premise

The three decoupled dimensions (Instruction Following, Rendering Quality, Edit Exclusivity) are assumed to comprehensively and independently capture editing quality, and the collected human annotations are treated as reliable ground truth without detailed discussion of inter-annotator agreement or potential biases.

read the original abstract

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models. Our project page is https://xiangbogaobarry.github.io/VEFX-Bench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces VEFX-Dataset, a human-annotated collection of 5,049 video editing examples spanning 9 major categories and 32 subcategories, with each example labeled on three decoupled dimensions (Instruction Following, Rendering Quality, and Edit Exclusivity) via ordinal regression. It proposes VEFX-Reward, a model that jointly processes source video, editing instruction, and edited output to predict per-dimension quality scores, and releases VEFX-Bench, a standardized set of 300 video-prompt pairs. Experiments claim that VEFX-Reward exhibits stronger correlation with human judgments than generic VLM judges and prior reward models on IQA/VQA metrics and group-wise preference evaluation, and use the model to benchmark commercial and open-source editing systems.

Significance. If the human annotations prove reliable and the reported alignment gains hold under scrutiny, the work supplies a valuable large-scale resource and specialized evaluator for instruction-guided video editing, an area currently hampered by reliance on generic VLMs or costly manual review. The public release of the dataset, reward model, and benchmark would enable reproducible comparisons and targeted improvements in editing locality, instruction adherence, and rendering quality.

major comments (3)
  1. [VEFX-Dataset description] Description of VEFX-Dataset: No inter-annotator agreement statistics (Fleiss' kappa, Krippendorff's alpha, or pairwise correlations), annotator count, qualification criteria, or bias analysis are reported for the 5,049 labels across the three dimensions. This directly undermines the central claim that VEFX-Reward aligns more strongly with 'human judgments,' as the ground truth reliability remains unverified.
  2. [Experiments] Experiments section: The abstract and results claim superior performance of VEFX-Reward over baselines on IQA/VQA metrics and group-wise preference, yet supply no details on training procedure, data splits, hyperparameter choices, or ablation studies. Without these, the superiority result cannot be assessed for robustness or overfitting to the annotation set.
  3. [Introduction and VEFX-Dataset] Introduction and VEFX-Dataset: The three dimensions are asserted to be 'decoupled' and jointly sufficient for editing quality, but no evidence (e.g., correlation matrices between dimensions or coverage analysis against editing failure modes) is provided to support independence or completeness. If dimensions overlap or miss key aspects, both the reward model training and the benchmark conclusions become questionable.
minor comments (1)
  1. [Abstract] The abstract states that VEFX-Reward 'jointly processes the source video, the editing instruction, and the edited video,' but the precise architecture (e.g., fusion mechanism, backbone choice) is not summarized, hindering quick assessment of novelty relative to prior multimodal reward models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve the manuscript.

read point-by-point responses
  1. Referee: Description of VEFX-Dataset: No inter-annotator agreement statistics (Fleiss' kappa, Krippendorff's alpha, or pairwise correlations), annotator count, qualification criteria, or bias analysis are reported for the 5,049 labels across the three dimensions. This directly undermines the central claim that VEFX-Reward aligns more strongly with 'human judgments,' as the ground truth reliability remains unverified.

    Authors: We agree that reporting inter-annotator agreement is important for validating annotation reliability. The original manuscript omitted these details for brevity. We have computed Fleiss' kappa and Krippendorff's alpha across the three dimensions (showing substantial agreement), and will add these statistics, the annotator count (five qualified editors), qualification criteria, and bias analysis to the revised VEFX-Dataset section to better support the human judgment alignment claims. revision: yes

  2. Referee: Experiments section: The abstract and results claim superior performance of VEFX-Reward over baselines on IQA/VQA metrics and group-wise preference, yet supply no details on training procedure, data splits, hyperparameter choices, or ablation studies. Without these, the superiority result cannot be assessed for robustness or overfitting to the annotation set.

    Authors: We acknowledge the need for full experimental transparency. The revised manuscript will include a new subsection detailing the VEFX-Reward training procedure, data splits (80/10/10), hyperparameter choices, optimization settings, and ablation studies on input modalities and loss terms. This will allow assessment of robustness and reduce concerns about overfitting. revision: yes

  3. Referee: Introduction and VEFX-Dataset: The three dimensions are asserted to be 'decoupled' and jointly sufficient for editing quality, but no evidence (e.g., correlation matrices between dimensions or coverage analysis against editing failure modes) is provided to support independence or completeness. If dimensions overlap or miss key aspects, both the reward model training and the benchmark conclusions become questionable.

    Authors: The dimensions were chosen to target distinct failure modes based on video editing literature. To provide empirical support, the revised VEFX-Dataset section will include a correlation matrix demonstrating low inter-dimension correlations and a coverage analysis against common editing failure modes. These additions will substantiate the design choices and their sufficiency for the reward model and benchmark. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external human annotations as independent ground truth

full rationale

The paper collects a new human-annotated dataset (VEFX-Dataset) with 5,049 examples labeled on three dimensions, trains VEFX-Reward on it via ordinal regression, and evaluates alignment against held-out human judgments plus comparisons to external VLM judges and prior reward models. No equations, fitted parameters, or self-citations reduce the reported correlations or benchmark results to quantities defined by the same inputs. The central claims are empirical performance numbers on separate test splits and external baselines, with no self-definitional loops, renamed predictions, or load-bearing self-citations. This is the standard non-circular structure for a benchmark-plus-reward-model paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on the assumption that human annotations constitute reliable ground truth and that the chosen three dimensions plus ordinal regression are appropriate for modeling editing quality.

axioms (2)
  • domain assumption Human annotations on the three dimensions provide reliable and unbiased ground truth for video editing quality
    The dataset labels are used both to train VEFX-Reward and to validate its alignment with humans.
  • ad hoc to paper The three dimensions are independent and jointly sufficient to evaluate editing quality
    The paper introduces and relies on this decoupled labeling scheme without external validation of independence.

pith-pipeline@v0.9.0 · 5652 in / 1485 out tokens · 54243 ms · 2026-05-10T08:20:17.513667+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 31 canonical work pages · 12 internal anchors

  1. [1]

    Wan: Open and Advanced Large-Scale Video Generative Models

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

  2. [2]

    Goku: Flow based video generative foundation models,

    S. Chen, C. Ge, Y. Zhang, Y. Zhang, F. Zhu, H. Yang, H. Hao, H. Wu, Z. Lai, Y. Huet al., “Goku: Flow based video generative foundation models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 23516–23527

  3. [3]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

  4. [4]

    Movie Gen: A Cast of Media Foundation Models

    A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuanget al., “Movie gen: A cast of media foundation models,”arXiv preprint arXiv:2410.13720, 2024

  5. [5]

    Sora: Creating video from text,

    OpenAI, “Sora: Creating video from text,” 2024

  6. [6]

    arXiv preprint arXiv:2406.13527 (2024)

    R. Li, P. Pan, B. Yang, D. Xu, S. Zhou, X. Zhang, Z. Li, A. Kadambi, Z. Wang, Z. Tuet al., “4k4dgen: Panoramic 4d generation at 4k resolution,”arXiv preprint arXiv:2406.13527, 2024

  7. [7]

    Consid-gen: View-consistent and identity-preserving image-to-video generation,

    M. Wu, A. Mishra, S. Dey, S. Xing, N. Ravipati, H. Wu, B. Li, and Z. Tu, “Consid-gen: View-consistent and identity-preserving image-to-video generation,”arXiv preprint arXiv:2602.10113, 2026

  8. [8]

    Veo3 technical report,

    DeepMind, “Veo3 technical report,” DeepMind, Technical Report, 2025, accessed: 2026-02-18. [Online]. Available: https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

  9. [9]

    Kling AI Omni / VIDEO O1 creative interface,

    Kling AI, “Kling AI Omni / VIDEO O1 creative interface,” 2025

  10. [10]

    Grok Imagine — ai image & video generation by xai,

    xAI, “Grok Imagine — ai image & video generation by xai,” 2026

  11. [11]

    Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

    Z. Jiang, Z. Han, C. Mao, J. Zhang, Y. Pan, and Y. Liu, “Vace: All-in-one video creation and editing,”arXiv preprint arXiv:2503.07598, 2025

  12. [12]

    Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025a

    C. Wei, Q. Liu, Z. Ye, Q. Wang, X. Wang, P. Wan, K. Gai, and W. Chen, “Univideo: Unified understanding, generation, and editing for videos,”arXiv preprint arXiv:2510.08377, 2025

  13. [13]

    Editboard: Towards a comprehensive evaluation benchmark for text-based video editing models,

    Y. Chen, P. Chen, X. Zhang, Y. Huang, and Q. Xie, “Editboard: Towards a comprehensive evaluation benchmark for text-based video editing models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 15, 2025, pp. 15975–15983

  14. [14]

    Five: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models.arXiv preprint arXiv:2503.13684, 2025

    M. Li, C. Xie, Y. Wu, L. Zhang, and M. Wang, “Five: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models,”arXiv preprint arXiv:2503.13684, 2025

  15. [15]

    Ivebench: Modern benchmark suite for instruction-guided video editing assessment.arXiv preprint arXiv:2510.11647, 2025

    Y. Chen, J. Zhang, T. Hu, Y. Zeng, Z. Xue, Q. He, C. Wang, Y. Liu, X. Hu, and S. Yan, “Ivebench: Modern benchmark suite for instruction-guided video editing assessment,”arXiv preprint arXiv:2510.11647, 2025

  16. [16]

    Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

    H. He, J. Wang, J. Zhang, Z. Xue, X. Bu, Q. Yang, S. Wen, and L. Xie, “Openve-3m: A large-scale high-quality dataset for instruction-guided video editing,”arXiv preprint arXiv:2512.07826, 2025

  17. [17]

    Ve-bench: Subjective-aligned benchmark suite for text-driven video editing quality assessment,

    S. Sun, X. Liang, S. Fan, W. Gao, and W. Gao, “Ve-bench: Subjective-aligned benchmark suite for text-driven video editing quality assessment,”arXiv preprint arXiv:2408.11481, 2024

  18. [18]

    Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

    K. Wu, S. Jiang, M. Ku, P. Nie, M. Liu, and W. Chen, “Editreward: A human-aligned reward model for instruction-guided image editing,”arXiv preprint arXiv:2509.26346, 2025

  19. [19]

    Improving Video Generation with Human Feedback

    J. Liu, G. Liu, J. Liang, Z. Yuan, X. Liu, M. Zheng, X. Wu, Q. Wang, M. Xia, X. Wanget al., “Improving video generation with human feedback,”arXiv preprint arXiv:2501.13918, 2025

  20. [20]

    Physics-Aware Video Instance Removal Benchmark

    Z. Li, X. Chen, L. Jiang, D. Hou, F. Lin, K. Yamada, X. Gao, and Z. Tu, “Physics-aware video instance removal benchmark,”arXiv preprint arXiv:2604.05898, 2026

  21. [21]

    V oid: Video object and interaction deletion.arXiv preprint arXiv:2604.02296, 2026

    S. Motamed, W. Harvey, B. Klein, L. Van Gool, Z. Yuan, and T.-Y. Cheng, “Void: Video object and interaction deletion,”arXiv preprint arXiv:2604.02296, 2026

  22. [22]

    arXiv preprint arXiv:2511.20640 (2025)

    R. Burgert, C. Herrmann, F. Cole, M. S. Ryoo, N. Wadhwa, A. Voynov, and N. Ruiz, “Motionv2v: Editing motion in a video,”arXiv preprint arXiv:2511.20640, 2025

  23. [23]

    Pisco: Precise video instance insertion with sparse control.arXiv preprint arXiv:2602.08277, 2026

    X. Gao, R. Li, X. Chen, Y. Wu, S. Feng, Q. Yin, and Z. Tu, “Pisco: Precise video instance insertion with sparse control,”arXiv preprint arXiv:2602.08277, 2026

  24. [24]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,”arXiv preprint arXiv:2307.04725, 2023

  25. [25]

    Diffusion models: A comprehensive survey of methods and applications,

    L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,”ACM computing surveys, vol. 56, no. 4, pp. 1–39, 2023. 14

  26. [26]

    Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

    G. Cheng, X. Gao, L. Hu, S. Hu, M. Huang, C. Ji, J. Li, D. Meng, J. Qi, P. Qiaoet al., “Wan-animate: Unified character animation and replacement with holistic replication,”arXiv preprint arXiv:2509.14055, 2025

  27. [27]

    Luma ray2,

    Luma AI, “Luma ray2,”https://lumalabs.ai/ray2, 2025

  28. [28]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” inICML, 2021

  29. [29]

    The unreasonable effectiveness of deep features as a perceptual metric,

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595

  30. [30]

    Vbench: Comprehensive benchmark suite for video generative models,

    Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisitet al., “Vbench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21807–21818

  31. [31]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models,

    Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jianget al., “Vbench++: Comprehensive and versatile benchmark suite for video generative models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  32. [32]

    Imagereward: Learning and evaluating human preferences for text-to-image generation,

    J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagereward: Learning and evaluating human preferences for text-to-image generation,”Advances in Neural Information Processing Systems, vol. 36, pp. 15903–15935, 2023

  33. [33]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis,”arXiv preprint arXiv:2306.09341, 2023

  34. [34]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation,

    Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image generation,”Advances in neural information processing systems, vol. 36, pp. 36652– 36663, 2023

  35. [35]

    Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,

    X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulrajet al., “Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 2105–2123

  36. [36]

    Densedpo: Fine-grained temporal preference optimization for video diffusion models,

    Z. Wu, A. Kag, I. Skorokhodov, W. Menapace, A. Mirzaei, I. Gilitschenski, S. Tulyakov, and A. Siarohin, “Densedpo: Fine-grained temporal preference optimization for video diffusion models,”arXiv preprint arXiv:2506.03517, 2025

  37. [37]

    Worldscore: A unified evaluation benchmark for world generation,

    H. Duan, H.-X. Yu, S. Chen, L. Fei-Fei, and J. Wu, “Worldscore: A unified evaluation benchmark for world generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 27713– 27724

  38. [38]

    The pulse of motion: Measuring physical frame rate from visual dynamics.arXiv preprint arXiv:2603.14375, 2026

    X. Gao, M. Wu, S. Yang, J. Yu, P. Taghavi, F. Lin, and Z. Tu, “The pulse of motion: Measuring physical frame rate from visual dynamics,”arXiv preprint arXiv:2603.14375, 2026

  39. [39]

    Open-Sora: Democratizing Efficient Video Production for All

    Z. Zhenget al., “Open-Sora: Democratizing efficient video production for all,”arXiv preprint arXiv:2412.20404, 2024

  40. [40]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    K. Nanet al., “OpenVid-1M: A large-scale high-quality dataset for text-to-video generation,”arXiv preprint arXiv:2407.02371, 2024

  41. [41]

    Gemini 3 Flash — deepmind ai model,

    Google DeepMind, “Gemini 3 Flash — deepmind ai model,” 2025

  42. [42]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  43. [43]

    Deep ordinal regression using optimal transport loss and unimodal output probabilities,

    U. Shaham, I. Zaidman, and J. Svirsky, “Deep ordinal regression using optimal transport loss and unimodal output probabilities,”arXiv preprint arXiv:2011.07607, 2020

  44. [44]

    Gemini 3.1 Pro — deepmind ai model,

    Google DeepMind, “Gemini 3.1 Pro — deepmind ai model,” 2025

  45. [45]

    ByteDance Seed: Models and research,

    ByteDance Seed Team, “ByteDance Seed: Models and research,” https://seed.bytedance.com/, 2025, accessed: 2026-02-27

  46. [46]

    Kling video 3.0 model user guide,

    Kling AI, “Kling video 3.0 model user guide,” https://kling.ai/quickstart/klingai-video-3-model-user-guide, Feb. 2026, accessed: 2026-04-16

  47. [47]

    Kling video o1 user guide,

    ——, “Kling video o1 user guide,” https://kling.ai/quickstart/klingai-video-o1-user-guide, Dec. 2025, accessed: 2026-04-16

  48. [48]

    Introducing runway gen-4.5: A new frontier for video generation,

    Runway, “Introducing runway gen-4.5: A new frontier for video generation,” https://runwayml.com/research/ introducing-runway-gen-4.5, Dec. 2025, accessed: 2026-04-16

  49. [49]

    Seedance 2.0 official launch,

    ByteDance Seed Team, “Seedance 2.0 official launch,” https://seed.bytedance.com/en/blog/ 15 official-launch-of-seedance-2-0, Feb. 2026, accessed: 2026-04-16

  50. [50]

    Grok imagine api,

    xAI, “Grok imagine api,” https://x.ai/news/grok-imagine-api, Jan. 2026, accessed: 2026-04-16

  51. [51]

    Luma ai launches ray3,

    Luma AI, “Luma ai launches ray3,” https://lumalabs.ai/news/ray3, Sep. 2025, accessed: 2026-04-16

  52. [52]

    Alibaba unveils wan2.6 series enabling everyone to star in videos,

    Alibaba Cloud, “Alibaba unveils wan2.6 series enabling everyone to star in videos,” https://www.alibabacloud.com/ blog/alibaba-unveils-wan2-6-series-enabling-everyone-to-star-in-videos_602742, Dec. 2025, accessed: 2026-04-16

  53. [53]

    Introducing ray2,

    Luma AI, “Introducing ray2,” https://lumalabs.ai/changelog/introducing-ray2, Jan. 2025, accessed: 2026-04-16

  54. [54]

    R. L. Keeney and H. Raiffa,Decisions with Multiple Objectives: Preferences and Value Tradeoffs. Cambridge University Press, 1993

  55. [55]

    Multiplicative utility functions,

    R. L. Keeney, “Multiplicative utility functions,”Operations Research, vol. 22, no. 1, pp. 22–34, 1974

  56. [56]

    Inference and missing data,

    D. B. Rubin, “Inference and missing data,”Biometrika, vol. 63, no. 3, pp. 581–592, 1976

  57. [57]

    A generalization of sampling without replacement from a finite universe,

    D. G. Horvitz and D. J. Thompson, “A generalization of sampling without replacement from a finite universe,” Journal of the American Statistical Association, vol. 47, no. 260, pp. 663–685, 1952

  58. [58]

    Estimation of regression coefficients when some regressors are not always observed,

    J. M. Robins, A. Rotnitzky, and L. P. Zhao, “Estimation of regression coefficients when some regressors are not always observed,”Journal of the American Statistical Association, vol. 89, no. 427, pp. 846–866, 1994

  59. [59]

    Review of inverse probability weighting for dealing with missing data,

    S. R. Seaman and I. R. White, “Review of inverse probability weighting for dealing with missing data,”Statistical Methods in Medical Research, vol. 22, no. 3, pp. 278–295, 2013

  60. [60]

    SAM 2: Segment Anything in Images and Videos

    N. Raviet al., “SAM 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

  61. [61]

    Rose: Remove objects with side effects in videos.arXiv preprint arXiv:2508.18633,

    C. Miao, Y. Feng, J. Zeng, Z. Gao, H. Liu, Y. Yan, D. Qi, X. Chen, B. Wang, and H. Zhao, “Rose: Remove objects with side effects in videos,”arXiv preprint arXiv:2508.18633, 2025

  62. [62]

    ViTPose: Simple vision transformer baselines for human pose estimation,

    Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: Simple vision transformer baselines for human pose estimation,” inNeurIPS, 2022

  63. [63]

    Depth Anything 3: Recovering the Visual Space from Any Views

    H. Lin, S. Chen, J. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025

  64. [64]

    ReCamMaster: Camera-controlled generative rendering from a single video,

    J. Heet al., “ReCamMaster: Camera-controlled generative rendering from a single video,”arXiv preprint arXiv:2501.12007, 2025

  65. [65]

    Light-x: Generative 4d video rendering with camera and illumination control,

    T. Liu, Z. Chen, Z. Huang, S. Xu, S. Zhang, C. Ye, B. Li, Z. Cao, W. Li, H. Zhaoet al., “Light-x: Generative 4d video rendering with camera and illumination control,”arXiv preprint arXiv:2512.05115, 2025. 16 Appendix A Additional Training Details forVEFX-Reward We provide the implementation details omitted from the main paper.VEFX-Reward is trained on the...