Recognition: unknown
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
Pith reviewed 2026-05-10 08:20 UTC · model grok-4.3
The pith
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation.
Load-bearing premise
The three decoupled dimensions (Instruction Following, Rendering Quality, Edit Exclusivity) are assumed to comprehensively and independently capture editing quality, and the collected human annotations are treated as reliable ground truth without detailed discussion of inter-annotator agreement or potential biases.
read the original abstract
As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models. Our project page is https://xiangbogaobarry.github.io/VEFX-Bench/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VEFX-Dataset, a human-annotated collection of 5,049 video editing examples spanning 9 major categories and 32 subcategories, with each example labeled on three decoupled dimensions (Instruction Following, Rendering Quality, and Edit Exclusivity) via ordinal regression. It proposes VEFX-Reward, a model that jointly processes source video, editing instruction, and edited output to predict per-dimension quality scores, and releases VEFX-Bench, a standardized set of 300 video-prompt pairs. Experiments claim that VEFX-Reward exhibits stronger correlation with human judgments than generic VLM judges and prior reward models on IQA/VQA metrics and group-wise preference evaluation, and use the model to benchmark commercial and open-source editing systems.
Significance. If the human annotations prove reliable and the reported alignment gains hold under scrutiny, the work supplies a valuable large-scale resource and specialized evaluator for instruction-guided video editing, an area currently hampered by reliance on generic VLMs or costly manual review. The public release of the dataset, reward model, and benchmark would enable reproducible comparisons and targeted improvements in editing locality, instruction adherence, and rendering quality.
major comments (3)
- [VEFX-Dataset description] Description of VEFX-Dataset: No inter-annotator agreement statistics (Fleiss' kappa, Krippendorff's alpha, or pairwise correlations), annotator count, qualification criteria, or bias analysis are reported for the 5,049 labels across the three dimensions. This directly undermines the central claim that VEFX-Reward aligns more strongly with 'human judgments,' as the ground truth reliability remains unverified.
- [Experiments] Experiments section: The abstract and results claim superior performance of VEFX-Reward over baselines on IQA/VQA metrics and group-wise preference, yet supply no details on training procedure, data splits, hyperparameter choices, or ablation studies. Without these, the superiority result cannot be assessed for robustness or overfitting to the annotation set.
- [Introduction and VEFX-Dataset] Introduction and VEFX-Dataset: The three dimensions are asserted to be 'decoupled' and jointly sufficient for editing quality, but no evidence (e.g., correlation matrices between dimensions or coverage analysis against editing failure modes) is provided to support independence or completeness. If dimensions overlap or miss key aspects, both the reward model training and the benchmark conclusions become questionable.
minor comments (1)
- [Abstract] The abstract states that VEFX-Reward 'jointly processes the source video, the editing instruction, and the edited video,' but the precise architecture (e.g., fusion mechanism, backbone choice) is not summarized, hindering quick assessment of novelty relative to prior multimodal reward models.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve the manuscript.
read point-by-point responses
-
Referee: Description of VEFX-Dataset: No inter-annotator agreement statistics (Fleiss' kappa, Krippendorff's alpha, or pairwise correlations), annotator count, qualification criteria, or bias analysis are reported for the 5,049 labels across the three dimensions. This directly undermines the central claim that VEFX-Reward aligns more strongly with 'human judgments,' as the ground truth reliability remains unverified.
Authors: We agree that reporting inter-annotator agreement is important for validating annotation reliability. The original manuscript omitted these details for brevity. We have computed Fleiss' kappa and Krippendorff's alpha across the three dimensions (showing substantial agreement), and will add these statistics, the annotator count (five qualified editors), qualification criteria, and bias analysis to the revised VEFX-Dataset section to better support the human judgment alignment claims. revision: yes
-
Referee: Experiments section: The abstract and results claim superior performance of VEFX-Reward over baselines on IQA/VQA metrics and group-wise preference, yet supply no details on training procedure, data splits, hyperparameter choices, or ablation studies. Without these, the superiority result cannot be assessed for robustness or overfitting to the annotation set.
Authors: We acknowledge the need for full experimental transparency. The revised manuscript will include a new subsection detailing the VEFX-Reward training procedure, data splits (80/10/10), hyperparameter choices, optimization settings, and ablation studies on input modalities and loss terms. This will allow assessment of robustness and reduce concerns about overfitting. revision: yes
-
Referee: Introduction and VEFX-Dataset: The three dimensions are asserted to be 'decoupled' and jointly sufficient for editing quality, but no evidence (e.g., correlation matrices between dimensions or coverage analysis against editing failure modes) is provided to support independence or completeness. If dimensions overlap or miss key aspects, both the reward model training and the benchmark conclusions become questionable.
Authors: The dimensions were chosen to target distinct failure modes based on video editing literature. To provide empirical support, the revised VEFX-Dataset section will include a correlation matrix demonstrating low inter-dimension correlations and a coverage analysis against common editing failure modes. These additions will substantiate the design choices and their sufficiency for the reward model and benchmark. revision: yes
Circularity Check
No significant circularity; derivation relies on external human annotations as independent ground truth
full rationale
The paper collects a new human-annotated dataset (VEFX-Dataset) with 5,049 examples labeled on three dimensions, trains VEFX-Reward on it via ordinal regression, and evaluates alignment against held-out human judgments plus comparisons to external VLM judges and prior reward models. No equations, fitted parameters, or self-citations reduce the reported correlations or benchmark results to quantities defined by the same inputs. The central claims are empirical performance numbers on separate test splits and external baselines, with no self-definitional loops, renamed predictions, or load-bearing self-citations. This is the standard non-circular structure for a benchmark-plus-reward-model paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human annotations on the three dimensions provide reliable and unbiased ground truth for video editing quality
- ad hoc to paper The three dimensions are independent and jointly sufficient to evaluate editing quality
Reference graph
Works this paper leans on
-
[1]
Wan: Open and Advanced Large-Scale Video Generative Models
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Goku: Flow based video generative foundation models,
S. Chen, C. Ge, Y. Zhang, Y. Zhang, F. Zhu, H. Yang, H. Hao, H. Wu, Z. Lai, Y. Huet al., “Goku: Flow based video generative foundation models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 23516–23527
2025
-
[3]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Movie Gen: A Cast of Media Foundation Models
A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuanget al., “Movie gen: A cast of media foundation models,”arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
Sora: Creating video from text,
OpenAI, “Sora: Creating video from text,” 2024
2024
-
[6]
arXiv preprint arXiv:2406.13527 (2024)
R. Li, P. Pan, B. Yang, D. Xu, S. Zhou, X. Zhang, Z. Li, A. Kadambi, Z. Wang, Z. Tuet al., “4k4dgen: Panoramic 4d generation at 4k resolution,”arXiv preprint arXiv:2406.13527, 2024
-
[7]
Consid-gen: View-consistent and identity-preserving image-to-video generation,
M. Wu, A. Mishra, S. Dey, S. Xing, N. Ravipati, H. Wu, B. Li, and Z. Tu, “Consid-gen: View-consistent and identity-preserving image-to-video generation,”arXiv preprint arXiv:2602.10113, 2026
-
[8]
Veo3 technical report,
DeepMind, “Veo3 technical report,” DeepMind, Technical Report, 2025, accessed: 2026-02-18. [Online]. Available: https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf
2025
-
[9]
Kling AI Omni / VIDEO O1 creative interface,
Kling AI, “Kling AI Omni / VIDEO O1 creative interface,” 2025
2025
-
[10]
Grok Imagine — ai image & video generation by xai,
xAI, “Grok Imagine — ai image & video generation by xai,” 2026
2026
-
[11]
Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025
Z. Jiang, Z. Han, C. Mao, J. Zhang, Y. Pan, and Y. Liu, “Vace: All-in-one video creation and editing,”arXiv preprint arXiv:2503.07598, 2025
-
[12]
C. Wei, Q. Liu, Z. Ye, Q. Wang, X. Wang, P. Wan, K. Gai, and W. Chen, “Univideo: Unified understanding, generation, and editing for videos,”arXiv preprint arXiv:2510.08377, 2025
-
[13]
Editboard: Towards a comprehensive evaluation benchmark for text-based video editing models,
Y. Chen, P. Chen, X. Zhang, Y. Huang, and Q. Xie, “Editboard: Towards a comprehensive evaluation benchmark for text-based video editing models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 15, 2025, pp. 15975–15983
2025
-
[14]
M. Li, C. Xie, Y. Wu, L. Zhang, and M. Wang, “Five: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models,”arXiv preprint arXiv:2503.13684, 2025
-
[15]
Y. Chen, J. Zhang, T. Hu, Y. Zeng, Z. Xue, Q. He, C. Wang, Y. Liu, X. Hu, and S. Yan, “Ivebench: Modern benchmark suite for instruction-guided video editing assessment,”arXiv preprint arXiv:2510.11647, 2025
-
[16]
H. He, J. Wang, J. Zhang, Z. Xue, X. Bu, Q. Yang, S. Wen, and L. Xie, “Openve-3m: A large-scale high-quality dataset for instruction-guided video editing,”arXiv preprint arXiv:2512.07826, 2025
-
[17]
Ve-bench: Subjective-aligned benchmark suite for text-driven video editing quality assessment,
S. Sun, X. Liang, S. Fan, W. Gao, and W. Gao, “Ve-bench: Subjective-aligned benchmark suite for text-driven video editing quality assessment,”arXiv preprint arXiv:2408.11481, 2024
-
[18]
K. Wu, S. Jiang, M. Ku, P. Nie, M. Liu, and W. Chen, “Editreward: A human-aligned reward model for instruction-guided image editing,”arXiv preprint arXiv:2509.26346, 2025
-
[19]
Improving Video Generation with Human Feedback
J. Liu, G. Liu, J. Liang, Z. Yuan, X. Liu, M. Zheng, X. Wu, Q. Wang, M. Xia, X. Wanget al., “Improving video generation with human feedback,”arXiv preprint arXiv:2501.13918, 2025
work page internal anchor Pith review arXiv 2025
-
[20]
Physics-Aware Video Instance Removal Benchmark
Z. Li, X. Chen, L. Jiang, D. Hou, F. Lin, K. Yamada, X. Gao, and Z. Tu, “Physics-aware video instance removal benchmark,”arXiv preprint arXiv:2604.05898, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
V oid: Video object and interaction deletion.arXiv preprint arXiv:2604.02296, 2026
S. Motamed, W. Harvey, B. Klein, L. Van Gool, Z. Yuan, and T.-Y. Cheng, “Void: Video object and interaction deletion,”arXiv preprint arXiv:2604.02296, 2026
-
[22]
arXiv preprint arXiv:2511.20640 (2025)
R. Burgert, C. Herrmann, F. Cole, M. S. Ryoo, N. Wadhwa, A. Voynov, and N. Ruiz, “Motionv2v: Editing motion in a video,”arXiv preprint arXiv:2511.20640, 2025
-
[23]
Pisco: Precise video instance insertion with sparse control.arXiv preprint arXiv:2602.08277, 2026
X. Gao, R. Li, X. Chen, Y. Wu, S. Feng, Q. Yin, and Z. Tu, “Pisco: Precise video instance insertion with sparse control,”arXiv preprint arXiv:2602.08277, 2026
-
[24]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,”arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review arXiv 2023
-
[25]
Diffusion models: A comprehensive survey of methods and applications,
L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,”ACM computing surveys, vol. 56, no. 4, pp. 1–39, 2023. 14
2023
-
[26]
G. Cheng, X. Gao, L. Hu, S. Hu, M. Huang, C. Ji, J. Li, D. Meng, J. Qi, P. Qiaoet al., “Wan-animate: Unified character animation and replacement with holistic replication,”arXiv preprint arXiv:2509.14055, 2025
-
[27]
Luma ray2,
Luma AI, “Luma ray2,”https://lumalabs.ai/ray2, 2025
2025
-
[28]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” inICML, 2021
2021
-
[29]
The unreasonable effectiveness of deep features as a perceptual metric,
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595
2018
-
[30]
Vbench: Comprehensive benchmark suite for video generative models,
Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisitet al., “Vbench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21807–21818
2024
-
[31]
Vbench++: Comprehensive and versatile benchmark suite for video generative models,
Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jianget al., “Vbench++: Comprehensive and versatile benchmark suite for video generative models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[32]
Imagereward: Learning and evaluating human preferences for text-to-image generation,
J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagereward: Learning and evaluating human preferences for text-to-image generation,”Advances in Neural Information Processing Systems, vol. 36, pp. 15903–15935, 2023
2023
-
[33]
X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis,”arXiv preprint arXiv:2306.09341, 2023
work page internal anchor Pith review arXiv 2023
-
[34]
Pick-a-pic: An open dataset of user preferences for text-to-image generation,
Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image generation,”Advances in neural information processing systems, vol. 36, pp. 36652– 36663, 2023
2023
-
[35]
Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,
X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulrajet al., “Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 2105–2123
2024
-
[36]
Densedpo: Fine-grained temporal preference optimization for video diffusion models,
Z. Wu, A. Kag, I. Skorokhodov, W. Menapace, A. Mirzaei, I. Gilitschenski, S. Tulyakov, and A. Siarohin, “Densedpo: Fine-grained temporal preference optimization for video diffusion models,”arXiv preprint arXiv:2506.03517, 2025
-
[37]
Worldscore: A unified evaluation benchmark for world generation,
H. Duan, H.-X. Yu, S. Chen, L. Fei-Fei, and J. Wu, “Worldscore: A unified evaluation benchmark for world generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 27713– 27724
2025
-
[38]
X. Gao, M. Wu, S. Yang, J. Yu, P. Taghavi, F. Lin, and Z. Tu, “The pulse of motion: Measuring physical frame rate from visual dynamics,”arXiv preprint arXiv:2603.14375, 2026
-
[39]
Open-Sora: Democratizing Efficient Video Production for All
Z. Zhenget al., “Open-Sora: Democratizing efficient video production for all,”arXiv preprint arXiv:2412.20404, 2024
work page internal anchor Pith review arXiv 2024
-
[40]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
K. Nanet al., “OpenVid-1M: A large-scale high-quality dataset for text-to-video generation,”arXiv preprint arXiv:2407.02371, 2024
work page internal anchor Pith review arXiv 2024
-
[41]
Gemini 3 Flash — deepmind ai model,
Google DeepMind, “Gemini 3 Flash — deepmind ai model,” 2025
2025
-
[42]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Deep ordinal regression using optimal transport loss and unimodal output probabilities,
U. Shaham, I. Zaidman, and J. Svirsky, “Deep ordinal regression using optimal transport loss and unimodal output probabilities,”arXiv preprint arXiv:2011.07607, 2020
-
[44]
Gemini 3.1 Pro — deepmind ai model,
Google DeepMind, “Gemini 3.1 Pro — deepmind ai model,” 2025
2025
-
[45]
ByteDance Seed: Models and research,
ByteDance Seed Team, “ByteDance Seed: Models and research,” https://seed.bytedance.com/, 2025, accessed: 2026-02-27
2025
-
[46]
Kling video 3.0 model user guide,
Kling AI, “Kling video 3.0 model user guide,” https://kling.ai/quickstart/klingai-video-3-model-user-guide, Feb. 2026, accessed: 2026-04-16
2026
-
[47]
Kling video o1 user guide,
——, “Kling video o1 user guide,” https://kling.ai/quickstart/klingai-video-o1-user-guide, Dec. 2025, accessed: 2026-04-16
2025
-
[48]
Introducing runway gen-4.5: A new frontier for video generation,
Runway, “Introducing runway gen-4.5: A new frontier for video generation,” https://runwayml.com/research/ introducing-runway-gen-4.5, Dec. 2025, accessed: 2026-04-16
2025
-
[49]
Seedance 2.0 official launch,
ByteDance Seed Team, “Seedance 2.0 official launch,” https://seed.bytedance.com/en/blog/ 15 official-launch-of-seedance-2-0, Feb. 2026, accessed: 2026-04-16
2026
-
[50]
Grok imagine api,
xAI, “Grok imagine api,” https://x.ai/news/grok-imagine-api, Jan. 2026, accessed: 2026-04-16
2026
-
[51]
Luma ai launches ray3,
Luma AI, “Luma ai launches ray3,” https://lumalabs.ai/news/ray3, Sep. 2025, accessed: 2026-04-16
2025
-
[52]
Alibaba unveils wan2.6 series enabling everyone to star in videos,
Alibaba Cloud, “Alibaba unveils wan2.6 series enabling everyone to star in videos,” https://www.alibabacloud.com/ blog/alibaba-unveils-wan2-6-series-enabling-everyone-to-star-in-videos_602742, Dec. 2025, accessed: 2026-04-16
2025
-
[53]
Introducing ray2,
Luma AI, “Introducing ray2,” https://lumalabs.ai/changelog/introducing-ray2, Jan. 2025, accessed: 2026-04-16
2025
-
[54]
R. L. Keeney and H. Raiffa,Decisions with Multiple Objectives: Preferences and Value Tradeoffs. Cambridge University Press, 1993
1993
-
[55]
Multiplicative utility functions,
R. L. Keeney, “Multiplicative utility functions,”Operations Research, vol. 22, no. 1, pp. 22–34, 1974
1974
-
[56]
Inference and missing data,
D. B. Rubin, “Inference and missing data,”Biometrika, vol. 63, no. 3, pp. 581–592, 1976
1976
-
[57]
A generalization of sampling without replacement from a finite universe,
D. G. Horvitz and D. J. Thompson, “A generalization of sampling without replacement from a finite universe,” Journal of the American Statistical Association, vol. 47, no. 260, pp. 663–685, 1952
1952
-
[58]
Estimation of regression coefficients when some regressors are not always observed,
J. M. Robins, A. Rotnitzky, and L. P. Zhao, “Estimation of regression coefficients when some regressors are not always observed,”Journal of the American Statistical Association, vol. 89, no. 427, pp. 846–866, 1994
1994
-
[59]
Review of inverse probability weighting for dealing with missing data,
S. R. Seaman and I. R. White, “Review of inverse probability weighting for dealing with missing data,”Statistical Methods in Medical Research, vol. 22, no. 3, pp. 278–295, 2013
2013
-
[60]
SAM 2: Segment Anything in Images and Videos
N. Raviet al., “SAM 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review arXiv 2024
-
[61]
Rose: Remove objects with side effects in videos.arXiv preprint arXiv:2508.18633,
C. Miao, Y. Feng, J. Zeng, Z. Gao, H. Liu, Y. Yan, D. Qi, X. Chen, B. Wang, and H. Zhao, “Rose: Remove objects with side effects in videos,”arXiv preprint arXiv:2508.18633, 2025
-
[62]
ViTPose: Simple vision transformer baselines for human pose estimation,
Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: Simple vision transformer baselines for human pose estimation,” inNeurIPS, 2022
2022
-
[63]
Depth Anything 3: Recovering the Visual Space from Any Views
H. Lin, S. Chen, J. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025
work page internal anchor Pith review arXiv 2025
-
[64]
ReCamMaster: Camera-controlled generative rendering from a single video,
J. Heet al., “ReCamMaster: Camera-controlled generative rendering from a single video,”arXiv preprint arXiv:2501.12007, 2025
-
[65]
Light-x: Generative 4d video rendering with camera and illumination control,
T. Liu, Z. Chen, Z. Huang, S. Xu, S. Zhang, C. Ye, B. Li, Z. Cao, W. Li, H. Zhaoet al., “Light-x: Generative 4d video rendering with camera and illumination control,”arXiv preprint arXiv:2512.05115, 2025. 16 Appendix A Additional Training Details forVEFX-Reward We provide the implementation details omitted from the main paper.VEFX-Reward is trained on the...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.