Recognition: unknown
LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing
Pith reviewed 2026-05-10 07:17 UTC · model grok-4.3
The pith
LIVE jointly trains video editors on image and video data using frame-wise token noise to bridge domain gaps and reach state-of-the-art results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LIVE is a joint training framework that leverages large-scale, high-quality image editing data alongside video datasets to bolster editing capabilities. To mitigate the domain discrepancy between static images and dynamic videos, it introduces a frame-wise token noise strategy which treats the latents of specific frames as reasoning tokens, leveraging large pretrained video generative models to create plausible temporal transformations. Through cleaning public datasets and constructing an automated data pipeline, a two-stage training strategy anneals video editing capabilities. A comprehensive evaluation benchmark encompassing over 60 challenging tasks is curated, and extensive comparative,
What carries the argument
The frame-wise token noise strategy, which selects latents of specific frames as reasoning tokens inside pretrained video generative models to produce temporal transformations from image priors.
If this is right
- Video editing can scale without proportional increases in costly video-specific annotations by drawing on image datasets.
- Models gain the ability to perform editing tasks common in images but rare in existing video collections.
- Two-stage training allows gradual transfer of capabilities from image to video domains.
- Ablation results confirm that removing the noise strategy or image data degrades performance on the benchmark.
Where Pith is reading between the lines
- The same principle of mixing high-quality static priors with dynamic data could extend to other modalities such as audio or 3D scene editing.
- The new 60-task benchmark could serve as a shared testbed that pushes future video editing methods toward greater task coverage.
- Further scaling the image data component might yield additional gains on edits involving complex or rare motions.
Load-bearing premise
The frame-wise token noise strategy can effectively mitigate the domain discrepancy between static images and dynamic videos by enabling plausible temporal transformations.
What would settle it
Train an otherwise identical video editing model on video data alone without the image data or frame-wise noise component, then compare its performance directly against LIVE on the same 60-task benchmark; equal or better results would refute the benefit of the image priors.
Figures
read the original abstract
Video editing aims to modify input videos according to user intent. Recently, end-to-end training methods have garnered widespread attention, constructing paired video editing data through video generation or editing models. However, compared to image editing, the high annotation costs of video data severely constrain the scale, quality, and task diversity of video editing datasets when relying on video generative models or manual annotation. To bridge this gap, we propose LIVE, a joint training framework that leverages large-scale, high-quality image editing data alongside video datasets to bolster editing capabilities. To mitigate the domain discrepancy between static images and dynamic videos, we introduce a frame-wise token noise strategy, which treats the latents of specific frames as reasoning tokens, leveraging large pretrained video generative models to create plausible temporal transformations. Moreover, through cleaning public datasets and constructing an automated data pipeline, we adopt a two-stage training strategy to anneal video editing capabilities. Furthermore, we curate a comprehensive evaluation benchmark encompassing over 60 challenging tasks that are prevalent in image editing but scarce in existing video datasets. Extensive comparative and ablation experiments demonstrate that our method achieves state-of-the-art performance. The source code will be publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LIVE, a joint training framework for instruction-based video editing that leverages large-scale image editing data alongside video datasets. It proposes a frame-wise token noise strategy to mitigate domain discrepancies by treating latents of specific frames as reasoning tokens within pretrained video generative models, uses a two-stage training pipeline after dataset cleaning and curation, and presents a new benchmark covering over 60 challenging tasks. The authors report state-of-the-art performance via comparative experiments and ablations, with plans to release source code.
Significance. If the central claims hold, the work could meaningfully advance video editing by reducing reliance on scarce, high-cost video annotations through effective transfer from image priors, while the curated benchmark and public code would provide valuable resources for the community. The two-stage annealing strategy and emphasis on task diversity represent practical contributions to scaling editing capabilities.
major comments (3)
- [§3.2] §3.2: The frame-wise token noise strategy is presented as the key mechanism for bridging static image and dynamic video domains by converting selected frame latents into reasoning tokens; however, the ablations do not isolate its contribution (e.g., via controlled comparisons of noise scheduling versus joint training alone), leaving the load-bearing assumption that it induces plausible temporal transformations unverified by quantitative evidence.
- [Table 3] Table 3 and §5.2: The SOTA performance claims rest on comparative results against prior video editing methods, but the reported metrics lack error bars, multiple random seeds, or statistical tests, making it difficult to determine whether observed gains are robust or attributable to the proposed components rather than implementation details.
- [§4.3] §4.3: The automated data pipeline and cleaning process for public datasets are described at a high level, but without explicit criteria for task selection, exclusion rules, or quality metrics, it is unclear how the 60-task benchmark ensures coverage of image-editing tasks that are scarce in video data or avoids introducing biases.
minor comments (2)
- [Abstract] The abstract and §2 could benefit from a brief equation or pseudocode snippet formalizing the frame-wise token noise application to improve clarity for readers unfamiliar with the latent space operations.
- [Figure 2] Figure 2 (method overview) would be strengthened by annotating the exact frames selected for noise injection and the resulting temporal flow.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive suggestions. We address each of the major comments below and have made revisions to the manuscript to incorporate the feedback where appropriate.
read point-by-point responses
-
Referee: [§3.2] §3.2: The frame-wise token noise strategy is presented as the key mechanism for bridging static image and dynamic video domains by converting selected frame latents into reasoning tokens; however, the ablations do not isolate its contribution (e.g., via controlled comparisons of noise scheduling versus joint training alone), leaving the load-bearing assumption that it induces plausible temporal transformations unverified by quantitative evidence.
Authors: We appreciate this observation. The ablations in the original manuscript focused on the overall joint training framework and its benefits over video-only training. To better isolate the frame-wise token noise strategy, we have added new controlled experiments in the revised version of §3.2. Specifically, we compare the full LIVE model against a baseline that performs joint training without the frame-wise token noise (using standard noise scheduling instead). The results, reported in a new table, show that the token noise strategy contributes to improved temporal coherence, providing quantitative support for its role in creating plausible temporal transformations. These additional ablations are also detailed in the supplementary material. revision: yes
-
Referee: [Table 3] Table 3 and §5.2: The SOTA performance claims rest on comparative results against prior video editing methods, but the reported metrics lack error bars, multiple random seeds, or statistical tests, making it difficult to determine whether observed gains are robust or attributable to the proposed components rather than implementation details.
Authors: We agree that reporting variability would enhance the reliability of the SOTA claims. Due to the high computational demands of training on large-scale datasets, we initially reported results from a single run with a fixed seed. In the revised manuscript, we have included error bars based on three independent runs for the main metrics in Table 3. Additionally, we have added a discussion in §5.2 on the observed variance and performed paired t-tests where applicable to assess statistical significance of the improvements over baselines. We note that the gains remain consistent across runs. revision: yes
-
Referee: [§4.3] §4.3: The automated data pipeline and cleaning process for public datasets are described at a high level, but without explicit criteria for task selection, exclusion rules, or quality metrics, it is unclear how the 60-task benchmark ensures coverage of image-editing tasks that are scarce in video data or avoids introducing biases.
Authors: We thank the referee for highlighting the need for greater transparency in the data curation process. In the revised §4.3, we have provided detailed descriptions of the automated pipeline, including: explicit task selection criteria (focusing on 60 tasks such as style transfer, object manipulation, and attribute editing that are common in image editing but rare in video datasets), exclusion rules (e.g., discarding samples with low motion coherence or poor text-video alignment), and quality metrics (including FID scores for image fidelity and temporal consistency measures). We have also included an analysis showing the distribution of tasks to confirm broad coverage and minimal bias introduction. A new supplementary section elaborates on the pipeline implementation. revision: yes
Circularity Check
No significant circularity; claims rest on empirical comparisons and new benchmark
full rationale
The paper presents an empirical method (joint image-video training with a frame-wise token noise strategy and two-stage pipeline) whose central claims are justified by comparative experiments, ablations, and a curated 60-task benchmark rather than any derivation chain. No equations, fitted parameters renamed as predictions, or self-citations invoked as uniqueness theorems appear in the abstract or described approach. The frame-wise strategy is introduced as a design choice to address domain gap and is evaluated externally via performance metrics, keeping the contribution self-contained without reduction to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- frame-wise token noise levels
axioms (1)
- domain assumption Large pretrained video generative models can create plausible temporal transformations from noisy image latents.
Reference graph
Works this paper leans on
-
[1]
In: ICCV (2025) 7
Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: ICCV (2025) 7
2025
-
[2]
Bai, Q., Wang, Q., Ouyang, H., Yu, Y., Wang, H., Wang, W., Cheng, K.L., Ma, S., Zeng, Y., Liu, Z., et al.: Scaling instruction-based video editing with a high-quality synthetic dataset. arXiv preprint arXiv:2510.15742 (2025) 2, 4, 5, 10, 11
-
[3]
In: CVPR (2023) 2 16 Wang et al
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR (2023) 2 16 Wang et al
2023
-
[4]
In: ICLR (2026) 6
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. In: ICLR (2026) 6
2026
-
[5]
In: ICCV (2023) 2
Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2video: Video editing using image diffusion. In: ICCV (2023) 2
2023
-
[6]
Chang, D., Cao, M., Shi, Y., Liu, B., Cai, S., Zhou, S., Huang, W., Wetzstein, G., Soleymani, M., Wang, P.: Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions. arXiv preprint arXiv:2506.03107 (2025) 5, 6
-
[7]
In: ICLR (2026) 2, 10
Chen, Y., Zhang, J., Hu, T., Zeng, Y., Xue, Z., He, Q., Wang, C., Liu, Y., Hu, X., Yan, S.: Ivebench: Modern benchmark suite for instruction-guided video editing assessment. In: ICLR (2026) 2, 10
2026
-
[8]
In: ICLR (2024) 2, 4, 5
Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. In: ICLR (2024) 2, 4, 5
2024
-
[9]
In: ICLR (2024) 4
Cong, Y., Xu, M., Simon, C., Chen, S., Ren, J., Xie, Y., Perez-Rua, J.M., Rosen- hahn, B., Xiang, T., He, S.: Flatten: optical flow-guided attention for consistent text-to-video editing. In: ICLR (2024) 4
2024
-
[10]
In: ICLR (2024) 2, 4
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion fea- tures for consistent video editing. In: ICLR (2024) 2, 4
2024
-
[11]
He, H., Wang, J., Zhang, J., Xue, Z., Bu, X., Yang, Q., Wen, S., Xie, L.: Openve- 3m: A large-scale high-quality dataset for instruction-guided video editing. arXiv preprint arXiv:2512.07826 (2025) 2, 4, 5, 6, 10
-
[12]
In: ICLR (2023) 2
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: ICLR (2023) 2
2023
-
[13]
In: ICLR (2023) 4
Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. In: ICLR (2023) 4
2023
-
[14]
In: ICML (2025) 5, 6
Jiang, H., Fang, J., Zhang, N., Ma, G., Wan, M., Wang, X., He, X., Chua, T.s.: Anyedit: Edit any knowledge encoded in language models. In: ICML (2025) 5, 6
2025
-
[15]
In: ICCV (2025) 3, 7
Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: ICCV (2025) 3, 7
2025
-
[16]
In: ICLR (2026) 4, 5, 10, 11
Ju, X., Wang, T., Zhou, Y., Zhang, H., Liu, Q., Zhao, N., Zhang, Z., Li, Y., Cai, Y., Liu, S., et al.: Editverse: Unifying image and video editing and generation with in-context learning. In: ICLR (2026) 4, 5, 10, 11
2026
-
[17]
Kara, O., Kurtkaya, B., Yesiltepe,H., Rehg, J.M., Yanardag, P.:Rave: Randomized noiseshufflingforfastandconsistentvideoeditingwithdiffusionmodels.In:CVPR (2024) 4
2024
-
[18]
Transactions on Machine Learning Research (2024) 2
Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. Transactions on Machine Learning Research (2024) 2
2024
-
[19]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 5
work page internal anchor Pith review arXiv 2025
-
[20]
arXiv preprint arXiv:2506.05046 (2025) 2
Li, G., Yang, Y., Song, C., Zhang, C.: Flowdirector: Training-free flow steering for precise text-to-video editing. arXiv preprint arXiv:2506.05046 (2025) 2
-
[21]
arXiv preprint arXiv:2509.14638 (2025) 5, 6
Li, M., Liu, L., Wang, H., Chen, H., Gu, X., Liu, S., Gong, D., Zhao, J., Lan, Z., Li, J.: Multiedit: Advancing instruction-based image editing on diverse and challenging tasks. arXiv preprint arXiv:2509.14638 (2025) 5, 6
-
[22]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 5, 6 LIVE 17
work page internal anchor Pith review arXiv 2025
-
[23]
arXiv preprint arXiv:2602.09609 (2026) 4
Liu, J., Ma, Y., Cao, X., Li, T., Shang, G., Huang, H., Zhang, C., Li, X., Liu, C., Liu, J., et al.: Tele-omni: a unified multimodal framework for video generation and editing. arXiv preprint arXiv:2602.09609 (2026) 4
-
[24]
Step1X-Edit: A Practical Framework for General Image Editing
Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 2
work page internal anchor Pith review arXiv 2025
-
[25]
Ma, J., Zhu, X., Pan, Z., Peng, Q., Guo, X., Chen, C., Lu, H.: X2edit: Revisiting arbitrary-instruction image editing through self-constructed data and task-aware representation learning. arXiv preprint arXiv:2508.07607 (2025) 5, 6
-
[26]
In: NeurIPS (2025) 6
Miao, C., Feng, Y., Zeng, J., Gao, Z., Liu, H., Yan, Y., Qi, D., Chen, X., Wang, B., Zhao, H.: Rose: Remove objects with side effects in videos. In: NeurIPS (2025) 6
2025
-
[27]
PicoTrex: Picotrex/awesome-nano-banana-images (2025) 6
2025
-
[28]
In: ICLR (2024) 4
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: ICLR (2024) 4
2024
-
[29]
In: ICCV (2023) 2, 4
Qi, C., Cun, X., Zhang, Y., Lei, C., Wang, X., Shan, Y., Chen, Q.: Fatezero: Fusing attentions for zero-shot text-based video editing. In: ICCV (2023) 2, 4
2023
-
[30]
Qian, Y., Bocek-Rivele, E., Song, L., Tong, J., Yang, Y., Lu, J., Hu, W., Gan, Z.: Pico-banana-400k: A large-scale dataset for text-guided image editing. arXiv preprint arXiv:2510.19808 (2025) 5, 6
-
[31]
Journal of Machine Learning Research21(140), 1–67 (2020) 7
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research21(140), 1–67 (2020) 7
2020
-
[32]
In: CVPR (2022) 4
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022) 4
2022
-
[33]
In: CVPR (2024) 2
Sheynin, S., Polyak, A., Singer, U., Kirstain, Y., Zohar, A., Ashual, O., Parikh, D., Taigman, Y.: Emu edit: Precise image editing via recognition and generation tasks. In: CVPR (2024) 2
2024
-
[34]
In: ECCV (2024) 10
Singer, U., Zohar, A., Kirstain, Y., Sheynin, S., Polyak, A., Parikh, D., Taigman, Y.: Video editing via factorized diffusion distillation. In: ECCV (2024) 10
2024
-
[35]
In: ICLR (2021) 2
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021) 2
2021
-
[36]
Diffusion model-based video editing: A survey.arXiv preprint arXiv:2407.07111, 2024
Sun, W., Tu, R.C., Liao, J., Tao, D.: Diffusion model-based video editing: A survey. arXiv preprint arXiv:2407.07111 (2024) 10
-
[37]
In: CVPR (2025) 5, 6
Sushko, P., Bharadwaj, A., Lim, Z.Y., Ilin, V., Caffee, B., Chen, D., Salehi, M., Hsieh, C.Y., Krishna, R.: Realedit: Reddit edits as a large-scale empirical dataset for image transformations. In: CVPR (2025) 5, 6
2025
-
[38]
Team, D.: Lucy edit: Open-weight text-guided video editing (2025) 10, 11
2025
-
[39]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 4, 7, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
ModelScope Text-to-Video Technical Report
Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text- to-video technical report. arXiv preprint arXiv:2308.06571 (2023) 2
work page internal anchor Pith review arXiv 2023
-
[41]
Transactions on Machine Learning Research (2024) 2
Wang, W., Jiang, Y., Xie, K., Liu, Z., Chen, H., Cao, Y., Wang, X., Shen, C.: Zero-shot video editing using off-the-shelf image diffusion models. Transactions on Machine Learning Research (2024) 2
2024
-
[42]
Gpt-image-edit-1.5 m: A million-scale, gpt-generated image dataset
Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025) 4, 5, 6 18 Wang et al
-
[43]
In: ICLR (2026) 4
Wei,C.,Liu,Q.,Ye,Z.,Wang,Q.,Wang,X.,Wan,P.,Gai,K.,Chen,W.:Univideo: Unified understanding, generation, and editing for videos. In: ICLR (2026) 4
2026
-
[44]
Wei, H., Liu, H., Wang, Z., Peng, Y., Xu, B., Wu, S., Zhang, X., He, X., Liu, Z., Wang, P., et al.: Skywork unipic 3.0: Unified multi-image composition via sequence modeling. arXiv preprint arXiv:2601.15664 (2026) 5, 6
-
[45]
Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 5, 7
work page internal anchor Pith review arXiv 2025
-
[46]
ICLR (2026) 9
Wu, J.Z., Ren, X., Shen, T., Cao, T., He, K., Lu, Y., Gao, R., Xie, E., Lan, S., Alvarez, J.M., et al.: Chronoedit: Towards temporal reasoning for image editing and world simulation. ICLR (2026) 9
2026
-
[47]
In: ICCV (2025) 2, 4, 5
Wu, Y., Chen, L., Li, R., Wang, S., Xie, C., Zhang, L.: Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. In: ICCV (2025) 2, 4, 5
2025
-
[48]
arXiv preprint arXiv:2508.06080 (2025)
Xia, B., Liu, J., Zhang, Y., Peng, B., Chu, R., Wang, Y., Wu, X., Yu, B., Jia, J.: Dreamve: Unified instruction-based image and video editing. arXiv preprint arXiv:2508.06080 (2025) 4
-
[49]
IEEE Transactions on Visualization and Computer Graphics 31(2), 1526–1541 (2024) 2
Xing, J., Xia, M., Liu, Y., Zhang, Y., Zhang, Y., He, Y., Liu, H., Chen, H., Cun, X., Wang, X., et al.: Make-your-video: Customized video generation using textual and structural guidance. IEEE Transactions on Visualization and Computer Graphics 31(2), 1526–1541 (2024) 2
2024
-
[50]
IEEE Transactions on Visualization and Computer Graphics (2026) 2
Xu, Z., Huang, Z., Cao, J., Zhang, Y., Cun, X., Shuai, Q., Wang, Y., Bao, L., Tang, F.: Anchorcrafter: Animate cyber-anchors selling your products via human-object interacting video generation. IEEE Transactions on Visualization and Computer Graphics (2026) 2
2026
-
[51]
VideoCoF: Unified Video Editing with Temporal Reasoner
Yang, X., Xie, J., Yang, Y., Huang, Y., Xu, M., Wu, Q.: Unified video editing with temporal reasoner. arXiv preprint arXiv:2512.07469 (2025) 5, 6, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
In: ICLR (2025) 2
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. In: ICLR (2025) 2
2025
-
[53]
Ye, J., Jiang, D., Wang, Z., Zhu, L., Hu, Z., Huang, Z., He, J., Yan, Z., Yu, J., Li, H., et al.: Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation. arXiv preprint arXiv:2508.09987 (2025) 5, 6
-
[54]
arXiv preprint arXiv:2512.02790 (2025) 3, 5, 6
Ye, K., Huang, Z., Fu, C., Liu, Q., Cai, J., Lv, Z., Li, C., Lyu, J., Zhao, Z., Zhang, S.: Unicedit-10m: A dataset and benchmark breaking the scale-quality barrier via unified verification for reasoning-enriched edits. arXiv preprint arXiv:2512.02790 (2025) 3, 5, 6
-
[55]
Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation
Yuan, S., He, X., Deng, Y., Ye, Y., Huang, J., Lin, B., Luo, J., Yuan, L.: Opens2v- nexus: A detailed benchmark and million-scale dataset for subject-to-video gener- ation. arXiv preprint arXiv:2505.20292 (2025) 5, 6
-
[56]
arXiv preprint arXiv:2503.17641 (2025) 4
Zhang, C., Feng, C., Yan, F., Zhang, Q., Zhang, M., Zhong, Y., Zhang, J., Ma, L.: Instructvedit: A holistic approach for instructional video editing. arXiv preprint arXiv:2503.17641 (2025) 4
-
[57]
Zhang, Z., Long, F., Li, W., Qiu, Z., Liu, W., Yao, T., Mei, T.: Region- Constraint In-Context Generation for Instructional Video Editing. arXiv preprint arXiv:2512.17650 (2025) 5, 6, 10, 11
-
[58]
In: NeurIPS (2025) 5, 6
Zi, B., Ruan, P., Chen, M., Qi, X., Hao, S., Zhao, S., Huang, Y., Liang, B., Xiao, R., Wong, K.F.: Señorita-2m: A high-quality instruction-based dataset for general video editing by video specialists. In: NeurIPS (2025) 5, 6
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.