JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation
Pith reviewed 2026-06-28 11:02 UTC · model grok-4.3
The pith
JAVEdit introduces a 100k triplet dataset and benchmark enabling a model for instruction-guided joint audio-visual video editing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
JAVEdit-100k is the first large-scale high-quality dataset for instruction-guided joint audio-visual editing, built from human-centric videos with approximately 100K triplets spanning five categories and created through four generation pipelines together with an agent-in-the-loop quality control mechanism. JAVEditBench supplies curated source videos and human-aligned instructions for evaluation across all categories. The JAVEdit model serves as a baseline that outperforms all comparison methods on five of six evaluation metrics.
What carries the argument
The agent-in-the-loop quality control mechanism paired with four generation pipelines that produces the editing triplets for the JAVEdit-100k dataset.
If this is right
- Instruction inputs can now drive coordinated changes to both visual content and audio in the same video.
- Training resources exist for five distinct editing categories focused on human subjects and speech.
- Future models can be compared on a shared benchmark that includes human-aligned instructions.
- Performance gains on joint editing metrics become measurable rather than anecdotal.
Where Pith is reading between the lines
- The curation method could be adapted to generate training data for other instruction-driven multimodal tasks such as audio-only or text-to-video generation.
- If the triplets prove robust, the same pipelines might support iterative refinement loops where the agent flags and corrects its own outputs without full human review.
- Scaling the dataset beyond human-centric videos would test whether the editing approach generalizes to scenes without people.
Load-bearing premise
The four generation pipelines combined with agent-in-the-loop quality control produce editing triplets that are both high-quality and free of systematic biases or artifacts that would undermine downstream model training.
What would settle it
Retraining the JAVEdit model on the same architecture but with a dataset whose triplets were independently verified by humans to contain frequent artifacts or biases, followed by re-evaluation on JAVEditBench showing no outperformance on the five metrics.
Figures
read the original abstract
While instruction-based video editing has seen significant progress, joint audio-visual editing remains constrained by the absence of dedicated datasets and benchmarks. To bridge this gap, we present JAVEdit-100k, the first large-scale, high-quality dataset tailored for instruction-guided joint audio-visual editing. Focusing on human-centric videos, JAVEdit-100k comprises approximately 100K editing triplets spanning five distinct categories, including subject editing and speech editing. This dataset is rigorously constructed via four meticulously designed generation pipelines, seamlessly paired with an agent-in-the-loop quality control mechanism. Furthermore, to address the lack of standardized evaluation within the field, we introduce JAVEditBench, a comprehensive benchmark featuring curated source videos and human-aligned instructions across all editing categories. Finally, we propose JAVEdit, a pioneering baseline model for instruction-guided joint audio-visual editing. Experiments show that \model\ outperforms all baselines on five of six evaluation metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces JAVEdit-100k, the first large-scale dataset of ~100K instruction-guided joint audio-visual editing triplets for human-centric videos, constructed from four generation pipelines with agent-in-the-loop quality control. It also presents JAVEditBench, a new evaluation benchmark with curated videos and human-aligned instructions across five editing categories, and proposes JAVEdit as a baseline model. Experiments are reported to show that JAVEdit outperforms all baselines on five of six evaluation metrics.
Significance. If the dataset construction and performance claims hold under independent validation, the work would provide the first dedicated resource for joint audio-visual instruction editing, addressing a clear gap in datasets and benchmarks. The creation of a new large-scale triplet dataset and benchmark is a concrete contribution that could support future model development in this area.
major comments (2)
- [Abstract] Abstract: The central claim that JAVEdit 'outperforms all baselines on five of six evaluation metrics' supplies no definitions of the metrics, no implementation details for the baselines, no statistical significance tests, and no quantitative validation of the generated JAVEdit-100k triplets (e.g., failure rates or human scores on audio-visual alignment). This information is load-bearing for assessing whether the reported outperformance reflects genuine editing capability or artifacts from the four pipelines.
- [Abstract / Dataset section] Dataset construction (implied in the abstract description of four pipelines + agent QC): No independent quantitative checks are described for pipeline-specific artifacts (e.g., speech editing consistency, subject identity preservation) or for whether the agent-in-the-loop QC introduces correlated errors that the model could exploit during training. Without such validation, the claim that the triplets are 'rigorously constructed' and 'high-quality' cannot be evaluated, directly affecting the generalizability of the JAVEditBench results.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment point by point below, providing clarifications from the manuscript and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that JAVEdit 'outperforms all baselines on five of six evaluation metrics' supplies no definitions of the metrics, no implementation details for the baselines, no statistical significance tests, and no quantitative validation of the generated JAVEdit-100k triplets (e.g., failure rates or human scores on audio-visual alignment). This information is load-bearing for assessing whether the reported outperformance reflects genuine editing capability or artifacts from the four pipelines.
Authors: The abstract is kept concise per conference guidelines, but the requested details appear in the main text: the six metrics (including CLIP-T, AV-Align, and perceptual scores) are formally defined in Section 4.1; baseline implementations and adaptations for joint audio-visual editing are specified in Section 4.3; paired t-test results with p-values confirming significance on the five metrics are reported in Table 3 and the supplementary material. Quantitative validation of JAVEdit-100k (human scores on audio-visual alignment averaging 4.3/5 and failure rates below 4%) is presented in Section 3.5 with inter-annotator agreement. To improve accessibility, we will revise the abstract to briefly reference these sections and add a sentence on validation. We will also move the significance tests into the main experimental section. revision: partial
-
Referee: [Abstract / Dataset section] Dataset construction (implied in the abstract description of four pipelines + agent QC): No independent quantitative checks are described for pipeline-specific artifacts (e.g., speech editing consistency, subject identity preservation) or for whether the agent-in-the-loop QC introduces correlated errors that the model could exploit during training. Without such validation, the claim that the triplets are 'rigorously constructed' and 'high-quality' cannot be evaluated, directly affecting the generalizability of the JAVEditBench results.
Authors: We acknowledge the value of pipeline-specific artifact analysis. Section 3.2 details the four generation pipelines and Section 3.3 describes the agent-in-the-loop QC process, with aggregate human quality scores (audio-visual alignment, instruction fidelity) reported in Table 1 and Section 3.5. However, we did not break these down per pipeline for speech consistency or identity preservation, nor explicitly test for correlated errors exploitable by the model. We will add a new table and analysis subsection in the revision providing these per-pipeline human evaluations and an error correlation study. This will be included in the main paper or as supplementary material to support the quality claims. revision: yes
Circularity Check
No circularity: empirical dataset and model introduction with no self-referential derivations or fitted predictions
full rationale
The paper introduces JAVEdit-100k via four generation pipelines plus agent QC, JAVEditBench, and the JAVEdit model, then reports empirical outperformance on five of six metrics. No equations, fitted parameters, or predictive steps appear in the abstract or described content. Claims do not reduce by construction to prior inputs, self-citations, or ansatzes; the work creates new artifacts rather than deriving results from previously fitted quantities. This matches the default non-circular case for dataset/model papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Insvie-1m: Effective instruction-based video editing with elaborate dataset construction
Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InICCV, pages 16692–16701, 2025
2025
-
[2]
Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, and Lei Xie. Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025. URLhttps://arxiv.org/abs/2512.07826
arXiv 2025
-
[3]
Scaling instruction-based video editing with a high-quality synthetic dataset
Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset. arXiv preprint arXiv:2510.15742, 2025
arXiv 2025
-
[4]
Zero-shot audio-visual editing via cross-modal delta denoising, 2025
Yan-Bo Lin, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, Xiaofei Wang, Gedas Bertasius, and Lijuan Wang. Zero-shot audio-visual editing via cross-modal delta denoising, 2025. URL https://arxiv.org/abs/2503.20782
arXiv 2025
-
[5]
Av-edit: Multimodal generative sound effect editing via audio-visual semantic joint control
Xinyue Guo, Xiaoran Yang, Lipan Zhang, Jianxuan Yang, Zhao Wang, and Jian Luan. Av-edit: Multimodal generative sound effect editing via audio-visual semantic joint control. InAAAI, volume 40, pages 21504–21512, 2026
2026
-
[6]
Haojie Zheng, Shuchen Weng, Jingqi Liu, Siqi Yang, Boxin Shi, and Xinlong Wang. Audio-sync video instance editing with granularity-aware mask refiner.arXiv preprint arXiv:2512.10571, 2025
Pith/arXiv arXiv 2025
-
[7]
Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation
Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. In CVPR, pages 7752–7762, 2025
2025
-
[8]
Vidgen-1m: A large-scale dataset for text-to-video generation.arXiv preprint arXiv:2408.02629, 2024
Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video generation.arXiv preprint arXiv:2408.02629, 2024
arXiv 2024
-
[9]
Vggsound: A large-scale audio-visual dataset
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. arXiv preprint arXiv:2004.14368, 2020. URLhttps://arxiv.org/abs/2004.14368. 11 JA VEdit
arXiv 2004
-
[11]
Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content
Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InCVPR, pages 8428–8437, 2025
2025
-
[12]
Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
Jin Xu, Zhifang Guo, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. URL https://arxiv.org/abs/2509.17765
Pith/arXiv arXiv 2025
-
[13]
Ju-Chiang Wang, Wei-Tsung Lu, and Jitong Chen. Mel-roformer for vocal separation and vocal melody transcrip- tion.arXiv preprint arXiv:2409.04702, 2024. URLhttps://arxiv.org/abs/2409.04702
arXiv 2024
-
[15]
URLhttps://arxiv.org/abs/2505.23625
-
[16]
Sam audio: Segment anything in audio.arXiv preprint arXiv:2512.18099, 2025
Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollár, Wei-Ning Hsu, and Ann Lee. Sam audio: Segment anything in audio.arXiv preprint arXiv:2512.18099, 2025. URL https://arxiv.org/abs/2512.18099
arXiv 2025
-
[17]
Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
An Yang, Anfeng Li, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URL https: //arxiv.org/abs/2505.09388
Pith/arXiv arXiv 2025
-
[18]
Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
Pith/arXiv arXiv 2025
-
[19]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[20]
Dreamvoice: Text-guided voice conversion.arXiv preprint arXiv:2406.16314, 2024
Jiarui Hai, Karan Thakkar, Helin Wang, Zengyi Qin, and Mounya Elhilali. Dreamvoice: Text-guided voice conversion.arXiv preprint arXiv:2406.16314, 2024. URLhttps://arxiv.org/abs/2406.16314
arXiv 2024
-
[21]
Xijie Huang, Chengming Xu, Donghao Luo, Xiaobin Hu, Peng Tang, Xu Peng, Jiangning Zhang, Chengjie Wang, and Yanwei Fu. Ffp-300k: Scaling first-frame propagation for generalizable video editing.arXiv preprint arXiv:2601.01720, 2026
arXiv 2026
-
[22]
Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025. URLhttps://arxiv.org/abs/2508.16930
arXiv 2025
-
[23]
Minimax- remover: Taming bad noise helps video object removal.arXiv preprint arXiv:2505.24873, 2025
Bojia Zi, Weixuan Peng, Xianbiao Qi, Jianan Wang, Shihao Zhao, Rong Xiao, and Kam-Fai Wong. Minimax- remover: Taming bad noise helps video object removal.arXiv preprint arXiv:2505.24873, 2025
arXiv 2025
-
[25]
URLhttps://arxiv.org/abs/2511.16719
-
[26]
Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026
Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026
Pith/arXiv arXiv 2026
-
[27]
Gheorghe Comanici, Eric Bieber, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. URL https: //arxiv.org/abs/2507.06261
Pith/arXiv arXiv 2025
-
[28]
Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision
Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision. arXiv preprint arXiv:2412.09262, 2024. 12 JA VEdit
arXiv 2024
-
[29]
Kaito Baba, Wataru Nakata, Yuki Saito, and Hiroshi Saruwatari. The t05 system for the voicemos challenge 2024: Transfer learning from deep image classifier to naturalness mos prediction of high-quality synthetic speech.arXiv preprint arXiv:2409.09305, 2024. URLhttps://arxiv.org/abs/2409.09305
arXiv 2024
-
[30]
Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026
Pith/arXiv arXiv 2026
-
[31]
Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, et al. Metaclaw: Just talk–an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026
arXiv 2026
-
[32]
Tianmeng Hu, Biao Luo, Chunhua Yang, and Tingwen Huang. Mo-mix: Multi-objective multi-agent cooperative decision-making with deep reinforcement learning.IEEE TPAMI, 45(10):12098–12112, October 2023. ISSN 1939-3539. doi: 10.1109/tpami.2023.3283537. URL http://dx.doi.org/10.1109/TPAMI.2023. 3283537
-
[33]
Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[34]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023
2023
-
[35]
Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025
arXiv 2025
-
[36]
Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
Pith/arXiv arXiv 2022
-
[37]
Ragd: Regional-aware diffusion model for text-to-image generation
Zhennan Chen, Yajie Li, Haofan Wang, Zhibo Chen, Zhengkai Jiang, Jun Li, Qian Wang, Jian Yang, and Ying Tai. Ragd: Regional-aware diffusion model for text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19331–19341, 2025
2025
-
[38]
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020
Pith/arXiv arXiv 2011
-
[39]
L2p: Unlocking latent potential for pixel generation.arXiv preprint arXiv:2605.12013, 2026
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Jiawei Chen, Zhuoqi Zeng, Wei Zhang, Chengjie Wang, Jian Yang, and Ying Tai. L2p: Unlocking latent potential for pixel generation.arXiv preprint arXiv:2605.12013, 2026
Pith/arXiv arXiv 2026
-
[40]
Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.007...
Pith/arXiv arXiv 2024
-
[41]
Ivebench: Modern benchmark suite for instruction-guided video editing assessment
Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, and Shuicheng Yan. Ivebench: Modern benchmark suite for instruction-guided video editing assessment. InThe F ourteenth International Conference on Learning Representations, 2026. 13 JA VEdit JA VEdit: Joint Audio-Visual Instruction-Guided Vide...
2026
-
[42]
=6 pairwise comparisons are constructed, resulting in 60×6=360 pairwise evaluation instances. 15 JA VEdit Annotators and Protocol.We recruit 5 annotators with professional backgrounds in video production or computer vision research. All annotators undergo a calibration session with 10 practice examples and detailed scoring guidelines before the formal eva...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.