pith. machine review for the scientific record. sign in

arxiv: 2603.02175 · v4 · submitted 2026-03-02 · 💻 cs.CV · cs.AI

Recognition: unknown

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin , Guoqiang Liang , Ziyun Zeng , Zechen Bai , Yanzhe Chen , Mike Zheng Shou

Authors on Pith no claims yet
classification 💻 cs.CV cs.AI
keywords editingreferencevideodatakiwi-edittrainingvisualarchitecture
0
0 comments X
read the original abstract

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

    cs.CV 2026-05 unverdicted novelty 7.0

    Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.

  2. InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

    cs.CV 2026-04 unverdicted novelty 6.0

    InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.

  3. ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

    cs.CV 2026-04 unverdicted novelty 6.0

    ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.

  4. Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

    cs.CV 2026-05 unverdicted novelty 4.0

    Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.