hub

arXiv preprint arXiv:2510.15742 , year=

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset , author= · 2025 · arXiv 2510.15742

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Aurora: Unified Video Editing with a Tool-Using Agent

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.

InstructAV2AV: Instruction-Guided Audio-Video Joint Editing

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

InstructAV2AV is an end-to-end instruction-guided audio-video joint editing model that adapts a pre-trained backbone with gated attention and two-stage training, outperforming prior methods on 11 metrics after building the InsAVE-80K dataset.

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.

LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing

cs.CV · 2026-04-18 · unverdicted · novelty 6.0

LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.

ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.

Bernini: Latent Semantic Planning for Video Diffusion

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

cs.CV · 2026-05-04 · unverdicted · novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

Advancing Open-source World Models

cs.CV · 2026-01-28 · unverdicted · novelty 4.0

LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

cs.CV · 2026-04-13 · unverdicted · novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.

Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing

cs.CV · 2026-05-22

MiVE: Multiscale Vision-language features for reference-guided video Editing

cs.CV · 2026-05-14

citing papers explorer

Showing 11 of 11 citing papers.

Aurora: Unified Video Editing with a Tool-Using Agent cs.CV · 2026-05-18 · unverdicted · none · ref 1
Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.
InstructAV2AV: Instruction-Guided Audio-Video Joint Editing cs.CV · 2026-05-18 · unverdicted · none · ref 1
InstructAV2AV is an end-to-end instruction-guided audio-video joint editing model that adapts a pre-trained backbone with gated attention and two-stage training, outperforming prior methods on 11 metrics after building the InsAVE-80K dataset.
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance cs.CV · 2026-05-07 · unverdicted · none · ref 1
Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing cs.CV · 2026-04-18 · unverdicted · none · ref 2
LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks cs.CV · 2026-04-09 · unverdicted · none · ref 3
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
Bernini: Latent Semantic Planning for Video Diffusion cs.CV · 2026-05-21 · unverdicted · none · ref 3
Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE cs.CV · 2026-05-04 · unverdicted · none · ref 35
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
Advancing Open-source World Models cs.CV · 2026-01-28 · unverdicted · none · ref 3
LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation cs.CV · 2026-04-13 · unverdicted · none · ref 6
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing cs.CV · 2026-05-22 · unreviewed · ref 120
MiVE: Multiscale Vision-language features for reference-guided video Editing cs.CV · 2026-05-14 · unreviewed · ref 6

arXiv preprint arXiv:2510.15742 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer