Aurora: Unified Video Editing with a Tool-Using Agent

Hang Hua; Jiebo Luo; Wei Xiong; Yongsheng Yu; Zhenghong Zhou; Zhiyuan Xiao; Ziyun Zeng

arxiv: 2605.18748 · v1 · pith:2MMWD7RJnew · submitted 2026-05-18 · 💻 cs.CV

Aurora: Unified Video Editing with a Tool-Using Agent

Yongsheng Yu , Ziyun Zeng , Zhiyuan Xiao , Zhenghong Zhou , Hang Hua , Wei Xiong , Jiebo Luo This is my paper

Pith reviewed 2026-05-20 10:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords video editingvision-language modelagentdiffusion transformeredit planningreference selectionunderspecificationAgentEdit-Bench

0 comments

The pith

A trained tool-using vision-language agent converts raw user requests into structured plans for a unified video diffusion transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Aurora as a framework that adds a vision-language model agent to a video diffusion model for editing tasks. The agent uses tools to interpret incomplete instructions, create edit plans, and choose reference images so the diffusion model receives complete inputs. This targets the common problem that real user requests lack the precise text, visuals, and spatial details the model expects for operations like object replacement or style changes. Training combines supervised examples for planning with preference data to refine tool use. Results on a new benchmark and prior ones indicate gains over direct instruction methods and compatibility with other editing models.

Core claim

Aurora pairs a tool-augmented vision-language model agent with a unified video diffusion transformer. The agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, thereby resolving textual and visual underspecification before generation. The agent is trained with supervised data for complete edit planning and reference-image selection, together with preference pairs for robust tool use and instruction refinement.

What carries the argument

The tool-augmented vision-language model agent that generates edit plans and selects reference images to align with the diffusion transformer's inputs.

If this is right

Aurora produces higher-quality edits than instruction-only baselines on AgentEdit-Bench and existing video editing benchmarks.
The VLM agent transfers to compatible frozen video editing models without retraining the diffusion component.
The framework supports replacement, removal, style transfer, and reference-driven insertion from natural language requests that omit model-ready details.
Structured planning before diffusion reduces failures caused by missing spatial grounding or reference images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agent layers could extend to other generative models that require precise conditioning from loose user input.
The separation of planning from generation may allow independent scaling of the agent for more complex multi-step edits.
Benchmarks focused on underspecification could become standard for testing agent-augmented creative tools.

Load-bearing premise

The trained VLM agent can reliably resolve textual and visual underspecification in raw user requests without introducing planning errors that degrade the downstream diffusion output.

What would settle it

If Aurora videos score lower than instruction-only baselines on human preference ratings or automated metrics across AgentEdit-Bench and the two existing benchmarks, the claimed improvement would not hold.

read the original abstract

Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit. We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, thereby resolving textual and visual underspecification before generation. We train the VLM agent with supervised data for complete edit planning and reference-image selection, together with preference pairs for robust tool use and instruction refinement. We introduce AgentEdit-Bench to evaluate agent-enhanced video editing under textual and visual underspecification. Experiments on AgentEdit-Bench and two existing video editing benchmarks show that Aurora improves over instruction-only baselines and that the VLM agent transfers to compatible frozen video editing models. Project page: https://yeates.github.io/Aurora-Page

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Aurora adds a VLM agent to turn vague user requests into structured inputs for unified video diffusion models, but the reported gains rest on high-level results without isolating the agent's contribution.

read the letter

The main point is that Aurora trains a tool-augmented VLM to map raw editing requests into edit plans and reference images that a unified conditioning diffusion transformer can use directly. This targets the practical gap where existing unified models expect already-clean text, references, and grounding that typical users do not supply. The agent is trained on supervised planning and reference selection plus preference pairs for tool use and refinement, then tested on a new AgentEdit-Bench plus two prior benchmarks. The abstract states clear gains over instruction-only baselines and successful transfer to compatible frozen models. That combination of agent preprocessing with a single diffusion backbone is the concrete new piece. The benchmark itself is also useful for anyone studying underspecified video editing. The results are presented at a summary level, with no dataset sizes, statistical tests, or per-component breakdowns supplied in the abstract. The stress-test concern holds: there is no reported separation of agent planning accuracy from downstream video metrics, so it remains unclear whether observed improvements come from better reference handling or from the agent resolving underspecification without introducing new errors. If the full paper contains ablations or error correlations that address this, they would strengthen the central claim substantially. This work is aimed at researchers building practical video editing systems and at groups exploring agentic interfaces for generative models. A reader focused on usability gaps in diffusion-based editing would find the framework and benchmark worth examining. The paper deserves a serious referee because the usability problem is genuine and the proposed architecture is a direct response to it, even though the current evidence needs more granular validation to be fully convincing.

Referee Report

2 major / 2 minor

Summary. The paper introduces Aurora, a framework pairing a tool-augmented VLM agent with a unified video diffusion transformer. The agent converts raw user requests into structured edit plans (resolving textual/visual underspecification) via supervised training on planning/reference selection plus preference optimization; the system is evaluated on a new AgentEdit-Bench plus two prior benchmarks, reporting gains over instruction-only baselines and successful transfer to compatible frozen editing models.

Significance. If the empirical gains are shown to arise specifically from the agent's planning rather than ancillary factors, the work would meaningfully advance practical video editing by making unified diffusion models usable with underspecified natural-language inputs. The transfer results to frozen models and the introduction of AgentEdit-Bench for underspecification testing are concrete strengths that could influence follow-on agentic editing systems.

major comments (2)

[§4] §4 (Experiments): aggregate improvements on AgentEdit-Bench and prior benchmarks are reported, but no per-example breakdown of agent planning accuracy, reference-selection errors, or correlation between those errors and downstream metrics (temporal consistency, edit fidelity) is provided. This leaves the central claim—that the supervised+preference-trained agent reliably resolves underspecification without harming the frozen diffusion output—unsupported by the necessary isolation.
[§3] §3 (AgentEdit-Bench): the benchmark is introduced to evaluate underspecification handling, yet the manuscript supplies no dataset size, curation protocol for textual/visual underspecification, or statistical significance tests on the reported gains. Without these, the evaluation cannot be assessed for robustness or generalizability.

minor comments (2)

[§2] Notation for the structured edit plan (text, reference images, spatial grounding) is introduced in §2 but never formalized with an explicit tuple or schema; adding a short definition would improve reproducibility.
[Figure 3] Figure 3 (qualitative results) would benefit from side-by-side comparison with the instruction-only baseline on the same underspecified inputs to visually illustrate the agent's contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of our evaluation that can be strengthened. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses

Referee: [§4] §4 (Experiments): aggregate improvements on AgentEdit-Bench and prior benchmarks are reported, but no per-example breakdown of agent planning accuracy, reference-selection errors, or correlation between those errors and downstream metrics (temporal consistency, edit fidelity) is provided. This leaves the central claim—that the supervised+preference-trained agent reliably resolves underspecification without harming the frozen diffusion output—unsupported by the necessary isolation.

Authors: We agree that isolating the agent's contribution through per-example analysis would strengthen the central claim. In the revised version we will add a dedicated analysis subsection that reports planning accuracy and reference-selection error rates on a representative sample of examples from AgentEdit-Bench, together with quantitative correlations between these agent-level metrics and downstream measures such as temporal consistency and edit fidelity. This will provide direct evidence that the observed gains arise from the agent's planning rather than ancillary factors. revision: yes
Referee: [§3] §3 (AgentEdit-Bench): the benchmark is introduced to evaluate underspecification handling, yet the manuscript supplies no dataset size, curation protocol for textual/visual underspecification, or statistical significance tests on the reported gains. Without these, the evaluation cannot be assessed for robustness or generalizability.

Authors: We acknowledge that these details are necessary for a complete assessment of the benchmark. We will expand the AgentEdit-Bench section to explicitly state the total number of examples, describe the curation protocol used to generate instances exhibiting textual and visual underspecification, and include statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the performance differences versus instruction-only baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework with no derivations or self-referential reductions

full rationale

The paper introduces an agentic video editing system pairing a trained VLM agent with a unified diffusion transformer, evaluated on a new benchmark (AgentEdit-Bench) and prior benchmarks. No equations, derivations, or formal proof chains appear in the abstract or described content. All central claims rest on aggregate experimental improvements over instruction-only baselines and transfer tests to frozen models. These results are externally falsifiable via the reported benchmarks and do not reduce by construction to fitted parameters, self-definitions, or self-citation chains. The work is self-contained as standard empirical CV research.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems paper with no mathematical derivations. No free parameters, axioms, or invented entities are specified in the abstract; training details for the VLM agent are described at a high level only.

pith-pipeline@v0.9.0 · 5760 in / 1175 out tokens · 50373 ms · 2026-05-20T10:43:48.760473+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The VLM agent maps a raw user request to a structured edit plan... (y', c, q, m) = π_ϕ(V_src, y, R)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Aurora video DiT... flow-matching objective L_FM

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 9 internal anchors

[1]

Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

work page arXiv 2025
[2]

FLUX.2 [klein]: Towards interactive visual intelligence

Black Forest Labs. FLUX.2 [klein]: Towards interactive visual intelligence. Blog post, January 2026. URLhttps:// bfl.ai/blog/flux2-klein-towards-interactive-visual-intelligence . Model page: https://bfl. ai/models/flux-2-klein

work page 2026
[3]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Humo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. HuMo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025. 10

work page arXiv 2025
[5]

EditMGT: Unleashing potentials of masked generative transformers in image editing.arXiv preprint arXiv:2512.11715, 2025

Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, and Songhua Liu. EditMGT: Unleashing potentials of masked generative transformers in image editing.arXiv preprint arXiv:2512.11715, 2025

work page arXiv 2025
[6]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

work page 2025
[7]

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-Searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

EffectErase: Joint video object removal and insertion for high-quality effect erasing

Yang Fu, Yike Zheng, Ziyun Dai, and Henghui Ding. EffectErase: Joint video object removal and insertion for high-quality effect erasing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026
[9]

Tokenflow: Consistent diffusion features for consistent video editing

Michal Geyer, Omer Bar Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. InICLR, 2024

work page 2024
[10]

Gemini 3.1 Flash-Lite

Google DeepMind. Gemini 3.1 Flash-Lite. Model card, 2026. URLhttps://deepmind.google/models/gemini/ flash-lite/. Model card PDF: https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf

work page 2026
[11]

OpenVE-3M: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, and Lei Xie. OpenVE-3M: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

work page arXiv 2025
[12]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[13]

Mmig-bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.Advances in Neural Information Processing Systems, 2026

Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, and Jiebo Luo. Mmig-bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.Advances in Neural Information Processing Systems, 2026

work page 2026
[14]

Genmac: compositional text-to-video generation with multi-agent collaboration

Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, and Xihui Liu. Genmac: compositional text-to-video generation with multi-agent collaboration. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

work page 2026
[15]

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, et al. Editverse: Unifying image and video editing and generation with in-context learning.arXiv preprint arXiv:2509.20360, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

EgoEdit: Dataset, real-time streaming model, and benchmark for egocentric video editing

Runjia Li, Moayed Haji-Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siarohin, Tomas Jakab, Junlin Han, Sergey Tulyakov, Philip Torr, and Willi Menapace. EgoEdit: Dataset, real-time streaming model, and benchmark for egocentric video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026
[18]

Univa: Universal video agent towards open-source next-generation video generalist.arXiv preprint arXiv:2511.08521, 2025

Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, et al. Univa: Universal video agent towards open-source next-generation video generalist.arXiv preprint arXiv:2511.08521, 2025

work page arXiv 2025
[19]

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-Edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

JarvisEvo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, Wenxun Dai, Xinghao Ding, Chunyu Wang, and Qinglin Lu. JarvisEvo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

work page arXiv 2025
[21]

Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision, 2024

work page 2024
[22]

ROSE: Remove objects with side effects in videos.Advances in Neural Information Processing Systems, 38, 2025

Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, and Hengshuang Zhao. ROSE: Remove objects with side effects in videos.Advances in Neural Information Processing Systems, 38, 2025. 11

work page 2025
[23]

Fatezero: Fusing attentions for zero-shot text-based video editing

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023

work page 2023
[24]

Qwen3-VL: A family of open multimodal large language models

Qwen Team. Qwen3-VL: A family of open multimodal large language models. Model card and blog post, 2025. URL https://qwenlm.github.io/blog/qwen3-vl/

work page 2025
[25]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id= qwen3.5

work page 2026
[26]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[27]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InICLR, 2025

work page 2025
[28]

Introducing runway aleph

Runway. Introducing runway aleph. https://runwayml.com/research/introducing-runway-aleph, 2025. Accessed: 2025-09-10

work page 2025
[29]

Serper: The world’s fastest and cheapest Google search API.https://serper.dev/, 2024

Serper. Serper: The world’s fastest and cheapest Google search API.https://serper.dev/, 2024. Accessed: 2026-04-25

work page 2024
[30]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Lucy edit: Open-weight text-guided video editing

DecartAI Team. Lucy edit: Open-weight text-guided video editing. 2025

work page 2025
[32]

Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, and Hongjie Zhang. InternVL-U: Democ...

work page arXiv 2026
[33]

Spagent: Adaptive task decomposition and model selection for general video generation and editing.IEEE Transactions on Image Processing, 2026

Rong-Cheng Tu, Wenhao Sun, Zhao Jin, Jingyi Liao, Jiaxing Huang, and Dacheng Tao. Spagent: Adaptive task decomposition and model selection for general video generation and editing.IEEE Transactions on Image Processing, 2026

work page 2026
[34]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

SpatialVID: A large-scale video dataset with spatial annotations

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiao-Xiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, and Yao Yao. SpatialVID: A large-scale video dataset with spatial annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026
[36]

Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

work page arXiv 2025
[37]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Towards instruction-aligned multimodal generation. arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Insvie-1m: Effective instruction-based video editing with elaborate dataset construction

Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16692–16701, 2025

work page 2025
[39]

VideoCoF: Unified Video Editing with Temporal Reasoner

Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, and Qiang Wu. VideoCoF: Unified video editing with temporal reasoner.arXiv preprint arXiv:2512.07469, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

PhotoAgent: Agentic photo editing with exploratory visual aesthetic planning.arXiv preprint arXiv:2602.22809, 2026

Mingde Yao, Zhiyuan You, King-Man Tam, Menglu Wang, and Tianfan Xue. PhotoAgent: Agentic photo editing with exploratory visual aesthetic planning.arXiv preprint arXiv:2602.22809, 2026

work page arXiv 2026
[41]

Space-time diffusion features for zero-shot text-driven motion transfer

Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, and Tali Dekel. Space-time diffusion features for zero-shot text-driven motion transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8466–8476, 2024. 12

work page 2024
[42]

Agent banana: High-fidelity image editing with agentic thinking and tooling.arXiv preprint arXiv:2602.09084, 2026

Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, et al. Agent banana: High-fidelity image editing with agentic thinking and tooling.arXiv preprint arXiv:2602.09084, 2026

work page arXiv 2026
[43]

OpenS2V-Nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.Advances in Neural Information Processing Systems, 38, 2025

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. OpenS2V-Nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.Advances in Neural Information Processing Systems, 38, 2025

work page 2025
[44]

Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248, 2024

Zhengqing Yuan, Ruoxi Chen, Zhaoxu Li, Haolong Jia, Lifang He, Chi Wang, and Lichao Sun. Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248, 2024

work page arXiv 2024
[45]

Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

Ziyun Zeng, Hang Hua, and Jiebo Luo. Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

work page arXiv 2025
[46]

Region-constraint in-context generation for instructional video editing.arXiv preprint arXiv:2512.17650, 2025

Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, and Tao Mei. Region-constraint in-context generation for instructional video editing.arXiv preprint arXiv:2512.17650, 2025

work page arXiv 2025
[47]

UltraEdit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37, 2024

Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. UltraEdit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37, 2024

work page 2024
[48]

arXiv preprint arXiv:2502.06734(2025)

Bojia Zi, Penghui Ruan, Marco Chen, Xianbiao Qi, Shaozhe Hao, Shihao Zhao, Youze Huang, Bin Liang, Rong Xiao, and Kam-Fai Wong. Señorita-2m: A high-quality instruction-based dataset for general video editing by video specialists.arXiv preprint arXiv:2502.06734, 2025. 13 UniVideo KiwiEditAurora (Ours)Source / Reference Replace the glowing sphere on the tab...

work page arXiv 2025
[49]

Was the requested edit performed in the named region with the named target? 3: requested edit performed correctly in the right region

Instruction Following (0–3). Was the requested edit performed in the named region with the named target? 3: requested edit performed correctly in the right region. 2: edit performed but with a minor inaccuracy or omission. 1: edit partially performed or applied to the wrong region. 0: instruction ignored, or the opposite was done

work page
[50]

Was the change confined to the specified region?

Edit Region Localization (0–3). Was the change confined to the specified region?

work page
[53]

Inferred from the AFTER frame, would the edited entity remain stable across the clip without ghosting or pop-in?

Temporal Consistency (0–3). Inferred from the AFTER frame, would the edited entity remain stable across the clip without ghosting or pop-in?

work page
[54]

Is the named entity actually visible and recognizable in roughly the right region?

IP Presence (0–3). Is the named entity actually visible and recognizable in roughly the right region?

work page
[55]

Does the visible entity match the specific real-world identity, brand colors, logo, and signature shape? You may rely only on your internal world knowledge of the brand or product

IP Identity Match (0–3). Does the visible entity match the specific real-world identity, brand colors, logo, and signature shape? You may rely only on your internal world knowledge of the brand or product. Return your evaluation in exactly this format: Instruction Following: [score] - [one-sentence justification] Edit Region Localization: [score] - [one-s...

work page
[56]

Was the requested edit performed? 3: edit performed correctly

Instruction Following (0–3). Was the requested edit performed? 3: edit performed correctly. 2: edit performed with a minor inaccuracy. 1: edit partially performed or applied to the wrong region. 0: instruction ignored, or the opposite was done. Removal-only clause: if the model replaced the target with a new object instead of removing it, give at most1

work page
[57]

Was the change confined to the specified region or, for global edits, to the implied scope?

Edit Region Localization (0–3). Was the change confined to the specified region or, for global edits, to the implied scope?

work page
[58]

Are subject motion, geometry and lighting outside the edit region preserved?

Source Preservation (0–3). Are subject motion, geometry and lighting outside the edit region preserved?

work page
[59]

Realism, seamless integration, lighting and shadow match, scale and perspective

Visual Quality (0–3). Realism, seamless integration, lighting and shadow match, scale and perspective

work page
[60]

Temporal Consistency (0–3). Inferred from the AFTER frame, would the result remain stable across the clip? Return your evaluation in exactly this format: Instruction Following: [score] - [one-sentence justification] Edit Region Localization: [score] - [one-sentence justification] Source Preservation: [score] - [one-sentence justification] Visual Quality: ...

work page

[1] [1]

Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

work page arXiv 2025

[2] [2]

FLUX.2 [klein]: Towards interactive visual intelligence

Black Forest Labs. FLUX.2 [klein]: Towards interactive visual intelligence. Blog post, January 2026. URLhttps:// bfl.ai/blog/flux2-klein-towards-interactive-visual-intelligence . Model page: https://bfl. ai/models/flux-2-klein

work page 2026

[3] [3]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Humo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. HuMo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025. 10

work page arXiv 2025

[5] [5]

EditMGT: Unleashing potentials of masked generative transformers in image editing.arXiv preprint arXiv:2512.11715, 2025

Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, and Songhua Liu. EditMGT: Unleashing potentials of masked generative transformers in image editing.arXiv preprint arXiv:2512.11715, 2025

work page arXiv 2025

[6] [6]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

work page 2025

[7] [7]

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-Searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

EffectErase: Joint video object removal and insertion for high-quality effect erasing

Yang Fu, Yike Zheng, Ziyun Dai, and Henghui Ding. EffectErase: Joint video object removal and insertion for high-quality effect erasing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026

[9] [9]

Tokenflow: Consistent diffusion features for consistent video editing

Michal Geyer, Omer Bar Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. InICLR, 2024

work page 2024

[10] [10]

Gemini 3.1 Flash-Lite

Google DeepMind. Gemini 3.1 Flash-Lite. Model card, 2026. URLhttps://deepmind.google/models/gemini/ flash-lite/. Model card PDF: https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf

work page 2026

[11] [11]

OpenVE-3M: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, and Lei Xie. OpenVE-3M: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

work page arXiv 2025

[12] [12]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022

[13] [13]

Mmig-bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.Advances in Neural Information Processing Systems, 2026

Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, and Jiebo Luo. Mmig-bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.Advances in Neural Information Processing Systems, 2026

work page 2026

[14] [14]

Genmac: compositional text-to-video generation with multi-agent collaboration

Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, and Xihui Liu. Genmac: compositional text-to-video generation with multi-agent collaboration. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

work page 2026

[15] [15]

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, et al. Editverse: Unifying image and video editing and generation with in-context learning.arXiv preprint arXiv:2509.20360, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

EgoEdit: Dataset, real-time streaming model, and benchmark for egocentric video editing

Runjia Li, Moayed Haji-Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siarohin, Tomas Jakab, Junlin Han, Sergey Tulyakov, Philip Torr, and Willi Menapace. EgoEdit: Dataset, real-time streaming model, and benchmark for egocentric video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026

[18] [18]

Univa: Universal video agent towards open-source next-generation video generalist.arXiv preprint arXiv:2511.08521, 2025

Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, et al. Univa: Universal video agent towards open-source next-generation video generalist.arXiv preprint arXiv:2511.08521, 2025

work page arXiv 2025

[19] [19]

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-Edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

JarvisEvo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, Wenxun Dai, Xinghao Ding, Chunyu Wang, and Qinglin Lu. JarvisEvo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

work page arXiv 2025

[21] [21]

Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision, 2024

work page 2024

[22] [22]

ROSE: Remove objects with side effects in videos.Advances in Neural Information Processing Systems, 38, 2025

Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, and Hengshuang Zhao. ROSE: Remove objects with side effects in videos.Advances in Neural Information Processing Systems, 38, 2025. 11

work page 2025

[23] [23]

Fatezero: Fusing attentions for zero-shot text-based video editing

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023

work page 2023

[24] [24]

Qwen3-VL: A family of open multimodal large language models

Qwen Team. Qwen3-VL: A family of open multimodal large language models. Model card and blog post, 2025. URL https://qwenlm.github.io/blog/qwen3-vl/

work page 2025

[25] [25]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id= qwen3.5

work page 2026

[26] [26]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[27] [27]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InICLR, 2025

work page 2025

[28] [28]

Introducing runway aleph

Runway. Introducing runway aleph. https://runwayml.com/research/introducing-runway-aleph, 2025. Accessed: 2025-09-10

work page 2025

[29] [29]

Serper: The world’s fastest and cheapest Google search API.https://serper.dev/, 2024

Serper. Serper: The world’s fastest and cheapest Google search API.https://serper.dev/, 2024. Accessed: 2026-04-25

work page 2024

[30] [30]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Lucy edit: Open-weight text-guided video editing

DecartAI Team. Lucy edit: Open-weight text-guided video editing. 2025

work page 2025

[32] [32]

Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, and Hongjie Zhang. InternVL-U: Democ...

work page arXiv 2026

[33] [33]

Spagent: Adaptive task decomposition and model selection for general video generation and editing.IEEE Transactions on Image Processing, 2026

Rong-Cheng Tu, Wenhao Sun, Zhao Jin, Jingyi Liao, Jiaxing Huang, and Dacheng Tao. Spagent: Adaptive task decomposition and model selection for general video generation and editing.IEEE Transactions on Image Processing, 2026

work page 2026

[34] [34]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

SpatialVID: A large-scale video dataset with spatial annotations

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiao-Xiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, and Yao Yao. SpatialVID: A large-scale video dataset with spatial annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026

[36] [36]

Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

work page arXiv 2025

[37] [37]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Towards instruction-aligned multimodal generation. arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Insvie-1m: Effective instruction-based video editing with elaborate dataset construction

Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16692–16701, 2025

work page 2025

[39] [39]

VideoCoF: Unified Video Editing with Temporal Reasoner

Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, and Qiang Wu. VideoCoF: Unified video editing with temporal reasoner.arXiv preprint arXiv:2512.07469, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

PhotoAgent: Agentic photo editing with exploratory visual aesthetic planning.arXiv preprint arXiv:2602.22809, 2026

Mingde Yao, Zhiyuan You, King-Man Tam, Menglu Wang, and Tianfan Xue. PhotoAgent: Agentic photo editing with exploratory visual aesthetic planning.arXiv preprint arXiv:2602.22809, 2026

work page arXiv 2026

[41] [41]

Space-time diffusion features for zero-shot text-driven motion transfer

Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, and Tali Dekel. Space-time diffusion features for zero-shot text-driven motion transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8466–8476, 2024. 12

work page 2024

[42] [42]

Agent banana: High-fidelity image editing with agentic thinking and tooling.arXiv preprint arXiv:2602.09084, 2026

Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, et al. Agent banana: High-fidelity image editing with agentic thinking and tooling.arXiv preprint arXiv:2602.09084, 2026

work page arXiv 2026

[43] [43]

OpenS2V-Nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.Advances in Neural Information Processing Systems, 38, 2025

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. OpenS2V-Nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.Advances in Neural Information Processing Systems, 38, 2025

work page 2025

[44] [44]

Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248, 2024

Zhengqing Yuan, Ruoxi Chen, Zhaoxu Li, Haolong Jia, Lifang He, Chi Wang, and Lichao Sun. Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248, 2024

work page arXiv 2024

[45] [45]

Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

Ziyun Zeng, Hang Hua, and Jiebo Luo. Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

work page arXiv 2025

[46] [46]

Region-constraint in-context generation for instructional video editing.arXiv preprint arXiv:2512.17650, 2025

Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, and Tao Mei. Region-constraint in-context generation for instructional video editing.arXiv preprint arXiv:2512.17650, 2025

work page arXiv 2025

[47] [47]

UltraEdit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37, 2024

Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. UltraEdit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37, 2024

work page 2024

[48] [48]

arXiv preprint arXiv:2502.06734(2025)

Bojia Zi, Penghui Ruan, Marco Chen, Xianbiao Qi, Shaozhe Hao, Shihao Zhao, Youze Huang, Bin Liang, Rong Xiao, and Kam-Fai Wong. Señorita-2m: A high-quality instruction-based dataset for general video editing by video specialists.arXiv preprint arXiv:2502.06734, 2025. 13 UniVideo KiwiEditAurora (Ours)Source / Reference Replace the glowing sphere on the tab...

work page arXiv 2025

[49] [49]

Was the requested edit performed in the named region with the named target? 3: requested edit performed correctly in the right region

Instruction Following (0–3). Was the requested edit performed in the named region with the named target? 3: requested edit performed correctly in the right region. 2: edit performed but with a minor inaccuracy or omission. 1: edit partially performed or applied to the wrong region. 0: instruction ignored, or the opposite was done

work page

[50] [50]

Was the change confined to the specified region?

Edit Region Localization (0–3). Was the change confined to the specified region?

work page

[51] [53]

Inferred from the AFTER frame, would the edited entity remain stable across the clip without ghosting or pop-in?

Temporal Consistency (0–3). Inferred from the AFTER frame, would the edited entity remain stable across the clip without ghosting or pop-in?

work page

[52] [54]

Is the named entity actually visible and recognizable in roughly the right region?

IP Presence (0–3). Is the named entity actually visible and recognizable in roughly the right region?

work page

[53] [55]

Does the visible entity match the specific real-world identity, brand colors, logo, and signature shape? You may rely only on your internal world knowledge of the brand or product

IP Identity Match (0–3). Does the visible entity match the specific real-world identity, brand colors, logo, and signature shape? You may rely only on your internal world knowledge of the brand or product. Return your evaluation in exactly this format: Instruction Following: [score] - [one-sentence justification] Edit Region Localization: [score] - [one-s...

work page

[54] [56]

Was the requested edit performed? 3: edit performed correctly

Instruction Following (0–3). Was the requested edit performed? 3: edit performed correctly. 2: edit performed with a minor inaccuracy. 1: edit partially performed or applied to the wrong region. 0: instruction ignored, or the opposite was done. Removal-only clause: if the model replaced the target with a new object instead of removing it, give at most1

work page

[55] [57]

Was the change confined to the specified region or, for global edits, to the implied scope?

Edit Region Localization (0–3). Was the change confined to the specified region or, for global edits, to the implied scope?

work page

[56] [58]

Are subject motion, geometry and lighting outside the edit region preserved?

Source Preservation (0–3). Are subject motion, geometry and lighting outside the edit region preserved?

work page

[57] [59]

Realism, seamless integration, lighting and shadow match, scale and perspective

Visual Quality (0–3). Realism, seamless integration, lighting and shadow match, scale and perspective

work page

[58] [60]

Temporal Consistency (0–3). Inferred from the AFTER frame, would the result remain stable across the clip? Return your evaluation in exactly this format: Instruction Following: [score] - [one-sentence justification] Edit Region Localization: [score] - [one-sentence justification] Source Preservation: [score] - [one-sentence justification] Visual Quality: ...

work page