pith. sign in

arxiv: 2605.18748 · v1 · pith:2MMWD7RJnew · submitted 2026-05-18 · 💻 cs.CV

Aurora: Unified Video Editing with a Tool-Using Agent

Pith reviewed 2026-05-20 10:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords video editingvision-language modelagentdiffusion transformeredit planningreference selectionunderspecificationAgentEdit-Bench
0
0 comments X

The pith

A trained tool-using vision-language agent converts raw user requests into structured plans for a unified video diffusion transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Aurora as a framework that adds a vision-language model agent to a video diffusion model for editing tasks. The agent uses tools to interpret incomplete instructions, create edit plans, and choose reference images so the diffusion model receives complete inputs. This targets the common problem that real user requests lack the precise text, visuals, and spatial details the model expects for operations like object replacement or style changes. Training combines supervised examples for planning with preference data to refine tool use. Results on a new benchmark and prior ones indicate gains over direct instruction methods and compatibility with other editing models.

Core claim

Aurora pairs a tool-augmented vision-language model agent with a unified video diffusion transformer. The agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, thereby resolving textual and visual underspecification before generation. The agent is trained with supervised data for complete edit planning and reference-image selection, together with preference pairs for robust tool use and instruction refinement.

What carries the argument

The tool-augmented vision-language model agent that generates edit plans and selects reference images to align with the diffusion transformer's inputs.

If this is right

  • Aurora produces higher-quality edits than instruction-only baselines on AgentEdit-Bench and existing video editing benchmarks.
  • The VLM agent transfers to compatible frozen video editing models without retraining the diffusion component.
  • The framework supports replacement, removal, style transfer, and reference-driven insertion from natural language requests that omit model-ready details.
  • Structured planning before diffusion reduces failures caused by missing spatial grounding or reference images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar agent layers could extend to other generative models that require precise conditioning from loose user input.
  • The separation of planning from generation may allow independent scaling of the agent for more complex multi-step edits.
  • Benchmarks focused on underspecification could become standard for testing agent-augmented creative tools.

Load-bearing premise

The trained VLM agent can reliably resolve textual and visual underspecification in raw user requests without introducing planning errors that degrade the downstream diffusion output.

What would settle it

If Aurora videos score lower than instruction-only baselines on human preference ratings or automated metrics across AgentEdit-Bench and the two existing benchmarks, the claimed improvement would not hold.

read the original abstract

Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit. We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, thereby resolving textual and visual underspecification before generation. We train the VLM agent with supervised data for complete edit planning and reference-image selection, together with preference pairs for robust tool use and instruction refinement. We introduce AgentEdit-Bench to evaluate agent-enhanced video editing under textual and visual underspecification. Experiments on AgentEdit-Bench and two existing video editing benchmarks show that Aurora improves over instruction-only baselines and that the VLM agent transfers to compatible frozen video editing models. Project page: https://yeates.github.io/Aurora-Page

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Aurora, a framework pairing a tool-augmented VLM agent with a unified video diffusion transformer. The agent converts raw user requests into structured edit plans (resolving textual/visual underspecification) via supervised training on planning/reference selection plus preference optimization; the system is evaluated on a new AgentEdit-Bench plus two prior benchmarks, reporting gains over instruction-only baselines and successful transfer to compatible frozen editing models.

Significance. If the empirical gains are shown to arise specifically from the agent's planning rather than ancillary factors, the work would meaningfully advance practical video editing by making unified diffusion models usable with underspecified natural-language inputs. The transfer results to frozen models and the introduction of AgentEdit-Bench for underspecification testing are concrete strengths that could influence follow-on agentic editing systems.

major comments (2)
  1. [§4] §4 (Experiments): aggregate improvements on AgentEdit-Bench and prior benchmarks are reported, but no per-example breakdown of agent planning accuracy, reference-selection errors, or correlation between those errors and downstream metrics (temporal consistency, edit fidelity) is provided. This leaves the central claim—that the supervised+preference-trained agent reliably resolves underspecification without harming the frozen diffusion output—unsupported by the necessary isolation.
  2. [§3] §3 (AgentEdit-Bench): the benchmark is introduced to evaluate underspecification handling, yet the manuscript supplies no dataset size, curation protocol for textual/visual underspecification, or statistical significance tests on the reported gains. Without these, the evaluation cannot be assessed for robustness or generalizability.
minor comments (2)
  1. [§2] Notation for the structured edit plan (text, reference images, spatial grounding) is introduced in §2 but never formalized with an explicit tuple or schema; adding a short definition would improve reproducibility.
  2. [Figure 3] Figure 3 (qualitative results) would benefit from side-by-side comparison with the instruction-only baseline on the same underspecified inputs to visually illustrate the agent's contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of our evaluation that can be strengthened. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): aggregate improvements on AgentEdit-Bench and prior benchmarks are reported, but no per-example breakdown of agent planning accuracy, reference-selection errors, or correlation between those errors and downstream metrics (temporal consistency, edit fidelity) is provided. This leaves the central claim—that the supervised+preference-trained agent reliably resolves underspecification without harming the frozen diffusion output—unsupported by the necessary isolation.

    Authors: We agree that isolating the agent's contribution through per-example analysis would strengthen the central claim. In the revised version we will add a dedicated analysis subsection that reports planning accuracy and reference-selection error rates on a representative sample of examples from AgentEdit-Bench, together with quantitative correlations between these agent-level metrics and downstream measures such as temporal consistency and edit fidelity. This will provide direct evidence that the observed gains arise from the agent's planning rather than ancillary factors. revision: yes

  2. Referee: [§3] §3 (AgentEdit-Bench): the benchmark is introduced to evaluate underspecification handling, yet the manuscript supplies no dataset size, curation protocol for textual/visual underspecification, or statistical significance tests on the reported gains. Without these, the evaluation cannot be assessed for robustness or generalizability.

    Authors: We acknowledge that these details are necessary for a complete assessment of the benchmark. We will expand the AgentEdit-Bench section to explicitly state the total number of examples, describe the curation protocol used to generate instances exhibiting textual and visual underspecification, and include statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the performance differences versus instruction-only baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework with no derivations or self-referential reductions

full rationale

The paper introduces an agentic video editing system pairing a trained VLM agent with a unified diffusion transformer, evaluated on a new benchmark (AgentEdit-Bench) and prior benchmarks. No equations, derivations, or formal proof chains appear in the abstract or described content. All central claims rest on aggregate experimental improvements over instruction-only baselines and transfer tests to frozen models. These results are externally falsifiable via the reported benchmarks and do not reduce by construction to fitted parameters, self-definitions, or self-citation chains. The work is self-contained as standard empirical CV research.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems paper with no mathematical derivations. No free parameters, axioms, or invented entities are specified in the abstract; training details for the VLM agent are described at a high level only.

pith-pipeline@v0.9.0 · 5760 in / 1175 out tokens · 50373 ms · 2026-05-20T10:43:48.760473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 9 internal anchors

  1. [1]

    Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman

    Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

  2. [2]

    FLUX.2 [klein]: Towards interactive visual intelligence

    Black Forest Labs. FLUX.2 [klein]: Towards interactive visual intelligence. Blog post, January 2026. URLhttps:// bfl.ai/blog/flux2-klein-towards-interactive-visual-intelligence . Model page: https://bfl. ai/models/flux-2-klein

  3. [3]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

  4. [4]

    Humo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025

    Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. HuMo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025. 10

  5. [5]

    EditMGT: Unleashing potentials of masked generative transformers in image editing.arXiv preprint arXiv:2512.11715, 2025

    Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, and Songhua Liu. EditMGT: Unleashing potentials of masked generative transformers in image editing.arXiv preprint arXiv:2512.11715, 2025

  6. [6]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

  7. [7]

    Gen-Searcher: Reinforcing Agentic Search for Image Generation

    Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-Searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

  8. [8]

    EffectErase: Joint video object removal and insertion for high-quality effect erasing

    Yang Fu, Yike Zheng, Ziyun Dai, and Henghui Ding. EffectErase: Joint video object removal and insertion for high-quality effect erasing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  9. [9]

    Tokenflow: Consistent diffusion features for consistent video editing

    Michal Geyer, Omer Bar Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. InICLR, 2024

  10. [10]

    Gemini 3.1 Flash-Lite

    Google DeepMind. Gemini 3.1 Flash-Lite. Model card, 2026. URLhttps://deepmind.google/models/gemini/ flash-lite/. Model card PDF: https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf

  11. [11]

    OpenVE-3M: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

    Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, and Lei Xie. OpenVE-3M: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

  12. [12]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  13. [13]

    Mmig-bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.Advances in Neural Information Processing Systems, 2026

    Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, and Jiebo Luo. Mmig-bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.Advances in Neural Information Processing Systems, 2026

  14. [14]

    Genmac: compositional text-to-video generation with multi-agent collaboration

    Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, and Xihui Liu. Genmac: compositional text-to-video generation with multi-agent collaboration. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

  15. [15]

    VACE: All-in-One Video Creation and Editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

  16. [16]

    EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

    Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, et al. Editverse: Unifying image and video editing and generation with in-context learning.arXiv preprint arXiv:2509.20360, 2025

  17. [17]

    EgoEdit: Dataset, real-time streaming model, and benchmark for egocentric video editing

    Runjia Li, Moayed Haji-Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siarohin, Tomas Jakab, Junlin Han, Sergey Tulyakov, Philip Torr, and Willi Menapace. EgoEdit: Dataset, real-time streaming model, and benchmark for egocentric video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  18. [18]

    Univa: Universal video agent towards open-source next-generation video generalist.arXiv preprint arXiv:2511.08521, 2025

    Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, et al. Univa: Universal video agent towards open-source next-generation video generalist.arXiv preprint arXiv:2511.08521, 2025

  19. [19]

    Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

    Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-Edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026

  20. [20]

    JarvisEvo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

    Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, Wenxun Dai, Xinghao Ding, Chunyu Wang, and Qinglin Lu. JarvisEvo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

  21. [21]

    Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision, 2024

  22. [22]

    ROSE: Remove objects with side effects in videos.Advances in Neural Information Processing Systems, 38, 2025

    Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, and Hengshuang Zhao. ROSE: Remove objects with side effects in videos.Advances in Neural Information Processing Systems, 38, 2025. 11

  23. [23]

    Fatezero: Fusing attentions for zero-shot text-based video editing

    Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023

  24. [24]

    Qwen3-VL: A family of open multimodal large language models

    Qwen Team. Qwen3-VL: A family of open multimodal large language models. Model card and blog post, 2025. URL https://qwenlm.github.io/blog/qwen3-vl/

  25. [25]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id= qwen3.5

  26. [26]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, 2023

  27. [27]

    Sam 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InICLR, 2025

  28. [28]

    Introducing runway aleph

    Runway. Introducing runway aleph. https://runwayml.com/research/introducing-runway-aleph, 2025. Accessed: 2025-09-10

  29. [29]

    Serper: The world’s fastest and cheapest Google search API.https://serper.dev/, 2024

    Serper. Serper: The world’s fastest and cheapest Google search API.https://serper.dev/, 2024. Accessed: 2026-04-25

  30. [30]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  31. [31]

    Lucy edit: Open-weight text-guided video editing

    DecartAI Team. Lucy edit: Open-weight text-guided video editing. 2025

  32. [32]

    Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing

    Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, and Hongjie Zhang. InternVL-U: Democ...

  33. [33]

    Spagent: Adaptive task decomposition and model selection for general video generation and editing.IEEE Transactions on Image Processing, 2026

    Rong-Cheng Tu, Wenhao Sun, Zhao Jin, Jingyi Liao, Jiaxing Huang, and Dacheng Tao. Spagent: Adaptive task decomposition and model selection for general video generation and editing.IEEE Transactions on Image Processing, 2026

  34. [34]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  35. [35]

    SpatialVID: A large-scale video dataset with spatial annotations

    Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiao-Xiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, and Yao Yao. SpatialVID: A large-scale video dataset with spatial annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  36. [36]

    Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

    Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

  37. [37]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Towards instruction-aligned multimodal generation. arXiv preprint arXiv:2506.18871, 2025

  38. [38]

    Insvie-1m: Effective instruction-based video editing with elaborate dataset construction

    Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16692–16701, 2025

  39. [39]

    VideoCoF: Unified Video Editing with Temporal Reasoner

    Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, and Qiang Wu. VideoCoF: Unified video editing with temporal reasoner.arXiv preprint arXiv:2512.07469, 2025

  40. [40]

    PhotoAgent: Agentic photo editing with exploratory visual aesthetic planning.arXiv preprint arXiv:2602.22809, 2026

    Mingde Yao, Zhiyuan You, King-Man Tam, Menglu Wang, and Tianfan Xue. PhotoAgent: Agentic photo editing with exploratory visual aesthetic planning.arXiv preprint arXiv:2602.22809, 2026

  41. [41]

    Space-time diffusion features for zero-shot text-driven motion transfer

    Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, and Tali Dekel. Space-time diffusion features for zero-shot text-driven motion transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8466–8476, 2024. 12

  42. [42]

    Agent banana: High-fidelity image editing with agentic thinking and tooling.arXiv preprint arXiv:2602.09084, 2026

    Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, et al. Agent banana: High-fidelity image editing with agentic thinking and tooling.arXiv preprint arXiv:2602.09084, 2026

  43. [43]

    OpenS2V-Nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.Advances in Neural Information Processing Systems, 38, 2025

    Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. OpenS2V-Nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.Advances in Neural Information Processing Systems, 38, 2025

  44. [44]

    Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248, 2024

    Zhengqing Yuan, Ruoxi Chen, Zhaoxu Li, Haolong Jia, Lifang He, Chi Wang, and Lichao Sun. Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248, 2024

  45. [45]

    Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

    Ziyun Zeng, Hang Hua, and Jiebo Luo. Mira: Multimodal iterative reasoning agent for image editing.arXiv preprint arXiv:2511.21087, 2025

  46. [46]

    Region-constraint in-context generation for instructional video editing.arXiv preprint arXiv:2512.17650, 2025

    Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, and Tao Mei. Region-constraint in-context generation for instructional video editing.arXiv preprint arXiv:2512.17650, 2025

  47. [47]

    UltraEdit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37, 2024

    Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. UltraEdit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37, 2024

  48. [48]

    arXiv preprint arXiv:2502.06734(2025)

    Bojia Zi, Penghui Ruan, Marco Chen, Xianbiao Qi, Shaozhe Hao, Shihao Zhao, Youze Huang, Bin Liang, Rong Xiao, and Kam-Fai Wong. Señorita-2m: A high-quality instruction-based dataset for general video editing by video specialists.arXiv preprint arXiv:2502.06734, 2025. 13 UniVideo KiwiEditAurora (Ours)Source / Reference Replace the glowing sphere on the tab...

  49. [49]

    Was the requested edit performed in the named region with the named target? 3: requested edit performed correctly in the right region

    Instruction Following (0–3). Was the requested edit performed in the named region with the named target? 3: requested edit performed correctly in the right region. 2: edit performed but with a minor inaccuracy or omission. 1: edit partially performed or applied to the wrong region. 0: instruction ignored, or the opposite was done

  50. [50]

    Was the change confined to the specified region?

    Edit Region Localization (0–3). Was the change confined to the specified region?

  51. [53]

    Inferred from the AFTER frame, would the edited entity remain stable across the clip without ghosting or pop-in?

    Temporal Consistency (0–3). Inferred from the AFTER frame, would the edited entity remain stable across the clip without ghosting or pop-in?

  52. [54]

    Is the named entity actually visible and recognizable in roughly the right region?

    IP Presence (0–3). Is the named entity actually visible and recognizable in roughly the right region?

  53. [55]

    Does the visible entity match the specific real-world identity, brand colors, logo, and signature shape? You may rely only on your internal world knowledge of the brand or product

    IP Identity Match (0–3). Does the visible entity match the specific real-world identity, brand colors, logo, and signature shape? You may rely only on your internal world knowledge of the brand or product. Return your evaluation in exactly this format: Instruction Following: [score] - [one-sentence justification] Edit Region Localization: [score] - [one-s...

  54. [56]

    Was the requested edit performed? 3: edit performed correctly

    Instruction Following (0–3). Was the requested edit performed? 3: edit performed correctly. 2: edit performed with a minor inaccuracy. 1: edit partially performed or applied to the wrong region. 0: instruction ignored, or the opposite was done. Removal-only clause: if the model replaced the target with a new object instead of removing it, give at most1

  55. [57]

    Was the change confined to the specified region or, for global edits, to the implied scope?

    Edit Region Localization (0–3). Was the change confined to the specified region or, for global edits, to the implied scope?

  56. [58]

    Are subject motion, geometry and lighting outside the edit region preserved?

    Source Preservation (0–3). Are subject motion, geometry and lighting outside the edit region preserved?

  57. [59]

    Realism, seamless integration, lighting and shadow match, scale and perspective

    Visual Quality (0–3). Realism, seamless integration, lighting and shadow match, scale and perspective

  58. [60]

    Temporal Consistency (0–3). Inferred from the AFTER frame, would the result remain stable across the clip? Return your evaluation in exactly this format: Instruction Following: [score] - [one-sentence justification] Edit Region Localization: [score] - [one-sentence justification] Source Preservation: [score] - [one-sentence justification] Visual Quality: ...