pith. machine review for the scientific record. sign in

arxiv: 2605.06535 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

Recognition: unknown

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Guoqiang Liang, Mike Zheng Shou, Yiqi Lin, Ziyun Zeng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords backgrounddatasetdataguidancereplacementsparkletaskvideo
0
0 comments X

The pith

A decoupled pipeline for generating foreground and background guidance enables high-quality datasets for instruction-guided video background replacement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current video editing datasets struggle with background replacement because they do not provide enough precise guidance when creating new scenes, leading to static and unrealistic results. The paper introduces a method to generate the foreground and background parts of videos separately, applying strict quality checks to ensure natural motion and interactions. This produces a large dataset of video pairs focused on background changes across common themes. Training models on this data leads to much better results in maintaining temporal consistency and accurate foreground-background blending compared to earlier approaches. Such improvements matter for practical uses in film and advertising where seamless scene changes are needed.

Core claim

The lack of precise background guidance in data synthesis causes state-of-the-art models to generate static, unnatural backgrounds in replacement tasks. A scalable pipeline that creates foreground and background guidance in a decoupled manner with strict quality filtering addresses this issue. Building on the pipeline yields a dataset of about 140,000 video pairs covering five common background-change themes and a dedicated evaluation benchmark for the task. Models trained using this dataset substantially outperform existing baselines on both prior and new benchmarks.

What carries the argument

Decoupled generation of foreground and background guidance combined with strict quality filtering in the data synthesis pipeline.

Load-bearing premise

That insufficient precise background guidance during data synthesis is the main reason for unnatural outputs in previous models, and that decoupling the guidance will produce better data without new problems.

What would settle it

Observing that models trained on the new dataset continue to generate static backgrounds on test cases with dynamic scene requirements would indicate the approach has not resolved the core issue.

Figures

Figures reproduced from arXiv: 2605.06535 by Guoqiang Liang, Mike Zheng Shou, Yiqi Lin, Ziyun Zeng.

Figure 1
Figure 1. Figure 1: Data comparison between OpenVE-3M [9] and our proposed Sparkle. Left: Relying solely on foreground guidance, OpenVE-3M frequently suffers from severe background structural collapse. Right: Sparkle curates foreground-compatible background videos independently. The final synthesis utilizes dual guidance from both the background and the foreground (tracked by our high-precision BAIT algorithm) to ensure dynam… view at source ↗
Figure 2
Figure 2. Figure 2: The Sparkle data pipeline. First, only fixed-camera videos are retained to enable independent background generation. After preliminary first-frame background replacement, a VLM identifies the foreground, which is then removed to isolate a pure background image. An I2V model animates this image into a background video. Concurrently, our BAIT algorithm precisely tracks the foreground. Finally, decoupled fore… view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparison between single-frame tracking (top) and our BAIT (bottom). The red and view at source ↗
Figure 4
Figure 4. Figure 4: Sparkle statistical distribution. Building upon the aforementioned pipeline, we curated Sparkle, comprising ∼140K videos across five relatively balanced themes and 22 subthemes across ∼100 diverse scenes ( view at source ↗
Figure 5
Figure 5. Figure 5: Data comparison between OpenVE-3M [9] and our proposed view at source ↗
Figure 6
Figure 6. Figure 6: Data comparison between OpenVE-3M [9] and our proposed view at source ↗
Figure 7
Figure 7. Figure 7: Data comparison between OpenVE-3M [9] and our proposed view at source ↗
Figure 8
Figure 8. Figure 8: Data comparison between OpenVE-3M [9] and our proposed view at source ↗
Figure 9
Figure 9. Figure 9: Data comparison between Copy-and-Paste and our proposed view at source ↗
Figure 10
Figure 10. Figure 10: Data comparison between Copy-and-Paste and our proposed view at source ↗
Figure 11
Figure 11. Figure 11: Data comparison between Copy-and-Paste and our proposed view at source ↗
Figure 12
Figure 12. Figure 12: Data comparison between Copy-and-Paste and our proposed view at source ↗
Figure 13
Figure 13. Figure 13: Data comparison between Foreground-Only and our proposed view at source ↗
Figure 14
Figure 14. Figure 14: Data comparison between Foreground-Only and our proposed view at source ↗
Figure 15
Figure 15. Figure 15: Conversely, with sufficient decoupled background guidance, our view at source ↗
Figure 16
Figure 16. Figure 16: Data comparison between Foreground-Only and our proposed view at source ↗
Figure 17
Figure 17. Figure 17: Edited video comparison between Kiwi-Edit and view at source ↗
Figure 18
Figure 18. Figure 18: Edited video comparison between Kiwi-Edit and view at source ↗
Figure 19
Figure 19. Figure 19: Edited video comparison between Kiwi-Edit and view at source ↗
Figure 20
Figure 20. Figure 20: Edited video comparison between Kiwi-Edit and view at source ↗
Figure 21
Figure 21. Figure 21: Edited video comparison between Kiwi-Edit and view at source ↗
Figure 22
Figure 22. Figure 22: Edited video comparison between Kiwi-Edit and view at source ↗
Figure 23
Figure 23. Figure 23: Edited video comparison between Kiwi-Edit and view at source ↗
Figure 24
Figure 24. Figure 24: Kiwi-Sparkle as an effective foreground tracker by using the trigger phrase “a minimalist clean white space”-Part1. Replace the background with a minimalist clean white space, featuring a subtle gradient of soft light that gently shifts across the surface, and add faint, slowly drifting white particles that float upward, creating a serene and dynamic atmosphere. Source Video Kiwi-Sparkle view at source ↗
Figure 25
Figure 25. Figure 25: Kiwi-Sparkle as an effective foreground tracker by using the trigger phrase “a minimalist clean white space”-Part2. Kiwi-Sparkle as an Effective Foreground Tracker. Beyond visual comparisons of the data and models, we demonstrate that Kiwi-Sparkle possesses strong foreground tracking capabilities inherited from the proposed BAIT algorithm, alongside robust instruction-following skills. We validate this by… view at source ↗
Figure 26
Figure 26. Figure 26: Kiwi-Sparkle as an effective foreground tracker by using the trigger phrase “a minimalist clean white space”-Part3. Swap the background to a minimalist clean white space with soft, floating particles gently drifting upward and subtle light reflections shimmering across the surface, maintaining a serene and animated atmosphere. Source Video Kiwi-Sparkle view at source ↗
Figure 27
Figure 27. Figure 27: Kiwi-Sparkle as an effective foreground tracker by using the trigger phrase “a minimalist clean white space”-Part4. 27 view at source ↗
read the original abstract

In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces Sparkle, a dataset of ~140K video pairs for instruction-guided video background replacement, generated via a decoupled pipeline that produces foreground and background guidance separately followed by strict quality filtering. It also releases Sparkle-Bench, the largest benchmark for this task, and reports that a model trained on Sparkle substantially outperforms baselines including Kiwi-Edit on both OpenVE-Bench and Sparkle-Bench.

Significance. If the quantitative gains hold, the work meaningfully addresses the data scarcity for complex, temporally consistent background replacement in video editing—an underexplored task relevant to film production and advertising. The open-sourcing of the dataset, benchmark, and model, together with the provision of dataset statistics, qualitative results, and comparative tables, supports reproducibility and further progress in the area.

minor comments (3)
  1. The abstract asserts substantially better performance without any numerical metrics or error bars; adding one or two key quantitative highlights would better support the central claim for readers who stop at the abstract.
  2. In the experiments section, the tables comparing against baselines (including retrained Kiwi-Edit) are informative, but the paper should explicitly state the number of evaluation runs, random seeds, and any statistical testing used to establish that the observed gains are reliable.
  3. The description of the quality-filtering criteria in the data-generation pipeline is central; a short table or paragraph listing the exact thresholds or rejection rates applied at each stage would improve clarity and allow easier replication.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its significance in addressing data scarcity for instruction-guided video background replacement, and the recommendation for minor revision. The report correctly highlights the contributions of the Sparkle dataset, Sparkle-Bench, and the performance improvements over baselines such as Kiwi-Edit.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is an empirical ML contribution focused on dataset construction via a decoupled guidance pipeline, quality filtering, and benchmark evaluation of a trained model. No mathematical derivations, equations, or predictions are present that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Performance claims rest on independent comparisons against baselines on OpenVE-Bench and the new Sparkle-Bench, with no load-bearing steps that import uniqueness theorems or rename known results as novel derivations. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that prior quality issues stem from imprecise background guidance and that the new decoupled pipeline plus filtering will produce superior training data. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The quality degradation observed in models such as Kiwi-Edit is primarily caused by a lack of precise background guidance during data synthesis in datasets like OpenVE-3M.
    Explicitly stated in the abstract as the traced root cause of static and unnatural backgrounds.

pith-pipeline@v0.9.0 · 5583 in / 1413 out tokens · 50936 ms · 2026-05-08T12:29:32.758354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 18 canonical work pages · 7 internal anchors

  1. [1]

    Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

    Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    FLUX.2-klein-9B

    Black Forest Labs. FLUX.2-klein-9B. https://huggingface.co/black-forest-labs/ FLUX.2-klein-9B, 2026

  4. [4]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  5. [5]

    Learning to generate line drawings that convey geometry and semantics

    Caroline Chan, Frédo Durand, and Phillip Isola. Learning to generate line drawings that convey geometry and semantics. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7915–7925, 2022

  6. [6]

    Lightx2v: Light video generation inference framework

    LightX2V Contributors. Lightx2v: Light video generation inference framework. https: //github.com/ModelTC/lightx2v, 2025

  7. [7]

    Lucy edit: Open-weight text-guided video editing

    DecartAI Team. Lucy edit: Open-weight text-guided video editing. 2025. URL https://d2drjpuinn46lb.cloudfront.net/Lucy_Edit__High_Fidelity_Text_ Guided_Video_Editing.pdf

  8. [8]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

  9. [9]

    Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

    Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, and Lei Xie. Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

  10. [10]

    Vace: All-in- one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in- one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025

  11. [11]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  12. [12]

    Dif- fueraser: A diffusion model for video inpainting.arXiv preprint arXiv:2501.10018, 2025

    Xiaowen Li, Haolan Xue, Peiran Ren, and Liefeng Bo. Diffueraser: A diffusion model for video inpainting.arXiv preprint arXiv:2501.10018, 2025

  13. [13]

    In- context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

    Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. In- context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

  14. [14]

    Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026

    Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026

  15. [15]

    Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

    Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, et al. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

  16. [16]

    Instructx: Towards unified visual editing with mllm guidance.arXiv preprint arXiv:2510.08485, 2025

    Chong Mou, Qichao Sun, Yanze Wu, Pengze Zhang, Xinghui Li, Fulong Ye, Songtao Zhao, and Qian He. Instructx: Towards unified visual editing with mllm guidance.arXiv preprint arXiv:2510.08485, 2025. 10

  17. [17]

    ChatGPT Images 2.0 System Card, 2026

    OpenAI. ChatGPT Images 2.0 System Card, 2026. URL https://deploymentsafety. openai.com/chatgpt-images-2-0/introduction

  18. [18]

    Nano Banana 2: Combining Pro Capabilities with Lightning-Fast Speed

    Naina Raisinghani. Nano Banana 2: Combining Pro Capabilities with Lightning-Fast Speed. https://blog.google/innovation-and-ai/technology/ai/nano-banana-2, 2026

  19. [19]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024

  20. [20]

    Introducing runway aleph

    Runway. Introducing runway aleph. https://runwayml.com/research/ introducing-runway-aleph, 2025. Runway Research blog

  21. [21]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  22. [22]

    Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025a

    Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

  23. [23]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  24. [24]

    Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

    Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

  25. [25]

    Insvie-1m: Effective instruction-based video editing with elaborate dataset construction

    Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16692–16701, 2025

  26. [26]

    Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):13941–13958, 2023

    Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):13941–13958, 2023

  27. [27]

    Omni-video 2: Scaling mllm-conditioned diffusion for unified video generation and editing.arXiv preprint arXiv:2602.08820, 2026

    Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen, Xiaomeng Yang, Yuqing Sun, Yuetan Lin, Mengping Yang, and Hao Li. Omni-video 2: Scaling mllm-conditioned diffusion for unified video generation and editing.arXiv preprint arXiv:2602.08820, 2026

  28. [28]

    Region-Constraint In-Context Generation for Instructional Video Editing.arXiv preprint arXiv:2512.17650, 2025

    Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, and Tao Mei. Region- constraint in-context generation for instructional video editing.arXiv preprint arXiv:2512.17650, 2025

  29. [29]

    Se\˜ norita-2m: A high-quality instruction-based dataset for general video editing by video specialists.arXiv preprint arXiv:2502.06734, 2025

    Bojia Zi, Penghui Ruan, Marco Chen, Xianbiao Qi, Shaozhe Hao, Shihao Zhao, Youze Huang, Bin Liang, Rong Xiao, and Kam-Fai Wong. Se\˜ norita-2m: A high-quality instruction-based dataset for general video editing by video specialists.arXiv preprint arXiv:2502.06734, 2025. 11 A Coarse Camera Movement Filtering Since processing a large volume of source videos...

  30. [30]

    No change, or background entirely unrelated to the prompt, or foreground also replaced/distorted such that the edit fails as a whole

  31. [31]

    Background only partially matches prompt content or style; major requested elements wrong or missing; or foreground noticeably altered

  32. [32]

    Main background concept matches but with missing/extra elements, wrong sub-style, or partial spill onto the subject

  33. [33]

    Requested background fully present and consistent with the prompt; only minor mismatches in tone, detail, or atmosphere

  34. [34]

    Overall Visual Quality.This dimension covers global image quality AND foreground-background harmonization

    Background exactly matches the prompt in content, style, mood, and any specified dynamics; fore- ground untouched. Overall Visual Quality.This dimension covers global image quality AND foreground-background harmonization. The lighting, color temperature, and shadows on the foreground must match the new background environment. For example, when the prompt ...

  35. [35]

    brightly lit subject against a night scene, conflicting light directions, no shadow adaptation)

    Severe artefacts throughout (tearing, posterisation, color banding, heavy flicker), OR foreground lighting is grossly inconsistent with the new background (e.g. brightly lit subject against a night scene, conflicting light directions, no shadow adaptation). 12

  36. [36]

    Clear visual degradation (persistent blur, noise, unstable colors), OR obvious lighting / color- temperature mismatch between foreground and background visible at first glance

  37. [37]

    Watchable but with visible flaws on closer look: occasional flicker, mild compression artefacts, soft regions, OR partial harmonization where the foreground tone is in the right direction but not fully matched to the background

  38. [38]

    Clean output with only minor issues when zoomed in or paused; foreground lighting and color grading are well aligned with the background, with only subtle discrepancies

  39. [39]

    Foreground Integrity

    Indistinguishable from real captured footage: sharp, stable, well-graded across the entire clip, with foreground lighting, color temperature, and shadows fully harmonized with the new background environment. Foreground Integrity

  40. [40]

    Foreground severely damaged: missing limbs/parts, large holes, replaced with a different subject, or shape collapsed

  41. [41]

    Noticeable foreground damage: partial erosion by background, distorted contours, identity drift across frames

  42. [42]

    Foreground mostly preserved but with visible defects: edge halos, slight shape deformation, occasional color bleed

  43. [43]

    Foreground well preserved with only minute edge artefacts; shape and identity stable throughout

  44. [44]

    Foreground Motion Consistency

    Foreground perfectly preserved: every pixel of shape, texture, and identity intact across all frames. Foreground Motion Consistency

  45. [45]

    Foreground motion completely different from source: actions replaced, frozen, looped, or temporally scrambled

  46. [46]

    Major motion deviations: different gestures, dropped actions, or strong temporal jitter not present in source

  47. [47]

    Same general action is recognizable but with timing drift, trajectory shifts, or inconsistent speed versus source

  48. [48]

    Motion closely tracks the source with only minor temporal misalignment or subtle smoothing

  49. [49]

    gentle swaying grass

    Foreground motion is identical to the source video in trajectory, timing, and articulation, frame by frame. Background Dynamics (Liveness).This dimension measures whether the background motion matches the intensity and character implied by the prompt. The bar is appropriateness to the prompt, not absolute amount of motion. A “gentle swaying grass” prompt ...

  50. [50]

    crashing waves rendered as a still pond)

    Background motion contradicts the prompt: completely static when the prompt implies any motion, or wrong type/direction of motion (e.g. crashing waves rendered as a still pond)

  51. [51]

    rushing river

    Motion intensity is far below what the prompt implies (e.g. a “rushing river” rendered as barely moving water), or required dynamics are largely absent

  52. [52]

    Motion type is in the right direction but noticeably under- or over-rendered, OR motion exists but feels stiff and unnatural

  53. [53]

    Motion intensity and character are well matched to the prompt, with only minor stiffness, small frozen patches, or slight over/under rendering

  54. [54]

    still photo

    Background motion perfectly matches the prompt in both intensity and character, rendered naturally and continuously throughout the clip — gentle prompts receive gentle motion, energetic prompts receive energetic motion. Special case:if the prompt explicitly asks for a static background (e.g. “still photo”, “frozen scene”, “no motion”), a faithfully static...

  55. [55]

    Background severely degraded: melting structures, broken geometry, heavy blur, or incoherent textures

  56. [56]

    Clear distortion or blur in major background regions; structures wobble or warp over time

  57. [57]

    Acceptable background with visible imperfections: soft textures, mild geometric inconsistency, minor temporal warping

  58. [58]

    High-quality background with only minor issues on close inspection; geometry and textures stable. 13

  59. [59]

    Location-rural-vineyard rows with rustling leaves

    Background is sharp, geometrically coherent, and temporally stable; on par with real footage. Constraints.The scores for Overall Visual Quality, Foreground Integrity, Foreground Motion Consis- tency, Background Dynamics, and Background Visual Quality must not exceed the score for Instruction Compliance. Example Response Format. – Brief reasoning: No more ...