pith. machine review for the scientific record. sign in

arxiv: 2603.09283 · v2 · submitted 2026-03-10 · 💻 cs.CV

Recognition: no theorem link

From Ideal to Real: Stable Video Object Removal under Imperfect Conditions

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords video object removaldiffusion inpaintingtemporal stabilitymask degradationshadow removalcurriculum trainingvideo editingreal-world conditions
0
0 comments X

The pith

SVOR removes objects from videos while handling shadows, abrupt motion, and defective masks to produce stable results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Removing objects from videos is difficult when real-world issues like shadows, sudden movements, or imperfect masks appear. Existing diffusion models often produce flickers, incomplete erasures, or leftover reflections under these conditions. The paper introduces SVOR, which uses three designs to address them: MUSE applies a windowed union to masks during downsampling so no target regions are missed over time; DA-Seg adds a lightweight side segmentation head with denoising-aware layers to guide localization inside the diffusion process; and curriculum two-stage training first learns background patterns from unpaired real videos then refines on degraded synthetic pairs with losses that penalize side effects. These changes let the model erase objects cleanly without shadows or instability. Experiments show it outperforms prior methods on standard datasets and new tests with degraded masks, moving video object removal toward practical use.

Core claim

SVOR attains shadow-free, flicker-free, and mask-defect-tolerant video object removal through MUSE, a windowed union strategy applied during temporal mask downsampling to preserve all observed target regions; DA-Seg, a lightweight segmentation head on a decoupled side branch with Denoising-Aware AdaLN trained under mask degradation to supply an internal diffusion-aware localization prior; and curriculum two-stage training where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses.

What carries the argument

MUSE windowed mask union, DA-Seg denoising-aware segmentation head, and curriculum two-stage training that together enable temporal stability and shadow removal in diffusion-based video inpainting.

If this is right

  • MUSE prevents missed removals during abrupt motion by unioning all observed mask regions within each temporal window.
  • The model removes objects together with associated shadows and reflections through side-effect-weighted losses in Stage II.
  • DA-Seg supplies diffusion-aware localization without altering the main content generation path.
  • Curriculum pretraining on unpaired real videos followed by degraded-mask refinement improves cross-domain robustness.
  • SVOR reaches new state-of-the-art results across multiple datasets and degraded-mask benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decoupled side-branch design could be adapted to other diffusion video models to add localization awareness without retraining the core generator.
  • Curriculum training from unpaired real backgrounds may reduce dependence on perfectly paired synthetic data in related video editing tasks.
  • Testing SVOR on longer sequences or videos with changing lighting could reveal whether temporal stability scales beyond the current benchmarks.
  • Consumer video tools might incorporate similar mask-robust components to allow users to provide approximate rather than perfect removal masks.

Load-bearing premise

The three components MUSE, DA-Seg, and curriculum training will jointly eliminate shadows and reflections and maintain temporal stability without new artifacts when applied to arbitrary real-world videos beyond the tested datasets.

What would settle it

A real video with complex moving shadows, abrupt motion changes, and noisy input masks where SVOR outputs still show visible shadows, reflections, or frame-to-frame flickers would disprove the claim of robust real-world performance.

Figures

Figures reproduced from arXiv: 2603.09283 by Daiguo Zhou, Fei Wang, Fuhao Li, Jiagao Hu, Jian Luan, Yuxuan Chen, Zepeng Wang.

Figure 1
Figure 1. Figure 1: Results of our Stable Video Object Removal compared with MiniMax￾Remover [46] and ROSE [23] in three common real-world challenges. The proposed SVOR achieves stable and artifact-free removal. used in video editing, post-production, and AR. Recent VOR approaches have made significant progress in aspects such as inference efficiency [21,46] and side￾effect suppression (e.g., shadows and reflections) [15, 16,… view at source ↗
Figure 2
Figure 2. Figure 2: The framework of SVOR. Stage I: pretrain on unpaired real-world back￾ground videos using Random Mask Strategy to simulate object motion. Stage II: refine on paired synthetic data with Mask Degradation to mimic imperfect masks, where DA-Seg complements defective guidance. MUSE performs windowed union retention during mask temporal downsampling, preventing loss of dynamic location information. base removal c… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison between our SVOR and several state-of-the-art meth￾ods on real-world and synthetic samples. Previous methods facing issues like Undesired object, Artifacts, Blur, Undesired remove, Unremoved shadow, Unremoved effects. Our SVOR achieves consistently cleaner removal, fewer artifacts, and better shadow han￾dling. We recruited 15 participants for this study. The results, summarized in Ta… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of MUSE under abrupt-motion frames. MUSE improves removal even without additional training. “T”/“I” denote Training/Inference, “×”/“✓” indicate with￾out/with MUSE. generalization to real-world dynamics, minimizing artifacts and blurring when comparied with ROSE [23]. 4.4 Stability of Removal Stability for Abrupt-Motion. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Robust removal under SAM2 failures. Existing methods miss unsegmented ob￾jects when SAM2 drops, while our SVOR still achieves temporally consistent removal. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ReMOVE performance under mask drop. Our SVOR remains stable while existing methods collapse [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effectiveness of Stage I pre-training. Training with background videos signifi￾cantly improves removal quality and success rate. input (w/ mask) Only Stage II Stage I + II input (w/ mask) Only Stage II Stage I + II [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison between Stage II–only training and the full two-stage training scheme. The complete two-stage training substantially improves background comple￾tion and shadow removal in real-world scenarios. in [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effectiveness of MUSE. All methods suffer from missed removals or artifacts under abrupt motion, which is notably reduced after applying MUSE preprocessing [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Effectiveness of DA-Seg. Accurate mask predictions lead to better removals, while broken masks cause degraded results [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation of segmentation head design. DA-Seg produces more accurate lo￾calization, indicating that the context block extracts more reliable control context for DiT. This enables the model to suppress the target object in latent features and achieve more stable removal under degraded mask guidance. under the more challenging setting with 50% mask dropout, the DA-AdaLN￾based head consistently outperforms th… view at source ↗
Figure 12
Figure 12. Figure 12: Results under single mask condition. In some cases, our SVOR can remove the target object even with only single mask. 6.9 Details on GPT-based Evaluation We additionally use gpt-4o-2024-11-20 to automatically evaluate the percep￾tual quality of video removal results. For each video, we extract frames from the original video, the mask video, and the erased video. In the original frames, we overlay the mask… view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for GPT-based evaluation [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
read the original abstract

Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications. Project page: https://xiaomi-research.github.io/svor/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Stable Video Object Removal (SVOR), a diffusion-based video inpainting framework for object removal under real-world imperfections including shadows, reflections, abrupt motion, and defective masks. It introduces three components: MUSE (windowed union of temporal masks during downsampling to handle abrupt motion and missed regions), DA-Seg (a decoupled denoising-aware segmentation head with AdaLN and mask degradation training to provide an internal localization prior), and a two-stage curriculum training (Stage I self-supervised pretraining on unpaired real-background videos with random masks, Stage II refinement on synthetic pairs using mask degradation and side-effect-weighted losses). The central claim is that these jointly achieve shadow-free, flicker-free removal with new state-of-the-art results on multiple datasets and degraded-mask benchmarks.

Significance. If the quantitative claims hold, the work meaningfully advances video object removal toward practical deployment by explicitly targeting imperfections that break existing diffusion inpainting models. The self-supervised pretraining on real backgrounds and the internal DA-Seg prior represent potentially reusable design patterns for robust video editing; the curriculum approach could serve as a template for bridging synthetic-to-real gaps in other video synthesis tasks.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim is asserted without any numerical results, tables, or ablation numbers in the provided text; this prevents verification of the magnitude of gains over baselines on the degraded-mask benchmarks and undermines the central assertion that the three components jointly eliminate shadows/reflections while preserving temporal stability.
  2. [§3.3] §3.3 (Curriculum Two-Stage Training): the generalization argument rests on the unverified assumption that online random masks plus synthetic mask degradation sufficiently cover natural lighting variations, abrupt motions, and mask defects; no cross-domain hold-out tests, distribution-shift experiments, or failure-case analysis on unseen real videos are described to support that the MUSE windowing and DA-Seg prior will not introduce new artifacts outside the training distribution.
minor comments (2)
  1. [§3.1 and §3.2] Notation for MUSE window size and DA-Seg AdaLN parameters should be defined explicitly with symbols rather than prose descriptions to aid reproducibility.
  2. [Abstract] The project page URL is given but the manuscript should include a brief statement on code and model release status to support the reproducibility implied by the self-supervised pretraining description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make revisions to improve clarity and strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim is asserted without any numerical results, tables, or ablation numbers in the provided text; this prevents verification of the magnitude of gains over baselines on the degraded-mask benchmarks and undermines the central assertion that the three components jointly eliminate shadows/reflections while preserving temporal stability.

    Authors: We appreciate the referee highlighting this presentation issue. The full manuscript in Section 4 contains multiple tables (including quantitative comparisons on DAVIS, YouTube-VOS, and degraded-mask variants) with metrics such as PSNR, SSIM, LPIPS, and temporal flicker scores, along with ablations isolating MUSE, DA-Seg, and the curriculum stages. These show consistent gains over baselines. To address the concern, we will revise the abstract to reference key numerical improvements and ensure §4 explicitly highlights the tables and component-wise ablations for immediate verification. revision: yes

  2. Referee: [§3.3] §3.3 (Curriculum Two-Stage Training): the generalization argument rests on the unverified assumption that online random masks plus synthetic mask degradation sufficiently cover natural lighting variations, abrupt motions, and mask defects; no cross-domain hold-out tests, distribution-shift experiments, or failure-case analysis on unseen real videos are described to support that the MUSE windowing and DA-Seg prior will not introduce new artifacts outside the training distribution.

    Authors: We agree that additional validation would strengthen the generalization discussion. While the self-supervised Stage I on unpaired real videos and mask degradation in Stage II are intended to promote robustness, we will add new experiments in the revision: cross-domain evaluation on hold-out real videos from unseen distributions, lighting variation tests, and a failure-case analysis section examining potential artifacts from MUSE and DA-Seg. These will provide direct evidence supporting the claims. revision: yes

Circularity Check

0 steps flagged

No circularity detected; method relies on independent architectural and training choices

full rationale

The paper introduces SVOR via three explicit components (MUSE windowed masking, DA-Seg side-branch with AdaLN, and two-stage curriculum training) without any equations, derivations, or parameter-fitting steps that reduce outputs to inputs by construction. Claims of shadow/reflection removal and temporal stability rest on described architectural additions and loss weighting rather than self-definitional loops or self-citation load-bearing. Experimental SOTA results on degraded-mask benchmarks are presented as empirical outcomes, not tautological predictions. No load-bearing self-citations or renamed known results appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility; no explicit free parameters or invented entities named. Relies on standard diffusion model assumptions and the effectiveness of the proposed training strategy.

axioms (1)
  • domain assumption Diffusion-based video inpainting models form a viable base that can be extended for temporal stability
    Paper builds directly on existing diffusion video inpainting models as stated in abstract.

pith-pipeline@v0.9.0 · 5571 in / 1123 out tokens · 45585 ms · 2026-05-15T13:31:02.097225+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

  1. [1]

    In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

    Bian, Y., Zhang, Z., Ju, X., Cao, M., Xie, L., Shan, Y., Xu, Q.: Videopainter: Any-length video inpainting and editing with plug-and-play context control. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–12 (2025) 4

  2. [2]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025) 2

  3. [3]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: Basicvsr++: Improving video super- resolution with enhanced propagation and alignment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5972–5981 (2022) 4

  4. [4]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    Chandrasekar, A., Chakrabarty, G., Bardhan, J., Hebbalaguppe, R., AP, P.: Re- move: A reference-free metric for object erasure. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 7901–7910 (2024) 9

  5. [5]

    SkyReels-V2: Infinite-length Film Generative Model

    Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025) 1

  6. [6]

    Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: 2010 20th international conference on pattern recognition. pp. 2366–2369. IEEE (2010) 9

  7. [7]

    arXiv preprint arXiv:2411.15260 (2024) 4

    Hu, J., Zhong, T., Wang, X., Jiang, B., Tian, X., Yang, F., Wan, P., Zhang, D.: Vivid-10m: A dataset and baseline for versatile and interactive video local editing. arXiv preprint arXiv:2411.15260 (2024) 4

  8. [8]

    In: European Conference on Computer Vision

    Hu, Y.T., Wang, H., Ballas, N., Grauman, K., Schwing, A.G.: Proposal-based video completion. In: European Conference on Computer Vision. pp. 38–54. Springer (2020) 4

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) 9

  10. [10]

    arXiv preprint arXiv:2503.07598 (2025) 2, 4, 8, 9, 10, 6

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. arXiv preprint arXiv:2503.07598 (2025) 2, 4, 8, 9, 10, 6

  11. [11]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5148–5157 (2021) 1

  12. [12]

    In: Pro- ceedings of the IEEE/CVF international conference on computer vision

    Ke, L., Tai, Y.W., Tang, C.K.: Occlusion-aware video object inpainting. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 14468– 14478 (2021) 4

  13. [13]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Deep video inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5792– 5801 (2019) 4

  14. [14]

    In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 4015– 4026 (2023) 2

  15. [15]

    arXiv preprint arXiv:2601.06391 (2026) 2, 4 16 J

    Kushwaha, S.S., Nag, S., Tian, Y., Kulkarni, K.: Object-wiper: Training-free object and associated effect removal in videos. arXiv preprint arXiv:2601.06391 (2026) 2, 4 16 J. Hu et al

  16. [16]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Lee, Y.C., Lu, E., Rumbley, S., Geyer, M., Huang, J.B., Dekel, T., Cole, F.: Gener- ative omnimatte: Learning to decompose video into layers. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12522–12532 (2025) 2, 4, 5, 8, 10, 3, 6

  17. [17]

    arXiv preprint arXiv:2501.10018 (2025) 4, 10, 3

    Li, X., Xue, H., Ren, P., Bo, L.: Diffueraser: A diffusion model for video inpainting. arXiv preprint arXiv:2501.10018 (2025) 4, 10, 3

  18. [18]

    arXiv preprint arXiv:2309.00398 (2023) 6, 1

    Li, X., Chu, W., Wu, Y., Yuan, W., Liu, F., Zhang, Q., Li, F., Feng, H., Ding, E., Wang,J.:Videogen:Areference-guidedlatentdiffusionapproachforhighdefinition text-to-video generation. arXiv preprint arXiv:2309.00398 (2023) 6, 1

  19. [19]

    Li, Z., Lu, C.Z., Qin, J., Guo, C.L., Cheng, M.M.: Towards an end-to-end frame- workforflow-guidedvideoinpainting.In:ProceedingsoftheIEEE/CVFconference on computer vision and pattern recognition. pp. 17562–17571 (2022) 4

  20. [20]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video under- standing. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7083–7093 (2019) 4

  21. [21]

    arXiv preprint arXiv:2602.15031 (2026) 2, 4

    Litman, Y., Liu, S., Seyb, D., Milef, N., Zhou, Y., Marshall, C., Tulsiani, S., Leak, C.: Editctrl: Disentangled local and global control for real-time generative video editing. arXiv preprint arXiv:2602.15031 (2026) 2, 4

  22. [22]

    In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

    Liu, R., Deng, H., Huang, Y., Shi, X., Lu, L., Sun, W., Wang, X., Dai, J., Li, H.: Fuseformer: Fusing fine-grained information in transformers for video inpainting. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 14040–14049 (2021) 4, 10

  23. [23]

    In: Advances in Neural Information Processing Systems (2025) 2, 3, 4, 5, 8, 9, 10, 12, 7

    Miao, C., Feng, Y., Zeng, J., Gao, Z., Liu, H., Yan, Y., Qi, D., Chen, X., Wang, B., Zhao, H.: Rose: Remove objects with side effects in videos. In: Advances in Neural Information Processing Systems (2025) 2, 3, 4, 5, 8, 9, 10, 12, 7

  24. [24]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024) 6, 1

  25. [25]

    In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 4195– 4205 (2023) 4, 7

  26. [26]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017) 4, 9

  27. [27]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 2, 12

  28. [28]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ren, J., Zheng, Q., Zhao, Y., Xu, X., Li, C.: Dlformer: Discrete latent transformer for video inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3511–3520 (2022) 4

  29. [29]

    In: BMVC

    Sagong, M.C., Yeo, Y.J., Jung, S.W., Ko, S.J.: Rord: A real-world object removal dataset. In: BMVC. p. 542 (2022) 4, 9

  30. [30]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 1

  31. [31]

    IEEE Transactions on Image Processing32, 251–266 (2022) 6, 1

    Stergiou, A., Poppe, R.: Adapool: Exponential adaptive pooling for information- retaining downsampling. IEEE Transactions on Image Processing32, 251–266 (2022) 6, 1

  32. [32]

    In: Proceedings of the AAAI conference on artificial intelligence

    Wang, C., Huang, H., Han, X., Wang, J.: Video inpainting by jointly learning temporal structure and spatial details. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 5232–5239 (2019) 4 Stable Video Object Removal under Imperfect Conditions 17

  33. [33]

    arXiv preprint arXiv:2503.08153 (2025) 6, 1

    Wang, J., Ma, A., Cao, K., Zheng, J., Zhang, Z., Feng, J., Liu, S., Ma, Y., Cheng, B., Leng, D., et al.: Wisa: World simulator assistant for physics-aware text-to-video generation. arXiv preprint arXiv:2503.08153 (2025) 6, 1

  34. [34]

    Advances in Neural Information Processing Systems36, 7594–7611 (2023) 4

    Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y., Shen, Y., Zhao, D., Zhou, J.: Videocomposer: Compositional video synthesis with motion control- lability. Advances in Neural Information Processing Systems36, 7594–7611 (2023) 4

  35. [35]

    IEEE transactions on image processing 13(4), 600–612 (2004) 9

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004) 9

  36. [36]

    arXiv preprint arXiv:2506.13691 (2025) 6, 1

    Xue, Z., Zhang, J., Hu, T., He, H., Chen, Y., Cai, Y., Wang, Y., Wang, C., Liu, Y., Li, X., et al.: Ultravideo: High-quality uhd video dataset with comprehensive captions. arXiv preprint arXiv:2506.13691 (2025) 6, 1

  37. [37]

    arXiv preprint arXiv:2503.11412 (2025) 4

    Yang, S., Gu, Z., Hou, L., Tao, X., Wan, P., Chen, X., Liao, J.: Mtv-inpaint: Multi-task long video inpainting. arXiv preprint arXiv:2503.11412 (2025) 4

  38. [38]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yu, Y., Zeng, Z., Zheng, H., Luo, J.: Omnipaint: Mastering object-oriented editing via disentangled insertion-removal inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17324–17334 (2025) 3

  39. [39]

    In: European conference on computer vision

    Zeng, Y., Fu, J., Chao, H.: Learning joint spatial-temporal transformations for video inpainting. In: European conference on computer vision. pp. 528–543. Springer (2020) 4

  40. [40]

    In: Eu- ropean conference on computer vision

    Zhang, K., Fu, J., Liu, D.: Flow-guided transformer for video inpainting. In: Eu- ropean conference on computer vision. pp. 74–90. Springer (2022) 4, 10

  41. [41]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhang, K., Fu, J., Liu, D.: Inertia-guided flow completion and style fusion for video inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5982–5991 (2022) 4

  42. [42]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018) 3

  43. [43]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhang, Z., Wu, B., Wang, X., Luo, Y., Zhang, L., Zhao, Y., Vajda, P., Metaxas, D., Yu, L.: Avid: Any-length video inpainting with diffusion model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7162– 7172 (2024) 4

  44. [44]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Zheng, W., Xu, C., Xu, X., Liu, W., He, S.: Ciri: curricular inactivation for residue- aware one-shot video inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13012–13022 (2023) 3

  45. [45]

    In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

    Zhou, S., Li, C., Chan, K.C., Loy, C.C.: Propainter: Improving propagation and transformer for video inpainting. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 10477–10486 (2023) 4, 10

  46. [46]

    In: Advances in Neural Information Processing Systems (2025) 2, 4, 5, 8, 10, 3, 6

    Zi, B., Peng, W., Qi, X., Wang, J., Zhao, S., Xiao, R., Wong, K.F.: Minimax- remover: Taming bad noise helps video object removal. In: Advances in Neural Information Processing Systems (2025) 2, 4, 5, 8, 10, 3, 6

  47. [47]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Zi, B., Zhao, S., Qi, X., Wang, J., Shi, Y., Chen, Q., Liang, B., Xiao, R., Wong, K.F., Zhang, L.: Cococo: Improving text-guided video inpainting for better consis- tency, controllability and compatibility. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 11067–11076 (2025) 4

  48. [48]

    a man walking

    Zou, X., Yang, L., Liu, D., Lee, Y.J.: Progressive temporal feature alignment net- work for video inpainting. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 16448–16457 (2021) 4 Stable Video Object Removal under Imperfect Conditions 1 6 Supplementary Materials 6.1 Details of Background Data Construction To sup...