arxiv: 2603.09283 · v2 · submitted 2026-03-10 · 💻 cs.CV

Recognition: no theorem link

From Ideal to Real: Stable Video Object Removal under Imperfect Conditions

Jiagao Hu , Yuxuan Chen , Fuhao Li , Zepeng Wang , Fei Wang , Daiguo Zhou , Jian Luan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords video object removaldiffusion inpaintingtemporal stabilitymask degradationshadow removalcurriculum trainingvideo editingreal-world conditions

0 comments

The pith

SVOR removes objects from videos while handling shadows, abrupt motion, and defective masks to produce stable results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Removing objects from videos is difficult when real-world issues like shadows, sudden movements, or imperfect masks appear. Existing diffusion models often produce flickers, incomplete erasures, or leftover reflections under these conditions. The paper introduces SVOR, which uses three designs to address them: MUSE applies a windowed union to masks during downsampling so no target regions are missed over time; DA-Seg adds a lightweight side segmentation head with denoising-aware layers to guide localization inside the diffusion process; and curriculum two-stage training first learns background patterns from unpaired real videos then refines on degraded synthetic pairs with losses that penalize side effects. These changes let the model erase objects cleanly without shadows or instability. Experiments show it outperforms prior methods on standard datasets and new tests with degraded masks, moving video object removal toward practical use.

Core claim

SVOR attains shadow-free, flicker-free, and mask-defect-tolerant video object removal through MUSE, a windowed union strategy applied during temporal mask downsampling to preserve all observed target regions; DA-Seg, a lightweight segmentation head on a decoupled side branch with Denoising-Aware AdaLN trained under mask degradation to supply an internal diffusion-aware localization prior; and curriculum two-stage training where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses.

What carries the argument

MUSE windowed mask union, DA-Seg denoising-aware segmentation head, and curriculum two-stage training that together enable temporal stability and shadow removal in diffusion-based video inpainting.

If this is right

MUSE prevents missed removals during abrupt motion by unioning all observed mask regions within each temporal window.
The model removes objects together with associated shadows and reflections through side-effect-weighted losses in Stage II.
DA-Seg supplies diffusion-aware localization without altering the main content generation path.
Curriculum pretraining on unpaired real videos followed by degraded-mask refinement improves cross-domain robustness.
SVOR reaches new state-of-the-art results across multiple datasets and degraded-mask benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decoupled side-branch design could be adapted to other diffusion video models to add localization awareness without retraining the core generator.
Curriculum training from unpaired real backgrounds may reduce dependence on perfectly paired synthetic data in related video editing tasks.
Testing SVOR on longer sequences or videos with changing lighting could reveal whether temporal stability scales beyond the current benchmarks.
Consumer video tools might incorporate similar mask-robust components to allow users to provide approximate rather than perfect removal masks.

Load-bearing premise

The three components MUSE, DA-Seg, and curriculum training will jointly eliminate shadows and reflections and maintain temporal stability without new artifacts when applied to arbitrary real-world videos beyond the tested datasets.

What would settle it

A real video with complex moving shadows, abrupt motion changes, and noisy input masks where SVOR outputs still show visible shadows, reflections, or frame-to-frame flickers would disprove the claim of robust real-world performance.

Figures

Figures reproduced from arXiv: 2603.09283 by Daiguo Zhou, Fei Wang, Fuhao Li, Jiagao Hu, Jian Luan, Yuxuan Chen, Zepeng Wang.

**Figure 1.** Figure 1: Results of our Stable Video Object Removal compared with MiniMaxRemover [46] and ROSE [23] in three common real-world challenges. The proposed SVOR achieves stable and artifact-free removal. used in video editing, post-production, and AR. Recent VOR approaches have made significant progress in aspects such as inference efficiency [21,46] and sideeffect suppression (e.g., shadows and reflections) [15, 16,… view at source ↗

**Figure 2.** Figure 2: The framework of SVOR. Stage I: pretrain on unpaired real-world background videos using Random Mask Strategy to simulate object motion. Stage II: refine on paired synthetic data with Mask Degradation to mimic imperfect masks, where DA-Seg complements defective guidance. MUSE performs windowed union retention during mask temporal downsampling, preventing loss of dynamic location information. base removal c… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison between our SVOR and several state-of-the-art methods on real-world and synthetic samples. Previous methods facing issues like Undesired object, Artifacts, Blur, Undesired remove, Unremoved shadow, Unremoved effects. Our SVOR achieves consistently cleaner removal, fewer artifacts, and better shadow handling. We recruited 15 participants for this study. The results, summarized in Ta… view at source ↗

**Figure 4.** Figure 4: Effect of MUSE under abrupt-motion frames. MUSE improves removal even without additional training. “T”/“I” denote Training/Inference, “×”/“✓” indicate without/with MUSE. generalization to real-world dynamics, minimizing artifacts and blurring when comparied with ROSE [23]. 4.4 Stability of Removal Stability for Abrupt-Motion. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Robust removal under SAM2 failures. Existing methods miss unsegmented objects when SAM2 drops, while our SVOR still achieves temporally consistent removal. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: ReMOVE performance under mask drop. Our SVOR remains stable while existing methods collapse [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Effectiveness of Stage I pre-training. Training with background videos significantly improves removal quality and success rate. input (w/ mask) Only Stage II Stage I + II input (w/ mask) Only Stage II Stage I + II [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison between Stage II–only training and the full two-stage training scheme. The complete two-stage training substantially improves background completion and shadow removal in real-world scenarios. in [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Effectiveness of MUSE. All methods suffer from missed removals or artifacts under abrupt motion, which is notably reduced after applying MUSE preprocessing [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Effectiveness of DA-Seg. Accurate mask predictions lead to better removals, while broken masks cause degraded results [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation of segmentation head design. DA-Seg produces more accurate localization, indicating that the context block extracts more reliable control context for DiT. This enables the model to suppress the target object in latent features and achieve more stable removal under degraded mask guidance. under the more challenging setting with 50% mask dropout, the DA-AdaLNbased head consistently outperforms th… view at source ↗

**Figure 12.** Figure 12: Results under single mask condition. In some cases, our SVOR can remove the target object even with only single mask. 6.9 Details on GPT-based Evaluation We additionally use gpt-4o-2024-11-20 to automatically evaluate the perceptual quality of video removal results. For each video, we extract frames from the original video, the mask video, and the erased video. In the original frames, we overlay the mask… view at source ↗

**Figure 13.** Figure 13: Prompt for GPT-based evaluation [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

read the original abstract

Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications. Project page: https://xiaomi-research.github.io/svor/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SVOR adds three targeted modules for real-world video object removal but its SOTA claims rest on unverified generalization from synthetic degradations.

read the letter

The paper's core move is to tackle shadows, abrupt motion, and bad masks in diffusion video inpainting with MUSE (windowed mask union), DA-Seg (a side-branch segmentation head with denoising-aware AdaLN), and a two-stage curriculum that pretrains self-supervised on unpaired real backgrounds before refining on degraded synthetic pairs with side-effect losses. These are concrete engineering choices rather than a new theoretical framework, and they directly address the gap between clean lab videos and messy real ones. The self-supervised first stage is a sensible way to pick up background and temporal priors without needing paired data. If the full experiments show clean ablations and consistent gains on the degraded-mask benchmarks, that part is worth noting for the subfield. The main soft spot is the generalization story. The method assumes the online random masks and synthetic degradations in stage two will cover the distribution of natural lighting, reflections, and motion in unseen videos. The stress-test concern holds: if real cases introduce shadow patterns or mask defects outside that coverage, the internal priors may not prevent new artifacts or incomplete removals. The abstract claims SOTA across datasets, but without seeing the actual tables, error breakdowns, or cross-domain tests, it's difficult to judge how much of the improvement is robust versus benchmark-specific. This is for researchers already working on video inpainting who need practical robustness fixes. The ideas are specific enough to be checked in review. I would send it to referees rather than desk-reject, mainly to verify whether the components deliver measurable stability gains on real footage beyond the reported setups.

Referee Report

2 major / 2 minor

Summary. The paper proposes Stable Video Object Removal (SVOR), a diffusion-based video inpainting framework for object removal under real-world imperfections including shadows, reflections, abrupt motion, and defective masks. It introduces three components: MUSE (windowed union of temporal masks during downsampling to handle abrupt motion and missed regions), DA-Seg (a decoupled denoising-aware segmentation head with AdaLN and mask degradation training to provide an internal localization prior), and a two-stage curriculum training (Stage I self-supervised pretraining on unpaired real-background videos with random masks, Stage II refinement on synthetic pairs using mask degradation and side-effect-weighted losses). The central claim is that these jointly achieve shadow-free, flicker-free removal with new state-of-the-art results on multiple datasets and degraded-mask benchmarks.

Significance. If the quantitative claims hold, the work meaningfully advances video object removal toward practical deployment by explicitly targeting imperfections that break existing diffusion inpainting models. The self-supervised pretraining on real backgrounds and the internal DA-Seg prior represent potentially reusable design patterns for robust video editing; the curriculum approach could serve as a template for bridging synthetic-to-real gaps in other video synthesis tasks.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the SOTA claim is asserted without any numerical results, tables, or ablation numbers in the provided text; this prevents verification of the magnitude of gains over baselines on the degraded-mask benchmarks and undermines the central assertion that the three components jointly eliminate shadows/reflections while preserving temporal stability.
[§3.3] §3.3 (Curriculum Two-Stage Training): the generalization argument rests on the unverified assumption that online random masks plus synthetic mask degradation sufficiently cover natural lighting variations, abrupt motions, and mask defects; no cross-domain hold-out tests, distribution-shift experiments, or failure-case analysis on unseen real videos are described to support that the MUSE windowing and DA-Seg prior will not introduce new artifacts outside the training distribution.

minor comments (2)

[§3.1 and §3.2] Notation for MUSE window size and DA-Seg AdaLN parameters should be defined explicitly with symbols rather than prose descriptions to aid reproducibility.
[Abstract] The project page URL is given but the manuscript should include a brief statement on code and model release status to support the reproducibility implied by the self-supervised pretraining description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make revisions to improve clarity and strengthen the evidence for our claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim is asserted without any numerical results, tables, or ablation numbers in the provided text; this prevents verification of the magnitude of gains over baselines on the degraded-mask benchmarks and undermines the central assertion that the three components jointly eliminate shadows/reflections while preserving temporal stability.

Authors: We appreciate the referee highlighting this presentation issue. The full manuscript in Section 4 contains multiple tables (including quantitative comparisons on DAVIS, YouTube-VOS, and degraded-mask variants) with metrics such as PSNR, SSIM, LPIPS, and temporal flicker scores, along with ablations isolating MUSE, DA-Seg, and the curriculum stages. These show consistent gains over baselines. To address the concern, we will revise the abstract to reference key numerical improvements and ensure §4 explicitly highlights the tables and component-wise ablations for immediate verification. revision: yes
Referee: [§3.3] §3.3 (Curriculum Two-Stage Training): the generalization argument rests on the unverified assumption that online random masks plus synthetic mask degradation sufficiently cover natural lighting variations, abrupt motions, and mask defects; no cross-domain hold-out tests, distribution-shift experiments, or failure-case analysis on unseen real videos are described to support that the MUSE windowing and DA-Seg prior will not introduce new artifacts outside the training distribution.

Authors: We agree that additional validation would strengthen the generalization discussion. While the self-supervised Stage I on unpaired real videos and mask degradation in Stage II are intended to promote robustness, we will add new experiments in the revision: cross-domain evaluation on hold-out real videos from unseen distributions, lighting variation tests, and a failure-case analysis section examining potential artifacts from MUSE and DA-Seg. These will provide direct evidence supporting the claims. revision: yes

Circularity Check

0 steps flagged

No circularity detected; method relies on independent architectural and training choices

full rationale

The paper introduces SVOR via three explicit components (MUSE windowed masking, DA-Seg side-branch with AdaLN, and two-stage curriculum training) without any equations, derivations, or parameter-fitting steps that reduce outputs to inputs by construction. Claims of shadow/reflection removal and temporal stability rest on described architectural additions and loss weighting rather than self-definitional loops or self-citation load-bearing. Experimental SOTA results on degraded-mask benchmarks are presented as empirical outcomes, not tautological predictions. No load-bearing self-citations or renamed known results appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility; no explicit free parameters or invented entities named. Relies on standard diffusion model assumptions and the effectiveness of the proposed training strategy.

axioms (1)

domain assumption Diffusion-based video inpainting models form a viable base that can be extended for temporal stability
Paper builds directly on existing diffusion video inpainting models as stated in abstract.

pith-pipeline@v0.9.0 · 5571 in / 1123 out tokens · 45585 ms · 2026-05-15T13:31:02.097225+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

[1]

In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

Bian, Y., Zhang, Z., Ju, X., Cao, M., Xie, L., Shan, Y., Xu, Q.: Videopainter: Any-length video inpainting and editing with plug-and-play context control. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–12 (2025) 4

work page 2025
[2]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025) 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: Basicvsr++: Improving video super- resolution with enhanced propagation and alignment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5972–5981 (2022) 4

work page 2022
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

Chandrasekar, A., Chakrabarty, G., Bardhan, J., Hebbalaguppe, R., AP, P.: Re- move: A reference-free metric for object erasure. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 7901–7910 (2024) 9

work page 2024
[5]

SkyReels-V2: Infinite-length Film Generative Model

Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025) 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: 2010 20th international conference on pattern recognition. pp. 2366–2369. IEEE (2010) 9

work page 2010
[7]

arXiv preprint arXiv:2411.15260 (2024) 4

Hu, J., Zhong, T., Wang, X., Jiang, B., Tian, X., Yang, F., Wan, P., Zhang, D.: Vivid-10m: A dataset and baseline for versatile and interactive video local editing. arXiv preprint arXiv:2411.15260 (2024) 4

work page arXiv 2024
[8]

In: European Conference on Computer Vision

Hu, Y.T., Wang, H., Ballas, N., Grauman, K., Schwing, A.G.: Proposal-based video completion. In: European Conference on Computer Vision. pp. 38–54. Springer (2020) 4

work page 2020
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) 9

work page 2024
[10]

arXiv preprint arXiv:2503.07598 (2025) 2, 4, 8, 9, 10, 6

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. arXiv preprint arXiv:2503.07598 (2025) 2, 4, 8, 9, 10, 6

work page arXiv 2025
[11]

In: Proceedings of the IEEE/CVF international conference on computer vision

Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5148–5157 (2021) 1

work page 2021
[12]

In: Pro- ceedings of the IEEE/CVF international conference on computer vision

Ke, L., Tai, Y.W., Tang, C.K.: Occlusion-aware video object inpainting. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 14468– 14478 (2021) 4

work page 2021
[13]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Deep video inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5792– 5801 (2019) 4

work page 2019
[14]

In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 4015– 4026 (2023) 2

work page 2023
[15]

arXiv preprint arXiv:2601.06391 (2026) 2, 4 16 J

Kushwaha, S.S., Nag, S., Tian, Y., Kulkarni, K.: Object-wiper: Training-free object and associated effect removal in videos. arXiv preprint arXiv:2601.06391 (2026) 2, 4 16 J. Hu et al

work page arXiv 2026
[16]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Lee, Y.C., Lu, E., Rumbley, S., Geyer, M., Huang, J.B., Dekel, T., Cole, F.: Gener- ative omnimatte: Learning to decompose video into layers. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12522–12532 (2025) 2, 4, 5, 8, 10, 3, 6

work page 2025
[17]

arXiv preprint arXiv:2501.10018 (2025) 4, 10, 3

Li, X., Xue, H., Ren, P., Bo, L.: Diffueraser: A diffusion model for video inpainting. arXiv preprint arXiv:2501.10018 (2025) 4, 10, 3

work page arXiv 2025
[18]

arXiv preprint arXiv:2309.00398 (2023) 6, 1

Li, X., Chu, W., Wu, Y., Yuan, W., Liu, F., Zhang, Q., Li, F., Feng, H., Ding, E., Wang,J.:Videogen:Areference-guidedlatentdiffusionapproachforhighdefinition text-to-video generation. arXiv preprint arXiv:2309.00398 (2023) 6, 1

work page arXiv 2023
[19]

Li, Z., Lu, C.Z., Qin, J., Guo, C.L., Cheng, M.M.: Towards an end-to-end frame- workforflow-guidedvideoinpainting.In:ProceedingsoftheIEEE/CVFconference on computer vision and pattern recognition. pp. 17562–17571 (2022) 4

work page 2022
[20]

In: Proceedings of the IEEE/CVF international conference on computer vision

Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video under- standing. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7083–7093 (2019) 4

work page 2019
[21]

arXiv preprint arXiv:2602.15031 (2026) 2, 4

Litman, Y., Liu, S., Seyb, D., Milef, N., Zhou, Y., Marshall, C., Tulsiani, S., Leak, C.: Editctrl: Disentangled local and global control for real-time generative video editing. arXiv preprint arXiv:2602.15031 (2026) 2, 4

work page arXiv 2026
[22]

In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

Liu, R., Deng, H., Huang, Y., Shi, X., Lu, L., Sun, W., Wang, X., Dai, J., Li, H.: Fuseformer: Fusing fine-grained information in transformers for video inpainting. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 14040–14049 (2021) 4, 10

work page 2021
[23]

In: Advances in Neural Information Processing Systems (2025) 2, 3, 4, 5, 8, 9, 10, 12, 7

Miao, C., Feng, Y., Zeng, J., Gao, Z., Liu, H., Yan, Y., Qi, D., Chen, X., Wang, B., Zhao, H.: Rose: Remove objects with side effects in videos. In: Advances in Neural Information Processing Systems (2025) 2, 3, 4, 5, 8, 9, 10, 12, 7

work page 2025
[24]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024) 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 4195– 4205 (2023) 4, 7

work page 2023
[26]

The 2017 DAVIS Challenge on Video Object Segmentation

Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017) 4, 9

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 2, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ren, J., Zheng, Q., Zhao, Y., Xu, X., Li, C.: Dlformer: Discrete latent transformer for video inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3511–3520 (2022) 4

work page 2022
[29]

In: BMVC

Sagong, M.C., Yeo, Y.J., Jung, S.W., Ko, S.J.: Rord: A real-world object removal dataset. In: BMVC. p. 542 (2022) 4, 9

work page 2022
[30]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

IEEE Transactions on Image Processing32, 251–266 (2022) 6, 1

Stergiou, A., Poppe, R.: Adapool: Exponential adaptive pooling for information- retaining downsampling. IEEE Transactions on Image Processing32, 251–266 (2022) 6, 1

work page 2022
[32]

In: Proceedings of the AAAI conference on artificial intelligence

Wang, C., Huang, H., Han, X., Wang, J.: Video inpainting by jointly learning temporal structure and spatial details. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 5232–5239 (2019) 4 Stable Video Object Removal under Imperfect Conditions 17

work page 2019
[33]

arXiv preprint arXiv:2503.08153 (2025) 6, 1

Wang, J., Ma, A., Cao, K., Zheng, J., Zhang, Z., Feng, J., Liu, S., Ma, Y., Cheng, B., Leng, D., et al.: Wisa: World simulator assistant for physics-aware text-to-video generation. arXiv preprint arXiv:2503.08153 (2025) 6, 1

work page arXiv 2025
[34]

Advances in Neural Information Processing Systems36, 7594–7611 (2023) 4

Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y., Shen, Y., Zhao, D., Zhou, J.: Videocomposer: Compositional video synthesis with motion control- lability. Advances in Neural Information Processing Systems36, 7594–7611 (2023) 4

work page 2023
[35]

IEEE transactions on image processing 13(4), 600–612 (2004) 9

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004) 9

work page 2004
[36]

arXiv preprint arXiv:2506.13691 (2025) 6, 1

Xue, Z., Zhang, J., Hu, T., He, H., Chen, Y., Cai, Y., Wang, Y., Wang, C., Liu, Y., Li, X., et al.: Ultravideo: High-quality uhd video dataset with comprehensive captions. arXiv preprint arXiv:2506.13691 (2025) 6, 1

work page arXiv 2025
[37]

arXiv preprint arXiv:2503.11412 (2025) 4

Yang, S., Gu, Z., Hou, L., Tao, X., Wan, P., Chen, X., Liao, J.: Mtv-inpaint: Multi-task long video inpainting. arXiv preprint arXiv:2503.11412 (2025) 4

work page arXiv 2025
[38]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Yu, Y., Zeng, Z., Zheng, H., Luo, J.: Omnipaint: Mastering object-oriented editing via disentangled insertion-removal inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17324–17334 (2025) 3

work page 2025
[39]

In: European conference on computer vision

Zeng, Y., Fu, J., Chao, H.: Learning joint spatial-temporal transformations for video inpainting. In: European conference on computer vision. pp. 528–543. Springer (2020) 4

work page 2020
[40]

In: Eu- ropean conference on computer vision

Zhang, K., Fu, J., Liu, D.: Flow-guided transformer for video inpainting. In: Eu- ropean conference on computer vision. pp. 74–90. Springer (2022) 4, 10

work page 2022
[41]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhang, K., Fu, J., Liu, D.: Inertia-guided flow completion and style fusion for video inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5982–5991 (2022) 4

work page 2022
[42]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018) 3

work page 2018
[43]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhang, Z., Wu, B., Wang, X., Luo, Y., Zhang, L., Zhao, Y., Vajda, P., Metaxas, D., Yu, L.: Avid: Any-length video inpainting with diffusion model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7162– 7172 (2024) 4

work page 2024
[44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zheng, W., Xu, C., Xu, X., Liu, W., He, S.: Ciri: curricular inactivation for residue- aware one-shot video inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13012–13022 (2023) 3

work page 2023
[45]

In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

Zhou, S., Li, C., Chan, K.C., Loy, C.C.: Propainter: Improving propagation and transformer for video inpainting. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 10477–10486 (2023) 4, 10

work page 2023
[46]

In: Advances in Neural Information Processing Systems (2025) 2, 4, 5, 8, 10, 3, 6

Zi, B., Peng, W., Qi, X., Wang, J., Zhao, S., Xiao, R., Wong, K.F.: Minimax- remover: Taming bad noise helps video object removal. In: Advances in Neural Information Processing Systems (2025) 2, 4, 5, 8, 10, 3, 6

work page 2025
[47]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zi, B., Zhao, S., Qi, X., Wang, J., Shi, Y., Chen, Q., Liang, B., Xiao, R., Wong, K.F., Zhang, L.: Cococo: Improving text-guided video inpainting for better consis- tency, controllability and compatibility. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 11067–11076 (2025) 4

work page 2025
[48]

a man walking

Zou, X., Yang, L., Liu, D., Lee, Y.J.: Progressive temporal feature alignment net- work for video inpainting. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 16448–16457 (2021) 4 Stable Video Object Removal under Imperfect Conditions 1 6 Supplementary Materials 6.1 Details of Background Data Construction To sup...

work page arXiv 2021