Recognition: no theorem link
From Ideal to Real: Stable Video Object Removal under Imperfect Conditions
Pith reviewed 2026-05-15 13:31 UTC · model grok-4.3
The pith
SVOR removes objects from videos while handling shadows, abrupt motion, and defective masks to produce stable results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SVOR attains shadow-free, flicker-free, and mask-defect-tolerant video object removal through MUSE, a windowed union strategy applied during temporal mask downsampling to preserve all observed target regions; DA-Seg, a lightweight segmentation head on a decoupled side branch with Denoising-Aware AdaLN trained under mask degradation to supply an internal diffusion-aware localization prior; and curriculum two-stage training where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses.
What carries the argument
MUSE windowed mask union, DA-Seg denoising-aware segmentation head, and curriculum two-stage training that together enable temporal stability and shadow removal in diffusion-based video inpainting.
If this is right
- MUSE prevents missed removals during abrupt motion by unioning all observed mask regions within each temporal window.
- The model removes objects together with associated shadows and reflections through side-effect-weighted losses in Stage II.
- DA-Seg supplies diffusion-aware localization without altering the main content generation path.
- Curriculum pretraining on unpaired real videos followed by degraded-mask refinement improves cross-domain robustness.
- SVOR reaches new state-of-the-art results across multiple datasets and degraded-mask benchmarks.
Where Pith is reading between the lines
- The decoupled side-branch design could be adapted to other diffusion video models to add localization awareness without retraining the core generator.
- Curriculum training from unpaired real backgrounds may reduce dependence on perfectly paired synthetic data in related video editing tasks.
- Testing SVOR on longer sequences or videos with changing lighting could reveal whether temporal stability scales beyond the current benchmarks.
- Consumer video tools might incorporate similar mask-robust components to allow users to provide approximate rather than perfect removal masks.
Load-bearing premise
The three components MUSE, DA-Seg, and curriculum training will jointly eliminate shadows and reflections and maintain temporal stability without new artifacts when applied to arbitrary real-world videos beyond the tested datasets.
What would settle it
A real video with complex moving shadows, abrupt motion changes, and noisy input masks where SVOR outputs still show visible shadows, reflections, or frame-to-frame flickers would disprove the claim of robust real-world performance.
Figures
read the original abstract
Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications. Project page: https://xiaomi-research.github.io/svor/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Stable Video Object Removal (SVOR), a diffusion-based video inpainting framework for object removal under real-world imperfections including shadows, reflections, abrupt motion, and defective masks. It introduces three components: MUSE (windowed union of temporal masks during downsampling to handle abrupt motion and missed regions), DA-Seg (a decoupled denoising-aware segmentation head with AdaLN and mask degradation training to provide an internal localization prior), and a two-stage curriculum training (Stage I self-supervised pretraining on unpaired real-background videos with random masks, Stage II refinement on synthetic pairs using mask degradation and side-effect-weighted losses). The central claim is that these jointly achieve shadow-free, flicker-free removal with new state-of-the-art results on multiple datasets and degraded-mask benchmarks.
Significance. If the quantitative claims hold, the work meaningfully advances video object removal toward practical deployment by explicitly targeting imperfections that break existing diffusion inpainting models. The self-supervised pretraining on real backgrounds and the internal DA-Seg prior represent potentially reusable design patterns for robust video editing; the curriculum approach could serve as a template for bridging synthetic-to-real gaps in other video synthesis tasks.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim is asserted without any numerical results, tables, or ablation numbers in the provided text; this prevents verification of the magnitude of gains over baselines on the degraded-mask benchmarks and undermines the central assertion that the three components jointly eliminate shadows/reflections while preserving temporal stability.
- [§3.3] §3.3 (Curriculum Two-Stage Training): the generalization argument rests on the unverified assumption that online random masks plus synthetic mask degradation sufficiently cover natural lighting variations, abrupt motions, and mask defects; no cross-domain hold-out tests, distribution-shift experiments, or failure-case analysis on unseen real videos are described to support that the MUSE windowing and DA-Seg prior will not introduce new artifacts outside the training distribution.
minor comments (2)
- [§3.1 and §3.2] Notation for MUSE window size and DA-Seg AdaLN parameters should be defined explicitly with symbols rather than prose descriptions to aid reproducibility.
- [Abstract] The project page URL is given but the manuscript should include a brief statement on code and model release status to support the reproducibility implied by the self-supervised pretraining description.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will make revisions to improve clarity and strengthen the evidence for our claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim is asserted without any numerical results, tables, or ablation numbers in the provided text; this prevents verification of the magnitude of gains over baselines on the degraded-mask benchmarks and undermines the central assertion that the three components jointly eliminate shadows/reflections while preserving temporal stability.
Authors: We appreciate the referee highlighting this presentation issue. The full manuscript in Section 4 contains multiple tables (including quantitative comparisons on DAVIS, YouTube-VOS, and degraded-mask variants) with metrics such as PSNR, SSIM, LPIPS, and temporal flicker scores, along with ablations isolating MUSE, DA-Seg, and the curriculum stages. These show consistent gains over baselines. To address the concern, we will revise the abstract to reference key numerical improvements and ensure §4 explicitly highlights the tables and component-wise ablations for immediate verification. revision: yes
-
Referee: [§3.3] §3.3 (Curriculum Two-Stage Training): the generalization argument rests on the unverified assumption that online random masks plus synthetic mask degradation sufficiently cover natural lighting variations, abrupt motions, and mask defects; no cross-domain hold-out tests, distribution-shift experiments, or failure-case analysis on unseen real videos are described to support that the MUSE windowing and DA-Seg prior will not introduce new artifacts outside the training distribution.
Authors: We agree that additional validation would strengthen the generalization discussion. While the self-supervised Stage I on unpaired real videos and mask degradation in Stage II are intended to promote robustness, we will add new experiments in the revision: cross-domain evaluation on hold-out real videos from unseen distributions, lighting variation tests, and a failure-case analysis section examining potential artifacts from MUSE and DA-Seg. These will provide direct evidence supporting the claims. revision: yes
Circularity Check
No circularity detected; method relies on independent architectural and training choices
full rationale
The paper introduces SVOR via three explicit components (MUSE windowed masking, DA-Seg side-branch with AdaLN, and two-stage curriculum training) without any equations, derivations, or parameter-fitting steps that reduce outputs to inputs by construction. Claims of shadow/reflection removal and temporal stability rest on described architectural additions and loss weighting rather than self-definitional loops or self-citation load-bearing. Experimental SOTA results on degraded-mask benchmarks are presented as empirical outcomes, not tautological predictions. No load-bearing self-citations or renamed known results appear in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffusion-based video inpainting models form a viable base that can be extended for temporal stability
Reference graph
Works this paper leans on
-
[1]
Bian, Y., Zhang, Z., Ju, X., Cao, M., Xie, L., Shan, Y., Xu, Q.: Videopainter: Any-length video inpainting and editing with plug-and-play context control. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–12 (2025) 4
work page 2025
-
[2]
SAM 3: Segment Anything with Concepts
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025) 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: Basicvsr++: Improving video super- resolution with enhanced propagation and alignment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5972–5981 (2022) 4
work page 2022
-
[4]
Chandrasekar, A., Chakrabarty, G., Bardhan, J., Hebbalaguppe, R., AP, P.: Re- move: A reference-free metric for object erasure. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 7901–7910 (2024) 9
work page 2024
-
[5]
SkyReels-V2: Infinite-length Film Generative Model
Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025) 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: 2010 20th international conference on pattern recognition. pp. 2366–2369. IEEE (2010) 9
work page 2010
-
[7]
arXiv preprint arXiv:2411.15260 (2024) 4
Hu, J., Zhong, T., Wang, X., Jiang, B., Tian, X., Yang, F., Wan, P., Zhang, D.: Vivid-10m: A dataset and baseline for versatile and interactive video local editing. arXiv preprint arXiv:2411.15260 (2024) 4
-
[8]
In: European Conference on Computer Vision
Hu, Y.T., Wang, H., Ballas, N., Grauman, K., Schwing, A.G.: Proposal-based video completion. In: European Conference on Computer Vision. pp. 38–54. Springer (2020) 4
work page 2020
-
[9]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) 9
work page 2024
-
[10]
arXiv preprint arXiv:2503.07598 (2025) 2, 4, 8, 9, 10, 6
Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. arXiv preprint arXiv:2503.07598 (2025) 2, 4, 8, 9, 10, 6
-
[11]
In: Proceedings of the IEEE/CVF international conference on computer vision
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5148–5157 (2021) 1
work page 2021
-
[12]
In: Pro- ceedings of the IEEE/CVF international conference on computer vision
Ke, L., Tai, Y.W., Tang, C.K.: Occlusion-aware video object inpainting. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 14468– 14478 (2021) 4
work page 2021
-
[13]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Deep video inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5792– 5801 (2019) 4
work page 2019
-
[14]
In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 4015– 4026 (2023) 2
work page 2023
-
[15]
arXiv preprint arXiv:2601.06391 (2026) 2, 4 16 J
Kushwaha, S.S., Nag, S., Tian, Y., Kulkarni, K.: Object-wiper: Training-free object and associated effect removal in videos. arXiv preprint arXiv:2601.06391 (2026) 2, 4 16 J. Hu et al
-
[16]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Lee, Y.C., Lu, E., Rumbley, S., Geyer, M., Huang, J.B., Dekel, T., Cole, F.: Gener- ative omnimatte: Learning to decompose video into layers. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12522–12532 (2025) 2, 4, 5, 8, 10, 3, 6
work page 2025
-
[17]
arXiv preprint arXiv:2501.10018 (2025) 4, 10, 3
Li, X., Xue, H., Ren, P., Bo, L.: Diffueraser: A diffusion model for video inpainting. arXiv preprint arXiv:2501.10018 (2025) 4, 10, 3
-
[18]
arXiv preprint arXiv:2309.00398 (2023) 6, 1
Li, X., Chu, W., Wu, Y., Yuan, W., Liu, F., Zhang, Q., Li, F., Feng, H., Ding, E., Wang,J.:Videogen:Areference-guidedlatentdiffusionapproachforhighdefinition text-to-video generation. arXiv preprint arXiv:2309.00398 (2023) 6, 1
-
[19]
Li, Z., Lu, C.Z., Qin, J., Guo, C.L., Cheng, M.M.: Towards an end-to-end frame- workforflow-guidedvideoinpainting.In:ProceedingsoftheIEEE/CVFconference on computer vision and pattern recognition. pp. 17562–17571 (2022) 4
work page 2022
-
[20]
In: Proceedings of the IEEE/CVF international conference on computer vision
Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video under- standing. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7083–7093 (2019) 4
work page 2019
-
[21]
arXiv preprint arXiv:2602.15031 (2026) 2, 4
Litman, Y., Liu, S., Seyb, D., Milef, N., Zhou, Y., Marshall, C., Tulsiani, S., Leak, C.: Editctrl: Disentangled local and global control for real-time generative video editing. arXiv preprint arXiv:2602.15031 (2026) 2, 4
-
[22]
In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)
Liu, R., Deng, H., Huang, Y., Shi, X., Lu, L., Sun, W., Wang, X., Dai, J., Li, H.: Fuseformer: Fusing fine-grained information in transformers for video inpainting. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 14040–14049 (2021) 4, 10
work page 2021
-
[23]
In: Advances in Neural Information Processing Systems (2025) 2, 3, 4, 5, 8, 9, 10, 12, 7
Miao, C., Feng, Y., Zeng, J., Gao, Z., Liu, H., Yan, Y., Qi, D., Chen, X., Wang, B., Zhao, H.: Rose: Remove objects with side effects in videos. In: Advances in Neural Information Processing Systems (2025) 2, 3, 4, 5, 8, 9, 10, 12, 7
work page 2025
-
[24]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024) 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 4195– 4205 (2023) 4, 7
work page 2023
-
[26]
The 2017 DAVIS Challenge on Video Object Segmentation
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017) 4, 9
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
SAM 2: Segment Anything in Images and Videos
Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 2, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ren, J., Zheng, Q., Zhao, Y., Xu, X., Li, C.: Dlformer: Discrete latent transformer for video inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3511–3520 (2022) 4
work page 2022
- [29]
-
[30]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
IEEE Transactions on Image Processing32, 251–266 (2022) 6, 1
Stergiou, A., Poppe, R.: Adapool: Exponential adaptive pooling for information- retaining downsampling. IEEE Transactions on Image Processing32, 251–266 (2022) 6, 1
work page 2022
-
[32]
In: Proceedings of the AAAI conference on artificial intelligence
Wang, C., Huang, H., Han, X., Wang, J.: Video inpainting by jointly learning temporal structure and spatial details. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 5232–5239 (2019) 4 Stable Video Object Removal under Imperfect Conditions 17
work page 2019
-
[33]
arXiv preprint arXiv:2503.08153 (2025) 6, 1
Wang, J., Ma, A., Cao, K., Zheng, J., Zhang, Z., Feng, J., Liu, S., Ma, Y., Cheng, B., Leng, D., et al.: Wisa: World simulator assistant for physics-aware text-to-video generation. arXiv preprint arXiv:2503.08153 (2025) 6, 1
-
[34]
Advances in Neural Information Processing Systems36, 7594–7611 (2023) 4
Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y., Shen, Y., Zhao, D., Zhou, J.: Videocomposer: Compositional video synthesis with motion control- lability. Advances in Neural Information Processing Systems36, 7594–7611 (2023) 4
work page 2023
-
[35]
IEEE transactions on image processing 13(4), 600–612 (2004) 9
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004) 9
work page 2004
-
[36]
arXiv preprint arXiv:2506.13691 (2025) 6, 1
Xue, Z., Zhang, J., Hu, T., He, H., Chen, Y., Cai, Y., Wang, Y., Wang, C., Liu, Y., Li, X., et al.: Ultravideo: High-quality uhd video dataset with comprehensive captions. arXiv preprint arXiv:2506.13691 (2025) 6, 1
-
[37]
arXiv preprint arXiv:2503.11412 (2025) 4
Yang, S., Gu, Z., Hou, L., Tao, X., Wan, P., Chen, X., Liao, J.: Mtv-inpaint: Multi-task long video inpainting. arXiv preprint arXiv:2503.11412 (2025) 4
-
[38]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Yu, Y., Zeng, Z., Zheng, H., Luo, J.: Omnipaint: Mastering object-oriented editing via disentangled insertion-removal inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17324–17334 (2025) 3
work page 2025
-
[39]
In: European conference on computer vision
Zeng, Y., Fu, J., Chao, H.: Learning joint spatial-temporal transformations for video inpainting. In: European conference on computer vision. pp. 528–543. Springer (2020) 4
work page 2020
-
[40]
In: Eu- ropean conference on computer vision
Zhang, K., Fu, J., Liu, D.: Flow-guided transformer for video inpainting. In: Eu- ropean conference on computer vision. pp. 74–90. Springer (2022) 4, 10
work page 2022
-
[41]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang, K., Fu, J., Liu, D.: Inertia-guided flow completion and style fusion for video inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5982–5991 (2022) 4
work page 2022
-
[42]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018) 3
work page 2018
-
[43]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang, Z., Wu, B., Wang, X., Luo, Y., Zhang, L., Zhao, Y., Vajda, P., Metaxas, D., Yu, L.: Avid: Any-length video inpainting with diffusion model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7162– 7172 (2024) 4
work page 2024
-
[44]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Zheng, W., Xu, C., Xu, X., Liu, W., He, S.: Ciri: curricular inactivation for residue- aware one-shot video inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13012–13022 (2023) 3
work page 2023
-
[45]
In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)
Zhou, S., Li, C., Chan, K.C., Loy, C.C.: Propainter: Improving propagation and transformer for video inpainting. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 10477–10486 (2023) 4, 10
work page 2023
-
[46]
In: Advances in Neural Information Processing Systems (2025) 2, 4, 5, 8, 10, 3, 6
Zi, B., Peng, W., Qi, X., Wang, J., Zhao, S., Xiao, R., Wong, K.F.: Minimax- remover: Taming bad noise helps video object removal. In: Advances in Neural Information Processing Systems (2025) 2, 4, 5, 8, 10, 3, 6
work page 2025
-
[47]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Zi, B., Zhao, S., Qi, X., Wang, J., Shi, Y., Chen, Q., Liang, B., Xiao, R., Wong, K.F., Zhang, L.: Cococo: Improving text-guided video inpainting for better consis- tency, controllability and compatibility. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 11067–11076 (2025) 4
work page 2025
-
[48]
Zou, X., Yang, L., Liu, D., Lee, Y.J.: Progressive temporal feature alignment net- work for video inpainting. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 16448–16457 (2021) 4 Stable Video Object Removal under Imperfect Conditions 1 6 Supplementary Materials 6.1 Details of Background Data Construction To sup...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.