pith. machine review for the scientific record. sign in

arxiv: 2604.10950 · v2 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords video semantic segmentationtest-time adaptationknowledge distillationSAM2temporal fusionannotation-free learningimage-to-video adaptationVSS
0
0 comments X

The pith

A pre-trained image segmentation model becomes a video segmentation model by distilling SAM2's temporal knowledge in a single annotation-free test-time adaptation pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to add temporal coherence to existing image segmentation models for video tasks without needing any labeled video data. It does this by having the model learn from SAM2 during a quick initial pass over the video, then using a simple module to combine information across frames. This matters because collecting dense annotations for videos is expensive and time-consuming, while this method allows practical use on new videos with minimal setup. It works even if only the first 10 percent of the video is used for adaptation. Experiments confirm it matches or beats fully supervised approaches on standard video segmentation benchmarks.

Core claim

DiTTA distills SAM2's temporal segmentation knowledge into a pre-trained image semantic segmentation model during a brief single-pass initialization phase on unlabeled video, then augments the model with a lightweight temporal fusion module to aggregate cross-frame context, enabling robust video semantic segmentation that generalizes well even from partial video inputs.

What carries the argument

The DiTTA framework, which performs distillation-assisted test-time adaptation to transfer temporal knowledge from SAM2 to an image model in one pass, combined with a temporal fusion module for cross-frame aggregation.

Load-bearing premise

That SAM2's temporal segmentation knowledge transfers reliably to any pre-trained image segmentation model via a single annotation-free adaptation pass and produces robust generalization across different videos and limited inputs.

What would settle it

Running the adapted model on a held-out set of video frames and finding its accuracy drops below that of the original frame-by-frame image model, or below zero-shot SAM2 refinement methods.

Figures

Figures reproduced from arXiv: 2604.10950 by Hoyong Kwon, Hyeokjun Kweon, Jihun Kim, Kuk-Jin Yoon.

Figure 1
Figure 1. Figure 1: VSS using ISS model and SAM2. I and S are frames and their semantic segmentation results, respectively. In (C), DiTTA adapts the ISS model to the VSS model at test time by distilling temporal knowledge from SAM2 over a few initial frames. The pink arrow denotes an temporal fusion add-on. as VSPW [37], whose construction demands extensive pixel-level labeling and substantial human effort. In contrast, Image… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our DiTTA (Distillation-assisted Test-Time Adaptation) framework. It comprises a lightweight add-on for temporal [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of VSS results across various methods under W2F protocol. Only the initial 10% of frames are used for [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on cross-dataset setting, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on ADE20k [70] → VSPW [37]. (a) Frames, (b) ISS, (c) DiTTA (Ours), (d) GT. 6.3. Using ISS Model Trained on Non-video Dataset One may wonder whether DiTTA’s performance is partly due to the ISS model being pre-trained on VSPW [37], which is a video dataset. While the used ISS model does not access any temporal information during training, we want to clarify that DiTTA’s effectiveness … view at source ↗
read the original abstract

Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2's temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10%), significantly outperforming zero-shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA's effectiveness, achieving competitive or superior performance relative to fully-supervised VSS methods, thus providing a practical and annotation-free solution for real-world VSS tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DiTTA, a framework that converts a pre-trained image semantic segmentation (ISS) model into a video semantic segmentation (VSS) model via distillation-assisted test-time adaptation (TTA) from SAM2 on unlabeled partial video snippets (e.g., initial 10%), augmented by a lightweight temporal fusion module for cross-frame context aggregation. It claims robust generalization to full videos and other domains, outperforming zero-shot SAM2 refinement approaches, and achieving competitive or superior results to fully-supervised VSS methods on VSPW and Cityscapes without requiring annotations.

Significance. If the performance and generalization claims hold, the work offers a practical annotation-free pathway to temporally coherent VSS by leveraging foundation models like SAM2 for efficient single-pass distillation, which could reduce reliance on dense video labeling in applications such as autonomous driving. The partial-snippet adaptation highlights computational efficiency gains over repeated inference-time SAM2 calls.

major comments (3)
  1. [§3.2] §3.2 (TTA initialization procedure): the single-pass adaptation on the initial 10% snippet lacks any described regularization, temporal consistency loss over the full sequence, or anti-overfitting term; without such a mechanism the claimed robust transfer of temporal coherence to remaining frames risks fitting transient snippet-specific statistics (lighting, camera motion) rather than generalizable semantics.
  2. [§4.2] §4.2 and Table 2 (quantitative comparisons): the reported competitive/superior mIoU versus fully-supervised VSS baselines requires explicit confirmation that those baselines use identical backbones and comparable training regimes; otherwise the interpretation of annotation-free gains is undermined.
  3. [§4.3] §4.3 (ablations): the contribution of the temporal fusion module versus pure SAM2 distillation must be isolated under the 10%-snippet regime, including controls for overfitting, to substantiate that the observed gains stem from the proposed components rather than dataset-specific artifacts.
minor comments (2)
  1. [§2] §2 (related work): a brief discussion of recent video-specific TTA techniques would strengthen the positioning of the distillation-assisted approach.
  2. [Figure 5] Figure 5 (qualitative results): adding error analysis or challenging cases (e.g., fast motion, occlusions) would provide a more balanced assessment of limitations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and outline revisions to strengthen the manuscript while preserving its core contributions.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (TTA initialization procedure): the single-pass adaptation on the initial 10% snippet lacks any described regularization, temporal consistency loss over the full sequence, or anti-overfitting term; without such a mechanism the claimed robust transfer of temporal coherence to remaining frames risks fitting transient snippet-specific statistics (lighting, camera motion) rather than generalizable semantics.

    Authors: We agree that the current description of the TTA procedure does not include explicit regularization or a temporal consistency loss over the full sequence. The single-pass adaptation relies on the pre-trained ISS model's generalization combined with SAM2 distillation and the lightweight fusion module to promote coherence. However, to mitigate concerns about snippet-specific overfitting, we will revise §3.2 to include a brief discussion of this risk and add a new ablation (in §4.3) that incorporates a simple temporal consistency regularizer during the 10% initialization phase. This will demonstrate that performance remains stable without such terms but improves modestly with them, supporting the robustness claim. revision: partial

  2. Referee: [§4.2] §4.2 and Table 2 (quantitative comparisons): the reported competitive/superior mIoU versus fully-supervised VSS baselines requires explicit confirmation that those baselines use identical backbones and comparable training regimes; otherwise the interpretation of annotation-free gains is undermined.

    Authors: We appreciate this clarification request. The baselines in Table 2 follow the standard configurations reported in their original papers (e.g., ResNet-101 or Swin Transformer backbones with their respective training schedules on VSPW and Cityscapes). To eliminate ambiguity, we will revise the caption of Table 2 and the surrounding text in §4.2 to explicitly list the backbone and training regime for each baseline, confirming they match common fully-supervised setups. This will make the annotation-free performance gains more directly interpretable. revision: yes

  3. Referee: [§4.3] §4.3 (ablations): the contribution of the temporal fusion module versus pure SAM2 distillation must be isolated under the 10%-snippet regime, including controls for overfitting, to substantiate that the observed gains stem from the proposed components rather than dataset-specific artifacts.

    Authors: We concur that further isolation is valuable. The existing ablations compare variants but do not explicitly break down fusion versus distillation solely under the 10% snippet setting with overfitting controls. We will expand §4.3 with a dedicated table showing: (i) pure distillation (no fusion) on 10% snippets, (ii) full DiTTA with fusion, and (iii) a control varying snippet length (5%, 10%, 20%) to check for overfitting. These additions will confirm that the temporal fusion module provides the primary gains beyond distillation alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline relies on external SAM2 distillation and standard TTA

full rationale

The paper describes an empirical method (DiTTA) that distills temporal knowledge from the external SAM2 foundation model into a pre-trained image segmentation model via single-pass test-time adaptation on partial video snippets, augmented by a lightweight temporal fusion module. No equations, derivations, or parameter-fitting steps appear in the provided abstract or summary; the approach does not define any quantity in terms of itself, rename fitted inputs as predictions, or rely on self-citation chains for load-bearing uniqueness claims. Results are validated through experiments on VSPW and Cityscapes against fully-supervised baselines, confirming the derivation chain is self-contained and externally grounded.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical computer-vision method paper; the abstract contains no explicit mathematical axioms, free parameters, or newly postulated physical entities. The central claim rests on the empirical effectiveness of the described adaptation pipeline.

pith-pipeline@v0.9.0 · 5524 in / 1203 out tokens · 41435 ms · 2026-05-10T16:16:27.935893+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    Temporal-aware hierarchical mask classi- fication for video semantic segmentation.arXiv preprint arXiv:2309.08020, 2023

    Zhaochong An, Guolei Sun, Zongwei Wu, Hao Tang, and Luc Van Gool. Temporal-aware hierarchical mask classi- fication for video semantic segmentation.arXiv preprint arXiv:2309.08020, 2023. 8

  2. [2]

    Contrastive test-time adaptation

    Dian Chen, Dequan Wang, Trevor Darrell, and Sayna Ebrahimi. Contrastive test-time adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 295–305, 2022. 3, 4

  3. [3]

    Per- pixel classification is not all you need for semantic segmen- tation.Advances in neural information processing systems, 34:17864–17875, 2021

    Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation.Advances in neural information processing systems, 34:17864–17875, 2021. 1

  4. [4]

    Masked-attention mask transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 2

  5. [5]

    Finding meaning in points: Weakly super- vised semantic segmentation for event cameras

    Hoonhee Cho, Sung-Hoon Yoon, Hyeokjun Kweon, and Kuk-Jin Yoon. Finding meaning in points: Weakly super- vised semantic segmentation for event cameras. InEuropean Conference on Computer Vision, pages 266–286. Springer,

  6. [6]

    To adapt or not to adapt? real- time adaptation for semantic segmentation

    Marc Botet Colomer, Pier Luigi Dovesi, Theodoros Pana- giotakopoulos, Joao Frederico Carvalho, Linus H ¨arenstam- Nielsen, Hossein Azizpour, Hedvig Kjellstr ¨om, Daniel Cre- mers, and Matteo Poggi. To adapt or not to adapt? real- time adaptation for semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 1...

  7. [7]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 5, 7

  8. [8]

    Every frame counts: Joint learning of video segmentation and optical flow

    Mingyu Ding, Zhe Wang, Bolei Zhou, Jianping Shi, Zhiwu Lu, and Ping Luo. Every frame counts: Joint learning of video segmentation and optical flow. InProceedings of the AAAI conference on artificial intelligence, pages 10713– 10720, 2020. 2

  9. [9]

    Flownet: Learning optical flow with convolutional networks

    Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. InPro- ceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015. 2

  10. [10]

    Uncertainty reduction for model adaptation in semantic segmentation

    Francois Fleuret et al. Uncertainty reduction for model adaptation in semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9613–9623, 2021. 3

  11. [11]

    Se- mantic video cnns through representation warping

    Raghudeep Gadde, Varun Jampani, and Peter V Gehler. Se- mantic video cnns through representation warping. InPro- ceedings of the IEEE International Conference on Computer Vision, pages 4453–4462, 2017. 2, 5

  12. [12]

    Video segmentation with superpixels

    Fabio Galasso, Roberto Cipolla, and Bernt Schiele. Video segmentation with superpixels. InComputer Vision–ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I 11, pages 760–774. Springer, 2013. 2

  13. [13]

    Superpixel-based video ob- ject segmentation using perceptual organization and location prior

    Daniela Giordano, Francesca Murabito, Simone Palazzo, and Concetto Spampinato. Superpixel-based video ob- ject segmentation using perceptual organization and location prior. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4814–4822, 2015. 2

  14. [14]

    Vanishing-point-guided video semantic segmentation of driving scenes

    Diandian Guo, Deng-Ping Fan, Tongyu Lu, Christos Sakaridis, and Luc Van Gool. Vanishing-point-guided video semantic segmentation of driving scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3544–3553, 2024. 2

  15. [15]

    Exploiting temporal state space sharing for video se- mantic segmentation

    Syed Ariff Syed Hesham, Yun Liu, Guolei Sun, Henghui Ding, Jing Yang, Ender Konukoglu, Xue Geng, and Xudong Jiang. Exploiting temporal state space sharing for video se- mantic segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24211–24221,

  16. [16]

    Temporally distributed networks for fast video semantic segmentation

    Ping Hu, Fabian Caba, Oliver Wang, Zhe Lin, Stan Sclaroff, and Federico Perazzi. Temporally distributed networks for fast video semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8818–8827, 2020. 2

  17. [17]

    Min- vis: A minimal video instance segmentation framework without video-based training.Advances in Neural Informa- tion Processing Systems, 35:31265–31277, 2022

    De-An Huang, Zhiding Yu, and Anima Anandkumar. Min- vis: A minimal video instance segmentation framework without video-based training.Advances in Neural Informa- tion Processing Systems, 35:31265–31277, 2022. 2

  18. [18]

    Efficient uncertainty estimation for se- mantic segmentation in videos

    Po-Yu Huang, Wan-Ting Hsu, Chun-Yueh Chiu, Ting-Fan Wu, and Min Sun. Efficient uncertainty estimation for se- mantic segmentation in videos. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 520– 535, 2018. 2

  19. [19]

    Accel: A corrective fusion network for efficient semantic segmenta- tion on video

    Samvit Jain, Xin Wang, and Joseph E Gonzalez. Accel: A corrective fusion network for efficient semantic segmenta- tion on video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8866– 8875, 2019. 2

  20. [20]

    Talos: Enhancing semantic scene completion via test-time adaptation on the line of sight.Advances in Neural Information Processing Systems, 37:74211–74232, 2024

    Hyun-Kurl Jang, Jihun Kim, Hyeokjun Kweon, and Kuk- Jin Yoon. Talos: Enhancing semantic scene completion via test-time adaptation on the line of sight.Advances in Neural Information Processing Systems, 37:74211–74232, 2024. 3

  21. [21]

    Video scene parsing with predictive feature learn- ing

    Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, et al. Video scene parsing with predictive feature learn- ing. InProceedings of the IEEE International Conference on Computer Vision, pages 5580–5588, 2017. 2

  22. [22]

    Improved image boundaries for better video segmentation

    Anna Khoreva, Rodrigo Benenson, Fabio Galasso, Matthias Hein, and Bernt Schiele. Improved image boundaries for better video segmentation. InComputer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, pages 773–788. Springer, 2016. 2

  23. [23]

    Dc-tta: Divide-and-conquer framework for test-time adaptation of interactive segmenta- tion

    Jihun Kim, Hoyong Kwon, Hyeokjun Kweon, Wooseong Jeong, and Kuk-Jin Yoon. Dc-tta: Divide-and-conquer framework for test-time adaptation of interactive segmenta- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 23279–23289, 2025. 3

  24. [24]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 2

  25. [25]

    From sam to cams: Ex- ploring segment anything model for weakly supervised se- mantic segmentation

    Hyeokjun Kweon and Kuk-Jin Yoon. From sam to cams: Ex- ploring segment anything model for weakly supervised se- mantic segmentation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19499–19509, 2024. 2

  26. [26]

    Wish: Weakly super- vised instance segmentation using heterogeneous labels

    Hyeokjun Kweon and Kuk-Jin Yoon. Wish: Weakly super- vised instance segmentation using heterogeneous labels. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 25377–25387, 2025. 2

  27. [27]

    Unlocking the potential of ordinary classifier: Class-specific adversarial erasing frame- work for weakly supervised semantic segmentation

    Hyeokjun Kweon, Sung-Hoon Yoon, Hyeonseong Kim, Daehee Park, and Kuk-Jin Yoon. Unlocking the potential of ordinary classifier: Class-specific adversarial erasing frame- work for weakly supervised semantic segmentation. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 6994–7003, 2021. 1

  28. [28]

    Weakly supervised semantic segmentation via adversarial learning of classifier and reconstructor

    Hyeokjun Kweon, Sung-Hoon Yoon, and Kuk-Jin Yoon. Weakly supervised semantic segmentation via adversarial learning of classifier and reconstructor. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11329–11339, 2023

  29. [29]

    Phase concentration and shortcut suppression for weakly supervised semantic segmentation

    Hoyong Kwon, Jaeseok Jeong, Sung-Hoon Yoon, and Kuk- Jin Yoon. Phase concentration and shortcut suppression for weakly supervised semantic segmentation. InEuropean Conference on Computer Vision, pages 293–312. Springer,

  30. [30]

    Gsvnet: Guided spatially-varying convolution for fast semantic seg- mentation on video

    Shih-Po Lee, Si-Cun Chen, and Wen-Hsiao Peng. Gsvnet: Guided spatially-varying convolution for fast semantic seg- mentation on video. In2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021. 2

  31. [31]

    Video semantic segmentation via sparse temporal transformer

    Jiangtong Li, Wentao Wang, Junjie Chen, Li Niu, Jianlou Si, Chen Qian, and Liqing Zhang. Video semantic segmentation via sparse temporal transformer. InProceedings of the 29th ACM International Conference on Multimedia, pages 59–68,

  32. [32]

    Low-latency video se- mantic segmentation

    Yule Li, Jianping Shi, and Dahua Lin. Low-latency video se- mantic segmentation. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5997– 6005, 2018. 2

  33. [33]

    Do we really need to access the source data? source hypothesis transfer for un- supervised domain adaptation

    Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for un- supervised domain adaptation. InInternational conference on machine learning, pages 6028–6039. PMLR, 2020. 3

  34. [34]

    Lsvos challenge 3rd place report: Sam2 and cutie based vos.arXiv preprint arXiv:2408.10469, 2024

    Xinyu Liu, Jing Zhang, Kexin Zhang, Xu Liu, and Lingling Li. Lsvos challenge 3rd place report: Sam2 and cutie based vos.arXiv preprint arXiv:2408.10469, 2024. 2

  35. [35]

    Efficient semantic video segmentation with per-frame inference

    Yifan Liu, Chunhua Shen, Changqian Yu, and Jingdong Wang. Efficient semantic video segmentation with per-frame inference. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part X 16, pages 352–368. Springer, 2020. 2

  36. [36]

    Spatio-temporal pixel- level contrastive learning-based source-free domain adapta- tion for video semantic segmentation

    Shao-Yuan Lo, Poojan Oza, Sumanth Chennupati, Alejan- dro Galindo, and Vishal M Patel. Spatio-temporal pixel- level contrastive learning-based source-free domain adapta- tion for video semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10534–10543, 2023. 2

  37. [37]

    Vspw: A large-scale dataset for video scene parsing in the wild

    Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, and Yi Yang. Vspw: A large-scale dataset for video scene parsing in the wild. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4133–4143, 2021. 1, 2, 5, 6, 7, 8

  38. [38]

    Semantic video segmentation by gated recurrent flow propagation

    David Nilsson and Cristian Sminchisescu. Semantic video segmentation by gated recurrent flow propagation. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 6819–6828, 2018. 2

  39. [39]

    Towards stable test-time adaptation in dynamic wild world

    Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. InInternetional Conference on Learning Representations, 2023. 3

  40. [40]

    Video object segmentation via sam 2: The 4th solution for lsvos challenge vos track.arXiv preprint arXiv:2408.10125, 2024

    Feiyu Pan, Hao Fang, Runmin Cong, Wei Zhang, and Xi- ankai Lu. Video object segmentation via sam 2: The 4th solution for lsvos challenge vos track.arXiv preprint arXiv:2408.10125, 2024. 2

  41. [41]

    Real-time, accurate, and consistent video semantic segmentation via unsupervised adaptation and cross-unit de- ployment on mobile device

    Hyojin Park, Alan Yessenbayev, Tushar Singhal, Navin Ku- mar Adhikari, Yizhe Zhang, Shubhankar Mangesh Borse, Hong Cai, Nilesh Prasad Pandey, Fei Yin, Frank Mayer, et al. Real-time, accurate, and consistent video semantic segmentation via unsupervised adaptation and cross-unit de- ployment on mobile device. InProceedings of the IEEE/CVF Conference on Comp...

  42. [42]

    Local memory attention for fast video semantic segmentation

    Matthieu Paul, Martin Danelljan, Luc Van Gool, and Radu Timofte. Local memory attention for fast video semantic segmentation. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1102–1109. IEEE, 2021. 2

  43. [43]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 2, 5

  44. [44]

    Motion-state alignment for video semantic segmentation

    Jinming Su, Ruihong Yin, Shuaibin Zhang, and Junfeng Luo. Motion-state alignment for video semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3571–3580, 2023. 2

  45. [45]

    Coarse-to-fine feature mining for video se- mantic segmentation

    Guolei Sun, Yun Liu, Henghui Ding, Thomas Probst, and Luc Van Gool. Coarse-to-fine feature mining for video se- mantic segmentation. Inproceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3126–3137, 2022. 2, 5, 6, 8

  46. [46]

    Mining relations among cross-frame affinities for video semantic segmentation

    Guolei Sun, Yun Liu, Hao Tang, Ajad Chhatkuli, Le Zhang, and Luc Van Gool. Mining relations among cross-frame affinities for video semantic segmentation. InEuropean Con- ference on Computer Vision, pages 522–539. Springer, 2022. 2, 5

  47. [47]

    Learning local and global temporal contexts for video semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Guolei Sun, Yun Liu, Henghui Ding, Min Wu, and Luc Van Gool. Learning local and global temporal contexts for video semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 5, 6, 8

  48. [48]

    Tesla: Test-time self-learning with automatic adversarial augmentation

    Devavrat Tomar, Guillaume Vray, Behzad Bozorgtabar, and Jean-Philippe Thiran. Tesla: Test-time self-learning with automatic adversarial augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20341–20350, 2023. 3

  49. [49]

    The 2nd solution for lsvos challenge rvos track: Spatial-temporal refinement for consistent semantic segmen- tation.arXiv preprint arXiv:2408.12447, 2024

    Tuyen Tran. The 2nd solution for lsvos challenge rvos track: Spatial-temporal refinement for consistent semantic segmen- tation.arXiv preprint arXiv:2408.12447, 2024. 2

  50. [50]

    Unsupervised semantic seg- mentation by contrasting object mask proposals

    Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. Unsupervised semantic seg- mentation by contrasting object mask proposals. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 10052–10062, 2021. 4

  51. [51]

    Multiple hypothesis video segmentation from superpixel flows

    Amelio Vazquez-Reina, Shai Avidan, Hanspeter Pfister, and Eric Miller. Multiple hypothesis video segmentation from superpixel flows. InComputer Vision–ECCV 2010: 11th Eu- ropean Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part V 11, pages 268–281. Springer, 2010. 2

  52. [52]

    On the road to online adaptation for semantic image segmentation

    Riccardo V olpi, Pau De Jorge, Diane Larlus, and Gabriela Csurka. On the road to online adaptation for semantic image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19184– 19195, 2022. 3

  53. [53]

    Tent: Fully Test-time Adaptation by Entropy Minimization

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization.arXiv preprint arXiv:2006.10726,

  54. [54]

    Temporal memory attention for video semantic segmentation

    Hao Wang, Weining Wang, and Jing Liu. Temporal memory attention for video semantic segmentation. In2021 IEEE International Conference on Image Processing (ICIP), pages 2254–2258. IEEE, 2021. 2

  55. [55]

    Continual test-time domain adaptation

    Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022. 3, 6, 7, 8

  56. [56]

    Dynamically instance- guided adaptation: A backward-free approach for test-time domain adaptive semantic segmentation

    Wei Wang, Zhun Zhong, Weijie Wang, Xi Chen, Charles Ling, Boyu Wang, and Nicu Sebe. Dynamically instance- guided adaptation: A backward-free approach for test-time domain adaptive semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24090–24099, 2023. 3

  57. [57]

    Continual test-time domain adaptation via dynamic sample selection

    Yanshuo Wang, Jie Hong, Ali Cheraghian, Shafin Rahman, David Ahmedt-Aristizabal, Lars Petersson, and Mehrtash Harandi. Continual test-time domain adaptation via dynamic sample selection. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 1701–1710, 2024. 3

  58. [58]

    Mask propagation for efficient video semantic segmentation

    Yuetian Weng, Mingfei Han, Haoyu He, Mingjie Li, Lina Yao, Xiaojun Chang, and Bohan Zhuang. Mask propagation for efficient video semantic segmentation. InNeurIPS, 2023. 2

  59. [59]

    Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021. 1, 5, 6, 8

  60. [60]

    Xin Yang, Yan Wending, Michael Bi Mi, Yuan Yuan, and Robby Tan. End-to-end video semantic segmentation in adverse weather using fusion blocks and temporal-spatial teacher-student learning.Advances in Neural Information Processing Systems, 37:141000–141020, 2024. 2

  61. [61]

    Entitysam: Segment everything in video

    Mingqiao Ye, Seoung Wug Oh, Lei Ke, and Joon-Young Lee. Entitysam: Segment everything in video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24234–24243, 2025. 2

  62. [62]

    Adversarial erasing frame- work via triplet with gated pyramid pooling layer for weakly supervised semantic segmentation

    Sung-Hoon Yoon, Hyeokjun Kweon, Jegyeong Cho, Shin- jeong Kim, and Kuk-Jin Yoon. Adversarial erasing frame- work via triplet with gated pyramid pooling layer for weakly supervised semantic segmentation. InEuropean conference on computer vision, pages 326–344. Springer, 2022. 1

  63. [63]

    Diffusion-guided weakly super- vised semantic segmentation

    Sung-Hoon Yoon, Hoyong Kwon, Jaeseok Jeong, Daehee Park, and Kuk-Jin Yoon. Diffusion-guided weakly super- vised semantic segmentation. InEuropean Conference on Computer Vision, pages 393–411. Springer, 2024

  64. [64]

    Class tokens infusion for weakly supervised semantic segmentation

    Sung-Hoon Yoon, Hoyong Kwon, Hyeonseong Kim, and Kuk-Jin Yoon. Class tokens infusion for weakly supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3595–3605, 2024. 1

  65. [65]

    Sam 2 in robotic surgery: An empirical evaluation for ro- bustness and generalization in surgical video segmentation

    Jieming Yu, An Wang, Wenzhen Dong, Mengya Xu, Mo- barakol Islam, Jie Wang, Long Bai, and Hongliang Ren. Sam 2 in robotic surgery: An empirical evaluation for ro- bustness and generalization in surgical video segmentation. arXiv preprint arXiv:2408.04593, 2024. 2

  66. [66]

    Object- contextual representations for semantic segmentation

    Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object- contextual representations for semantic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 173–190. Springer, 2020. 1, 8

  67. [67]

    Towards underwater camouflaged ob- ject tracking: An experimental evaluation of sam and sam 2

    Chunhui Zhang, Li Liu, Guanjie Huang, Hao Wen, Xi Zhou, and Yanfeng Wang. Towards underwater camouflaged ob- ject tracking: An experimental evaluation of sam and sam 2. arXiv preprint arXiv:2409.16902, 2024. 2

  68. [68]

    Unleashing the potential of sam2 for biomedical images and videos: A survey.arXiv preprint arXiv:2408.12889, 2024

    Yichi Zhang and Zhenrong Shen. Unleashing the potential of sam2 for biomedical images and videos: A survey.arXiv preprint arXiv:2408.12889, 2024. 2

  69. [69]

    Auxadapt: Stable and efficient test-time adaptation for temporally consistent video semantic segmentation

    Yizhe Zhang, Shubhankar Borse, Hong Cai, and Fatih Porikli. Auxadapt: Stable and efficient test-time adaptation for temporally consistent video semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pages 2339–2348, 2022. 3, 7

  70. [70]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

  71. [71]

    When sam2 meets video camouflaged object segmentation: A comprehensive evaluation and adaptation

    Yuli Zhou, Guolei Sun, Yawei Li, Luca Benini, and Ender Konukoglu. When sam2 meets video camouflaged object segmentation: A comprehensive evaluation and adaptation. arXiv preprint arXiv:2409.18653, 2024. 2