pith. machine review for the scientific record. sign in

arxiv: 2605.12006 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

Robust Promptable Video Object Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords robust promptable video object segmentationMoGAmemory conditioningvideo corruptionsadverse conditionstemporal consistencybenchmarkobject-specific adaptation
0
0 comments X

The pith

A memory-conditioned adaptation technique enables promptable video object segmentation to remain accurate despite input corruptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that promptable video object segmentation can be made robust to real-world corruptions by introducing MoGA, a new adaptation method, together with a dedicated evaluation benchmark. It claims that storing object-specific representations in memory across frames lets the model adapt its processing to each tracked object's degradation individually while preserving temporal consistency in the output masks. If correct, this would allow PVOS models to work reliably in safety-critical settings where videos routinely contain noise, weather effects, or other degradations. The authors create synthetic training data with temporally varying corruptions and test on new real-world datasets containing 351 clips and over 2,500 object masks to support the gains.

Core claim

The central discovery is the Memory-object-conditioned Gated-rank Adaptation (MoGA) method for robust promptable video object segmentation. MoGA uses object-specific representations stored in memory across video frames to condition the adaptation process. This lets the model treat degradations differently for each tracked object while ensuring temporal consistency in the segmentation results. Experiments on a new benchmark with real-world adverse conditions and synthetic data confirm notable performance gains across many corruption types.

What carries the argument

Memory-object-conditioned Gated-rank Adaptation (MoGA), a technique that conditions robustification on per-object memory representations to achieve temporally consistent handling of degradations.

If this is right

  • MoGA delivers consistent and significant performance gains across diverse corruption types on both synthetic and real-world data.
  • The memory mechanism allows object-specific adaptation without sacrificing frame-to-frame consistency in the segmented objects.
  • The new benchmark with 351 real-world clips provides a standardized way to measure robustness in promptable video object segmentation.
  • Training on temporally varying synthetic corruptions prepares models to handle the range of degradations seen in practical video inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-object memory conditioning could be transferred to improve robustness in neighboring tasks such as video instance segmentation or multi-object tracking.
  • Existing promptable video object segmentation models could be upgraded by inserting the MoGA adaptation layer with only modest additional training.
  • Expanding the benchmark to include more video domains and sensor types would test whether the observed gains hold under broader real-world distributions.

Load-bearing premise

The assumption that the synthetic corruptions applied to existing VOS datasets and the new real-world benchmark sufficiently represent the distribution of adverse conditions encountered in actual deployments.

What would settle it

A controlled test in which MoGA shows no improvement or reduced accuracy on a fresh collection of real-world videos containing corruption types absent from the benchmark would disprove the generalization of the approach.

Figures

Figures reproduced from arXiv: 2605.12006 by Christos Sakaridis, Konrad Schindler, Lukas Hoyer, Sohyun Lee, Suha Kwak, Yeho Gwon.

Figure 1
Figure 1. Figure 1: Overview of (a) RobustPVOS and (b) our benchmark. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example images and annotated object masks of the real-world evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of MoGA integrated into SAM2 [37]. Top: an example input video under adverse weather conditions. Frames with orange outlines are previously processed and stored in the memory bank, while the yellow outline indicates the current frame. Bottom left: MoGA modules are integrated into the memory attention mechanism of SAM2. Bottom right: our memory-conditioned gating mechanism where object pointers fro… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on the real-world corrupted sequences of our benchmark. Each color indicates tracked objects: red for vehicles [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of gating masks over time. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark is publicly available at https://sohyun-l.github.io/RobustPVOS_project_page/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the first comprehensive benchmark and method for Robust Promptable Video Object Segmentation (RobustPVOS). It creates real-world evaluation datasets (351 clips, >2500 masks under adverse conditions) and synthetic training data via temporally varying corruptions on existing VOS datasets. The proposed MoGA method uses object-specific representations maintained in memory to condition gated-rank adaptation, enabling per-object handling of degradation while preserving temporal consistency. Experiments report consistent improvements across corruption types on both synthetic and real data, positioning MoGA as a baseline for future work.

Significance. If the robustness gains hold under scrutiny, the work is significant for enabling PVOS deployment in safety-critical applications. The public benchmark release and focus on real-world adverse conditions address a clear gap. The object-specific memory conditioning idea is a plausible mechanism for handling heterogeneous corruptions, and the dual synthetic/real evaluation setup strengthens the claims relative to purely synthetic robustness studies.

major comments (3)
  1. [Experiments] Experiments section: No ablation isolates the contribution of object-specific memory conditioning. The paper must include a controlled variant that disables or freezes the memory module while retaining gated-rank adaptation (and vice versa) to substantiate the claim that memory conditioning, rather than adaptation alone, drives the per-object robustness and temporal consistency gains.
  2. [Benchmark Construction] Benchmark and data generation: The description of how temporally varying corruptions are applied to create synthetic training data lacks sufficient detail (e.g., per-frame corruption schedules, parameter ranges, and correlation with object masks) to allow reproduction or to verify that the synthetic distribution meaningfully approximates the real-world benchmark.
  3. [Experiments] Real-world results: Improvements on the new real-world datasets are reported without error bars, standard deviations across runs, or statistical tests. Given the modest number of clips (351), this makes it difficult to assess whether the gains are robust or could arise from dataset-specific biases.
minor comments (2)
  1. [Method] Method section: Provide a clearer diagram or pseudocode for the memory conditioning and gating mechanism to improve reproducibility.
  2. [Abstract] Abstract and introduction: The phrase 'object-specific representations maintained in memory' should be defined at first use with a forward reference to the relevant equation or figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where revisions are needed, we will incorporate them in the next version of the manuscript to improve clarity, reproducibility, and statistical rigor.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: No ablation isolates the contribution of object-specific memory conditioning. The paper must include a controlled variant that disables or freezes the memory module while retaining gated-rank adaptation (and vice versa) to substantiate the claim that memory conditioning, rather than adaptation alone, drives the per-object robustness and temporal consistency gains.

    Authors: We agree that a controlled ablation isolating the object-specific memory conditioning is essential to substantiate its role. In the revised manuscript, we will add experiments that (i) disable the memory module while retaining gated-rank adaptation and (ii) freeze the adaptation parameters while retaining the memory module. These variants will be evaluated on both synthetic and real-world data to quantify the specific contributions to per-object robustness and temporal consistency. revision: yes

  2. Referee: [Benchmark Construction] Benchmark and data generation: The description of how temporally varying corruptions are applied to create synthetic training data lacks sufficient detail (e.g., per-frame corruption schedules, parameter ranges, and correlation with object masks) to allow reproduction or to verify that the synthetic distribution meaningfully approximates the real-world benchmark.

    Authors: We acknowledge the need for greater detail to ensure reproducibility. We will expand the data generation section to specify the per-frame corruption schedules, the exact parameter ranges and sampling distributions for each corruption type, and the procedure for correlating corruptions with object masks (including any mask-aware application rules). This will also clarify how the synthetic distribution was designed to approximate the real-world adverse conditions. revision: yes

  3. Referee: [Experiments] Real-world results: Improvements on the new real-world datasets are reported without error bars, standard deviations across runs, or statistical tests. Given the modest number of clips (351), this makes it difficult to assess whether the gains are robust or could arise from dataset-specific biases.

    Authors: We agree that reporting variability and statistical significance is important given the dataset size. In the revision, we will rerun all real-world experiments across multiple random seeds, report standard deviations and error bars, and include statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests) to assess the significance of MoGA's improvements over baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new benchmark (synthetic corruptions on existing VOS data plus two new real-world datasets) and a novel architecture MoGA whose core components (object-specific memory conditioning and gated-rank adaptation) are defined independently of the evaluation metrics. Performance gains are reported via empirical comparison on held-out synthetic and real data rather than any quantity that is fitted or defined in terms of itself. No equations, self-citations, or uniqueness theorems are invoked that would reduce the central claim to a tautology or to a parameter fit performed inside the same paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that object-specific memory representations can be maintained and used to condition adaptation without introducing new free parameters beyond standard training; no explicit free parameters, axioms, or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1186 out tokens · 31400 ms · 2026-05-13T07:13:40.167129+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    Lora-ir: taming low-rank experts for efficient all-in-one image restoration,

    Yuang Ai, Huaibo Huang, and Ran He. Lora-ir: Taming low- rank experts for efficient all-in-one image restoration.arXiv preprint arXiv:2410.15385, 2024. 2

  2. [2]

    Refereverything: Towards seg- menting everything we can speak of in videos

    Anurag Bagchi, Zhipeng Bao, Yu-Xiong Wang, Pavel Tok- makov, and Martial Hebert. Refereverything: Towards seg- menting everything we can speak of in videos. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 23221–23231, 2025. 2

  3. [3]

    Generalized foggy- scene semantic segmentation by frequency decoupling

    Qi Bi, Shaodi You, and Theo Gevers. Generalized foggy- scene semantic segmentation by frequency decoupling. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) Workshops, pages 1389–1399, 2024. 2

  4. [4]

    RobustSAM: segment anything robustly on de- graded images

    Wei-Ting Chen, Yu-Jiet V ong, Sy-Yen Kuo, Sizhou Ma, and Jian Wang. RobustSAM: segment anything robustly on de- graded images. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 2, 3

  5. [5]

    Putting the object back into video object segmentation

    Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3151–3161, 2024. 2

  6. [6]

    On the effective- ness of layernorm tuning for continual learning in vision transformers

    Thomas De Min, Massimiliano Mancini, Karteek Alahari, Xavier Alameda-Pineda, and Elisa Ricci. On the effective- ness of layernorm tuning for continual learning in vision transformers. In2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 3577–3586. IEEE, 2023. 5

  7. [7]

    MOSE: A new dataset for video object segmentation in complex scenes

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. MOSE: A new dataset for video object segmentation in complex scenes. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 1, 3

  8. [8]

    Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree

    Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, and Jiaqi Wang. Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. InProc. IEEE/CVF Inter- national Conference on Computer Vision (ICCV), 2025. 1

  9. [9]

    Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022. 5

  10. [10]

    Devos: Flow-guided deformable trans- former for video object segmentation

    V olodymyr Fedynyak, Yaroslav Romanus, Bohdan Hlo- vatskyi, Bohdan Sydor, Oles Dobosevych, Igor Babin, and Roman Riazantsev. Devos: Flow-guided deformable trans- former for video object segmentation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 240–249, 2024. 2

  11. [11]

    Vanishing-point-guided video semantic segmentation of driving scenes

    Diandian Guo, Deng-Ping Fan, Tongyu Lu, Christos Sakaridis, and Luc Van Gool. Vanishing-point-guided video semantic segmentation of driving scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3544–3553, 2024. 2

  12. [12]

    X-prompt: Multi-modal visual prompt for video object segmentation

    Pinxue Guo, Wanyun Li, Hao Huang, Lingyi Hong, Xinyu Zhou, Zhaoyu Chen, Jinglun Li, Kaixun Jiang, Wei Zhang, and Wenqiang Zhang. X-prompt: Multi-modal visual prompt for video object segmentation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 5151– 5160, 2024. 2

  13. [13]

    Benchmarking neu- ral network robustness to common corruptions and perturba- tions

    Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- tions. InProc. International Conference on Learning Repre- sentations (ICLR), 2019. 1, 3

  14. [14]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Proc. International Conference on Learning Representations (ICLR), 2022. 4, 7

  15. [15]

    Categorical repa- rameterization with gumbel-softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa- rameterization with gumbel-softmax. InProc. International Conference on Learning Representations (ICLR), 2017. 5

  16. [16]

    Yuille, and Li Cheng

    Wei Ji, Jingjing Li, Cheng Bian, Zongwei Zhou, Jiaying Zhao, Alan L. Yuille, and Li Cheng. Multispectral video semantic segmentation: A benchmark dataset and baseline. InProc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2023. 1, 3, 5

  17. [17]

    Segment anything in high quality

    Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment anything in high quality. InProc. Neural Information Processing Sys- tems (NeurIPS), 2023. 2

  18. [18]

    Event-guided deblurring of unknown exposure time videos

    Taewoo Kim, Jeongmin Lee, Lin Wang, and Kuk-Jin Yoon. Event-guided deblurring of unknown exposure time videos. InEuropean Conference on Computer Vision, pages 519–

  19. [19]

    Ex- ploring temporally dynamic data augmentation for video recognition

    Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, and Sangyoun Lee. Ex- ploring temporally dynamic data augmentation for video recognition. InProc. International Conference on Learning Representations (ICLR), 2023. 1, 3

  20. [20]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 2

  21. [21]

    Fifo: Learn- ing fog-invariant features for foggy scene segmentation

    Sohyun Lee, Taeyoung Son, and Suha Kwak. Fifo: Learn- ing fog-invariant features for foggy scene segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2022. 2

  22. [22]

    Human pose estimation in extremely low-light con- ditions

    Sohyun Lee, Jaesung Rim, Boseung Jeong, Geonu Kim, Byungju Woo, Haechan Lee, Sunghyun Cho, and Suha Kwak. Human pose estimation in extremely low-light con- ditions. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  23. [23]

    Frest: Feature restoration for semantic segmentation under multiple adverse conditions

    Sohyun Lee, Namyup Kim, Sungyeon Kim, and Suha Kwak. Frest: Feature restoration for semantic segmentation under multiple adverse conditions. InProc. European Conference on Computer Vision (ECCV). Springer, 2024

  24. [24]

    GaRA-SAM: Robustifying segment anything model with gated-rank adaptation

    Sohyun Lee, Yeho Gwon, Lukas Hoyer, and Suha Kwak. GaRA-SAM: Robustifying segment anything model with gated-rank adaptation. InProc. Neural Information Process- ing Systems (NeurIPS), 2025. 2, 4, 5

  25. [25]

    Testdg: Test-time domain general- ization for continual test-time adaptation.arXiv preprint arXiv:2504.04981, 2025

    Sohyun Lee, Nayeong Kim, Juwon Kang, Seong Joon Oh, and Suha Kwak. Testdg: Test-time domain general- ization for continual test-time adaptation.arXiv preprint arXiv:2504.04981, 2025. 2

  26. [26]

    All-In-One Image Restoration for Unknown Corruption

    Boyun Li, Xiao Liu, Peng Hu, Zhongqin Wu, Jiancheng Lv, and Xi Peng. All-In-One Image Restoration for Unknown Corruption. InIEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, 2022. 2, 5

  27. [27]

    Event-assisted low-light video object segmentation

    Hebei Li, Jin Wang, Jiahui Yuan, Yue Li, Wenming Weng, Yansong Peng, Yueyi Zhang, Zhiwei Xiong, and Xiaoyan Sun. Event-assisted low-light video object segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 3

  28. [28]

    UniVS: Unified and universal video segmentation with prompts as queries

    Minghan Li, Shuai Li, Xindong Zhang, and Lei Zhang. UniVS: Unified and universal video segmentation with prompts as queries. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 3227–3238, 2024. 1, 2

  29. [29]

    Learning spatial-semantic fea- tures for robust video object segmentation

    Xin Li, Deshui Miao, Zhenyu He, Yaowei Wang, Huchuan Lu, and Ming-Hsuan Yang. Learning spatial-semantic fea- tures for robust video object segmentation. InProc. Inter- national Conference on Learning Representations (ICLR),

  30. [30]

    Unified open-world segmentation with multi-modal prompts

    Yang Liu, Yufei Yin, Chenchen Jing, Muzhi Zhu, Hao Chen, Yuling Xi, Bo Feng, Hao Wang, Shiyu Li, and Chunhua Shen. Unified open-world segmentation with multi-modal prompts. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2

  31. [31]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProc. International Conference on Learn- ing Representations (ICLR), 2019. 5

  32. [32]

    Image segmenta- tion using text and image prompts

    Timo L ¨uddecke and Alexander Ecker. Image segmenta- tion using text and image prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7086–7096, 2022. 2

  33. [33]

    Sam- i2v: Upgrading sam to support promptable video segmen- tation with less than 0.2% training cost

    Haiyang Mei, Pengyu Zhang, and Mike Zheng Shou. Sam- i2v: Upgrading sam to support promptable video segmen- tation with less than 0.2% training cost. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3417–3426, 2025. 1, 2, 4

  34. [34]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 1, 3

  35. [35]

    PromptIR: Prompting for all-in-one image restoration

    Vaishnav Potlapalli, Syed Waqas Zamir, Salman Khan, and Fahad Khan. PromptIR: Prompting for all-in-one image restoration. InProc. Neural Information Processing Systems (NeurIPS), 2023. 2

  36. [36]

    Parameter-efficient tuning on layer normalization for pre- trained language models.arXiv preprint arXiv:2211.08682,

    Wang Qi, Yu-Ping Ruan, Yuan Zuo, and Taihao Li. Parameter-efficient tuning on layer normalization for pre- trained language models.arXiv preprint arXiv:2211.08682,

  37. [37]

    SAM 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feicht- enhofer. SAM 2: Segment anything in images and videos. In Proc. International...

  38. [38]

    ACDC: The adverse conditions dataset with correspondences for se- mantic driving scene understanding

    Christos Sakaridis, Dengxin Dai, and Luc Van Gool. ACDC: The adverse conditions dataset with correspondences for se- mantic driving scene understanding. InProc. IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2021. 1

  39. [39]

    ACDC: The adverse condi- tions dataset with correspondences for robust semantic driv- ing scene perception.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Christos Sakaridis, Haoran Wang, Ke Li, Ren ´e Zurbr ¨ugg, Arpit Jadon, Wim Abbeloos, Daniel Olmeda Reino, Luc Van Gool, and Dengxin Dai. ACDC: The adverse condi- tions dataset with correspondences for robust semantic driv- ing scene perception.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1, 3, 5

  40. [40]

    Kernelized memory network for video object segmentation

    Hongje Seong, Junhyuk Hyun, and Euntai Kim. Kernelized memory network for video object segmentation. InProc. European Conference on Computer Vision (ECCV), 2020. 2

  41. [41]

    Urie: Universal image enhancement for visual recognition in the wild

    Taeyoung Son, Juwon Kang, Namyup Kim, Sunghyun Cho, and Suha Kwak. Urie: Universal image enhancement for visual recognition in the wild. InProc. European Conference on Computer Vision (ECCV), 2020. 2, 5

  42. [42]

    Learning video object segmentation with visual memory

    Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning video object segmentation with visual memory. In Proc. IEEE/CVF International Conference on Computer Vi- sion (ICCV), 2017. 2

  43. [43]

    Learning motion patterns in videos

    Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning motion patterns in videos. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  44. [44]

    Video segmentation via object flow

    Yi-Hsuan Tsai, Ming-Hsuan Yang, and Michael J Black. Video segmentation via object flow. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

  45. [45]

    Layernorm: A key component in parameter-efficient fine-tuning.arXiv preprint arXiv:2403.20284, 2024

    Taha ValizadehAslani and Hualou Liang. Layernorm: A key component in parameter-efficient fine-tuning.arXiv preprint arXiv:2403.20284, 2024. 5

  46. [46]

    Efficient track anything

    Yunyang Xiong, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Balakrishnan Varadarajan, Ramya Akula, Forrest Iandola, et al. Efficient track anything. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 1

  47. [47]

    Youtube-vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018

    Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018. 1, 3, 4, 5

  48. [48]

    Tuning layernorm in attention: Towards efficient multi-modal llm finetuning,

    Bingchen Zhao, Haoqin Tu, Chen Wei, Jieru Mei, and Cihang Xie. Tuning layernorm in attention: Towards efficient multi-modal llm finetuning.arXiv preprint arXiv:2312.11420, 2023. 5

  49. [49]

    Rmem: Re- stricted memory banks improve video object segmentation

    Junbao Zhou, Ziqi Pang, and Yu-Xiong Wang. Rmem: Re- stricted memory banks improve video object segmentation. InProc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 2, 4

  50. [50]

    Segment everything everywhere all at once

    Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. InProc. Neural Information Processing Systems (NeurIPS), 2023. 2