arxiv: 2605.12006 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

Robust Promptable Video Object Segmentation

Sohyun Lee , Yeho Gwon , Lukas Hoyer , Konrad Schindler , Christos Sakaridis , Suha Kwak

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords robust promptable video object segmentationMoGAmemory conditioningvideo corruptionsadverse conditionstemporal consistencybenchmarkobject-specific adaptation

0 comments

The pith

A memory-conditioned adaptation technique enables promptable video object segmentation to remain accurate despite input corruptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that promptable video object segmentation can be made robust to real-world corruptions by introducing MoGA, a new adaptation method, together with a dedicated evaluation benchmark. It claims that storing object-specific representations in memory across frames lets the model adapt its processing to each tracked object's degradation individually while preserving temporal consistency in the output masks. If correct, this would allow PVOS models to work reliably in safety-critical settings where videos routinely contain noise, weather effects, or other degradations. The authors create synthetic training data with temporally varying corruptions and test on new real-world datasets containing 351 clips and over 2,500 object masks to support the gains.

Core claim

The central discovery is the Memory-object-conditioned Gated-rank Adaptation (MoGA) method for robust promptable video object segmentation. MoGA uses object-specific representations stored in memory across video frames to condition the adaptation process. This lets the model treat degradations differently for each tracked object while ensuring temporal consistency in the segmentation results. Experiments on a new benchmark with real-world adverse conditions and synthetic data confirm notable performance gains across many corruption types.

What carries the argument

Memory-object-conditioned Gated-rank Adaptation (MoGA), a technique that conditions robustification on per-object memory representations to achieve temporally consistent handling of degradations.

If this is right

MoGA delivers consistent and significant performance gains across diverse corruption types on both synthetic and real-world data.
The memory mechanism allows object-specific adaptation without sacrificing frame-to-frame consistency in the segmented objects.
The new benchmark with 351 real-world clips provides a standardized way to measure robustness in promptable video object segmentation.
Training on temporally varying synthetic corruptions prepares models to handle the range of degradations seen in practical video inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-object memory conditioning could be transferred to improve robustness in neighboring tasks such as video instance segmentation or multi-object tracking.
Existing promptable video object segmentation models could be upgraded by inserting the MoGA adaptation layer with only modest additional training.
Expanding the benchmark to include more video domains and sensor types would test whether the observed gains hold under broader real-world distributions.

Load-bearing premise

The assumption that the synthetic corruptions applied to existing VOS datasets and the new real-world benchmark sufficiently represent the distribution of adverse conditions encountered in actual deployments.

What would settle it

A controlled test in which MoGA shows no improvement or reduced accuracy on a fresh collection of real-world videos containing corruption types absent from the benchmark would disprove the generalization of the approach.

Figures

Figures reproduced from arXiv: 2605.12006 by Christos Sakaridis, Konrad Schindler, Lukas Hoyer, Sohyun Lee, Suha Kwak, Yeho Gwon.

**Figure 2.** Figure 2: Example images and annotated object masks of the real-world evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of MoGA integrated into SAM2 [37]. Top: an example input video under adverse weather conditions. Frames with orange outlines are previously processed and stored in the memory bank, while the yellow outline indicates the current frame. Bottom left: MoGA modules are integrated into the memory attention mechanism of SAM2. Bottom right: our memory-conditioned gating mechanism where object pointers fro… view at source ↗

**Figure 4.** Figure 4: Qualitative results on the real-world corrupted sequences of our benchmark. Each color indicates tracked objects: red for vehicles [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of gating masks over time. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark is publicly available at https://sohyun-l.github.io/RobustPVOS_project_page/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark for robust PVOS is useful, but MoGA's gains from object-specific memory conditioning lack isolating ablations.

read the letter

The paper's core offering is a new benchmark with 351 real-world video clips and over 2500 masks under adverse conditions, paired with a method called MoGA that applies gated-rank adaptation conditioned on object-specific memory representations. This directly targets the drop in promptable video object segmentation performance when inputs are corrupted, which matters for practical uses like robotics or surveillance. They train on synthetically corrupted VOS data and evaluate on both synthetic and real sets, reporting consistent gains across corruption types. The benchmark itself is publicly released, which is a concrete step forward for the area. The approach of maintaining per-object memory to handle degradation differently per track while preserving temporal consistency is a logical extension of existing memory-based VOS techniques. That said, the stress-test point lands: the description does not include an ablation that keeps the gated-rank adaptation but removes or freezes the memory conditioning, so it remains unclear whether the object-specific aspect is what produces the reported robustness or if generic adaptation would suffice. The assumption that the chosen synthetic corruptions approximate real-world distributions also sits on limited evidence from the abstract alone. Full architecture details and per-component breakdowns would help clarify this. This is aimed at CV researchers working on video segmentation who need robustness data and baselines. A reader focused on deployment issues would find the benchmark worth looking at even without the method. It deserves peer review because the benchmark is new and the method adds a plausible mechanism, though reviewers will likely press for the missing ablations and more quantitative validation of the causal claims.

Referee Report

3 major / 2 minor

Summary. The paper introduces the first comprehensive benchmark and method for Robust Promptable Video Object Segmentation (RobustPVOS). It creates real-world evaluation datasets (351 clips, >2500 masks under adverse conditions) and synthetic training data via temporally varying corruptions on existing VOS datasets. The proposed MoGA method uses object-specific representations maintained in memory to condition gated-rank adaptation, enabling per-object handling of degradation while preserving temporal consistency. Experiments report consistent improvements across corruption types on both synthetic and real data, positioning MoGA as a baseline for future work.

Significance. If the robustness gains hold under scrutiny, the work is significant for enabling PVOS deployment in safety-critical applications. The public benchmark release and focus on real-world adverse conditions address a clear gap. The object-specific memory conditioning idea is a plausible mechanism for handling heterogeneous corruptions, and the dual synthetic/real evaluation setup strengthens the claims relative to purely synthetic robustness studies.

major comments (3)

[Experiments] Experiments section: No ablation isolates the contribution of object-specific memory conditioning. The paper must include a controlled variant that disables or freezes the memory module while retaining gated-rank adaptation (and vice versa) to substantiate the claim that memory conditioning, rather than adaptation alone, drives the per-object robustness and temporal consistency gains.
[Benchmark Construction] Benchmark and data generation: The description of how temporally varying corruptions are applied to create synthetic training data lacks sufficient detail (e.g., per-frame corruption schedules, parameter ranges, and correlation with object masks) to allow reproduction or to verify that the synthetic distribution meaningfully approximates the real-world benchmark.
[Experiments] Real-world results: Improvements on the new real-world datasets are reported without error bars, standard deviations across runs, or statistical tests. Given the modest number of clips (351), this makes it difficult to assess whether the gains are robust or could arise from dataset-specific biases.

minor comments (2)

[Method] Method section: Provide a clearer diagram or pseudocode for the memory conditioning and gating mechanism to improve reproducibility.
[Abstract] Abstract and introduction: The phrase 'object-specific representations maintained in memory' should be defined at first use with a forward reference to the relevant equation or figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where revisions are needed, we will incorporate them in the next version of the manuscript to improve clarity, reproducibility, and statistical rigor.

read point-by-point responses

Referee: [Experiments] Experiments section: No ablation isolates the contribution of object-specific memory conditioning. The paper must include a controlled variant that disables or freezes the memory module while retaining gated-rank adaptation (and vice versa) to substantiate the claim that memory conditioning, rather than adaptation alone, drives the per-object robustness and temporal consistency gains.

Authors: We agree that a controlled ablation isolating the object-specific memory conditioning is essential to substantiate its role. In the revised manuscript, we will add experiments that (i) disable the memory module while retaining gated-rank adaptation and (ii) freeze the adaptation parameters while retaining the memory module. These variants will be evaluated on both synthetic and real-world data to quantify the specific contributions to per-object robustness and temporal consistency. revision: yes
Referee: [Benchmark Construction] Benchmark and data generation: The description of how temporally varying corruptions are applied to create synthetic training data lacks sufficient detail (e.g., per-frame corruption schedules, parameter ranges, and correlation with object masks) to allow reproduction or to verify that the synthetic distribution meaningfully approximates the real-world benchmark.

Authors: We acknowledge the need for greater detail to ensure reproducibility. We will expand the data generation section to specify the per-frame corruption schedules, the exact parameter ranges and sampling distributions for each corruption type, and the procedure for correlating corruptions with object masks (including any mask-aware application rules). This will also clarify how the synthetic distribution was designed to approximate the real-world adverse conditions. revision: yes
Referee: [Experiments] Real-world results: Improvements on the new real-world datasets are reported without error bars, standard deviations across runs, or statistical tests. Given the modest number of clips (351), this makes it difficult to assess whether the gains are robust or could arise from dataset-specific biases.

Authors: We agree that reporting variability and statistical significance is important given the dataset size. In the revision, we will rerun all real-world experiments across multiple random seeds, report standard deviations and error bars, and include statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests) to assess the significance of MoGA's improvements over baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new benchmark (synthetic corruptions on existing VOS data plus two new real-world datasets) and a novel architecture MoGA whose core components (object-specific memory conditioning and gated-rank adaptation) are defined independently of the evaluation metrics. Performance gains are reported via empirical comparison on held-out synthetic and real data rather than any quantity that is fitted or defined in terms of itself. No equations, self-citations, or uniqueness theorems are invoked that would reduce the central claim to a tautology or to a parameter fit performed inside the same paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that object-specific memory representations can be maintained and used to condition adaptation without introducing new free parameters beyond standard training; no explicit free parameters, axioms, or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1186 out tokens · 31400 ms · 2026-05-13T07:13:40.167129+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

[1]

Lora-ir: taming low-rank experts for efficient all-in-one image restoration,

Yuang Ai, Huaibo Huang, and Ran He. Lora-ir: Taming low- rank experts for efficient all-in-one image restoration.arXiv preprint arXiv:2410.15385, 2024. 2

work page arXiv 2024
[2]

Refereverything: Towards seg- menting everything we can speak of in videos

Anurag Bagchi, Zhipeng Bao, Yu-Xiong Wang, Pavel Tok- makov, and Martial Hebert. Refereverything: Towards seg- menting everything we can speak of in videos. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 23221–23231, 2025. 2

work page 2025
[3]

Generalized foggy- scene semantic segmentation by frequency decoupling

Qi Bi, Shaodi You, and Theo Gevers. Generalized foggy- scene semantic segmentation by frequency decoupling. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) Workshops, pages 1389–1399, 2024. 2

work page 2024
[4]

RobustSAM: segment anything robustly on de- graded images

Wei-Ting Chen, Yu-Jiet V ong, Sy-Yen Kuo, Sizhou Ma, and Jian Wang. RobustSAM: segment anything robustly on de- graded images. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 2, 3

work page 2024
[5]

Putting the object back into video object segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3151–3161, 2024. 2

work page 2024
[6]

On the effective- ness of layernorm tuning for continual learning in vision transformers

Thomas De Min, Massimiliano Mancini, Karteek Alahari, Xavier Alameda-Pineda, and Elisa Ricci. On the effective- ness of layernorm tuning for continual learning in vision transformers. In2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 3577–3586. IEEE, 2023. 5

work page 2023
[7]

MOSE: A new dataset for video object segmentation in complex scenes

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. MOSE: A new dataset for video object segmentation in complex scenes. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 1, 3

work page 2023
[8]

Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree

Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, and Jiaqi Wang. Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. InProc. IEEE/CVF Inter- national Conference on Computer Vision (ICCV), 2025. 1

work page 2025
[9]

Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022. 5

work page 2022
[10]

Devos: Flow-guided deformable trans- former for video object segmentation

V olodymyr Fedynyak, Yaroslav Romanus, Bohdan Hlo- vatskyi, Bohdan Sydor, Oles Dobosevych, Igor Babin, and Roman Riazantsev. Devos: Flow-guided deformable trans- former for video object segmentation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 240–249, 2024. 2

work page 2024
[11]

Vanishing-point-guided video semantic segmentation of driving scenes

Diandian Guo, Deng-Ping Fan, Tongyu Lu, Christos Sakaridis, and Luc Van Gool. Vanishing-point-guided video semantic segmentation of driving scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3544–3553, 2024. 2

work page 2024
[12]

X-prompt: Multi-modal visual prompt for video object segmentation

Pinxue Guo, Wanyun Li, Hao Huang, Lingyi Hong, Xinyu Zhou, Zhaoyu Chen, Jinglun Li, Kaixun Jiang, Wei Zhang, and Wenqiang Zhang. X-prompt: Multi-modal visual prompt for video object segmentation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 5151– 5160, 2024. 2

work page 2024
[13]

Benchmarking neu- ral network robustness to common corruptions and perturba- tions

Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- tions. InProc. International Conference on Learning Repre- sentations (ICLR), 2019. 1, 3

work page 2019
[14]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Proc. International Conference on Learning Representations (ICLR), 2022. 4, 7

work page 2022
[15]

Categorical repa- rameterization with gumbel-softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa- rameterization with gumbel-softmax. InProc. International Conference on Learning Representations (ICLR), 2017. 5

work page 2017
[16]

Yuille, and Li Cheng

Wei Ji, Jingjing Li, Cheng Bian, Zongwei Zhou, Jiaying Zhao, Alan L. Yuille, and Li Cheng. Multispectral video semantic segmentation: A benchmark dataset and baseline. InProc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2023. 1, 3, 5

work page 2023
[17]

Segment anything in high quality

Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment anything in high quality. InProc. Neural Information Processing Sys- tems (NeurIPS), 2023. 2

work page 2023
[18]

Event-guided deblurring of unknown exposure time videos

Taewoo Kim, Jeongmin Lee, Lin Wang, and Kuk-Jin Yoon. Event-guided deblurring of unknown exposure time videos. InEuropean Conference on Computer Vision, pages 519–

work page
[19]

Ex- ploring temporally dynamic data augmentation for video recognition

Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, and Sangyoun Lee. Ex- ploring temporally dynamic data augmentation for video recognition. InProc. International Conference on Learning Representations (ICLR), 2023. 1, 3

work page 2023
[20]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 2

work page 2023
[21]

Fifo: Learn- ing fog-invariant features for foggy scene segmentation

Sohyun Lee, Taeyoung Son, and Suha Kwak. Fifo: Learn- ing fog-invariant features for foggy scene segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2022. 2

work page 2022
[22]

Human pose estimation in extremely low-light con- ditions

Sohyun Lee, Jaesung Rim, Boseung Jeong, Geonu Kim, Byungju Woo, Haechan Lee, Sunghyun Cho, and Suha Kwak. Human pose estimation in extremely low-light con- ditions. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[23]

Frest: Feature restoration for semantic segmentation under multiple adverse conditions

Sohyun Lee, Namyup Kim, Sungyeon Kim, and Suha Kwak. Frest: Feature restoration for semantic segmentation under multiple adverse conditions. InProc. European Conference on Computer Vision (ECCV). Springer, 2024

work page 2024
[24]

GaRA-SAM: Robustifying segment anything model with gated-rank adaptation

Sohyun Lee, Yeho Gwon, Lukas Hoyer, and Suha Kwak. GaRA-SAM: Robustifying segment anything model with gated-rank adaptation. InProc. Neural Information Process- ing Systems (NeurIPS), 2025. 2, 4, 5

work page 2025
[25]

Testdg: Test-time domain general- ization for continual test-time adaptation.arXiv preprint arXiv:2504.04981, 2025

Sohyun Lee, Nayeong Kim, Juwon Kang, Seong Joon Oh, and Suha Kwak. Testdg: Test-time domain general- ization for continual test-time adaptation.arXiv preprint arXiv:2504.04981, 2025. 2

work page arXiv 2025
[26]

All-In-One Image Restoration for Unknown Corruption

Boyun Li, Xiao Liu, Peng Hu, Zhongqin Wu, Jiancheng Lv, and Xi Peng. All-In-One Image Restoration for Unknown Corruption. InIEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, 2022. 2, 5

work page 2022
[27]

Event-assisted low-light video object segmentation

Hebei Li, Jin Wang, Jiahui Yuan, Yue Li, Wenming Weng, Yansong Peng, Yueyi Zhang, Zhiwei Xiong, and Xiaoyan Sun. Event-assisted low-light video object segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 3

work page 2024
[28]

UniVS: Unified and universal video segmentation with prompts as queries

Minghan Li, Shuai Li, Xindong Zhang, and Lei Zhang. UniVS: Unified and universal video segmentation with prompts as queries. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 3227–3238, 2024. 1, 2

work page 2024
[29]

Learning spatial-semantic fea- tures for robust video object segmentation

Xin Li, Deshui Miao, Zhenyu He, Yaowei Wang, Huchuan Lu, and Ming-Hsuan Yang. Learning spatial-semantic fea- tures for robust video object segmentation. InProc. Inter- national Conference on Learning Representations (ICLR),

work page
[30]

Unified open-world segmentation with multi-modal prompts

Yang Liu, Yufei Yin, Chenchen Jing, Muzhi Zhu, Hao Chen, Yuling Xi, Bo Feng, Hao Wang, Shiyu Li, and Chunhua Shen. Unified open-world segmentation with multi-modal prompts. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2

work page 2025
[31]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProc. International Conference on Learn- ing Representations (ICLR), 2019. 5

work page 2019
[32]

Image segmenta- tion using text and image prompts

Timo L ¨uddecke and Alexander Ecker. Image segmenta- tion using text and image prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7086–7096, 2022. 2

work page 2022
[33]

Sam- i2v: Upgrading sam to support promptable video segmen- tation with less than 0.2% training cost

Haiyang Mei, Pengyu Zhang, and Mike Zheng Shou. Sam- i2v: Upgrading sam to support promptable video segmen- tation with less than 0.2% training cost. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3417–3426, 2025. 1, 2, 4

work page 2025
[34]

A benchmark dataset and evaluation methodology for video object segmentation

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 1, 3

work page 2016
[35]

PromptIR: Prompting for all-in-one image restoration

Vaishnav Potlapalli, Syed Waqas Zamir, Salman Khan, and Fahad Khan. PromptIR: Prompting for all-in-one image restoration. InProc. Neural Information Processing Systems (NeurIPS), 2023. 2

work page 2023
[36]

Parameter-efficient tuning on layer normalization for pre- trained language models.arXiv preprint arXiv:2211.08682,

Wang Qi, Yu-Ping Ruan, Yuan Zuo, and Taihao Li. Parameter-efficient tuning on layer normalization for pre- trained language models.arXiv preprint arXiv:2211.08682,

work page arXiv
[37]

SAM 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feicht- enhofer. SAM 2: Segment anything in images and videos. In Proc. International...

work page 2025
[38]

ACDC: The adverse conditions dataset with correspondences for se- mantic driving scene understanding

Christos Sakaridis, Dengxin Dai, and Luc Van Gool. ACDC: The adverse conditions dataset with correspondences for se- mantic driving scene understanding. InProc. IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2021. 1

work page 2021
[39]

ACDC: The adverse condi- tions dataset with correspondences for robust semantic driv- ing scene perception.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Christos Sakaridis, Haoran Wang, Ke Li, Ren ´e Zurbr ¨ugg, Arpit Jadon, Wim Abbeloos, Daniel Olmeda Reino, Luc Van Gool, and Dengxin Dai. ACDC: The adverse condi- tions dataset with correspondences for robust semantic driv- ing scene perception.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1, 3, 5

work page 2025
[40]

Kernelized memory network for video object segmentation

Hongje Seong, Junhyuk Hyun, and Euntai Kim. Kernelized memory network for video object segmentation. InProc. European Conference on Computer Vision (ECCV), 2020. 2

work page 2020
[41]

Urie: Universal image enhancement for visual recognition in the wild

Taeyoung Son, Juwon Kang, Namyup Kim, Sunghyun Cho, and Suha Kwak. Urie: Universal image enhancement for visual recognition in the wild. InProc. European Conference on Computer Vision (ECCV), 2020. 2, 5

work page 2020
[42]

Learning video object segmentation with visual memory

Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning video object segmentation with visual memory. In Proc. IEEE/CVF International Conference on Computer Vi- sion (ICCV), 2017. 2

work page 2017
[43]

Learning motion patterns in videos

Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning motion patterns in videos. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[44]

Video segmentation via object flow

Yi-Hsuan Tsai, Ming-Hsuan Yang, and Michael J Black. Video segmentation via object flow. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016
[45]

Layernorm: A key component in parameter-efficient fine-tuning.arXiv preprint arXiv:2403.20284, 2024

Taha ValizadehAslani and Hualou Liang. Layernorm: A key component in parameter-efficient fine-tuning.arXiv preprint arXiv:2403.20284, 2024. 5

work page arXiv 2024
[46]

Efficient track anything

Yunyang Xiong, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Balakrishnan Varadarajan, Ramya Akula, Forrest Iandola, et al. Efficient track anything. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 1

work page 2025
[47]

Youtube-vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018

Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018. 1, 3, 4, 5

work page arXiv 2018
[48]

Tuning layernorm in attention: Towards efficient multi-modal llm finetuning,

Bingchen Zhao, Haoqin Tu, Chen Wei, Jieru Mei, and Cihang Xie. Tuning layernorm in attention: Towards efficient multi-modal llm finetuning.arXiv preprint arXiv:2312.11420, 2023. 5

work page arXiv 2023
[49]

Rmem: Re- stricted memory banks improve video object segmentation

Junbao Zhou, Ziqi Pang, and Yu-Xiong Wang. Rmem: Re- stricted memory banks improve video object segmentation. InProc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 2, 4

work page 2024
[50]

Segment everything everywhere all at once

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. InProc. Neural Information Processing Systems (NeurIPS), 2023. 2

work page 2023