pith. machine review for the scientific record. sign in

arxiv: 2605.11760 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

M⁴-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords RGB-D video salient object detectionSAM2 adaptationmixture of expertsLoRA fine-tuninggated feature fusionmemory augmentationzero-shot segmentation
0
0 comments X

The pith

M⁴-SAM adapts SAM2 for RGB-D video salient object detection using modality-aware experts, gated multi-scale fusion, and prompt-free memory initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to extend the Segment Anything Model 2 to RGB-D video salient object detection by fixing three specific limitations of direct application. It introduces Modality-Aware MoE-LORA to handle multi-modal spatial modeling, Gated Multi-Level Feature Fusion to combine encoder scales effectively, and Pseudo-Guided Initialization to run without manual prompts. If these changes work as described, the model delivers state-of-the-art results across all metrics on three public RGB-D VSOD datasets while operating in a zero-shot setting.

Core claim

M⁴-SAM equips SAM2 with Modality-Aware MoE-LORA that uses convolutional experts and a modality dispatcher for efficient fine-tuning, Gated Multi-Level Feature Fusion that hierarchically aggregates multi-scale features via adaptive gating, and Pseudo-Guided Initialization that bootstraps the memory bank from a coarse mask, enabling effective prompt-free RGB-D VSOD and state-of-the-art performance on three public datasets.

What carries the argument

The three integrated components of M⁴-SAM: Modality-Aware MoE-LORA for spatial and multi-modal adaptation, Gated Multi-Level Feature Fusion for hierarchical feature balancing, and Pseudo-Guided Initialization for prompt-free memory setup.

If this is right

  • RGB-D video salient object detection becomes feasible in a fully prompt-free manner using only a coarse mask to seed the memory bank.
  • Convolutional experts inside MoE-LoRA provide stronger local spatial priors than standard linear LoRA for video tasks involving depth.
  • Adaptive gating in multi-level fusion lets SAM2 features trade off detail and semantics without manual scale selection.
  • The overall design supports direct transfer to other multi-modal video segmentation tasks that currently rely on SAM2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Memory banks seeded by coarse priors could reduce prompt engineering needs in other SAM-based video applications such as tracking or change detection.
  • The modality dispatcher mechanism might generalize to additional input types like thermal or event data if retrained on mixed datasets.
  • Hierarchical gated fusion may improve performance in any SAM2 downstream task where multi-scale encoder outputs are currently underutilized.

Load-bearing premise

That the three added components directly solve the listed challenges of linear LoRA, multi-scale underuse, and prompt dependence, and that the benchmark comparisons reflect genuine gains rather than dataset-specific tuning.

What would settle it

An ablation experiment on the same three datasets that removes any one of the three components and shows performance falling to or below existing RGB-D VSOD baselines would falsify the claim that the full combination is necessary for the reported gains.

Figures

Figures reproduced from arXiv: 2605.11760 by Deyang Liu, Jia Lin, Jiyuan Liu, Runmin Cong, Xiaofei Zhou, Zhi Liu.

Figure 1
Figure 1. Figure 1: The overall architecture of our proposed Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Gated Multi-Level Feature Fusion module. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison with representative state-of-the-art models. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M$^4$-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M$^4$-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper presents M⁴-SAM, an extension of SAM2 for RGB-D video salient object detection. It identifies three challenges (limited spatial modeling in linear LoRA, insufficient multi-scale feature use, and prompt dependence) and proposes three components to address them: Modality-Aware MoE-LORA (convolutional experts plus modality dispatcher for multi-modal PEFT), Gated Multi-Level Feature Fusion (adaptive hierarchical aggregation of encoder features), and Pseudo-Guided Initialization (coarse-mask bootstrapping of the memory bank for zero-shot operation). The manuscript claims this yields state-of-the-art results across all metrics on three public RGB-D VSOD datasets.

Significance. If the performance claims and component contributions are substantiated, the work would offer a concrete recipe for adapting SAM2-style foundation models to multi-modal video tasks, particularly by injecting spatial priors, gated multi-scale fusion, and prompt-free memory. This could influence downstream applications in video segmentation where depth and RGB must be jointly modeled without manual prompts.

major comments (3)
  1. [Abstract] Abstract: The SOTA claim across all metrics on three datasets is stated without any quantitative tables, specific metric values, baseline numbers, or statistical significance tests, so the central empirical assertion cannot be evaluated or reproduced from the manuscript text.
  2. [Method] Method sections describing the three components: No ablation studies isolate the contribution of Modality-Aware MoE-LORA, Gated Multi-Level Feature Fusion, or Pseudo-Guided Initialization to the stated challenges, which is required to attribute any gains to the proposed innovations rather than other factors.
  3. [Experiments] Experiments: The manuscript supplies no implementation details, training protocols, hyper-parameter settings, dataset splits, or confirmation that baselines were re-implemented under identical conditions, leaving open the possibility that reported improvements arise from experimental confounds rather than the architectural changes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The three major comments identify areas where the manuscript can be strengthened for clarity, reproducibility, and attribution of results. We address each point below and will incorporate the suggested improvements in the revised version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The SOTA claim across all metrics on three datasets is stated without any quantitative tables, specific metric values, baseline numbers, or statistical significance tests, so the central empirical assertion cannot be evaluated or reproduced from the manuscript text.

    Authors: We agree that the abstract would be more informative with concrete numbers. In the revision we will add the key quantitative results (maximum F-measure, S-measure, mean absolute error) achieved by M⁴-SAM together with the strongest baseline on each of the three RGB-D VSOD datasets. This will allow readers to directly assess the magnitude of the reported gains without needing to consult the tables. revision: yes

  2. Referee: [Method] Method sections describing the three components: No ablation studies isolate the contribution of Modality-Aware MoE-LORA, Gated Multi-Level Feature Fusion, or Pseudo-Guided Initialization to the stated challenges, which is required to attribute any gains to the proposed innovations rather than other factors.

    Authors: We acknowledge that explicit ablations are necessary to link each component to the three identified challenges. Although the method section describes the modules, we will add a dedicated ablation subsection that incrementally activates Modality-Aware MoE-LORA, Gated Multi-Level Feature Fusion, and Pseudo-Guided Initialization on top of the SAM2 baseline and reports the resulting performance deltas on all three datasets. This will provide direct evidence for the contribution of each innovation. revision: yes

  3. Referee: [Experiments] Experiments: The manuscript supplies no implementation details, training protocols, hyper-parameter settings, dataset splits, or confirmation that baselines were re-implemented under identical conditions, leaving open the possibility that reported improvements arise from experimental confounds rather than the architectural changes.

    Authors: We agree that full experimental transparency is required. We will insert a new “Implementation Details” subsection that specifies the optimizer, learning-rate schedule, batch size, number of epochs, hardware, exact train/validation/test splits for each dataset, and an explicit statement that all baselines were re-implemented and evaluated under the identical protocol and data splits used for M⁴-SAM. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical model design with benchmark evaluation

full rationale

The paper presents an empirical architecture extension of SAM2 for RGB-D VSOD, introducing three components (Modality-Aware MoE-LORA, Gated Multi-Level Feature Fusion, Pseudo-Guided Initialization) and claiming SOTA via dataset experiments. No equations, derivations, or first-principles results are shown that reduce any claimed performance or prediction to quantities defined by the authors' own fitted parameters, self-citations, or ansatzes. The central claim rests on experimental comparisons rather than any self-referential mathematical chain, making this a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the unverified effectiveness of three newly introduced components whose performance is asserted only at the abstract level.

axioms (1)
  • domain assumption SAM2 supplies generalizable visual representations that can be extended to RGB-D video salient object detection
    Stated as the starting point for applying SAM2 to the new task.
invented entities (3)
  • Modality-Aware MoE-LORA no independent evidence
    purpose: Inject convolutional experts and modality dispatcher into SAM2 encoder for multi-modal spatial modeling
    New PEFT module proposed to address limited spatial modeling of linear LoRA.
  • Gated Multi-Level Feature Fusion no independent evidence
    purpose: Hierarchically aggregate multi-scale encoder features with adaptive gating
    New fusion mechanism to address insufficient use of multi-scale features.
  • Pseudo-Guided Initialization no independent evidence
    purpose: Bootstrap memory bank with coarse mask as pseudo prior for prompt-free operation
    New initialization strategy to remove dependence on explicit prompts.

pith-pipeline@v0.9.0 · 5574 in / 1394 out tokens · 139448 ms · 2026-05-13T06:05:29.196719+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

  1. [1]

    Frequency-tuned salient region detec- tion

    Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. Frequency-tuned salient region detec- tion. InCVPR, pages 1597–1604. IEEE, 2009

  2. [2]

    Quality-aware selective fusion network for VDT salient ob- ject detection.IEEE TIP, 33:3212–3226, 2024

    Liuxin Bao, Xiaofei Zhou, Xiankai Lu, Yaoqi Sun, Haib- ing Yin, Zhenghui Hu, Jiyong Zhang, and Chenggang Yan. Quality-aware selective fusion network for VDT salient ob- ject detection.IEEE TIP, 33:3212–3226, 2024

  3. [3]

    Salient object detection: A survey.CVM, 5(2):117– 150, 2019

    Ali Borji, Ming-Ming Cheng, Qibin Hou, Huaizu Jiang, and Jia Li. Salient object detection: A survey.CVM, 5(2):117– 150, 2019

  4. [4]

    SAM2-Adapter: Evaluating & adapting Seg- ment Anything 2 in downstream tasks: Camouflage, shadow, medical image segmentation, and more.arXiv preprint arXiv:2408.04579, 2024

    Tianrun Chen, Ankang Lu, Lanyun Zhu, Chaotao Ding, Chu- nan Yu, Deyi Ji, Zejian Li, Lingyun Sun, Papa Mao, and Ying Zang. SAM2-Adapter: Evaluating & adapting Seg- ment Anything 2 in downstream tasks: Camouflage, shadow, medical image segmentation, and more.arXiv preprint arXiv:2408.04579, 2024

  5. [5]

    XMem: Long- term video object segmentation with an Atkinson-Shiffrin memory model

    Ho Kei Cheng and Alexander G Schwing. XMem: Long- term video object segmentation with an Atkinson-Shiffrin memory model. InECCV, pages 640–658. Springer, 2022

  6. [6]

    Rethink- ing space-time networks with improved memory coverage for efficient video object segmentation

    Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Rethink- ing space-time networks with improved memory coverage for efficient video object segmentation. InNeurIPS, pages 11781–11794, 2021

  7. [7]

    Dual pro- totype attention for unsupervised video object segmentation

    Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Dogyoon Lee, Heeseung Choi, Ig-Jae Kim, and Sangyoun Lee. Dual pro- totype attention for unsupervised video object segmentation. InCVPR, pages 19238–19247, 2024

  8. [8]

    Transflow: Motion knowledge transfer from video diffusion models to video salient object detec- tion

    Suhwan Cho, Minhyeok Lee, Jungho Lee, Sunghun Yang, and Sangyoun Lee. Transflow: Motion knowledge transfer from video diffusion models to video salient object detec- tion. InICCVW, pages 3803–3813, 2025

  9. [9]

    Point-aware interaction and CNN-induced refinement network for RGB-D salient object detection

    Runmin Cong, Hongyu Liu, Chen Zhang, Wei Zhang, Feng Zheng, Ran Song, and Sam Kwong. Point-aware interaction and CNN-induced refinement network for RGB-D salient object detection. InACM MM, pages 406–416, 2023

  10. [10]

    MemSAM: Taming Segment Anything Model for echocar- diography video segmentation

    Xiaolong Deng, Huisi Wu, Runhao Zeng, and Jing Qin. MemSAM: Taming Segment Anything Model for echocar- diography video segmentation. InCVPR, pages 9622–9631, 2024

  11. [11]

    Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature Machine In- telligence, 5(3):220–235, 2023

    Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature Machine In- telligence, 5(3):220–235, 2023

  12. [12]

    Structure-measure: A new way to evaluate foreground maps

    Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. Structure-measure: A new way to evaluate foreground maps. InICCV, pages 4548–4557, 2017

  13. [13]

    Enhanced-alignment measure for binary foreground map evaluation

    Deng-Ping Fan, Cheng Gong, Yang Cao, Bo Ren, Ming- Ming Cheng, and Ali Borji. Enhanced-alignment measure for binary foreground map evaluation. InIJCAI, pages 698– 704, 2018

  14. [14]

    Shifting more attention to video salient object detection

    Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. Shifting more attention to video salient object detection. InCVPR, pages 8554–8564, 2019

  15. [15]

    Multi-scale and detail-enhanced Segment Anything Model for salient object detection

    Shixuan Gao, Pingping Zhang, Tianyu Yan, and Huchuan Lu. Multi-scale and detail-enhanced Segment Anything Model for salient object detection. InACM MM, pages 9894– 9903, 2024

  16. [16]

    CNNs-based RGB-D saliency detection via cross- view transfer and multiview fusion.IEEE TCYB, 48(11): 3171–3183, 2017

    Junwei Han, Hao Chen, Nian Liu, Chenggang Yan, and Xue- long Li. CNNs-based RGB-D saliency detection via cross- view transfer and multiview fusion.IEEE TCYB, 48(11): 3171–3183, 2017

  17. [17]

    Deeply supervised salient object detection with short connections

    Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip HS Torr. Deeply supervised salient object detection with short connections. InCVPR, pages 3203–3212, 2017

  18. [18]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. InICML, pages 2790–2799. PMLR, 2019

  19. [19]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022

  20. [20]

    Calibrated RGB-D salient object detection

    Wei Ji, Jingjing Li, Shuang Yu, Miao Zhang, Yongri Piao, Shunyu Yao, Qi Bi, Kai Ma, Yefeng Zheng, Huchuan Lu, et al. Calibrated RGB-D salient object detection. InCVPR, pages 9471–9481, 2021

  21. [21]

    Segment Any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment Any- thing. InICCV, pages 4015–4026, 2023

  22. [22]

    DVSOD: RGB-D video salient object detection

    Jingjing Li, Wei Ji, Size Wang, Wenbo Li, and Li Cheng. DVSOD: RGB-D video salient object detection. InNeurIPS, pages 8774–8787, 2023

  23. [23]

    Efficient long-short temporal attention net- work for unsupervised video object segmentation.PR, 146: 110078, 2024

    Ping Li, Yu Zhang, Li Yuan, Huaxin Xiao, Binbin Lin, and Xianghua Xu. Efficient long-short temporal attention net- work for unsupervised video object segmentation.PR, 146: 110078, 2024

  24. [24]

    KAN-SAM: Kolmogorov-Arnold network guided Seg- ment Anything Model for RGB-T salient object detection

    Xingyuan Li, Ruichao Hou, Tongwei Ren, and Gangshan Wu. KAN-SAM: Kolmogorov-Arnold network guided Seg- ment Anything Model for RGB-T salient object detection. In ICME, pages 1–6. IEEE, 2025

  25. [25]

    ViDSOD-100: A new dataset and a baseline model for RGB-D video salient object detection

    Junhao Lin, Lei Zhu, Jiaxing Shen, Huazhu Fu, Qing Zhang, and Liansheng Wang. ViDSOD-100: A new dataset and a baseline model for RGB-D video salient object detection. IJCV, 132(11):5173–5191, 2024

  26. [26]

    Visual saliency transformer

    Nian Liu, Ni Zhang, Kaiyuan Wan, Ling Shao, and Junwei Han. Visual saliency transformer. InICCV, pages 4722– 4732, 2021

  27. [27]

    Receptive field block net for accurate and fast object detection

    Songtao Liu, Di Huang, and Yunhong Wang. Receptive field block net for accurate and fast object detection. InECCV, pages 385–400, 2018

  28. [28]

    KAN: Kolmogorov-Arnold Networks

    Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Solja ˇci´c, Thomas Y Hou, and Max Tegmark. KAN: Kolmogorov-Arnold networks.arXiv preprint arXiv:2404.19756, 2024

  29. [29]

    Salient object detection in RGB-D videos

    Ao Mou, Yukang Lu, Jiahao He, Dingyao Min, Keren Fu, and Qijun Zhao. Salient object detection in RGB-D videos. IEEE TIP, 33:6660–6675, 2024

  30. [30]

    Segmentation of moving objects by long term video analysis.IEEE TPAMI, 36(6):1187–1200, 2013

    Peter Ochs, Jitendra Malik, and Thomas Brox. Segmentation of moving objects by long term video analysis.IEEE TPAMI, 36(6):1187–1200, 2013

  31. [31]

    Video object segmentation using space-time memory networks

    Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. InICCV, pages 9226–9235, 2019

  32. [32]

    Multi-scale interactive network for salient object detection

    Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu. Multi-scale interactive network for salient object detection. InCVPR, pages 9413–9422, 2020

  33. [33]

    RGBD salient object detection: A benchmark and algorithms

    Houwen Peng, Bing Li, Weihua Xiong, Weiming Hu, and Rongrong Ji. RGBD salient object detection: A benchmark and algorithms. InECCV, pages 92–109. Springer, 2014

  34. [34]

    BASNet: Boundary-aware salient object detection

    Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, and Martin Jagersand. BASNet: Boundary-aware salient object detection. InCVPR, pages 7479–7489, 2019

  35. [35]

    RGBD salient object detection via deep fusion.IEEE TIP, 26(5):2274– 2285, 2017

    Liangqiong Qu, Shengfeng He, Jiawei Zhang, Jiandong Tian, Yandong Tang, and Qingxiong Yang. RGBD salient object detection via deep fusion.IEEE TIP, 26(5):2274– 2285, 2017

  36. [36]

    SAM 2: Segment Anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment Anything in images and videos. InICLR, 2025

  37. [37]

    Hi- era: A hierarchical vision transformer without the bells-and- whistles

    Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Hi- era: A hierarchical vision transformer without the bells-and- whistles. InICML, pages 29441–29454. PMLR, 2023

  38. [38]

    Ex- plore the potential of CLIP for training-free open vocabulary semantic segmentation

    Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of CLIP for training-free open vocabulary semantic segmentation. InECCV, pages 139–156. Springer, 2024

  39. [39]

    Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer. InICLR, 2017

  40. [40]

    MGD-SAM2: Multi- view guided detail-enhanced segment anything model 2 for high-resolution class-agnostic segmentation.arXiv preprint arXiv:2503.23786, 2025

    Haoran Shen, Peixian Zhuang, Jiahao Kou, Yuxin Zeng, Haoying Xu, and Jiangyun Li. MGD-SAM2: Multi- view guided detail-enhanced segment anything model 2 for high-resolution class-agnostic segmentation.arXiv preprint arXiv:2503.23786, 2025

  41. [41]

    Rapid salient object detection with difference convolutional neural networks.IEEE TPAMI, 47(10):9061–9077, 2025

    Zhuo Su, Li Liu, Matthias M ¨uller, Jiehua Zhang, Diana Wofk, Ming-Ming Cheng, and Matti Pietik ¨ainen. Rapid salient object detection with difference convolutional neural networks.IEEE TPAMI, 47(10):9061–9077, 2025

  42. [42]

    Lightweight multi-frequency enhancement network for RGB-D video salient object detec- tion

    Daerji Suolang, Jiahao He, Wangchuk Tsering, Keren Fu, Xiaofeng Li, and Qijun Zhao. Lightweight multi-frequency enhancement network for RGB-D video salient object detec- tion. InICASSP, pages 1–5. IEEE, 2025

  43. [43]

    HRTransNet: HRFormer-driven two-modality salient object detection.IEEE TCSVT, 33(2):728–742, 2022

    Bin Tang, Zhengyi Liu, Yacheng Tan, and Qian He. HRTransNet: HRFormer-driven two-modality salient object detection.IEEE TCSVT, 33(2):728–742, 2022

  44. [44]

    RGBT salient object detection: A large-scale dataset and benchmark.IEEE TMM, 25:4163– 4176, 2022

    Zhengzheng Tu, Yan Ma, Zhun Li, Chenglong Li, Jieming Xu, and Yongtao Liu. RGBT salient object detection: A large-scale dataset and benchmark.IEEE TMM, 25:4163– 4176, 2022

  45. [45]

    LFRNet: Localizing, focus, and refinement net- work for salient object detection of surface defects.IEEE TIM, 72:1–12, 2023

    Bin Wan, Xiaofei Zhou, Bolun Zheng, Haibing Yin, Zunjie Zhu, Hongkui Wang, Yaoqi Sun, Jiyong Zhang, and Cheng- gang Yan. LFRNet: Localizing, focus, and refinement net- work for salient object detection of surface defects.IEEE TIM, 72:1–12, 2023

  46. [46]

    Adaptive fusion for RGB-D salient object detection.IEEE Access, 7:55277– 55284, 2019

    Ningning Wang and Xiaojin Gong. Adaptive fusion for RGB-D salient object detection.IEEE Access, 7:55277– 55284, 2019

  47. [47]

    Consistent video saliency using local gradient flow optimization and global refinement.IEEE TIP, 24(11):4185–4196, 2015

    Wenguan Wang, Jianbing Shen, and Ling Shao. Consistent video saliency using local gradient flow optimization and global refinement.IEEE TIP, 24(11):4185–4196, 2015

  48. [48]

    F 3Net: fu- sion, feedback and focus for salient object detection

    Jun Wei, Shuhui Wang, and Qingming Huang. F 3Net: fu- sion, feedback and focus for salient object detection. In AAAI, pages 12321–12328, 2020

  49. [49]

    SAM2-UNet: Segment Anything 2 makes strong encoder for natural and medical image segmentation.Visual Intelligence, 4(1):2, 2026

    Xinyu Xiong, Zihuang Wu, Shuangyi Tan, Wenxue Li, Fei- long Tang, Ying Chen, Siying Li, Jie Ma, and Guanbin Li. SAM2-UNet: Segment Anything 2 makes strong encoder for natural and medical image segmentation.Visual Intelligence, 4(1):2, 2026

  50. [50]

    arXiv preprint arXiv:2304.13785 (2023)

    Kaidong Zhang and Dong Liu. Customized Segment Any- thing Model for medical image segmentation.arXiv preprint arXiv:2304.13785, 2023

  51. [51]

    A single stream network for robust and real-time RGB-D salient object detection

    Xiaoqi Zhao, Lihe Zhang, Youwei Pang, Huchuan Lu, and Lei Zhang. A single stream network for robust and real-time RGB-D salient object detection. InECCV, pages 646–662. Springer, 2020

  52. [52]

    Convolution meets LoRA: Parameter efficient finetuning for Segment Anything Model.arXiv preprint arXiv:2401.17868, 2024

    Zihan Zhong, Zhiqiang Tang, Tong He, Haoyang Fang, and Chun Yuan. Convolution meets LoRA: Parameter efficient finetuning for Segment Anything Model.arXiv preprint arXiv:2401.17868, 2024

  53. [53]

    RGB-D salient object detection: A survey.CVM, 7(1):37–69, 2021

    Tao Zhou, Deng-Ping Fan, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. RGB-D salient object detection: A survey.CVM, 7(1):37–69, 2021

  54. [54]

    Dense attention-guided cascaded network for salient object detection of strip steel surface defects.IEEE TIM, 71:1–14, 2021

    Xiaofei Zhou, Hao Fang, Zhi Liu, Bolun Zheng, Yaoqi Sun, Jiyong Zhang, and Chenggang Yan. Dense attention-guided cascaded network for salient object detection of strip steel surface defects.IEEE TIM, 71:1–14, 2021

  55. [55]

    STI-Net: Spatiotemporal integration network for video saliency detection.Information Sciences, 628:134– 147, 2023

    Xiaofei Zhou, Weipeng Cao, Hanxiao Gao, Zhong Ming, and Jiyong Zhang. STI-Net: Spatiotemporal integration network for video saliency detection.Information Sciences, 628:134– 147, 2023

  56. [56]

    Transformer-based multi-scale feature integration network for video saliency prediction.IEEE TCSVT, 33(12):7696– 7707, 2023

    Xiaofei Zhou, Songhe Wu, Ran Shi, Bolun Zheng, Shuai Wang, Haibing Yin, Jiyong Zhang, and Chenggang Yan. Transformer-based multi-scale feature integration network for video saliency prediction.IEEE TCSVT, 33(12):7696– 7707, 2023

  57. [57]

    Salient object detection via integrity learning.IEEE TPAMI, 45(3):3738–3752, 2022

    Mingchen Zhuge, Deng-Ping Fan, Nian Liu, Dingwen Zhang, Dong Xu, and Ling Shao. Salient object detection via integrity learning.IEEE TPAMI, 45(3):3738–3752, 2022