pith. machine review for the scientific record. sign in

arxiv: 2604.12380 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Modality-Agnostic Prompt Learning for Multi-Modal Camouflaged Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords camouflaged object detectionmulti-modal prompt learningsegment anything modelmodality-agnosticsemantic segmentationauxiliary modalitiesmask refinement
0
0 comments X

The pith

Modality-agnostic prompts let the Segment Anything Model adapt to any auxiliary sensor for camouflaged object detection without custom fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework that produces a single set of prompts usable by SAM across arbitrary extra modalities such as depth, thermal, or polarization. It achieves this by letting a data-driven content domain interact with a knowledge-driven prompt domain to extract and unify task-relevant cues for mask decoding. A lightweight Mask Refine Module then sharpens the resulting boundaries. Existing multi-modal COD methods require separate architectures or fusion rules for each sensor pair, so a modality-agnostic route would simplify scaling to new combinations and reduce retraining costs. Experiments on three standard benchmarks are presented to show the gains.

Core claim

The central claim is that multi-modal learning for camouflaged object detection can be reduced to generating unified modality-agnostic prompts: interactions between a data-driven content domain and a knowledge-driven prompt domain distill complementary cues into prompts that SAM can decode directly, with a subsequent Mask Refine Module incorporating fine-grained prompt information to correct coarse segmentations and improve boundary accuracy.

What carries the argument

Modality-agnostic multi-modal prompts produced by domain interactions between data-driven content and knowledge-driven prompts, then fed to SAM for decoding.

If this is right

  • Performance improves on RGB-Depth, RGB-Thermal, and RGB-Polarization COD benchmarks relative to prior multi-modal methods.
  • Parameter-efficient adaptation becomes possible for new auxiliary modalities without retraining the entire model.
  • Coarse SAM outputs receive boundary corrections from the Mask Refine Module that uses the same unified prompts.
  • Customized fusion modules or modality-specific encoders are no longer required for each sensor combination.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt-generation pattern could be applied to other foundation segmentation models or to tasks outside camouflaged detection such as medical or remote-sensing imagery.
  • Real-world systems that switch between sensor suites could keep one prompt set and one decoder rather than maintaining separate pipelines.
  • Extending the content-prompt interaction to three or more simultaneous modalities would test whether the unification step remains stable.
  • If the prompts transfer across datasets collected under different lighting or weather, deployment cost for multi-sensor field systems would drop further.

Load-bearing premise

A single fixed set of prompts can extract and use useful complementary signals from any added visual modality without needing designs tailored to that modality.

What would settle it

Training the prompts on RGB-Depth and RGB-Thermal data then testing on a held-out RGB-Polarization set or a fresh modality such as RGB-LiDAR and measuring whether accuracy falls below a modality-specific baseline would directly test the claim.

Figures

Figures reproduced from arXiv: 2604.12380 by Baocai Yin, Hao Wang, Huibing Wang, Jiqing Zhang, Lu Jiang, Xin Yang, Zetian Mi.

Figure 1
Figure 1. Figure 1: The F w β scores with respect to the trainable parameters of different methods on COD10K (RGB-Depth), PCOD-1200 (RGB-Polarization), and VIAC (RGB-Thermal) datasets. number of studies have incorporated SAM into COD. Exist￾ing approaches typically adopt parameter-efficient adaptation, including fine-tuning a small subset of parameters [6] or insert￾ing lightweight adapters [7] to frozen SAM. While effective … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed framework. The framework adopts a dual-stream architecture where RGB and auxiliary [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of detection results on RGB-D and RGB-P camouflaged object detection datasets. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual ablation comparison of different variants. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of detection results on the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attentional visualization of the transposed convolution [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of detection results on RGB-T [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: The case of failure. insufficient complementary data, it becomes difficult for all methods to detect camouflaged targets. V. CONCLUSION This paper proposes a modality-agnostic multimodal prompt framework for camouflaged object detection based on the Segment Anything Model. Our approach adopts a dual-domain learning paradigm consisting of a content domain and a prompt domain, which jointly capture data-driv… view at source ↗
read the original abstract

Camouflaged Object Detection (COD) aims to segment objects that blend seamlessly into complex backgrounds, with growing interest in exploiting additional visual modalities to enhance robustness through complementary information. However, most existing approaches generally rely on modality-specific architectures or customized fusion strategies, which limit scalability and cross-modal generalization. To address this, we propose a novel framework that generates modality-agnostic multi-modal prompts for the Segment Anything Model (SAM), enabling parameter-efficient adaptation to arbitrary auxiliary modalities and significantly improving overall performance on COD tasks. Specifically, we model multi-modal learning through interactions between a data-driven content domain and a knowledge-driven prompt domain, distilling task-relevant cues into unified prompts for SAM decoding. We further introduce a lightweight Mask Refine Module to calibrate coarse predictions by incorporating fine-grained prompt cues, leading to more accurate camouflaged object boundaries. Extensive experiments on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks validate the effectiveness and generalization of our modality-agnostic framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a framework for multi-modal camouflaged object detection that generates modality-agnostic prompts for the Segment Anything Model (SAM). It models interactions between a data-driven content domain and a knowledge-driven prompt domain to distill task-relevant cues into unified prompts, introduces a lightweight Mask Refine Module for calibrating coarse predictions with fine-grained cues, and claims parameter-efficient adaptation to arbitrary auxiliary modalities with improved performance on COD tasks, validated on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks.

Significance. If the central claims hold, the work offers a scalable alternative to modality-specific architectures in multi-modal segmentation by leveraging prompt-based adaptation of foundation models like SAM. This could enable more flexible incorporation of auxiliary modalities in challenging vision tasks such as camouflaged object detection without requiring custom fusion modules.

major comments (1)
  1. Abstract: The claim of enabling 'parameter-efficient adaptation to arbitrary auxiliary modalities' and a 'modality-agnostic' framework is not supported by the reported validation. Experiments are confined to three specific modality pairs (RGB-Depth, RGB-Thermal, RGB-Polarization), with no tests on unseen modalities, no evidence of a single shared encoder for truly arbitrary inputs, and no ablation confirming the absence of implicit modality-specific branches in the content-domain ingestion or prompt generator. This directly undermines the load-bearing assertion that the content-prompt domain interactions distill complementary cues without customized strategies.
minor comments (2)
  1. Abstract: The statement that the method 'significantly improving overall performance' is made without any quantitative metrics, baseline comparisons, or specific gains, which reduces the ability to assess the strength of the empirical claims at a glance.
  2. Abstract: The Mask Refine Module is introduced as 'lightweight' but without details on its parameter count, architecture, or integration point with SAM decoding, which would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address the major concern about the abstract claims as follows.

read point-by-point responses
  1. Referee: Abstract: The claim of enabling 'parameter-efficient adaptation to arbitrary auxiliary modalities' and a 'modality-agnostic' framework is not supported by the reported validation. Experiments are confined to three specific modality pairs (RGB-Depth, RGB-Thermal, RGB-Polarization), with no tests on unseen modalities, no evidence of a single shared encoder for truly arbitrary inputs, and no ablation confirming the absence of implicit modality-specific branches in the content-domain ingestion or prompt generator. This directly undermines the load-bearing assertion that the content-prompt domain interactions distill complementary cues without customized strategies.

    Authors: We acknowledge that our experimental validation is limited to three modality pairs and does not include tests on entirely unseen modalities, which would provide stronger evidence for 'arbitrary' adaptation. However, the framework is modality-agnostic by design: the same prompt learning and interaction modules are used across all tested modalities without any customized fusion strategies or modality-specific components, as detailed in the method section. This is what enables parameter-efficient adaptation, where only a small number of parameters (prompts and the mask refine module) are learned for each new modality. We will revise the abstract to replace 'arbitrary' with 'diverse' or 'additional' auxiliary modalities to more accurately reflect the scope of our experiments. Additionally, we will add an ablation study demonstrating that the content-domain ingestion and prompt generator do not contain implicit modality-specific branches, by showing equivalent performance when using a unified encoder. We believe this addresses the concern while preserving the validity of the core claims. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural proposal with independent empirical validation

full rationale

The paper proposes a framework that models multi-modal learning via interactions between a data-driven content domain and knowledge-driven prompt domain to distill cues into unified SAM prompts, plus a lightweight Mask Refine Module. This is a design choice and architectural contribution, not a mathematical derivation or parameter fit that reduces to its own inputs by construction. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. Validation occurs on external RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks, which are independent of the internal claims. The modality-agnostic assertion is a generalization claim tested on three pairs rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Based on abstract only; no specific free parameters, axioms or invented entities can be identified in detail. The 'modality-agnostic multi-modal prompts' and 'Mask Refine Module' are introduced as part of the framework but lack independent evidence or details.

invented entities (2)
  • modality-agnostic multi-modal prompts no independent evidence
    purpose: Enable parameter-efficient adaptation to arbitrary auxiliary modalities
    Core of the proposed framework but no independent validation or details provided.
  • Mask Refine Module no independent evidence
    purpose: Calibrate coarse predictions using fine-grained prompt cues
    Lightweight addition for boundary accuracy but details absent.

pith-pipeline@v0.9.0 · 5484 in / 1258 out tokens · 92171 ms · 2026-05-10T16:25:01.528156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 8 canonical work pages · 1 internal anchor

  1. [1]

    Camouflaged object detection,

    D.-P. Fan, G.-P. Ji, G. Sun, M.-M. Cheng, J. Shen, and L. Shao, “Camouflaged object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2777–2787

  2. [2]

    Ip- net: Polarization-based camouflaged object detection via dual-flow network,

    X. Wang, J. Ding, Z. Zhang, J. Xu, and J. Gao, “Ip- net: Polarization-based camouflaged object detection via dual-flow network,”Engineering Applications of Artifi- cial Intelligence, vol. 127, p. 107303, 2024

  3. [3]

    Visible- infrared camouflaged object detection,

    C. Liu, Z. Wang, X. Yan, M. Sun, and Q. Hu, “Visible- infrared camouflaged object detection,”IEEE Transac- tions on Circuits and Systems for Video Technology, pp. 1–1, 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2025 9

  4. [4]

    Depth-aided camouflaged object detection,

    Q. Wang, J. Yang, X. Yu, F. Wang, P. Chen, and F. Zheng, “Depth-aided camouflaged object detection,” inProceedings of the 31st ACM international conference on multimedia, 2023, pp. 3297–3306

  5. [5]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.- Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  6. [6]

    Com- prompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection,

    X. Zhang, Z. Yu, L. Zhao, D.-P. Fan, and G. Xiao, “Com- prompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection,” Science China Information Sciences, vol. 68, no. 1, p. 112104, 2025

  7. [7]

    Sam-adapter: Adapting segment anything in underperformed scenes,

    T. Chen, L. Zhu, C. Deng, R. Cao, Y . Wang, S. Zhang, Z. Li, L. Sun, Y . Zang, and P. Mao, “Sam-adapter: Adapting segment anything in underperformed scenes,” inProceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 3367–3375

  8. [8]

    Explor- ing deeper! segment anything model with depth percep- tion for camouflaged object detection,

    Z. Yu, X. Zhang, L. Zhao, Y . Bin, and G. Xiao, “Explor- ing deeper! segment anything model with depth percep- tion for camouflaged object detection,” inProceedings of the 32nd ACM international conference on multimedia, 2024, pp. 4322–4330

  9. [9]

    Improving sam for camouflaged object detection via dual stream adapters,

    J. Liu, L. Kong, and G. Chen, “Improving sam for camouflaged object detection via dual stream adapters,” arXiv preprint arXiv:2503.06042, 2025

  10. [10]

    Camouflaged object segmentation with distraction mining,

    H. Mei, G.-P. Ji, Z. Wei, X. Yang, X. Wei, and D.-P. Fan, “Camouflaged object segmentation with distraction mining,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8772– 8781

  11. [11]

    Boundary- guided camouflaged object detection,

    Y . Sun, S. Wang, C. Chen, and T.-Z. Xiang, “Boundary- guided camouflaged object detection,” inIJCAI, 2022, pp. 1335–1341

  12. [12]

    High-resolution iterative feedback network for camouflaged object detection,

    X. Hu, S. Wang, X. Qin, H. Dai, W. Ren, Y . Tai, C. Wang, and L. Shao, “High-resolution iterative feedback network for camouflaged object detection,” 2023. [Online]. Available: https://arxiv.org/abs/2203.11624

  13. [13]

    Frequency-spatial entanglement learning for camou- flaged object detection,

    Y . Sun, C. Xu, J. Yang, H. Xuan, and L. Luo, “Frequency-spatial entanglement learning for camou- flaged object detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 343–360

  14. [14]

    Uncertainty-guided diffusion model for cam- ouflaged object detection,

    J. Yang, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and S. Song, “Uncertainty-guided diffusion model for cam- ouflaged object detection,”IEEE Transactions on Multi- media, vol. 27, pp. 4656–4669, 2025

  15. [15]

    Depth-aware concealed crop detection in dense agricul- tural scenes,

    L. Wang, J. Yang, Y . Zhang, F. Wang, and F. Zheng, “Depth-aware concealed crop detection in dense agricul- tural scenes,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 17 201–17 211

  16. [16]

    Multi- modal segment anything model for camouflaged scene segmentation,

    G. Ren, H. Liu, M. Lazarou, and T. Stathaki, “Multi- modal segment anything model for camouflaged scene segmentation,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), October 2025, pp. 19 882–19 892

  17. [17]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational confer- ence on machine learning. PMLR, 2022, pp. 12 888– 12 900

  18. [18]

    Focus: Towards universal foreground segmentation,

    Z. You, L. Kong, L. Meng, and Z. Wu, “Focus: Towards universal foreground segmentation,” 2025. [Online]. Available: https://arxiv.org/abs/2501.05238

  19. [19]

    Explicit vi- sual prompting for low-level structure segmentations,

    W. Liu, X. Shen, C.-M. Pun, and X. Cun, “Explicit vi- sual prompting for low-level structure segmentations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 434–19 445

  20. [20]

    Emerging properties in self-supervised vision transformers.arXiv preprint arXiv:2104.14294,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” 2021. [Online]. Available: https://arxiv.org/abs/2104.14294

  21. [21]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

  22. [22]

    Boosting segment anything model to generalize visually non-salient scenarios,

    G. Guo, P. Chen, Y . Guo, H. Chen, B. Zhang, and S. Gao, “Boosting segment anything model to generalize visually non-salient scenarios,”IEEE Transactions on Image Processing, 2026

  23. [23]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,

    W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” 2021. [Online]. Available: https: //arxiv.org/abs/2102.12122

  24. [24]

    Feature shrinkage pyramid for camouflaged object detection with transformers,

    Z. Huang, H. Dai, T.-Z. Xiang, S. Wang, H.-X. Chen, J. Qin, and H. Xiong, “Feature shrinkage pyramid for camouflaged object detection with transformers,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 5557–5566

  25. [25]

    Exploring figure-ground assignment mechanism in perceptual or- ganization,

    W. Zhai, Y . Cao, J. Zhang, and Z.-J. Zha, “Exploring figure-ground assignment mechanism in perceptual or- ganization,”Advances in Neural Information Processing Systems, vol. 35, pp. 17 030–17 042, 2022

  26. [26]

    Camouflaged object detection with feature decomposition and edge reconstruction,

    C. He, K. Li, Y . Zhang, L. Tang, Y . Zhang, Z. Guo, and X. Li, “Camouflaged object detection with feature decomposition and edge reconstruction,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 046–22 055

  27. [27]

    Segment anything in high quality,

    L. Ke, M. Ye, M. Danelljan, Y .-W. Tai, C.-K. Tang, F. Yu et al., “Segment anything in high quality,”Advances in Neural Information Processing Systems, vol. 36, pp. 29 914–29 934, 2023

  28. [28]

    Focusdiffuser: Perceiving local disparities for camouflaged object detection,

    J. Zhao, X. Li, F. Yang, Q. Zhai, A. Luo, Z. Jiao, and H. Cheng, “Focusdiffuser: Perceiving local disparities for camouflaged object detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 181–198

  29. [29]

    Zoomnext: A unified collaborative pyramid network for camouflaged object detection,

    Y . Pang, X. Zhao, T.-Z. Xiang, L. Zhang, and H. Lu, “Zoomnext: A unified collaborative pyramid network for camouflaged object detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  30. [30]

    T. Chen, A. Lu, L. Zhu, C. Ding, C. Yu, D. Ji, Z. Li, L. Sun, P. Mao, and Y . Zang, “Sam2-adapter: Evaluating & adapting segment anything 2 in downstream tasks: JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2025 10 Camouflage, shadow, medical image segmentation, and more,”arXiv preprint arXiv:2408.04579, 2024

  31. [31]

    Relax image-specific prompt requirement in sam: A single generic prompt for segmenting camouflaged objects,

    J. Hu, J. Lin, S. Gong, and W. Cai, “Relax image-specific prompt requirement in sam: A single generic prompt for segmenting camouflaged objects,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, pp. 12 511–12 518

  32. [32]

    Segment anything in medical images,

    J. Ma, Y . He, F. Li, L. Han, C. You, and B. Wang, “Segment anything in medical images,”Nature Commu- nications, vol. 15, no. 1, p. 654, 2024

  33. [33]

    A unet-like transformer network for camouflaged object detection,

    F. Sun, J. Han, W. Wu, J. Sun, M. Wang, and H. Li, “A unet-like transformer network for camouflaged object detection,”IEEE Transactions on Multimedia, pp. 1–15, 2025

  34. [34]

    Plantcamo: Plant camouflage detection,

    J. Yang, Q. Wang, F. Zheng, P. Chen, A. Leonardis, and D.-P. Fan, “Plantcamo: Plant camouflage detection,” arXiv preprint arXiv:2410.17598, 2024

  35. [35]

    Escnet: Edge-semantic collaborative network for camouflaged object detection,

    S. Ye, X. Chen, Y . Zhang, X. Lin, and L. Cao, “Escnet: Edge-semantic collaborative network for camouflaged object detection,” inProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2025, pp. 20 053–20 063

  36. [36]

    Sam-ttt: Segment anything model via reverse parameter config- uration and test-time training for camouflaged object detection,

    Z. Yu, L. Zhao, G. Xiao, and X. Zhang, “Sam-ttt: Segment anything model via reverse parameter config- uration and test-time training for camouflaged object detection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 4030–4038

  37. [37]

    St-sam: Sam-driven self-training framework for semi-supervised camouflaged object detection,

    X. Hu, F. Sun, J. Liu, F. Xu, and X. Zhang, “St-sam: Sam-driven self-training framework for semi-supervised camouflaged object detection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 8194–8203

  38. [38]

    Specificity-preserving rgb-d saliency detection,

    T. Zhou, H. Fu, G. Chen, Y . Zhou, D.-P. Fan, and L. Shao, “Specificity-preserving rgb-d saliency detection,” inPro- ceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4681–4691

  39. [39]

    Spsn: Superpixel prototype sampling network for rgb-d salient object de- tection (supplementary material)

    M. Lee, C. Park, S. Cho, and S. Lee, “Spsn: Superpixel prototype sampling network for rgb-d salient object de- tection (supplementary material).”

  40. [40]

    Source- free depth for object pop-out,

    Z. Wu, D. P. Paudel, D.-P. Fan, J. Wang, S. Wang, C. Demonceaux, R. Timofte, and L. Van Gool, “Source- free depth for object pop-out,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 1032–1042

  41. [41]

    Con- cealed object detection,

    D.-P. Fan, G.-P. Ji, M.-M. Cheng, and L. Shao, “Con- cealed object detection,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 10, pp. 6024–6042, 2021

  42. [42]

    Zoom in and out: A mixed-scale triplet network for camouflaged object detection,

    Y . Pang, X. Zhao, T.-Z. Xiang, L. Zhang, and H. Lu, “Zoom in and out: A mixed-scale triplet network for camouflaged object detection,” inProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2022, pp. 2160–2170

  43. [43]

    I can find you! boundary-guided separated attention network for camouflaged object detection,

    H. Zhu, P. Li, H. Xie, X. Yan, D. Liang, D. Chen, M. Wei, and J. Qin, “I can find you! boundary-guided separated attention network for camouflaged object detection,” in Proceedings of the AAAI conference on artificial intelli- gence, vol. 36, no. 3, 2022, pp. 3608–3616

  44. [44]

    Frequency perception network for camouflaged object detection,

    R. Cong, M. Sun, S. Zhang, X. Zhou, W. Zhang, and Y . Zhao, “Frequency perception network for camouflaged object detection,” inProceedings of the 31st ACM in- ternational conference on multimedia, 2023, pp. 1179– 1189

  45. [45]

    Efficient camouflaged object detection network based on global localization perception and local guidance refinement,

    X. Hu, X. Zhang, F. Wang, J. Sun, and F. Sun, “Efficient camouflaged object detection network based on global localization perception and local guidance refinement,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 7, pp. 5452–5465, 2024

  46. [46]

    Hierarchical graph interaction transformer with dynamic token clustering for camouflaged object detection,

    S. Yao, H. Sun, T.-Z. Xiang, X. Wang, and X. Cao, “Hierarchical graph interaction transformer with dynamic token clustering for camouflaged object detection,”IEEE Transactions on Image Processing, 2024

  47. [47]

    Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation,

    X. Xiong, Z. Wu, S. Tan, W. Li, F. Tang, Y . Chen, S. Li, J. Ma, and G. Li, “Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation,”Visual Intelligence, vol. 4, no. 1, p. 2, 2026

  48. [48]

    Glass segmen- tation using intensity and spectral polarization cues,

    H. Mei, B. Dong, W. Dong, J. Yang, S.-H. Baek, F. Heide, P. Peers, X. Wei, and X. Yang, “Glass segmen- tation using intensity and spectral polarization cues,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 622–12 631

  49. [49]

    Polarization-based cam- ouflaged object detection with high-resolution adaptive fusion network,

    X. Wang, J. Xu, and J. Ding, “Polarization-based cam- ouflaged object detection with high-resolution adaptive fusion network,”Engineering Applications of Artificial Intelligence, vol. 146, p. 110245, 2025

  50. [50]

    Hrtransnet: Hrformer-driven two-modality salient object detection,

    B. Tang, Z. Liu, Y . Tan, and Q. He, “Hrtransnet: Hrformer-driven two-modality salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 2, pp. 728–742, 2022

  51. [51]

    Wavenet: Wavelet network with knowledge distillation for rgb-t salient object detection,

    W. Zhou, F. Sun, Q. Jiang, R. Cong, and J.-N. Hwang, “Wavenet: Wavelet network with knowledge distillation for rgb-t salient object detection,”IEEE Transactions on Image Processing, vol. 32, pp. 3027–3039, 2023

  52. [52]

    Alignment- free rgbt salient object detection: Semantics-guided asymmetric correlation network and a unified bench- mark,

    K. Wang, D. Lin, C. Li, Z. Tu, and B. Luo, “Alignment- free rgbt salient object detection: Semantics-guided asymmetric correlation network and a unified bench- mark,”IEEE Transactions on Multimedia, vol. 26, pp. 10 692–10 707, 2024

  53. [53]

    Vscode: General visual salient and camouflaged object detection with 2d prompt learning,

    Z. Luo, N. Liu, W. Zhao, X. Yang, D. Zhang, D.-P. Fan, F. Khan, and J. Han, “Vscode: General visual salient and camouflaged object detection with 2d prompt learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 17 169–17 180

  54. [54]

    Unified- modal salient object detection via adaptive prompt learn- ing,

    K. Wang, Z. Tu, C. Li, Z. Liu, and B. Luo, “Unified- modal salient object detection via adaptive prompt learn- ing,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  55. [55]

    Alignment- free rgb-t salient object detection: A large-scale dataset and progressive correlation network,

    K. Wang, K. Chen, C. Li, Z. Tu, and B. Luo, “Alignment- free rgb-t salient object detection: A large-scale dataset and progressive correlation network,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 7780–7788