arxiv: 2604.12380 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Modality-Agnostic Prompt Learning for Multi-Modal Camouflaged Object Detection

Hao Wang , Jiqing Zhang , Xin Yang , Baocai Yin , Lu Jiang , Zetian Mi , Huibing Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords camouflaged object detectionmulti-modal prompt learningsegment anything modelmodality-agnosticsemantic segmentationauxiliary modalitiesmask refinement

0 comments

The pith

Modality-agnostic prompts let the Segment Anything Model adapt to any auxiliary sensor for camouflaged object detection without custom fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework that produces a single set of prompts usable by SAM across arbitrary extra modalities such as depth, thermal, or polarization. It achieves this by letting a data-driven content domain interact with a knowledge-driven prompt domain to extract and unify task-relevant cues for mask decoding. A lightweight Mask Refine Module then sharpens the resulting boundaries. Existing multi-modal COD methods require separate architectures or fusion rules for each sensor pair, so a modality-agnostic route would simplify scaling to new combinations and reduce retraining costs. Experiments on three standard benchmarks are presented to show the gains.

Core claim

The central claim is that multi-modal learning for camouflaged object detection can be reduced to generating unified modality-agnostic prompts: interactions between a data-driven content domain and a knowledge-driven prompt domain distill complementary cues into prompts that SAM can decode directly, with a subsequent Mask Refine Module incorporating fine-grained prompt information to correct coarse segmentations and improve boundary accuracy.

What carries the argument

Modality-agnostic multi-modal prompts produced by domain interactions between data-driven content and knowledge-driven prompts, then fed to SAM for decoding.

If this is right

Performance improves on RGB-Depth, RGB-Thermal, and RGB-Polarization COD benchmarks relative to prior multi-modal methods.
Parameter-efficient adaptation becomes possible for new auxiliary modalities without retraining the entire model.
Coarse SAM outputs receive boundary corrections from the Mask Refine Module that uses the same unified prompts.
Customized fusion modules or modality-specific encoders are no longer required for each sensor combination.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt-generation pattern could be applied to other foundation segmentation models or to tasks outside camouflaged detection such as medical or remote-sensing imagery.
Real-world systems that switch between sensor suites could keep one prompt set and one decoder rather than maintaining separate pipelines.
Extending the content-prompt interaction to three or more simultaneous modalities would test whether the unification step remains stable.
If the prompts transfer across datasets collected under different lighting or weather, deployment cost for multi-sensor field systems would drop further.

Load-bearing premise

A single fixed set of prompts can extract and use useful complementary signals from any added visual modality without needing designs tailored to that modality.

What would settle it

Training the prompts on RGB-Depth and RGB-Thermal data then testing on a held-out RGB-Polarization set or a fresh modality such as RGB-LiDAR and measuring whether accuracy falls below a modality-specific baseline would directly test the claim.

Figures

Figures reproduced from arXiv: 2604.12380 by Baocai Yin, Hao Wang, Huibing Wang, Jiqing Zhang, Lu Jiang, Xin Yang, Zetian Mi.

**Figure 1.** Figure 1: The F w β scores with respect to the trainable parameters of different methods on COD10K (RGB-Depth), PCOD-1200 (RGB-Polarization), and VIAC (RGB-Thermal) datasets. number of studies have incorporated SAM into COD. Existing approaches typically adopt parameter-efficient adaptation, including fine-tuning a small subset of parameters [6] or inserting lightweight adapters [7] to frozen SAM. While effective … view at source ↗

**Figure 2.** Figure 2: Overview of our proposed framework. The framework adopts a dual-stream architecture where RGB and auxiliary [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of detection results on RGB-D and RGB-P camouflaged object detection datasets. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Visual ablation comparison of different variants. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of detection results on the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Attentional visualization of the transposed convolution [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of detection results on RGB-T [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 9.** Figure 9: The case of failure. insufficient complementary data, it becomes difficult for all methods to detect camouflaged targets. V. CONCLUSION This paper proposes a modality-agnostic multimodal prompt framework for camouflaged object detection based on the Segment Anything Model. Our approach adopts a dual-domain learning paradigm consisting of a content domain and a prompt domain, which jointly capture data-driv… view at source ↗

read the original abstract

Camouflaged Object Detection (COD) aims to segment objects that blend seamlessly into complex backgrounds, with growing interest in exploiting additional visual modalities to enhance robustness through complementary information. However, most existing approaches generally rely on modality-specific architectures or customized fusion strategies, which limit scalability and cross-modal generalization. To address this, we propose a novel framework that generates modality-agnostic multi-modal prompts for the Segment Anything Model (SAM), enabling parameter-efficient adaptation to arbitrary auxiliary modalities and significantly improving overall performance on COD tasks. Specifically, we model multi-modal learning through interactions between a data-driven content domain and a knowledge-driven prompt domain, distilling task-relevant cues into unified prompts for SAM decoding. We further introduce a lightweight Mask Refine Module to calibrate coarse predictions by incorporating fine-grained prompt cues, leading to more accurate camouflaged object boundaries. Extensive experiments on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks validate the effectiveness and generalization of our modality-agnostic framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces modality-agnostic prompts for SAM in multi-modal COD but only validates on three specific modality pairs.

read the letter

This paper gives a modality-agnostic prompt method for using SAM on multi-modal camouflaged object detection, but the support for handling truly arbitrary modalities is limited to three tested pairs. They introduce interactions between a data-driven content domain and a knowledge-driven prompt domain to distill cues into unified prompts for SAM. A lightweight Mask Refine Module then improves the boundaries on the coarse outputs. The setup is tested on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks, where it shows gains over existing approaches. What works here is the effort to avoid per-modality custom designs. By leaning on prompts and SAM, it keeps the model parameter-efficient and potentially easier to extend. The idea of distilling task-relevant cues this way is a reasonable direction for multi-modal work. The main weakness is in the generalization claim. The abstract and experiments stick to those three modality combinations, with no results on an unseen auxiliary modality like RGB plus event camera data or something else. Without that, it's hard to know if the prompt generator truly works for arbitrary inputs or if it has implicit dependencies. The paper would be stronger with an ablation on the domain interactions too. Readers in computer vision who deal with camouflaged detection or adapting large models like SAM will get the most from this. It could interest people building systems for surveillance or robotics that use multiple sensors. The work shows clear thinking on the problem setup and engages with the relevant literature on prompts and multi-modal fusion. I would send this to peer review. The idea has enough substance to warrant feedback, though the authors should address the limited modality testing to back up the agnostic framing.

Referee Report

1 major / 2 minor

Summary. The paper proposes a framework for multi-modal camouflaged object detection that generates modality-agnostic prompts for the Segment Anything Model (SAM). It models interactions between a data-driven content domain and a knowledge-driven prompt domain to distill task-relevant cues into unified prompts, introduces a lightweight Mask Refine Module for calibrating coarse predictions with fine-grained cues, and claims parameter-efficient adaptation to arbitrary auxiliary modalities with improved performance on COD tasks, validated on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks.

Significance. If the central claims hold, the work offers a scalable alternative to modality-specific architectures in multi-modal segmentation by leveraging prompt-based adaptation of foundation models like SAM. This could enable more flexible incorporation of auxiliary modalities in challenging vision tasks such as camouflaged object detection without requiring custom fusion modules.

major comments (1)

Abstract: The claim of enabling 'parameter-efficient adaptation to arbitrary auxiliary modalities' and a 'modality-agnostic' framework is not supported by the reported validation. Experiments are confined to three specific modality pairs (RGB-Depth, RGB-Thermal, RGB-Polarization), with no tests on unseen modalities, no evidence of a single shared encoder for truly arbitrary inputs, and no ablation confirming the absence of implicit modality-specific branches in the content-domain ingestion or prompt generator. This directly undermines the load-bearing assertion that the content-prompt domain interactions distill complementary cues without customized strategies.

minor comments (2)

Abstract: The statement that the method 'significantly improving overall performance' is made without any quantitative metrics, baseline comparisons, or specific gains, which reduces the ability to assess the strength of the empirical claims at a glance.
Abstract: The Mask Refine Module is introduced as 'lightweight' but without details on its parameter count, architecture, or integration point with SAM decoding, which would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address the major concern about the abstract claims as follows.

read point-by-point responses

Referee: Abstract: The claim of enabling 'parameter-efficient adaptation to arbitrary auxiliary modalities' and a 'modality-agnostic' framework is not supported by the reported validation. Experiments are confined to three specific modality pairs (RGB-Depth, RGB-Thermal, RGB-Polarization), with no tests on unseen modalities, no evidence of a single shared encoder for truly arbitrary inputs, and no ablation confirming the absence of implicit modality-specific branches in the content-domain ingestion or prompt generator. This directly undermines the load-bearing assertion that the content-prompt domain interactions distill complementary cues without customized strategies.

Authors: We acknowledge that our experimental validation is limited to three modality pairs and does not include tests on entirely unseen modalities, which would provide stronger evidence for 'arbitrary' adaptation. However, the framework is modality-agnostic by design: the same prompt learning and interaction modules are used across all tested modalities without any customized fusion strategies or modality-specific components, as detailed in the method section. This is what enables parameter-efficient adaptation, where only a small number of parameters (prompts and the mask refine module) are learned for each new modality. We will revise the abstract to replace 'arbitrary' with 'diverse' or 'additional' auxiliary modalities to more accurately reflect the scope of our experiments. Additionally, we will add an ablation study demonstrating that the content-domain ingestion and prompt generator do not contain implicit modality-specific branches, by showing equivalent performance when using a unified encoder. We believe this addresses the concern while preserving the validity of the core claims. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural proposal with independent empirical validation

full rationale

The paper proposes a framework that models multi-modal learning via interactions between a data-driven content domain and knowledge-driven prompt domain to distill cues into unified SAM prompts, plus a lightweight Mask Refine Module. This is a design choice and architectural contribution, not a mathematical derivation or parameter fit that reduces to its own inputs by construction. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. Validation occurs on external RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks, which are independent of the internal claims. The modality-agnostic assertion is a generalization claim tested on three pairs rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Based on abstract only; no specific free parameters, axioms or invented entities can be identified in detail. The 'modality-agnostic multi-modal prompts' and 'Mask Refine Module' are introduced as part of the framework but lack independent evidence or details.

invented entities (2)

modality-agnostic multi-modal prompts no independent evidence
purpose: Enable parameter-efficient adaptation to arbitrary auxiliary modalities
Core of the proposed framework but no independent validation or details provided.
Mask Refine Module no independent evidence
purpose: Calibrate coarse predictions using fine-grained prompt cues
Lightweight addition for boundary accuracy but details absent.

pith-pipeline@v0.9.0 · 5484 in / 1258 out tokens · 92171 ms · 2026-05-10T16:25:01.528156+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Camouflaged object detection,

D.-P. Fan, G.-P. Ji, G. Sun, M.-M. Cheng, J. Shen, and L. Shao, “Camouflaged object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2777–2787

2020
[2]

Ip- net: Polarization-based camouflaged object detection via dual-flow network,

X. Wang, J. Ding, Z. Zhang, J. Xu, and J. Gao, “Ip- net: Polarization-based camouflaged object detection via dual-flow network,”Engineering Applications of Artifi- cial Intelligence, vol. 127, p. 107303, 2024

2024
[3]

Visible- infrared camouflaged object detection,

C. Liu, Z. Wang, X. Yan, M. Sun, and Q. Hu, “Visible- infrared camouflaged object detection,”IEEE Transac- tions on Circuits and Systems for Video Technology, pp. 1–1, 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2025 9

2025
[4]

Depth-aided camouflaged object detection,

Q. Wang, J. Yang, X. Yu, F. Wang, P. Chen, and F. Zheng, “Depth-aided camouflaged object detection,” inProceedings of the 31st ACM international conference on multimedia, 2023, pp. 3297–3306

2023
[5]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.- Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023
[6]

Com- prompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection,

X. Zhang, Z. Yu, L. Zhao, D.-P. Fan, and G. Xiao, “Com- prompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection,” Science China Information Sciences, vol. 68, no. 1, p. 112104, 2025

2025
[7]

Sam-adapter: Adapting segment anything in underperformed scenes,

T. Chen, L. Zhu, C. Deng, R. Cao, Y . Wang, S. Zhang, Z. Li, L. Sun, Y . Zang, and P. Mao, “Sam-adapter: Adapting segment anything in underperformed scenes,” inProceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 3367–3375

2023
[8]

Explor- ing deeper! segment anything model with depth percep- tion for camouflaged object detection,

Z. Yu, X. Zhang, L. Zhao, Y . Bin, and G. Xiao, “Explor- ing deeper! segment anything model with depth percep- tion for camouflaged object detection,” inProceedings of the 32nd ACM international conference on multimedia, 2024, pp. 4322–4330

2024
[9]

Improving sam for camouflaged object detection via dual stream adapters,

J. Liu, L. Kong, and G. Chen, “Improving sam for camouflaged object detection via dual stream adapters,” arXiv preprint arXiv:2503.06042, 2025

work page arXiv 2025
[10]

Camouflaged object segmentation with distraction mining,

H. Mei, G.-P. Ji, Z. Wei, X. Yang, X. Wei, and D.-P. Fan, “Camouflaged object segmentation with distraction mining,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8772– 8781

2021
[11]

Boundary- guided camouflaged object detection,

Y . Sun, S. Wang, C. Chen, and T.-Z. Xiang, “Boundary- guided camouflaged object detection,” inIJCAI, 2022, pp. 1335–1341

2022
[12]

High-resolution iterative feedback network for camouflaged object detection,

X. Hu, S. Wang, X. Qin, H. Dai, W. Ren, Y . Tai, C. Wang, and L. Shao, “High-resolution iterative feedback network for camouflaged object detection,” 2023. [Online]. Available: https://arxiv.org/abs/2203.11624

work page arXiv 2023
[13]

Frequency-spatial entanglement learning for camou- flaged object detection,

Y . Sun, C. Xu, J. Yang, H. Xuan, and L. Luo, “Frequency-spatial entanglement learning for camou- flaged object detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 343–360

2024
[14]

Uncertainty-guided diffusion model for cam- ouflaged object detection,

J. Yang, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and S. Song, “Uncertainty-guided diffusion model for cam- ouflaged object detection,”IEEE Transactions on Multi- media, vol. 27, pp. 4656–4669, 2025

2025
[15]

Depth-aware concealed crop detection in dense agricul- tural scenes,

L. Wang, J. Yang, Y . Zhang, F. Wang, and F. Zheng, “Depth-aware concealed crop detection in dense agricul- tural scenes,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 17 201–17 211

2024
[16]

Multi- modal segment anything model for camouflaged scene segmentation,

G. Ren, H. Liu, M. Lazarou, and T. Stathaki, “Multi- modal segment anything model for camouflaged scene segmentation,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), October 2025, pp. 19 882–19 892

2025
[17]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational confer- ence on machine learning. PMLR, 2022, pp. 12 888– 12 900

2022
[18]

Focus: Towards universal foreground segmentation,

Z. You, L. Kong, L. Meng, and Z. Wu, “Focus: Towards universal foreground segmentation,” 2025. [Online]. Available: https://arxiv.org/abs/2501.05238

work page arXiv 2025
[19]

Explicit vi- sual prompting for low-level structure segmentations,

W. Liu, X. Shen, C.-M. Pun, and X. Cun, “Explicit vi- sual prompting for low-level structure segmentations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 434–19 445

2023
[20]

Emerging properties in self-supervised vision transformers.arXiv preprint arXiv:2104.14294,

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” 2021. [Online]. Available: https://arxiv.org/abs/2104.14294

work page arXiv 2021
[21]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Boosting segment anything model to generalize visually non-salient scenarios,

G. Guo, P. Chen, Y . Guo, H. Chen, B. Zhang, and S. Gao, “Boosting segment anything model to generalize visually non-salient scenarios,”IEEE Transactions on Image Processing, 2026

2026
[23]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,

W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” 2021. [Online]. Available: https: //arxiv.org/abs/2102.12122

work page arXiv 2021
[24]

Feature shrinkage pyramid for camouflaged object detection with transformers,

Z. Huang, H. Dai, T.-Z. Xiang, S. Wang, H.-X. Chen, J. Qin, and H. Xiong, “Feature shrinkage pyramid for camouflaged object detection with transformers,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 5557–5566

2023
[25]

Exploring figure-ground assignment mechanism in perceptual or- ganization,

W. Zhai, Y . Cao, J. Zhang, and Z.-J. Zha, “Exploring figure-ground assignment mechanism in perceptual or- ganization,”Advances in Neural Information Processing Systems, vol. 35, pp. 17 030–17 042, 2022

2022
[26]

Camouflaged object detection with feature decomposition and edge reconstruction,

C. He, K. Li, Y . Zhang, L. Tang, Y . Zhang, Z. Guo, and X. Li, “Camouflaged object detection with feature decomposition and edge reconstruction,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 046–22 055

2023
[27]

Segment anything in high quality,

L. Ke, M. Ye, M. Danelljan, Y .-W. Tai, C.-K. Tang, F. Yu et al., “Segment anything in high quality,”Advances in Neural Information Processing Systems, vol. 36, pp. 29 914–29 934, 2023

2023
[28]

Focusdiffuser: Perceiving local disparities for camouflaged object detection,

J. Zhao, X. Li, F. Yang, Q. Zhai, A. Luo, Z. Jiao, and H. Cheng, “Focusdiffuser: Perceiving local disparities for camouflaged object detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 181–198

2024
[29]

Zoomnext: A unified collaborative pyramid network for camouflaged object detection,

Y . Pang, X. Zhao, T.-Z. Xiang, L. Zhang, and H. Lu, “Zoomnext: A unified collaborative pyramid network for camouflaged object detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[30]

T. Chen, A. Lu, L. Zhu, C. Ding, C. Yu, D. Ji, Z. Li, L. Sun, P. Mao, and Y . Zang, “Sam2-adapter: Evaluating & adapting segment anything 2 in downstream tasks: JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2025 10 Camouflage, shadow, medical image segmentation, and more,”arXiv preprint arXiv:2408.04579, 2024

work page arXiv 2025
[31]

Relax image-specific prompt requirement in sam: A single generic prompt for segmenting camouflaged objects,

J. Hu, J. Lin, S. Gong, and W. Cai, “Relax image-specific prompt requirement in sam: A single generic prompt for segmenting camouflaged objects,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, pp. 12 511–12 518
[32]

Segment anything in medical images,

J. Ma, Y . He, F. Li, L. Han, C. You, and B. Wang, “Segment anything in medical images,”Nature Commu- nications, vol. 15, no. 1, p. 654, 2024

2024
[33]

A unet-like transformer network for camouflaged object detection,

F. Sun, J. Han, W. Wu, J. Sun, M. Wang, and H. Li, “A unet-like transformer network for camouflaged object detection,”IEEE Transactions on Multimedia, pp. 1–15, 2025

2025
[34]

Plantcamo: Plant camouflage detection,

J. Yang, Q. Wang, F. Zheng, P. Chen, A. Leonardis, and D.-P. Fan, “Plantcamo: Plant camouflage detection,” arXiv preprint arXiv:2410.17598, 2024

work page arXiv 2024
[35]

Escnet: Edge-semantic collaborative network for camouflaged object detection,

S. Ye, X. Chen, Y . Zhang, X. Lin, and L. Cao, “Escnet: Edge-semantic collaborative network for camouflaged object detection,” inProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2025, pp. 20 053–20 063

2025
[36]

Sam-ttt: Segment anything model via reverse parameter config- uration and test-time training for camouflaged object detection,

Z. Yu, L. Zhao, G. Xiao, and X. Zhang, “Sam-ttt: Segment anything model via reverse parameter config- uration and test-time training for camouflaged object detection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 4030–4038

2025
[37]

St-sam: Sam-driven self-training framework for semi-supervised camouflaged object detection,

X. Hu, F. Sun, J. Liu, F. Xu, and X. Zhang, “St-sam: Sam-driven self-training framework for semi-supervised camouflaged object detection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 8194–8203

2025
[38]

Specificity-preserving rgb-d saliency detection,

T. Zhou, H. Fu, G. Chen, Y . Zhou, D.-P. Fan, and L. Shao, “Specificity-preserving rgb-d saliency detection,” inPro- ceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4681–4691

2021
[39]

Spsn: Superpixel prototype sampling network for rgb-d salient object de- tection (supplementary material)

M. Lee, C. Park, S. Cho, and S. Lee, “Spsn: Superpixel prototype sampling network for rgb-d salient object de- tection (supplementary material).”
[40]

Source- free depth for object pop-out,

Z. Wu, D. P. Paudel, D.-P. Fan, J. Wang, S. Wang, C. Demonceaux, R. Timofte, and L. Van Gool, “Source- free depth for object pop-out,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 1032–1042

2023
[41]

Con- cealed object detection,

D.-P. Fan, G.-P. Ji, M.-M. Cheng, and L. Shao, “Con- cealed object detection,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 10, pp. 6024–6042, 2021

2021
[42]

Zoom in and out: A mixed-scale triplet network for camouflaged object detection,

Y . Pang, X. Zhao, T.-Z. Xiang, L. Zhang, and H. Lu, “Zoom in and out: A mixed-scale triplet network for camouflaged object detection,” inProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2022, pp. 2160–2170

2022
[43]

I can find you! boundary-guided separated attention network for camouflaged object detection,

H. Zhu, P. Li, H. Xie, X. Yan, D. Liang, D. Chen, M. Wei, and J. Qin, “I can find you! boundary-guided separated attention network for camouflaged object detection,” in Proceedings of the AAAI conference on artificial intelli- gence, vol. 36, no. 3, 2022, pp. 3608–3616

2022
[44]

Frequency perception network for camouflaged object detection,

R. Cong, M. Sun, S. Zhang, X. Zhou, W. Zhang, and Y . Zhao, “Frequency perception network for camouflaged object detection,” inProceedings of the 31st ACM in- ternational conference on multimedia, 2023, pp. 1179– 1189

2023
[45]

Efficient camouflaged object detection network based on global localization perception and local guidance refinement,

X. Hu, X. Zhang, F. Wang, J. Sun, and F. Sun, “Efficient camouflaged object detection network based on global localization perception and local guidance refinement,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 7, pp. 5452–5465, 2024

2024
[46]

Hierarchical graph interaction transformer with dynamic token clustering for camouflaged object detection,

S. Yao, H. Sun, T.-Z. Xiang, X. Wang, and X. Cao, “Hierarchical graph interaction transformer with dynamic token clustering for camouflaged object detection,”IEEE Transactions on Image Processing, 2024

2024
[47]

Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation,

X. Xiong, Z. Wu, S. Tan, W. Li, F. Tang, Y . Chen, S. Li, J. Ma, and G. Li, “Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation,”Visual Intelligence, vol. 4, no. 1, p. 2, 2026

2026
[48]

Glass segmen- tation using intensity and spectral polarization cues,

H. Mei, B. Dong, W. Dong, J. Yang, S.-H. Baek, F. Heide, P. Peers, X. Wei, and X. Yang, “Glass segmen- tation using intensity and spectral polarization cues,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 622–12 631

2022
[49]

Polarization-based cam- ouflaged object detection with high-resolution adaptive fusion network,

X. Wang, J. Xu, and J. Ding, “Polarization-based cam- ouflaged object detection with high-resolution adaptive fusion network,”Engineering Applications of Artificial Intelligence, vol. 146, p. 110245, 2025

2025
[50]

Hrtransnet: Hrformer-driven two-modality salient object detection,

B. Tang, Z. Liu, Y . Tan, and Q. He, “Hrtransnet: Hrformer-driven two-modality salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 2, pp. 728–742, 2022

2022
[51]

Wavenet: Wavelet network with knowledge distillation for rgb-t salient object detection,

W. Zhou, F. Sun, Q. Jiang, R. Cong, and J.-N. Hwang, “Wavenet: Wavelet network with knowledge distillation for rgb-t salient object detection,”IEEE Transactions on Image Processing, vol. 32, pp. 3027–3039, 2023

2023
[52]

Alignment- free rgbt salient object detection: Semantics-guided asymmetric correlation network and a unified bench- mark,

K. Wang, D. Lin, C. Li, Z. Tu, and B. Luo, “Alignment- free rgbt salient object detection: Semantics-guided asymmetric correlation network and a unified bench- mark,”IEEE Transactions on Multimedia, vol. 26, pp. 10 692–10 707, 2024

2024
[53]

Vscode: General visual salient and camouflaged object detection with 2d prompt learning,

Z. Luo, N. Liu, W. Zhao, X. Yang, D. Zhang, D.-P. Fan, F. Khan, and J. Han, “Vscode: General visual salient and camouflaged object detection with 2d prompt learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 17 169–17 180

2024
[54]

Unified- modal salient object detection via adaptive prompt learn- ing,

K. Wang, Z. Tu, C. Li, Z. Liu, and B. Luo, “Unified- modal salient object detection via adaptive prompt learn- ing,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

2025
[55]

Alignment- free rgb-t salient object detection: A large-scale dataset and progressive correlation network,

K. Wang, K. Chen, C. Li, Z. Tu, and B. Luo, “Alignment- free rgb-t salient object detection: A large-scale dataset and progressive correlation network,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 7780–7788

2025