pith. machine review for the scientific record. sign in

arxiv: 2605.13140 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: no theorem link

Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-source domain adaptationobject detectiondepth mapsprompt learningmulti-modal featuresdomain generalizationregion proposals
0
0 comments X

The pith

Depth maps and text prompts let a detector adapt object detection across multiple source domains without target labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MS-DePro, a method that processes multiple source domains separately for object detection while using depth maps and text as domain-agnostic signals. It generates region proposals from depth and aligns text embeddings with multi-modal features to handle domain shifts better than prior RGB-only approaches. A sympathetic reader would care because real-world detectors often fail when camera conditions or scenes change, and this technique aims to make training on several labeled sources transfer more reliably to an unlabeled target. The authors report state-of-the-art results on MSDA benchmarks along with ablations that isolate the value of each added component.

Core claim

MS-DePro consists of depth-guided localization that produces domain-agnostic region proposals from depth maps and multi-modal guided prompt learning that integrates multi-modal features to align learnable text embeddings for classification. By leveraging these domain-agnostic inputs, the detector learns domain-agnostic characteristics while preserving domain-specific information, outperforming previous multi-source domain adaptation methods on standard benchmarks.

What carries the argument

MS-DePro's depth-guided localization and multi-modal guided prompt learning, which encode domain-agnostic characteristics from depth maps and text to produce region proposals and aligned embeddings for cross-domain object detection.

If this is right

  • MS-DePro reaches state-of-the-art detection accuracy on existing multi-source domain adaptation benchmarks.
  • Ablation experiments isolate performance lifts from the depth-guided localization and the prompt-learning modules.
  • The method keeps domain-specific cues while extracting domain-agnostic signals from non-RGB modalities.
  • Separate processing of each source domain outperforms simple blending of all sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could transfer to semantic segmentation if depth continues to supply reliable object boundaries across domains.
  • Text-prompt alignment opens a route for adding new object categories at test time by editing the text embeddings.
  • Performance would likely degrade in settings where depth sensors are absent or poorly calibrated.
  • Combining the same depth and prompt signals with infrared or event data might further widen the domain gap the method can bridge.

Load-bearing premise

Depth maps and text are reliably available as domain-agnostic inputs that improve region proposals and embeddings without introducing new biases or needing target-domain depth data.

What would settle it

Running the full MS-DePro pipeline on the MSDA benchmarks but replacing depth inputs with random noise or removing the prompt alignment step yields no accuracy gain over strong multi-source baselines.

Figures

Figures reproduced from arXiv: 2605.13140 by Jeongmin Shin, Namil Kim, Sangin Lee, Seokjun Kwon, Yukyung Choi.

Figure 1
Figure 1. Figure 1: Comparison of MSDA methods. (a) Existing MSDA methods rely solely on RGB images to learn domain-agnostic features, while also encoding domain-specific fea￾tures from them, which leads to a training conflict. (b) Our MS-DePro leverages additional domain-agnostic modalities—depth map and text—to encode domain-agnostic features. objectives of these two networks, which impedes optimization and slows con￾vergen… view at source ↗
Figure 2
Figure 2. Figure 2: Overall Framework of MS-DePro. Following the mean-teacher framework, MS-DePro consists of an unlabeled target teacher (top) and a labeled multi-source student (bottom). We exploit depth maps to propose regions for localization, and our learnable prompt is encoded for classification. The teacher generates pseudo-labels for the target domain. The student learns from multiple sources and adapts to the target … view at source ↗
Figure 3
Figure 3. Figure 3: (a) t-SNE visualization of feature distribution for RGB and Depth map. (b) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of an RGB image and its corresponding low-quality depth map, [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Confidence scores of region proposals from RGB and depth maps. (b) Counts [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of class frequency on performance gap. Under a class-specific configu [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (Top) Detection visualization results in the cross-time domain adaptation. We [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
read the original abstract

General object detection (OD) struggles to detect objects in the target domain that differ from the training distribution. To address this, recent studies demonstrate that training from multiple source domains and explicitly processing them separately for multi-source domain adaptation (MSDA) outperforms blending them for unsupervised domain adaptation (UDA). However, existing MSDA methods learn domain-agnostic features from domain-specific RGB images while preserving domain-specific information from the domain-agnostic feature map. To address this, we propose MS-DePro: Multi-Source Detector with Depth and Prompt, composed of (1) depth-guided localization and (2) multi-modal guided prompt learning. We leverage domain-agnostic input modalities, namely depth maps and text, to encode domain-agnostic characteristics. Specifically, we utilize depth maps to generate domain-agnostic region proposals for localization and integrate multi-modal features to align learnable text embeddings for classification. MS-DePro achieves state-of-the-art performance on MSDA benchmarks, and comprehensive ablations demonstrate the effectiveness of our contributions. Our code is available on https://github.com/sejong-rcv/Multi-Modal-Guided-Multi-Source-Domain-Adaptation-for-Object-Detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MS-DePro, a multi-source domain adaptation (MSDA) framework for object detection that integrates depth maps for domain-agnostic region proposals via depth-guided localization and text embeddings via multi-modal prompt learning for classification. It claims state-of-the-art performance on MSDA benchmarks with supporting ablations, leveraging depth and text as domain-agnostic modalities to improve cross-domain generalization over RGB-only approaches.

Significance. If the central claims hold under scrutiny of the depth source and target-domain handling, the work would be moderately significant for MSDA in object detection by showing how auxiliary modalities can produce more invariant proposals and aligned features. The public code release is a clear strength that supports reproducibility and follow-up work.

major comments (2)
  1. [Abstract] Abstract: The assertion that depth maps are domain-agnostic inputs enabling domain-agnostic region proposals is load-bearing for the central claim, yet the manuscript does not specify the depth source for target domains (e.g., Cityscapes/Foggy Cityscapes benchmarks supply only RGB). If monocular depth estimation is applied at test time, domain-specific errors could be injected that the prompt-learning stage cannot cancel, undermining the domain-invariance premise.
  2. [Methods] Methods section (depth-guided localization component): The description of how depth maps generate region proposals lacks explicit handling of target-domain depth at inference; without this, it is impossible to verify whether the method truly operates in a fully unsupervised MSDA setting or implicitly requires target depth data.
minor comments (2)
  1. [Abstract] The abstract states 'comprehensive ablations demonstrate the effectiveness of our contributions' but does not preview the specific ablation factors (e.g., depth vs. prompt removal) or report quantitative deltas; adding a one-sentence summary would improve clarity.
  2. [Introduction] Notation for multi-modal feature alignment is introduced without an accompanying equation or diagram reference in the early sections, making the prompt-learning description harder to follow on first reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying the need for greater clarity on depth map handling. We agree that the current manuscript does not explicitly describe the depth source for target domains and will revise both the abstract and methods section to address this. Our responses to the major comments are provided below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that depth maps are domain-agnostic inputs enabling domain-agnostic region proposals is load-bearing for the central claim, yet the manuscript does not specify the depth source for target domains (e.g., Cityscapes/Foggy Cityscapes benchmarks supply only RGB). If monocular depth estimation is applied at test time, domain-specific errors could be injected that the prompt-learning stage cannot cancel, undermining the domain-invariance premise.

    Authors: We acknowledge that the manuscript does not explicitly state how depth maps are obtained for target domains that provide only RGB images. In the revised version we will clarify that depth maps are generated by applying a fixed, pre-trained monocular depth estimation model (e.g., MiDaS) to the RGB input in both source and target domains; this model is never fine-tuned on target data. While depth estimation errors are inevitable, our ablation studies demonstrate that the resulting proposals remain more domain-invariant than RGB-only baselines, yielding the reported performance gains. We will add a short discussion of robustness to depth noise and include the exact depth model and inference procedure in the methods section. revision: yes

  2. Referee: [Methods] Methods section (depth-guided localization component): The description of how depth maps generate region proposals lacks explicit handling of target-domain depth at inference; without this, it is impossible to verify whether the method truly operates in a fully unsupervised MSDA setting or implicitly requires target depth data.

    Authors: We agree that the current methods description is incomplete on this point. The revised manuscript will explicitly state that, at inference, depth maps for target-domain images are produced by the same pre-trained monocular depth estimator used during training, with no access to ground-truth depth or any target supervision. This keeps the setting fully unsupervised. We will also add a concise pipeline description (and optional pseudocode) that distinguishes training and inference stages to remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

Empirical architecture; no derivation reduces to fitted inputs or self-citations by construction

full rationale

The paper introduces MS-DePro as a composite detector that applies depth maps for region proposals and text prompts for embedding alignment within a multi-source domain adaptation pipeline. All performance claims rest on external MSDA benchmarks rather than internal equations that equate outputs to training losses or parameters by definition. No self-citation chain is invoked to establish uniqueness, and the method is presented as an engineering combination whose effectiveness is measured empirically, not derived tautologically from its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that depth maps and text are domain-agnostic signals that can be leveraged without target labels. Free parameters include all network weights and prompt embeddings trained on source data. No invented physical entities are introduced.

free parameters (1)
  • network weights and prompt embeddings
    Learned during training on source domains; their values determine localization and classification performance.
axioms (2)
  • domain assumption Depth maps provide domain-agnostic information for generating region proposals
    Invoked to justify the depth-guided localization module.
  • domain assumption Multi-modal features can align learnable text embeddings for classification across domains
    Invoked to justify the prompt-learning component.

pith-pipeline@v0.9.0 · 5518 in / 1416 out tokens · 65005 ms · 2026-05-14T19:33:24.068655+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 2 internal anchors

  1. [1]

    X. Wei, S. Liu, Y. Xiang, Z. Duan, C. Zhao, Y. Lu, Incremental learning based multi-domain adaptation for object detection, Knowledge-Based Systems 210 (2020) 106420

  2. [2]

    J. Deng, W. Li, Y. Chen, L. Duan, Unbiased mean teacher for cross- domain object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4091–4101. 30

  3. [3]

    Y.-J. Li, X. Dai, C.-Y. Ma, Y.-C. Liu, K. Chen, B. Wu, Z. He, K. Kitani, P. Vajda, Cross-domain adaptive teacher for object detection, in: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7581–7590

  4. [4]

    Y. Bai, C. Liu, R. Yang, X. Li, Misalignment-resistant domain adaptive learning for one-stage object detection, Knowledge-Based Systems 305 (2024) 112605

  5. [5]

    X. Yao, S. Zhao, P. Xu, J. Yang, Multi-source domain adaptation for object detection, in: Proceedings of the IEEE/CVF International Con- ference on Computer Vision, 2021, pp. 3273–3282

  6. [6]

    J. Wu, J. Chen, M. He, Y. Wang, B. Li, B. Ma, W. Gan, W. Wu, Y. Wang, D. Huang, Target-relevant knowledge preservation for multi- source domain adaptive object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5301–5310

  7. [7]

    Belal, A

    A. Belal, A. Meethal, F. P. Romero, M. Pedersoli, E. Granger, Multi- source domain adaptation for object detection with prototype-based mean teacher, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 1277–1286

  8. [8]

    8566–8575

    A.Belal,A.Meethal,F.P.Romero,M.Pedersoli,E.Granger,Attention- based class-conditioned alignment for multi-source domain adaptation of object detectors, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025, pp. 8566–8575

  9. [9]

    C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, G. Huang, Domain adaptation via prompt learning, IEEE Transactions on Neural Networks and Learning Systems (2023)

  10. [10]

    Gungor, A

    C. Gungor, A. Kovashka, Boosting weakly supervised object detection using fusion and priors from hallucinated depth, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 739–748

  11. [11]

    Addepalli, A

    S. Addepalli, A. R. Asokan, L. Sharma, R. V. Babu, Leveraging vision- language models for improving domain generalization in image classifica- 31 tion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 23922–23932

  12. [12]

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, H. Zhao, Depth anything v2, in: Advances in Neural Information Processing Systems, Vol. 37, 2024, pp. 21875–21911

  13. [13]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, V. Koltun, Depth pro: Sharp monocular metric depth in less than a second, arXiv preprint arXiv:2410.02073 (2024)

  14. [14]

    Y. Lu, H. Huang, X. Hu, Z. Lai, Multiple adaptation network for multi- source and multi-target domain adaptation, IEEE Transactions on Mul- timedia 27 (2025) 5731–5745

  15. [15]

    Y. Lu, Y. Yang, W. K. Wong, A. Toomey, Z. Lai, X. Li, Energy-driven explicit alignment network: A blended-target domain adaptation ap- proach, IEEE Transactions on Multimedia (2026)

  16. [16]

    Y. Lu, Y. Lan, H. Yang, Z. Lai, X. Li, Exploring generic knowledge and reactivating source model for source-free universal domain adaptation, IEEE Transactions on Multimedia (2026)

  17. [17]

    Y. Lu, H. Huang, W. K. Wong, X. Hu, Z. Lai, X. Li, Adaptive dispersal and collaborative clustering for few-shot unsupervised domain adapta- tion, IEEE Transactions on Image Processing (2025)

  18. [18]

    C.Ouyang,C.Chen,S.Li,Z.Li,C.Qin,W.Bai,D.Rueckert,Causality- inspired single-source domain generalization for medical image segmen- tation, IEEE Transactions on Medical Imaging 42 (4) (2022) 1095–1106

  19. [19]

    J. Song, H. Chen, Y. Lyu, W. Nie, A.-A. Liu, Causality-inspired un- supervised domain adaptation with target style imitation for medical image segmentation, IEEE Transactions on Circuits and Systems for Video Technology (2025)

  20. [20]

    K. Zhou, M. Jiang, B. Gabrys, Y. Xu, Learning causal representations based on a gae embedded autoencoder, IEEE Transactions on Knowl- edge and Data Engineering (2025). 32

  21. [21]

    J. Wang, Y. Chen, Z. Dong, M. Gao, H. Lin, Q. Miao, Sabv-depth: A biologically inspired deep learning network for monocular depth estima- tion, Knowledge-Based Systems 263 (2023) 110301

  22. [22]

    Huang, H

    R. Huang, H. Zheng, Y. Wang, Z. Xia, M. Pavone, G. Huang, Training an open-vocabulary monocular 3d detection model without 3d data, Advances in Neural Information Processing Systems 37 (2024) 72145– 72169

  23. [23]

    K. Zhou, J. Yang, C. C. Loy, Z. Liu, Learning to prompt for vision- language models, International Journal of Computer Vision 130 (9) (2022) 2337–2348

  24. [24]

    K. Zhou, J. Yang, C. C. Loy, Z. Liu, Conditional prompt learning for vision-language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16816–16825

  25. [25]

    Y. Du, F. Wei, Z. Zhang, M. Shi, Y. Gao, G. Li, Learning to prompt for open-vocabulary object detection with vision-language model, in: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14084–14093

  26. [26]

    H. Li, R. Zhang, H. Yao, X. Song, Y. Hao, Y. Zhao, L. Li, Y. Chen, Learning domain-aware detection head with prompt tuning, Advances in Neural Information Processing Systems 36 (2023) 4248–4262

  27. [27]

    Zhong, J

    Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li, et al., Regionclip: Region-based language-image pretraining, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16793–16803

  28. [28]

    S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6) (2016) 1137–1149

  29. [29]

    K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog- nition, in: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2016, pp. 770–778

  30. [30]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable 33 visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning, PMLR, 2021, pp. 8748–8763

  31. [31]

    Tarvainen, H

    A. Tarvainen, H. Valpola, Mean teachers are better role models: Weight- averaged consistency targets improve semi-supervised deep learning re- sults, Advances in neural information processing systems 30 (2017)

  32. [32]

    Geirhos, J.-H

    R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, F. A. Wichmann, Shortcut learning in deep neural networks, Nature Machine Intelligence 2 (11) (2020) 665–673

  33. [33]

    K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 2961–2969

  34. [34]

    X. Pan, P. Luo, J. Shi, X. Tang, Two at once: Enhancing learning and generalization capacities via ibn-net, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 464–479

  35. [35]

    T.-Y. Ross, G. Dollár, Focal loss for dense object detection, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 2980–2988

  36. [36]

    Liu, C.-Y

    Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z.Kira, P. Vajda,Unbiased teacher forsemi-supervisedobject detection, arXiv preprint arXiv:2102.09480 (2021)

  37. [37]

    Y. Wu,A. Kirillov, F.Massa,W.-Y. Lo, R.Girshick,Detectron2,https: //github.com/facebookresearch/detectron2(2019)

  38. [38]

    F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, T. Darrell, Bdd100k: A diverse driving dataset for heterogeneous mul- titask learning, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2020, pp. 2636–2645

  39. [39]

    Cordts, M

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223. 34

  40. [40]

    Geiger, P

    A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark suite, in: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2012, pp. 3354– 3361

  41. [41]

    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Proceedings of the European Conference on Computer Vision, Springer, 2014, pp. 740–755

  42. [42]

    Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing

    M. Wrenninge, J. Unger, Synscapes: A photorealistic synthetic dataset for street scene parsing, arXiv preprint arXiv:1810.08705 (2018)

  43. [43]

    H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, G. J. Gordon, Adversarial multiple source domain adaptation, Advances in neural in- formation processing systems 31 (2018)

  44. [44]

    X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, B. Wang, Moment matching for multi-source domain adaptation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1406–1415

  45. [45]

    Vidit, M

    V. Vidit, M. Engilberge, M. Salzmann, Clip the gap: A single domain generalization approach for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3219–3229

  46. [46]

    D. Li, A. Wu, Y. Wang, Y. Han, Prompt-driven dynamic object- centric learning for single domain generalization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17606–17615

  47. [47]

    M. S. Danish, M. H. Khan, M. A. Munir, M. S. Sarfraz, M. Ali, Improv- ing single domain-generalized object detection: A focus on diversification and alignment, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 17732–17742

  48. [48]

    F. Wu, J. Gao, L. Hong, X. Wang, C. Zhou, N. Ye, G-nas: Generalizable neural architecture search for single domain generalization object detec- tion, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 5958–5966. 35

  49. [49]

    W. Lee, D. Hong, H. Lim, H. Myung, Object-aware domain generaliza- tion for object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 2947–2955

  50. [50]

    X. Xu, J. Yang, W. Shi, S. Ding, L. Luo, J. Liu, Physaug: A physical- guided and frequency-based data augmentation for single-domain gen- eralized object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, 2025, pp. 21815–21823

  51. [51]

    B. He, Y. Ji, Q. Ye, Z. Tan, L. Wu, Generalized diffusion detector: Min- ing robust features from diffusion models for domain-generalized detec- tion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 9921–9932

  52. [52]

    Sakaridis, D

    C. Sakaridis, D. Dai, L. Van Gool, Semantic foggy scene understand- ing with synthetic data, International Journal of Computer Vision 126 (2018) 973–992

  53. [53]

    Hassaballah, M

    M. Hassaballah, M. A. Kenk, K. Muhammad, S. Minaee, Vehicle detec- tion and tracking in adverse weather using a deep learning framework, IEEE Transactions on Intelligent Transportation Systems 22 (7) (2020) 4230–4242

  54. [54]

    A. Wu, R. Liu, Y. Han, L. Zhu, Y. Yang, Vector-decomposed disentan- glement for domain-invariant object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9342–9351

  55. [55]

    Everingham, S

    M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman, The pascal visual object classes challenge: A retrospective, International Journal of Computer Vision 111 (1) (2015) 98–136

  56. [56]

    Inoue, R

    N. Inoue, R. Furuta, T. Yamasaki, K. Aizawa, Cross-domain weakly- supervised object detection through progressive domain adaptation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2018, pp. 5001–5009

  57. [57]

    A. Wu, C. Deng, Single-domain generalized object detection in ur- ban scene via cyclic-disentangled self-distillation, in: Proceedings of the 36 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 847–856

  58. [58]

    Sharma, N

    P. Sharma, N. Ding, S. Goodman, R. Soricut, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image cap- tioning, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556– 2565

  59. [59]

    Eftekhar, A

    A. Eftekhar, A. Sax, J. Malik, A. Zamir, Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans, in: Pro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2021, pp. 10786–10796

  60. [60]

    Castrejon, Y

    L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, A. Torralba, Learn- ing aligned cross-modal representations from weakly aligned data, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2016, pp. 2940–2949

  61. [61]

    M. J. Wilber, C. Fang, H. Jin, A. Hertzmann, J. Collomosse, S. Belongie, Bam! the behance artistic media dataset for recognition beyond photog- raphy, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 1202–1211. 37