arxiv: 2605.13140 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: no theorem link

Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection

Sangin Lee , Seokjun Kwon , Jeongmin Shin , Namil Kim , Yukyung Choi

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-source domain adaptationobject detectiondepth mapsprompt learningmulti-modal featuresdomain generalizationregion proposals

0 comments

The pith

Depth maps and text prompts let a detector adapt object detection across multiple source domains without target labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MS-DePro, a method that processes multiple source domains separately for object detection while using depth maps and text as domain-agnostic signals. It generates region proposals from depth and aligns text embeddings with multi-modal features to handle domain shifts better than prior RGB-only approaches. A sympathetic reader would care because real-world detectors often fail when camera conditions or scenes change, and this technique aims to make training on several labeled sources transfer more reliably to an unlabeled target. The authors report state-of-the-art results on MSDA benchmarks along with ablations that isolate the value of each added component.

Core claim

MS-DePro consists of depth-guided localization that produces domain-agnostic region proposals from depth maps and multi-modal guided prompt learning that integrates multi-modal features to align learnable text embeddings for classification. By leveraging these domain-agnostic inputs, the detector learns domain-agnostic characteristics while preserving domain-specific information, outperforming previous multi-source domain adaptation methods on standard benchmarks.

What carries the argument

MS-DePro's depth-guided localization and multi-modal guided prompt learning, which encode domain-agnostic characteristics from depth maps and text to produce region proposals and aligned embeddings for cross-domain object detection.

If this is right

MS-DePro reaches state-of-the-art detection accuracy on existing multi-source domain adaptation benchmarks.
Ablation experiments isolate performance lifts from the depth-guided localization and the prompt-learning modules.
The method keeps domain-specific cues while extracting domain-agnostic signals from non-RGB modalities.
Separate processing of each source domain outperforms simple blending of all sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could transfer to semantic segmentation if depth continues to supply reliable object boundaries across domains.
Text-prompt alignment opens a route for adding new object categories at test time by editing the text embeddings.
Performance would likely degrade in settings where depth sensors are absent or poorly calibrated.
Combining the same depth and prompt signals with infrared or event data might further widen the domain gap the method can bridge.

Load-bearing premise

Depth maps and text are reliably available as domain-agnostic inputs that improve region proposals and embeddings without introducing new biases or needing target-domain depth data.

What would settle it

Running the full MS-DePro pipeline on the MSDA benchmarks but replacing depth inputs with random noise or removing the prompt alignment step yields no accuracy gain over strong multi-source baselines.

Figures

Figures reproduced from arXiv: 2605.13140 by Jeongmin Shin, Namil Kim, Sangin Lee, Seokjun Kwon, Yukyung Choi.

**Figure 1.** Figure 1: Comparison of MSDA methods. (a) Existing MSDA methods rely solely on RGB images to learn domain-agnostic features, while also encoding domain-specific features from them, which leads to a training conflict. (b) Our MS-DePro leverages additional domain-agnostic modalities—depth map and text—to encode domain-agnostic features. objectives of these two networks, which impedes optimization and slows convergen… view at source ↗

**Figure 2.** Figure 2: Overall Framework of MS-DePro. Following the mean-teacher framework, MS-DePro consists of an unlabeled target teacher (top) and a labeled multi-source student (bottom). We exploit depth maps to propose regions for localization, and our learnable prompt is encoded for classification. The teacher generates pseudo-labels for the target domain. The student learns from multiple sources and adapts to the target … view at source ↗

**Figure 3.** Figure 3: (a) t-SNE visualization of feature distribution for RGB and Depth map. (b) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of an RGB image and its corresponding low-quality depth map, [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Confidence scores of region proposals from RGB and depth maps. (b) Counts [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of class frequency on performance gap. Under a class-specific configu [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: (Top) Detection visualization results in the cross-time domain adaptation. We [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

read the original abstract

General object detection (OD) struggles to detect objects in the target domain that differ from the training distribution. To address this, recent studies demonstrate that training from multiple source domains and explicitly processing them separately for multi-source domain adaptation (MSDA) outperforms blending them for unsupervised domain adaptation (UDA). However, existing MSDA methods learn domain-agnostic features from domain-specific RGB images while preserving domain-specific information from the domain-agnostic feature map. To address this, we propose MS-DePro: Multi-Source Detector with Depth and Prompt, composed of (1) depth-guided localization and (2) multi-modal guided prompt learning. We leverage domain-agnostic input modalities, namely depth maps and text, to encode domain-agnostic characteristics. Specifically, we utilize depth maps to generate domain-agnostic region proposals for localization and integrate multi-modal features to align learnable text embeddings for classification. MS-DePro achieves state-of-the-art performance on MSDA benchmarks, and comprehensive ablations demonstrate the effectiveness of our contributions. Our code is available on https://github.com/sejong-rcv/Multi-Modal-Guided-Multi-Source-Domain-Adaptation-for-Object-Detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MS-DePro adds depth-guided proposals and multi-modal prompts to multi-source DA for detection and reports SOTA numbers, but the depth source for targets needs explicit handling to support the invariance claim.

read the letter

MS-DePro combines depth maps for region proposals with multi-modal prompt learning on top of a multi-source domain adaptation pipeline for object detection. The two modules aim to pull out domain-agnostic signals from depth and text that RGB alone does not provide, then use them for localization and classification alignment. The abstract presents this as a direct extension of prior MSDA work, with ablations that credit each piece and code released for inspection. That combination is the concrete new element here, and it fits the practical needs of detection under distribution shift in driving or robotics scenes. The reported gains on standard benchmarks look plausible given the extra modalities, and the decision to keep the architecture modular rather than over-engineered is reasonable. The main soft spot is the depth input. Benchmarks like Cityscapes and Foggy Cityscapes supply only RGB, so the method must either assume depth is given or generate it. If a monocular estimator runs on the target at test time, its own training distribution could inject correlated errors that the prompt stage does not fully remove. The abstract does not state the exact depth source or whether target depth is required at inference, so the domain-agnostic premise rests on an unshown detail. This is fixable with a short methods clarification and a depth-quality ablation, but it is the part that most needs referee scrutiny. The paper is aimed at people already working on multi-source adaptation for detection who want an incremental but usable trick. It shows clear thinking about modality choice and supplies enough experimental structure to be worth referee time. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes MS-DePro, a multi-source domain adaptation (MSDA) framework for object detection that integrates depth maps for domain-agnostic region proposals via depth-guided localization and text embeddings via multi-modal prompt learning for classification. It claims state-of-the-art performance on MSDA benchmarks with supporting ablations, leveraging depth and text as domain-agnostic modalities to improve cross-domain generalization over RGB-only approaches.

Significance. If the central claims hold under scrutiny of the depth source and target-domain handling, the work would be moderately significant for MSDA in object detection by showing how auxiliary modalities can produce more invariant proposals and aligned features. The public code release is a clear strength that supports reproducibility and follow-up work.

major comments (2)

[Abstract] Abstract: The assertion that depth maps are domain-agnostic inputs enabling domain-agnostic region proposals is load-bearing for the central claim, yet the manuscript does not specify the depth source for target domains (e.g., Cityscapes/Foggy Cityscapes benchmarks supply only RGB). If monocular depth estimation is applied at test time, domain-specific errors could be injected that the prompt-learning stage cannot cancel, undermining the domain-invariance premise.
[Methods] Methods section (depth-guided localization component): The description of how depth maps generate region proposals lacks explicit handling of target-domain depth at inference; without this, it is impossible to verify whether the method truly operates in a fully unsupervised MSDA setting or implicitly requires target depth data.

minor comments (2)

[Abstract] The abstract states 'comprehensive ablations demonstrate the effectiveness of our contributions' but does not preview the specific ablation factors (e.g., depth vs. prompt removal) or report quantitative deltas; adding a one-sentence summary would improve clarity.
[Introduction] Notation for multi-modal feature alignment is introduced without an accompanying equation or diagram reference in the early sections, making the prompt-learning description harder to follow on first reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying the need for greater clarity on depth map handling. We agree that the current manuscript does not explicitly describe the depth source for target domains and will revise both the abstract and methods section to address this. Our responses to the major comments are provided below.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that depth maps are domain-agnostic inputs enabling domain-agnostic region proposals is load-bearing for the central claim, yet the manuscript does not specify the depth source for target domains (e.g., Cityscapes/Foggy Cityscapes benchmarks supply only RGB). If monocular depth estimation is applied at test time, domain-specific errors could be injected that the prompt-learning stage cannot cancel, undermining the domain-invariance premise.

Authors: We acknowledge that the manuscript does not explicitly state how depth maps are obtained for target domains that provide only RGB images. In the revised version we will clarify that depth maps are generated by applying a fixed, pre-trained monocular depth estimation model (e.g., MiDaS) to the RGB input in both source and target domains; this model is never fine-tuned on target data. While depth estimation errors are inevitable, our ablation studies demonstrate that the resulting proposals remain more domain-invariant than RGB-only baselines, yielding the reported performance gains. We will add a short discussion of robustness to depth noise and include the exact depth model and inference procedure in the methods section. revision: yes
Referee: [Methods] Methods section (depth-guided localization component): The description of how depth maps generate region proposals lacks explicit handling of target-domain depth at inference; without this, it is impossible to verify whether the method truly operates in a fully unsupervised MSDA setting or implicitly requires target depth data.

Authors: We agree that the current methods description is incomplete on this point. The revised manuscript will explicitly state that, at inference, depth maps for target-domain images are produced by the same pre-trained monocular depth estimator used during training, with no access to ground-truth depth or any target supervision. This keeps the setting fully unsupervised. We will also add a concise pipeline description (and optional pseudocode) that distinguishes training and inference stages to remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

Empirical architecture; no derivation reduces to fitted inputs or self-citations by construction

full rationale

The paper introduces MS-DePro as a composite detector that applies depth maps for region proposals and text prompts for embedding alignment within a multi-source domain adaptation pipeline. All performance claims rest on external MSDA benchmarks rather than internal equations that equate outputs to training losses or parameters by definition. No self-citation chain is invoked to establish uniqueness, and the method is presented as an engineering combination whose effectiveness is measured empirically, not derived tautologically from its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that depth maps and text are domain-agnostic signals that can be leveraged without target labels. Free parameters include all network weights and prompt embeddings trained on source data. No invented physical entities are introduced.

free parameters (1)

network weights and prompt embeddings
Learned during training on source domains; their values determine localization and classification performance.

axioms (2)

domain assumption Depth maps provide domain-agnostic information for generating region proposals
Invoked to justify the depth-guided localization module.
domain assumption Multi-modal features can align learnable text embeddings for classification across domains
Invoked to justify the prompt-learning component.

pith-pipeline@v0.9.0 · 5518 in / 1416 out tokens · 65005 ms · 2026-05-14T19:33:24.068655+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 2 internal anchors

[1]

X. Wei, S. Liu, Y. Xiang, Z. Duan, C. Zhao, Y. Lu, Incremental learning based multi-domain adaptation for object detection, Knowledge-Based Systems 210 (2020) 106420

work page 2020
[2]

J. Deng, W. Li, Y. Chen, L. Duan, Unbiased mean teacher for cross- domain object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4091–4101. 30

work page 2021
[3]

Y.-J. Li, X. Dai, C.-Y. Ma, Y.-C. Liu, K. Chen, B. Wu, Z. He, K. Kitani, P. Vajda, Cross-domain adaptive teacher for object detection, in: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7581–7590

work page 2022
[4]

Y. Bai, C. Liu, R. Yang, X. Li, Misalignment-resistant domain adaptive learning for one-stage object detection, Knowledge-Based Systems 305 (2024) 112605

work page 2024
[5]

X. Yao, S. Zhao, P. Xu, J. Yang, Multi-source domain adaptation for object detection, in: Proceedings of the IEEE/CVF International Con- ference on Computer Vision, 2021, pp. 3273–3282

work page 2021
[6]

J. Wu, J. Chen, M. He, Y. Wang, B. Li, B. Ma, W. Gan, W. Wu, Y. Wang, D. Huang, Target-relevant knowledge preservation for multi- source domain adaptive object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5301–5310

work page 2022
[7]

Belal, A

A. Belal, A. Meethal, F. P. Romero, M. Pedersoli, E. Granger, Multi- source domain adaptation for object detection with prototype-based mean teacher, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 1277–1286

work page 2024
[8]

8566–8575

A.Belal,A.Meethal,F.P.Romero,M.Pedersoli,E.Granger,Attention- based class-conditioned alignment for multi-source domain adaptation of object detectors, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025, pp. 8566–8575

work page 2025
[9]

C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, G. Huang, Domain adaptation via prompt learning, IEEE Transactions on Neural Networks and Learning Systems (2023)

work page 2023
[10]

Gungor, A

C. Gungor, A. Kovashka, Boosting weakly supervised object detection using fusion and priors from hallucinated depth, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 739–748

work page 2024
[11]

Addepalli, A

S. Addepalli, A. R. Asokan, L. Sharma, R. V. Babu, Leveraging vision- language models for improving domain generalization in image classifica- 31 tion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 23922–23932

work page 2024
[12]

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, H. Zhao, Depth anything v2, in: Advances in Neural Information Processing Systems, Vol. 37, 2024, pp. 21875–21911

work page 2024
[13]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, V. Koltun, Depth pro: Sharp monocular metric depth in less than a second, arXiv preprint arXiv:2410.02073 (2024)

work page internal anchor Pith review arXiv 2024
[14]

Y. Lu, H. Huang, X. Hu, Z. Lai, Multiple adaptation network for multi- source and multi-target domain adaptation, IEEE Transactions on Mul- timedia 27 (2025) 5731–5745

work page 2025
[15]

Y. Lu, Y. Yang, W. K. Wong, A. Toomey, Z. Lai, X. Li, Energy-driven explicit alignment network: A blended-target domain adaptation ap- proach, IEEE Transactions on Multimedia (2026)

work page 2026
[16]

Y. Lu, Y. Lan, H. Yang, Z. Lai, X. Li, Exploring generic knowledge and reactivating source model for source-free universal domain adaptation, IEEE Transactions on Multimedia (2026)

work page 2026
[17]

Y. Lu, H. Huang, W. K. Wong, X. Hu, Z. Lai, X. Li, Adaptive dispersal and collaborative clustering for few-shot unsupervised domain adapta- tion, IEEE Transactions on Image Processing (2025)

work page 2025
[18]

C.Ouyang,C.Chen,S.Li,Z.Li,C.Qin,W.Bai,D.Rueckert,Causality- inspired single-source domain generalization for medical image segmen- tation, IEEE Transactions on Medical Imaging 42 (4) (2022) 1095–1106

work page 2022
[19]

J. Song, H. Chen, Y. Lyu, W. Nie, A.-A. Liu, Causality-inspired un- supervised domain adaptation with target style imitation for medical image segmentation, IEEE Transactions on Circuits and Systems for Video Technology (2025)

work page 2025
[20]

K. Zhou, M. Jiang, B. Gabrys, Y. Xu, Learning causal representations based on a gae embedded autoencoder, IEEE Transactions on Knowl- edge and Data Engineering (2025). 32

work page 2025
[21]

J. Wang, Y. Chen, Z. Dong, M. Gao, H. Lin, Q. Miao, Sabv-depth: A biologically inspired deep learning network for monocular depth estima- tion, Knowledge-Based Systems 263 (2023) 110301

work page 2023
[22]

Huang, H

R. Huang, H. Zheng, Y. Wang, Z. Xia, M. Pavone, G. Huang, Training an open-vocabulary monocular 3d detection model without 3d data, Advances in Neural Information Processing Systems 37 (2024) 72145– 72169

work page 2024
[23]

K. Zhou, J. Yang, C. C. Loy, Z. Liu, Learning to prompt for vision- language models, International Journal of Computer Vision 130 (9) (2022) 2337–2348

work page 2022
[24]

K. Zhou, J. Yang, C. C. Loy, Z. Liu, Conditional prompt learning for vision-language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16816–16825

work page 2022
[25]

Y. Du, F. Wei, Z. Zhang, M. Shi, Y. Gao, G. Li, Learning to prompt for open-vocabulary object detection with vision-language model, in: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14084–14093

work page 2022
[26]

H. Li, R. Zhang, H. Yao, X. Song, Y. Hao, Y. Zhao, L. Li, Y. Chen, Learning domain-aware detection head with prompt tuning, Advances in Neural Information Processing Systems 36 (2023) 4248–4262

work page 2023
[27]

Zhong, J

Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li, et al., Regionclip: Region-based language-image pretraining, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16793–16803

work page 2022
[28]

S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6) (2016) 1137–1149

work page 2016
[29]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog- nition, in: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2016, pp. 770–778

work page 2016
[30]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable 33 visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning, PMLR, 2021, pp. 8748–8763

work page 2021
[31]

Tarvainen, H

A. Tarvainen, H. Valpola, Mean teachers are better role models: Weight- averaged consistency targets improve semi-supervised deep learning re- sults, Advances in neural information processing systems 30 (2017)

work page 2017
[32]

Geirhos, J.-H

R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, F. A. Wichmann, Shortcut learning in deep neural networks, Nature Machine Intelligence 2 (11) (2020) 665–673

work page 2020
[33]

K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 2961–2969

work page 2017
[34]

X. Pan, P. Luo, J. Shi, X. Tang, Two at once: Enhancing learning and generalization capacities via ibn-net, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 464–479

work page 2018
[35]

T.-Y. Ross, G. Dollár, Focal loss for dense object detection, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 2980–2988

work page 2017
[36]

Liu, C.-Y

Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z.Kira, P. Vajda,Unbiased teacher forsemi-supervisedobject detection, arXiv preprint arXiv:2102.09480 (2021)

work page arXiv 2021
[37]

Y. Wu,A. Kirillov, F.Massa,W.-Y. Lo, R.Girshick,Detectron2,https: //github.com/facebookresearch/detectron2(2019)

work page 2019
[38]

F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, T. Darrell, Bdd100k: A diverse driving dataset for heterogeneous mul- titask learning, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2020, pp. 2636–2645

work page 2020
[39]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223. 34

work page 2016
[40]

Geiger, P

A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark suite, in: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2012, pp. 3354– 3361

work page 2012
[41]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Proceedings of the European Conference on Computer Vision, Springer, 2014, pp. 740–755

work page 2014
[42]

Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing

M. Wrenninge, J. Unger, Synscapes: A photorealistic synthetic dataset for street scene parsing, arXiv preprint arXiv:1810.08705 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, G. J. Gordon, Adversarial multiple source domain adaptation, Advances in neural in- formation processing systems 31 (2018)

work page 2018
[44]

X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, B. Wang, Moment matching for multi-source domain adaptation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1406–1415

work page 2019
[45]

Vidit, M

V. Vidit, M. Engilberge, M. Salzmann, Clip the gap: A single domain generalization approach for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3219–3229

work page 2023
[46]

D. Li, A. Wu, Y. Wang, Y. Han, Prompt-driven dynamic object- centric learning for single domain generalization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17606–17615

work page 2024
[47]

M. S. Danish, M. H. Khan, M. A. Munir, M. S. Sarfraz, M. Ali, Improv- ing single domain-generalized object detection: A focus on diversification and alignment, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 17732–17742

work page 2024
[48]

F. Wu, J. Gao, L. Hong, X. Wang, C. Zhou, N. Ye, G-nas: Generalizable neural architecture search for single domain generalization object detec- tion, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 5958–5966. 35

work page 2024
[49]

W. Lee, D. Hong, H. Lim, H. Myung, Object-aware domain generaliza- tion for object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 2947–2955

work page 2024
[50]

X. Xu, J. Yang, W. Shi, S. Ding, L. Luo, J. Liu, Physaug: A physical- guided and frequency-based data augmentation for single-domain gen- eralized object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, 2025, pp. 21815–21823

work page 2025
[51]

B. He, Y. Ji, Q. Ye, Z. Tan, L. Wu, Generalized diffusion detector: Min- ing robust features from diffusion models for domain-generalized detec- tion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 9921–9932

work page 2025
[52]

Sakaridis, D

C. Sakaridis, D. Dai, L. Van Gool, Semantic foggy scene understand- ing with synthetic data, International Journal of Computer Vision 126 (2018) 973–992

work page 2018
[53]

Hassaballah, M

M. Hassaballah, M. A. Kenk, K. Muhammad, S. Minaee, Vehicle detec- tion and tracking in adverse weather using a deep learning framework, IEEE Transactions on Intelligent Transportation Systems 22 (7) (2020) 4230–4242

work page 2020
[54]

A. Wu, R. Liu, Y. Han, L. Zhu, Y. Yang, Vector-decomposed disentan- glement for domain-invariant object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9342–9351

work page 2021
[55]

Everingham, S

M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman, The pascal visual object classes challenge: A retrospective, International Journal of Computer Vision 111 (1) (2015) 98–136

work page 2015
[56]

Inoue, R

N. Inoue, R. Furuta, T. Yamasaki, K. Aizawa, Cross-domain weakly- supervised object detection through progressive domain adaptation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2018, pp. 5001–5009

work page 2018
[57]

A. Wu, C. Deng, Single-domain generalized object detection in ur- ban scene via cyclic-disentangled self-distillation, in: Proceedings of the 36 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 847–856

work page 2022
[58]

Sharma, N

P. Sharma, N. Ding, S. Goodman, R. Soricut, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image cap- tioning, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556– 2565

work page 2018
[59]

Eftekhar, A

A. Eftekhar, A. Sax, J. Malik, A. Zamir, Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans, in: Pro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2021, pp. 10786–10796

work page 2021
[60]

Castrejon, Y

L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, A. Torralba, Learn- ing aligned cross-modal representations from weakly aligned data, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2016, pp. 2940–2949

work page 2016
[61]

M. J. Wilber, C. Fang, H. Jin, A. Hertzmann, J. Collomosse, S. Belongie, Bam! the behance artistic media dataset for recognition beyond photog- raphy, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 1202–1211. 37

work page 2017