pith. machine review for the scientific record. sign in

arxiv: 2604.21502 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

VFM⁴SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords single-domain generalized object detectionvision foundation modelsdomain generalizationmissed detectionsrelational stabilityquery enhancementDETR detectors
0
0 comments X

The pith

A frozen vision foundation model supplies stability priors that cut missed detections in single-domain object detectors facing unseen conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that single-domain trained detectors mainly fail in new environments because object-background and inter-instance relations lose stability during feature encoding and because query representations lose semantic-spatial alignment during decoding. This matters because real scenes constantly change in lighting, weather and camera settings, so a method that fixes the detector's internal mechanisms could let models trained on one set of images work reliably on many others without extra data collection. The authors introduce a dual-prior framework that freezes a vision foundation model and distills its cross-domain stability into the detector's relational modeling and query enhancement steps. Experiments on standard benchmarks and DETR-style detectors confirm consistent gains over prior single-domain generalization techniques.

Core claim

Performance degradation under domain shift is driven by rising missed detections that stem from unstable object-background and inter-instance relations in the encoding stage together with harder semantic-spatial alignment of queries in the decoding stage. The authors therefore propose VFM4SDG, a dual-prior learning method that inserts a frozen vision foundation model as a transferable stability source: Cross-domain Stable Relational Prior Distillation strengthens relational modeling in encoding, while Semantic-Contextual Prior-based Query Enhancement injects category-level semantic prototypes and global visual context into queries during decoding.

What carries the argument

Dual-prior learning framework that freezes a vision foundation model and injects its stability into relational distillation in the encoder and semantic-contextual enhancement of queries in the decoder.

If this is right

  • Object-background and inter-instance relations remain more stable when the same detector is tested on images from unseen domains.
  • Query representations gain improved semantic recognition and spatial localization, lowering the rate of missed detections.
  • The same dual-prior additions raise accuracy on two mainstream DETR-based detectors across existing SDGOD evaluation sets.
  • No additional source-domain images or target-domain labels are required beyond the single training domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prior-injection pattern could be tested on detection heads other than DETR to check whether the stability benefit is architecture-specific.
  • Applying the relational and query priors to video object detection might reduce frame-to-frame missed detections when scene conditions drift over time.
  • Measuring whether the frozen VFM priors also improve localization precision on small or occluded objects would reveal additional side benefits.

Load-bearing premise

The stability priors taken from the frozen vision foundation model transfer to the detector without creating new failure modes or domain-specific biases that would increase errors on unseen data.

What would settle it

Running the proposed method on a standard single-domain generalization benchmark and observing that the number of missed detections does not decrease relative to the unmodified baseline detector would falsify the claim that the VFM priors improve cross-domain stability.

Figures

Figures reproduced from arXiv: 2604.21502 by Liang Wan, Ningnan Guo, Ruize Han, Song Wang, Wei Feng, Yupeng Zhang.

Figure 1
Figure 1. Figure 1: Evolution of Error Types under Increasing Domain Shift (based on [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of VFM4SDG. Built upon a DETR-based detector, VFM4SDG leverages a frozen VFM as a cross-domain structural visual prior for single-domain generalized object detection. At the encoding stage, Cross-domain Stable Relational Prior Distillation (CSRPD) transfers cross-domain stable inter-instance relational structures from VFM to the encoder, yielding a representation space with enhanced relat… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons under diverse domain conditions. We compare our detection results with state-of-the-art methods, with different categories [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of Encoder Feature Responses (Layer-2) under Domain Shift. From left to right: source image, Co-DETR encoder features, DINOv3 [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative failure cases of VFM4SDG (Co-DETR-based) under challenging domain conditions. From left to right: GT and VFM4SDG predic￾tions. The examples include scenarios with heavy rain, nighttime illumination, dense fog, motion blur, small-scale objects, and severe occlusion. While the proposed method maintains robust detection performance in most cases, certain missed detections still occur under extr… view at source ↗
read the original abstract

In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents VFM⁴SDG, a dual-prior learning framework for single-domain generalized object detection (SDGOD). Through analytical experiments, it identifies that performance degradation under domain shifts is dominated by missed detections arising from reduced cross-domain stability in object-background and inter-instance relations (encoding stage) and semantic-spatial query alignment (decoding stage). The method injects transferable stability priors from a frozen vision foundation model (VFM) via two modules: Cross-domain Stable Relational Prior Distillation and Semantic-Contextual Prior-based Query Enhancement. Extensive experiments claim consistent outperformance over existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors.

Significance. If the central mechanism holds, the work offers a promising direction for SDGOD by moving beyond data augmentation and domain-invariant representations to leverage frozen VFMs for explicit stability priors. This could improve robustness to real-world shifts (weather, illumination) while remaining computationally efficient. The reported generality across DETR detectors and benchmark gains, if mechanistically validated, would be a substantive contribution to the field.

major comments (2)
  1. [Analytical Experiments and Results sections] The analytical experiments diagnose missed-detection dominance and reduced relational/query stability as the core failure mode, yet the results provide no direct quantitative measurements (e.g., pre/post stability scores for object-background relations or query alignment) demonstrating that the proposed distillation and enhancement modules measurably restore those specific quantities. Without this link, benchmark gains could stem from generic feature enrichment rather than the claimed stability mechanism.
  2. [Experiments] The claim that the framework demonstrates 'generality' for arbitrary DETR-based detectors rests on experiments with only two mainstream detectors. The manuscript should either expand the detector testbed or provide a concrete argument (e.g., via ablation on query/relation components) showing why the VFM priors transfer independently of specific DETR architecture choices.
minor comments (2)
  1. [Method] Clarify the exact formulation of the relational prior distillation loss (e.g., which layers of the VFM are used and how the stability objective is defined) to allow reproducibility.
  2. [Introduction and Method] Ensure that all stability-related terms (e.g., 'cross-domain relational stability') are formally defined with equations or metrics early in the paper rather than only described qualitatively.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below with our responses and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Analytical Experiments and Results sections] The analytical experiments diagnose missed-detection dominance and reduced relational/query stability as the core failure mode, yet the results provide no direct quantitative measurements (e.g., pre/post stability scores for object-background relations or query alignment) demonstrating that the proposed distillation and enhancement modules measurably restore those specific quantities. Without this link, benchmark gains could stem from generic feature enrichment rather than the claimed stability mechanism.

    Authors: We appreciate the referee's emphasis on establishing a direct causal link. Section 4.1 presents quantitative diagnostics of missed-detection increases and stability degradation via relation consistency and query alignment metrics across domains. However, we agree that explicit pre/post measurements tied specifically to the distillation and enhancement modules would more rigorously rule out generic feature enrichment. We will add these direct quantitative comparisons (e.g., stability score deltas before and after each module) in the revised manuscript. revision: yes

  2. Referee: [Experiments] The claim that the framework demonstrates 'generality' for arbitrary DETR-based detectors rests on experiments with only two mainstream detectors. The manuscript should either expand the detector testbed or provide a concrete argument (e.g., via ablation on query/relation components) showing why the VFM priors transfer independently of specific DETR architecture choices.

    Authors: We acknowledge the experiments use two mainstream DETR-based detectors. The manuscript already includes component-wise ablations isolating the relational prior distillation (encoder) and semantic-contextual query enhancement (decoder). These ablations show consistent gains from targeting core DETR elements—relational modeling and query semantics—that are shared across DETR variants, providing evidence that the VFM priors operate independently of specific architectural choices. This constitutes our concrete argument for generality. We can expand the testbed with a third DETR variant if the referee considers it necessary. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's chain proceeds from analytical experiments diagnosing missed-detection dominance and relational instability, to an independent proposal of two new modules (Cross-domain Stable Relational Prior Distillation in encoding and Semantic-Contextual Prior-based Query Enhancement in decoding) that inject frozen VFM priors. Central performance claims rest on external benchmark comparisons with SOTA methods and two DETR detectors, not on any fitted parameter renamed as prediction, self-definitional loop, or load-bearing self-citation. The derivation introduces new architectural components whose effectiveness is measured separately from the diagnostic observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that vision foundation models encode transferable cross-domain stability that can be distilled without fine-tuning; no free parameters or invented entities are visible in the abstract.

axioms (1)
  • domain assumption Vision foundation models supply transferable cross-domain stability priors for object-background and inter-instance relations.
    The framework treats this property as given when freezing the VFM and distilling from it.

pith-pipeline@v0.9.0 · 5579 in / 1258 out tokens · 37500 ms · 2026-05-09T21:34:41.991160+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    Single-domain generalized object detection in urban scene via cyclic-disentangled self-distillation,

    A. Wu and C. Deng, “Single-domain generalized object detection in urban scene via cyclic-disentangled self-distillation,” inProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2022, pp. 847–856. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11 GT Daytime ClearDaytime FoggyDusk RainyNight Rainy SA-DETR (DINO) VFM4...

  2. [2]

    Dg-detr: Toward domain generalized detection transformer,

    S. Hwang, D. Han, and M. Jeon, “Dg-detr: Toward domain generalized detection transformer,”arXiv preprint arXiv:2504.19574, 2025

  3. [3]

    Style-adaptive detection transformer for single-source domain generalized object detection,

    J. Han, Y . Wang, and L. Chen, “Style-adaptive detection transformer for single-source domain generalized object detection,”arXiv preprint arXiv:2504.20498, 2025

  4. [4]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213– 229

  5. [5]

    Learning to learn single domain gen- eralization,

    F. Qiao, L. Zhao, and X. Peng, “Learning to learn single domain gen- eralization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 556–12 565. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 Source Image Co- VFM4SDG (Co-DETR)DETR DINOv3 Fig. 4. Visualization of Encoder Feature Responses (Layer...

  6. [6]

    Adversarially adaptive normalization for single domain generalization,

    X. Fan, Q. Wang, J. Ke, F. Yang, B. Gong, and M. Zhou, “Adversarially adaptive normalization for single domain generalization,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recogni- tion, 2021, pp. 8208–8217

  7. [7]

    Progressive domain expansion network for single domain gen- eralization,

    L. Li, K. Gao, J. Cao, Z. Huang, Y . Weng, X. Mi, Z. Yu, X. Li, and B. Xia, “Progressive domain expansion network for single domain gen- eralization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 224–233

  8. [8]

    Learning to diversify for single domain generalization,

    Z. Wang, Y . Luo, R. Qiu, Z. Huang, and M. Baktashmotlagh, “Learning to diversify for single domain generalization,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 834– 843

  9. [9]

    Exact feature distribution matching for arbitrary style transfer and domain generalization,

    Y . Zhang, M. Li, R. Li, K. Jia, and L. Zhang, “Exact feature distribution matching for arbitrary style transfer and domain generalization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8035–8045

  10. [10]

    Out-of-domain generalization from a single source: An uncertainty quantification approach,

    X. Peng, F. Qiao, and L. Zhao, “Out-of-domain generalization from a single source: An uncertainty quantification approach,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, vol. 46, no. 3, pp. 1775–1787, 2022

  11. [11]

    Meta convolutional neural networks for single domain gen- eralization,

    C. Wan, X. Shen, Y . Zhang, Z. Yin, X. Tian, F. Gao, J. Huang, and X.-S. Hua, “Meta convolutional neural networks for single domain gen- eralization,” inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4682–4691

  12. [12]

    Attention consistency on visual corruptions for single-source domain generalization,

    I. Cugu, M. Mancini, Y . Chen, and Z. Akata, “Attention consistency on visual corruptions for single-source domain generalization,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4165–4174

  13. [13]

    Adversarial source generation for source-free domain adaptation,

    C. Cui, F. Meng, C. Zhang, Z. Liu, L. Zhu, S. Gong, and X. Lin, “Adversarial source generation for source-free domain adaptation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4887–4898, 2023

  14. [14]

    Adversarial bayesian augmen- tation for single-source domain generalization,

    S. Cheng, T. Gokhale, and Y . Yang, “Adversarial bayesian augmen- tation for single-source domain generalization,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11 400–11 410

  15. [15]

    Meta-causal learning for single domain generalization,

    J. Chen, Z. Gao, X. Wu, and J. Luo, “Meta-causal learning for single domain generalization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 7683–7692

  16. [16]

    Center- aware adversarial augmentation for single domain generalization,

    T. Chen, M. Baktashmotlagh, Z. Wang, and M. Salzmann, “Center- aware adversarial augmentation for single domain generalization,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 4157–4165

  17. [17]

    Learning class and domain augmentations for single-source open- domain generalization,

    P. Bele, V . Bundele, A. Bhattacharya, A. Jha, G. Roig, and B. Banerjee, “Learning class and domain augmentations for single-source open- domain generalization,” inProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision, 2024, pp. 1816–1826

  18. [18]

    Progressive diversity generation for single domain generalization,

    D. Rui, K. Guo, X. Zhu, Z. Wu, and H. Fang, “Progressive diversity generation for single domain generalization,”IEEE Transactions on Multimedia, vol. 26, pp. 10 200–10 210, 2024

  19. [19]

    Single domain generalization via normalised cross- correlation based convolutions,

    W. Chuah, R. Tennakoon, R. Hoseinnezhad, D. Suter, and A. Bab- Hadiashar, “Single domain generalization via normalised cross- correlation based convolutions,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 1752–1761

  20. [20]

    Wildnet: Learning domain generalized semantic segmentation from the wild,

    S. Lee, H. Seong, S. Lee, and E. Kim, “Wildnet: Learning domain generalized semantic segmentation from the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9936–9946

  21. [21]

    Learning generalized knowledge from a single domain on urban-scene segmentation,

    X. Li, M. Li, X. Li, and X. Guo, “Learning generalized knowledge from a single domain on urban-scene segmentation,”IEEE Transactions on Multimedia, vol. 25, pp. 7635–7646, 2022

  22. [22]

    Style projected clustering for domain generalized semantic segmentation,

    W. Huang, C. Chen, Y . Li, J. Li, C. Li, F. Song, Y . Yan, and Z. Xiong, “Style projected clustering for domain generalized semantic segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 3061–3071

  23. [23]

    Adaptive texture filtering for single-domain generalized segmentation,

    X. Li, M. Li, Y . Wang, C.-X. Ren, and X. Guo, “Adaptive texture filtering for single-domain generalized segmentation,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 2, 2023, pp. 1442–1450

  24. [24]

    Clip the gap: A single domain generalization approach for object detection,

    V . Vidit, M. Engilberge, and M. Salzmann, “Clip the gap: A single domain generalization approach for object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 3219–3229

  25. [25]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  26. [26]

    Improving single domain-generalized object detection: A focus on diversification and alignment,

    M. S. Danish, M. H. Khan, M. A. Munir, M. S. Sarfraz, and M. Ali, “Improving single domain-generalized object detection: A focus on diversification and alignment,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024, pp. 17 732– 17 742

  27. [27]

    Towards robust object detection invariant to real-world domain shifts,

    Q. Fan, M. Segu, Y .-W. Tai, F. Yu, C.-K. Tang, B. Schiele, and D. Dai, “Towards robust object detection invariant to real-world domain shifts,” inThe Eleventh International Conference on Learning Representations (ICLR 2023). OpenReview, 2023

  28. [28]

    Srcd: Se- mantic reasoning with compound domains for single-domain generalized object detection,

    Z. Rao, J. Guo, L. Tang, Y . Huang, X. Ding, and S. Guo, “Srcd: Se- mantic reasoning with compound domains for single-domain generalized object detection,”IEEE Transactions on Neural Networks and Learning Systems, 2024

  29. [29]

    G-nas: Generalizable neural architecture search for single domain generalization object detection,

    F. Wu, J. Gao, L. Hong, X. Wang, C. Zhou, and N. Ye, “G-nas: Generalizable neural architecture search for single domain generalization object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5958–5966

  30. [30]

    Unbiased faster r-cnn for single-source domain generalized object detection,

    Y . Liu, S. Zhou, X. Liu, C. Hao, B. Fan, and J. Tian, “Unbiased faster r-cnn for single-source domain generalized object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 28 838–28 847

  31. [31]

    Deit iii: Revenge of the vit,

    H. Touvron, M. Cord, and H. J ´egou, “Deit iii: Revenge of the vit,” in European conference on computer vision. Springer, 2022, pp. 516–533

  32. [32]

    Grounded language-image pre-training,

    L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language-image pre-training,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 965–10 975

  33. [33]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

  34. [34]

    SAM 2: Segment Anything in Images and Videos

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

  35. [35]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

  36. [36]

    An empirical study of training self- supervised vision transformers,

    X. Chen, S. Xie, and K. He, “An empirical study of training self- supervised vision transformers,” inProceedings of the IEEE/CVF in- ternational conference on computer vision, 2021, pp. 9640–9649

  37. [37]

    Emerging properties in self-supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660

  38. [38]

    Is imagenet worth 1 video? learning strong image encoders from 1 long unlabelled video,

    S. Venkataramanan, M. N. Rizve, J. Carreira, Y . M. Asano, and Y . Avrithis, “Is imagenet worth 1 video? learning strong image encoders from 1 long unlabelled video,”arXiv preprint arXiv:2310.08584, 2023

  39. [39]

    Self-supervised cross- stage regional contrastive learning for object detection,

    J. Yan, L. Yang, Y . Gao, and W.-S. Zheng, “Self-supervised cross- stage regional contrastive learning for object detection,” in2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2023, pp. 1044–1049

  40. [40]

    Masked au- toencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

  41. [41]

    BEiT: BERT Pre-Training of Image Transformers

    H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,”arXiv preprint arXiv:2106.08254, 2021

  42. [42]

    Image as a foreign language: Beit pretraining for vision and vision-language tasks,

    W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Somet al., “Image as a foreign language: Beit pretraining for vision and vision-language tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 175–19 186

  43. [43]

    Deconstructing denoising diffusion models for self-supervised learning

    X. Chen, Z. Liu, S. Xie, and K. He, “Deconstructing denoising diffusion models for self-supervised learning,”arXiv preprint arXiv:2401.14404, 2024

  44. [44]

    iBOT: Image BERT Pre-Training with Online Tokenizer

    J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “ibot: Image bert pre-training with online tokenizer,”arXiv preprint arXiv:2111.07832, 2021

  45. [45]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

  46. [46]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,” arXiv preprint arXiv:2508.10104, 2025

  47. [47]

    Vision transformer adapter for dense predictions,

    Z. Chen, Y . Duan, W. Wang, J. He, T. Lu, J. Dai, and Y . Qiao, “Vision transformer adapter for dense predictions,”arXiv preprint arXiv:2205.08534, 2022

  48. [48]

    Frozen- detr: Enhancing detr with image understanding from frozen foundation models,

    S. Fu, J. Yan, Q. Yang, X. Wei, X. Xie, and W.-S. Zheng, “Frozen- detr: Enhancing detr with image understanding from frozen foundation models,”Advances in Neural Information Processing Systems, vol. 37, pp. 105 949–105 971, 2024

  49. [49]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,”arXiv preprint arXiv:2203.03605, 2022

  50. [50]

    Detrs with collaborative hybrid assign- ments training. arxiv 2022,

    Z. Zong, G. Song, and Y . Liu, “Detrs with collaborative hybrid assign- ments training. arxiv 2022,”arXiv preprint arXiv:2211.12860, 2022

  51. [51]

    arXiv preprint arXiv:2510.25257

    Z. Liao, Y . Zhao, X. Shan, Y . Yan, C. Liu, L. Lu, X. Ji, and J. Chen, “Rt-detrv4: Painlessly furthering real-time object detection with vision foundation models,”arXiv preprint arXiv:2510.25257, 2025

  52. [52]

    Multi-view adversarial discriminator: Mine the non-causal factors for object detection in unseen domains,

    M. Xu, L. Qin, W. Chen, S. Pu, and L. Zhang, “Multi-view adversarial discriminator: Mine the non-causal factors for object detection in unseen domains,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 8103–8112

  53. [53]

    Object-aware domain gen- eralization for object detection,

    W. Lee, D. Hong, H. Lim, and H. Myung, “Object-aware domain gen- eralization for object detection,” inproceedings of the AAAI conference on artificial intelligence, vol. 38, no. 4, 2024, pp. 2947–2955

  54. [54]

    Physaug: A physical-guided and frequency-based data augmentation for single- domain generalized object detection,

    X. Xu, J. Yang, W. Shi, S. Ding, L. Luo, and J. Liu, “Physaug: A physical-guided and frequency-based data augmentation for single- domain generalized object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 20, 2025, pp. 21 815– 21 823

  55. [55]

    Sample-aware randaugment: Search-free automatic data augmentation for effective image recognition: A. xiao et al

    A. Xiao, W. Yu, and H. Yu, “Sample-aware randaugment: Search-free automatic data augmentation for effective image recognition: A. xiao et al.”International Journal of Computer Vision, vol. 133, no. 11, pp. 7710–7725, 2025