pith. machine review for the scientific record. sign in

arxiv: 2605.02258 · v1 · submitted 2026-05-04 · 💻 cs.CV

Recognition: unknown

SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords multispectral imagingvision foundation modelsparameter-efficient adaptationknowledge distillationinfrared object detectionsemantic segmentationcross-modal alignmentDINOv2
0
0 comments X

The pith

SpectraDINO extends frozen DINOv2 backbones to multispectral modalities using per-modality adapters and staged distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that RGB-pretrained vision foundation models can be adapted to near-infrared, short-wave infrared, and long-wave infrared data without retraining the core network from scratch. It inserts lightweight bottleneck adapters for each new modality and trains them through a curriculum of losses that align the student to the frozen teacher while preserving local structure in the feature space. This setup is tested on object detection and semantic segmentation benchmarks, where the resulting models reach state-of-the-art accuracy under standard fusion strategies. If the approach holds, existing large-scale RGB models become reusable bases for a broader range of sensing wavelengths rather than requiring separate pretraining for each spectrum.

Core claim

SpectraDINO bridges the spectral gap by keeping a DINOv2 ViT backbone frozen and attaching small per-modality bottleneck adapters. A multi-stage teacher-student protocol guides training with cosine distillation, symmetric contrastive loss, patch-level alignment, and a neighborhood-structure-preservation loss, allowing cross-modal alignment while avoiding catastrophic forgetting of the original RGB representations. When evaluated on multispectral object detection and semantic segmentation tasks across NIR, SWIR, and LWIR datasets, the adapted models achieve state-of-the-art results with common fusion methods.

What carries the argument

Lightweight per-modality bottleneck adapters inserted into a frozen DINOv2 backbone, trained via a multi-stage distillation curriculum that includes a neighborhood-structure-preservation loss.

If this is right

  • Multispectral perception pipelines can reuse large RGB foundation-model weights with only modest added parameters.
  • The same backbone can serve as a general-purpose feature extractor for visible and beyond-visible spectra without separate large-scale pretraining.
  • Adverse-condition robustness improves because complementary spectral channels become accessible through simple adapter attachment.
  • Modality-specific fine-tuning becomes cheaper and faster, lowering the barrier to deploying vision models on new sensors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The adapter-plus-distillation pattern could be tested on other non-RGB modalities such as radar or hyperspectral cubes to check how far the domain-gap closure generalizes.
  • If the neighborhood-structure loss proves critical, similar structure-preserving terms might help when adapting foundation models across other large distribution shifts.
  • The modular design suggests that a single frozen backbone could eventually host adapters for many sensor types simultaneously, enabling runtime switching between modalities with low overhead.

Load-bearing premise

The frozen RGB-pretrained DINOv2 model already holds priors that are close enough to spectral data that small adapters plus distillation can close the remaining gap without large drops in representation quality.

What would settle it

Training SpectraDINO on one set of spectral bands and then measuring whether its performance on an entirely unseen spectral band falls below that of a randomly initialized model trained only on the new band.

Figures

Figures reproduced from arXiv: 2605.02258 by Hyeongjin Ju, Incheol Park, Sanghyeop Yeo, Shiho Kim, Yagiz Nalcakan, Youngwan Jin.

Figure 1
Figure 1. Figure 1: Overview of the SpectraDINO architecture. view at source ↗
Figure 2
Figure 2. Figure 2: Training stage progression. Stage I trains only the modality-specific stems and adapters with the backbone frozen. Stage II introduces the neighborhood KL loss LA and populates the teacher queue while keeping the backbone frozen. Stage III unfreezes 50% of backbone blocks (25% for ViT-G) and rebalances loss weights to emphasize cross-modal consistency and spatial alignment. The purpose of Stage I is to lea… view at source ↗
Figure 3
Figure 3. Figure 3: 3-dimensional t-SNE projections of 100 randomly sampled RGB–infrared rep￾resentation pairs across the three training stages. Lines connect corresponding cross￾modal pairs view at source ↗
read the original abstract

Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and Long-Wave Infrared (LWIR) remains largely unexplored. These spectral modalities offer complementary sensing capabilities critical for robust perception in adverse conditions, but present a fundamental domain gap relative to RGB-centric pretrained models. We present SpectraDINO, a multispectral VFM that bridges this spectral gap by extending DINOv2 ViT backbones to beyond-visible modalities through lightweight, per-modality bottleneck adapters, while preserving the rich representations of the frozen RGB backbone. We introduce a multi-stage teacher-student training protocol in which a frozen DINOv2 teacher guides a spectral student via cosine distillation, symmetric contrastive loss, patch-level alignment, and a novel neighborhood-structure-preservation loss. This staged curriculum enables strong cross-modal alignment without catastrophic forgetting of RGB priors. We evaluate SpectraDINO on multispectral object detection and semantic segmentation across challenging NIR, SWIR, and LWIR benchmarks using widely adopted fusion strategies. SpectraDINO achieves state-of-the-art performance across most benchmarks, validating its effectiveness as a general-purpose backbone for spectral generalization. The code and weights for model variants are available at https://github.com/Yonsei-STL/SpectraDINO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces SpectraDINO, which adapts frozen DINOv2 ViT backbones to NIR, SWIR, and LWIR modalities via lightweight per-modality bottleneck adapters while preserving RGB priors. It employs a multi-stage teacher-student distillation protocol using cosine distillation, symmetric contrastive loss, patch-level alignment, and a novel neighborhood-structure-preservation loss to achieve cross-modal alignment. The approach is evaluated on multispectral object detection and semantic segmentation benchmarks, claiming state-of-the-art performance across most of them.

Significance. If the empirical results hold with proper controls, this provides a practical, parameter-efficient route to generalize RGB-pretrained vision foundation models to non-visible spectra without full retraining or catastrophic forgetting. The public code and weights release supports reproducibility and potential adoption in domains requiring robust perception under adverse conditions.

major comments (1)
  1. The central SOTA claim on NIR/SWIR/LWIR detection and segmentation benchmarks is load-bearing, yet the abstract supplies no quantitative numbers, baselines, error bars, or dataset details; if the experiments section does not provide these with statistical rigor, the claim cannot be assessed.
minor comments (2)
  1. The neighborhood-structure-preservation loss is described as novel but would benefit from an explicit equation and comparison to standard contrastive or structure-preserving losses to clarify its contribution.
  2. Ablation studies isolating the effect of each distillation term and the adapter bottleneck dimension would help substantiate the design choices, even if not strictly required for the main claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address the single major comment below and have prepared revisions to improve the clarity and self-containment of our SOTA claims.

read point-by-point responses
  1. Referee: The central SOTA claim on NIR/SWIR/LWIR detection and segmentation benchmarks is load-bearing, yet the abstract supplies no quantitative numbers, baselines, error bars, or dataset details; if the experiments section does not provide these with statistical rigor, the claim cannot be assessed.

    Authors: We agree that the abstract would benefit from greater specificity to make the central claim immediately verifiable. The experiments section (Section 4) already contains the requested details: comprehensive tables reporting mAP for object detection and mIoU for semantic segmentation across multiple NIR, SWIR, and LWIR benchmarks, direct comparisons against baselines including the frozen DINOv2 backbone and prior adaptation methods, and results averaged over repeated runs with standard deviations to indicate statistical stability. To address the referee's concern directly, we will revise the abstract to incorporate key quantitative results, baseline references, and dataset identifiers while preserving its concise style. This change will strengthen the manuscript without requiring alterations to the experimental design or results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical method with external benchmarks

full rationale

The paper describes an engineering approach: lightweight per-modality adapters on a frozen DINOv2 backbone, trained via a multi-stage distillation protocol (cosine, contrastive, patch alignment, and neighborhood-structure-preservation losses) and evaluated on NIR/SWIR/LWIR detection and segmentation benchmarks. No equations, uniqueness theorems, or first-principles derivations appear in the provided text. Performance claims rest on reported experimental results against external datasets and baselines, not on quantities defined by the fitted components themselves. No self-citation chains, ansatz smuggling, or renaming of known results are load-bearing. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach depends on several untuned hyperparameters for adapter size and loss balancing, plus the core domain assumption that RGB priors transfer via adapters; a new loss term is introduced without external validation.

free parameters (2)
  • adapter bottleneck dimension
    Size of the lightweight per-modality bottleneck adapters is a tunable hyperparameter required for the method to function.
  • loss weighting coefficients
    Relative weights among cosine distillation, contrastive, patch alignment, and neighborhood-structure losses must be chosen to balance the curriculum.
axioms (1)
  • domain assumption Frozen DINOv2 RGB backbone contains transferable priors sufficient for spectral modalities when augmented by lightweight adapters
    The entire training protocol and claim of no catastrophic forgetting rest on this transferability premise stated in the abstract.
invented entities (1)
  • neighborhood-structure-preservation loss no independent evidence
    purpose: To enforce preservation of local feature neighborhoods during cross-modal distillation
    New loss term introduced as part of the multi-stage training protocol; no independent evidence of its necessity or generality is supplied.

pith-pipeline@v0.9.0 · 5568 in / 1419 out tokens · 35157 ms · 2026-05-09T16:07:03.916959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    In: European Conference on Computer Vision

    Astruc, G., Gonthier, N., Mallet, C., Landrieu, L.: Omnisat: Self-supervised modal- ity fusion for earth observation. In: European Conference on Computer Vision. pp. 409–427. Springer (2024)

  2. [2]

    In: International Conference on Learning Representations (2022),https: //openreview.net/forum?id=p-BhZSz59o4

    Bao, H., Dong, L., Piao, S., Wei, F.: BEit: BERT pre-training of image trans- formers. In: International Conference on Learning Representations (2022),https: //openreview.net/forum?id=p-BhZSz59o4

  3. [3]

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2025)

    Braham, N.A.A., Albrecht, C.M., Mairal, J., Chanussot, J., Wang, Y., Zhu, X.X.: Spectralearth: Training hyperspectral foundation models at scale. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2025)

  4. [4]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Cao, B., Sun, Y., Zhu, P., Hu, Q.: Multi-modal gated mixture of local-to-global experts for dynamic image fusion. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 23555–23564 (2023)

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cao, Y., Bin, J., Hamari, J., Blasch, E., Liu, Z.: Multimodal object detection by channel switching and spatial attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 403–411 (2023)

  6. [6]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, J., Wang, X., Guo, Z., Zhang, X., Sun, J.: Dynamic region-aware convolu- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8064–8073 (2021)

  7. [7]

    In: Proceedings of the European conference on computer vision (ECCV)

    Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 801–818 (2018)

  8. [8]

    Advances in Neural Information Processing Systems35, 16664–16678 (2022)

    Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., Luo, P.: Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems35, 16664–16678 (2022)

  9. [9]

    In: European Conference on Computer Vision

    Chen, Y.T., Shi, J., Ye, Z., Mertz, C., Ramanan, D., Kong, S.: Multimodal ob- ject detection via probabilistic ensembling. In: European Conference on Computer Vision. pp. 139–158. Springer (2022)

  10. [10]

    In: The Eleventh International Conference on Learn- ing Representations (2023),https://openreview.net/forum?id=plKu2GByCNW

    Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., Qiao, Y.: Vision transformer adapter for dense predictions. In: The Eleventh International Conference on Learn- ing Representations (2023),https://openreview.net/forum?id=plKu2GByCNW

  11. [11]

    Choe, G., Kim, S.H., Im, S., Lee, J.Y., Narasimhan, S.G., Kweon, I.S.: Ranus: Rgb andnirurbanscenedatasetfordeepsceneparsing.IEEERoboticsandAutomation Letters3(3), 1808–1815 (2018)

  12. [12]

    Advances in Neural Information Processing Systems35, 197–211 (2022) SpectraDINO: Bridging the Spectral Gap in VFMs via Lightweight Adapters 21

    Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems35, 197–211 (2022) SpectraDINO: Bridging the Spectral Gap in VFMs via Lightweight Adapters 21

  13. [13]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Deevi, S.A., Lee, C., Gan, L., Nagesh, S., Pandey, G., Chung, S.J.: Rgb-x ob- ject detection via scene-specific fusion modules. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 7366–7375 (2024)

  14. [14]

    In: Infrared Imaging Systems: Design, Analysis, Modeling, and Testing XXIV

    Driggers, R.G., Hodgkin, V., Vollmerhausen, R.: What good is swir? passive day comparison of vis, nir, and swir. In: Infrared Imaging Systems: Design, Analysis, Modeling, and Testing XXIV. vol. 8706, pp. 187–201 (2013)

  15. [15]

    FLIR, D.: Flir thermal dataset for algorithm training,https://www.flir.com/ oem/adas/adas-dataset-form/, accessed on August 30, 2024

  16. [16]

    Advances in Neural Information Processing Systems36, 5506–5538 (2023)

    Fuller, A., Millard, K., Green, J.: Croma: Remote sensing representations with contrastive radar-optical masked autoencoders. Advances in Neural Information Processing Systems36, 5506–5538 (2023)

  17. [17]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15180– 15190 (2023)

  18. [18]

    In: 2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies

    Govardhan, P., Pati, U.C.: Nir image based pedestrian detection in night vision with cascade classification and validation. In: 2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies. pp. 1435–

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27672–27683 (2024)

  20. [20]

    In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y., Harada, T.: Mfnet: Towards real- time semantic segmentation for autonomous vehicles with multi-spectral scenes. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 5108–5115. IEEE (2017)

  21. [21]

    In: Asian conference on computer vision

    Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Asian conference on computer vision. pp. 213–228. Springer (2016)

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  23. [23]

    IEEE transactions on pattern analysis and machine intelligence46(8), 5227–5244 (2024)

    Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: Spectralgpt: Spectral remote sensing foundation model. IEEE transactions on pattern analysis and machine intelligence46(8), 5227–5244 (2024)

  24. [24]

    In: European conference on computer Vision

    Huang, Z., Liu, J., Fan, X., Liu, R., Zhong, W., Luo, Z.: Reconet: Recurrent cor- rection network for fast and efficient multi-modality image fusion. In: European conference on computer Vision. pp. 539–555. Springer (2022)

  25. [26]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Hwang, S., Park, J., Kim, N., Choi, Y., So Kweon, I.: Multispectral pedestrian detection: Benchmark dataset and baseline. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1037–1045 (2015)

  26. [27]

    International Conference on Learning Representations (ICLR) (2022) 22 Y

    Jaegle,A.,Borgeaud,S.,Alayrac,J.B.,Doersch,C.,Ionescu,C.,Ding,D.,Koppula, S., Zoran, D., Brock, A., Shelhamer, E., Hénaff, O., Botvinick, M.M., Zisserman, A., Vinyals, O., Carreira, J.: Perceiver io: A general architecture for structured inputs & outputs. International Conference on Learning Representations (ICLR) (2022) 22 Y. Nalcakan et al

  27. [28]

    In: Proceedings of the 31st ACM International Conference on Multimedia

    Ji, W., Li, J., Bian, C., Zhang, Z., Cheng, L.: Semanticrt: A large-scale dataset and method for robust semantic segmentation in multispectral images. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 3307–3316 (2023)

  28. [29]

    In: European conference on computer vision

    Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: European conference on computer vision. pp. 709–727. Springer (2022)

  29. [30]

    In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

    Jia, X., Zhu, C., Li, M., Tang, W., Zhou, W.: Llvip: A visible-infrared paired dataset for low-light vision. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 3496–3504 (2021)

  30. [31]

    Information Fusion p

    Jin, Y., Kovac, M., Nalcakan, Y., Park, I., Yeo, S., Ju, H., Kim, S.: Rasmd: Rgb and swir multispectral driving dataset for robust perception in adverse conditions. Information Fusion p. 103872 (2025)

  31. [32]

    In: Infrared Tech- nology and Applications L

    Jobert, G., Vannier, N., Pelletier, S., Delubac, R., Brenière, X., Péré-Laperne, N., Rubaldo, L.: Swir’s advantage over the visible in long-range imaging scenarios: comparative field trials in a variety of atmospheric conditions. In: Infrared Tech- nology and Applications L. vol. 13046, pp. 66–83. SPIE (2024)

  32. [33]

    In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

    Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: Mmtm: Multimodal trans- fer module for cnn fusion. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 13289–13299 (2020)

  33. [34]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

  34. [35]

    Machine Vision and Applications32(4), 88 (2021)

    Kumar, W.K., Singh, N.J., Singh, A.D., Nongmeikapam, K.: Enhanced machine perception by a scalable fusion of rgb–nir image pairs in diverse exposure environ- ments. Machine Vision and Applications32(4), 88 (2021)

  35. [36]

    Pattern Recognition Letters179, 144–150 (2024)

    Lee, S., Park, J., Park, J.: Crossformer: Cross-guided attention for multi-modal object detection. Pattern Recognition Letters179, 144–150 (2024)

  36. [37]

    Li, H., Xu, T., Wu, X.J., Lu, J., Kittler, J.: Lrrnet: A novel representation learning guidedfusionnetworkforinfraredandvisibleimages.IEEEtransactionsonpattern analysis and machine intelligence45(9), 11040–11052 (2023)

  37. [38]

    Biomimetic Intelligence and Robotics p

    Li, J., Yun, P., Xu, Y., Zhang, Y., Sun, M., Chen, Q., Alexander, I., Fan, R.: Hapnet: Toward superior rgb-thermal scene parsing via hybrid, asymmetric, and progressive heterogeneous feature fusion. Biomimetic Intelligence and Robotics p. 100309 (2026)

  38. [39]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Li, J., Liu, Y., Wang, X., Peng, Y., Sun, C., Wang, S., Sun, Z., Ke, T., Jiang, X., Lu, T., et al.: Hyperfree: A channel-adaptive and tuning-free foundation model for hyperspectral remote sensing imagery. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23048–23058 (2025)

  39. [40]

    In: International confer- ence on machine learning

    Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International confer- ence on machine learning. pp. 12888–12900. PMLR (2022)

  40. [41]

    In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition

    Li, X., Hong, D., Chanussot, J.: S2mae: A spatial-spectral pretraining foundation model for spectral remote sensing data. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. pp. 24088–24097 (2024)

  41. [42]

    In: European conference on computer vision

    Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: European conference on computer vision. pp. 280–296. Springer (2022)

  42. [43]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liang, Y., Wakaki, R., Nobuhara, S., Nishino, K.: Multimodal material segmen- tation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19800–19808 (2022) SpectraDINO: Bridging the Spectral Gap in VFMs via Lightweight Adapters 23

  43. [44]

    In: European Conference on Computer Vision

    Liu, F., Gao, C., Zhang, Y., Guo, J., Wang, J., Meng, D.: Infmae: A foundation model in the infrared modality. In: European Conference on Computer Vision. pp. 420–437. Springer (2024)

  44. [46]

    In: Proceedings of the IEEE/CVF international confer- ence on computer vision

    Liu, J., Liu, Z., Wu, G., Ma, L., Liu, R., Zhong, W., Luo, Z., Fan, X.: Multi- interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. In: Proceedings of the IEEE/CVF international confer- ence on computer vision. pp. 8115–8124 (2023)

  45. [47]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  46. [48]

    lu et al

    Lu,B.,Liu,H.,Watanabe,H.:Enhancingrgb-irobjectdetection:afrozenbackbone approach with multi-receptive field attention: B. lu et al. The Visual Computer 42(3), 164 (2026)

  47. [49]

    IEEE Transactions on Multimedia26, 6348–6360 (2024)

    Lv, Y., Liu, Z., Li, G.: Context-aware interaction network for rgb-t semantic seg- mentation. IEEE Transactions on Multimedia26, 6348–6360 (2024)

  48. [50]

    Information fusion45, 153–178 (2019)

    Ma, J., Ma, Y., Li, C.: Infrared and visible image fusion methods and applications: A survey. Information fusion45, 153–178 (2019)

  49. [51]

    IEEE/CAA Journal of Automatica Sinica9(7), 1200–1217 (2022)

    Ma, J., Tang, L., Fan, F., Huang, J., Mei, X., Ma, Y.: Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA Journal of Automatica Sinica9(7), 1200–1217 (2022)

  50. [52]

    Representation Learning with Contrastive Predictive Coding

    Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

  51. [53]

    Transactions on Ma- chineLearningResearch(2024),https://openreview.net/forum?id=a68SUt6zFt, featured Certification

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual feat...

  52. [54]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Prakash, A., Chitta, K., Geiger, A.: Multi-modal fusion transformer for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7077–7087 (2021)

  53. [55]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  54. [56]

    Diffv2ir: visible-to- infrared diffusion model via vision-language understanding.arXiv preprint arXiv:2503.19012, 2025

    Ran, L., Wang, L., Wang, G., Wang, P., Zhang, Y.: Diffv2ir: visible-to-infrared dif- fusion model via vision-language understanding. arXiv preprint arXiv:2503.19012 (2025)

  55. [57]

    IEEE Open Journal of Signal Processing 5, 599–610 (2024)

    Reza, M.K., Prater-Bennette, A., Asif, M.S.: Mmsformer: Multimodal transformer for material and semantic segmentation. IEEE Open Journal of Signal Processing 5, 599–610 (2024)

  56. [58]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Shin, U., Lee, K., Kweon, I.S., Oh, J.: Complementary random masking for rgb-thermal semantic segmentation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 11110–11117. IEEE (2024)

  57. [59]

    Nalcakan et al

    Shivakumar, S.S., Rodrigues, N., Zhou, A., Miller, I.D., Kumar, V., Taylor, C.J.: Pst900:Rgb-thermalcalibration,datasetandsegmentationnetwork.In:2020IEEE 24 Y. Nalcakan et al. international conference on robotics and automation (ICRA). pp. 9441–9447. IEEE (2020)

  58. [60]

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: Dinov3 (2025),https://ar...

  59. [61]

    IEEE Robotics and Automation Letters4(3), 2576– 2583 (2019)

    Sun, Y., Zuo, W., Liu, M.: Rtfnet: Rgb-thermal fusion network for semantic seg- mentation of urban scenes. IEEE Robotics and Automation Letters4(3), 2576– 2583 (2019)

  60. [62]

    Szwarcman, D., Roy, S., Fraccaro, P., Þorsteinn Elí Gíslason, Blumenstiel, B., Ghosal, R., de Oliveira, P.H., de Sousa Almeida, J.L., Sedona, R., Kang, Y., Chakraborty, S., Wang, S., Gomes, C., Kumar, A., Truong, M., Godwin, D., Lee, H., Hsu, C.Y., Asanjan, A.A., Mujeci, B., Shidham, D., Keenan, T., Arevalo, P., Li, W., Alemohammad, H., Olofsson, P., Hain...

  61. [63]

    IEEE/CAA Journal of Automatica Sinica9(12), 2121–2137 (2022)

    Tang, L., Deng, Y., Ma, Y., Huang, J., Ma, J.: Superfusion: A versatile image registration and fusion network with semantic awareness. IEEE/CAA Journal of Automatica Sinica9(12), 2121–2137 (2022)

  62. [64]

    IEEE Transactions on Multimedia25, 5413–5428 (2022)

    Tang, W., He, F., Liu, Y.: Ydtr: Infrared and visible image fusion via y-shape dynamic transformer. IEEE Transactions on Multimedia25, 5413–5428 (2022)

  63. [65]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Wang, D., Hu, M., Jin, Y., Miao, Y., Yang, J., Xu, Y., Qin, X., Ma, J., Sun, L., Li, C., et al.: Hypersigma: Hyperspectral intelligence comprehension foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  64. [66]

    Information Fusion120, 103030 (2025)

    Wang, Y., Chu, H.K., Sun, Y.: Peafusion: Parameter-efficient adaptation for rgb- thermal fusion-based semantic segmentation. Information Fusion120, 103030 (2025)

  65. [67]

    In: Proceedings of the 31st ACM International Conference on Multi- media

    Wang, Z., Colonnier, F., Zheng, J., Acharya, J., Jiang, W., Huang, K.: Tirdet: Mono-modality thermal infrared object detection based on prior thermal-to-visible translation. In: Proceedings of the 31st ACM International Conference on Multi- media. pp. 2663–2672 (2023)

  66. [68]

    In: Proceedings of the European conference on computer vision (ECCV)

    Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV). pp. 3–19 (2018)

  67. [69]

    In: Proceedings of the European conference on computer vision (ECCV)

    Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European conference on computer vision (ECCV). pp. 418–434 (2018)

  68. [70]

    In: Pro- ceedings of the 33rd ACM International Conference on Multimedia

    Yuan, M., Cui, B., Zhao, T., Wang, J., Fu, S., Yang, X., Wei, X.: Unirgb-ir: A unified framework for visible-infrared semantic tasks via adapter tuning. In: Pro- ceedings of the 33rd ACM International Conference on Multimedia. pp. 2409–2418 (2025)

  69. [71]

    IEEE Transactions on Circuits and Systems for Video Technology34(11), 11198–11213 (2024)

    Zeng, Y., Liang, T., Jin, Y., Li, Y.: Mmi-det: Exploring multi-modal integration for visible and infrared object detection. IEEE Transactions on Circuits and Systems for Video Technology34(11), 11198–11213 (2024)

  70. [72]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023) SpectraDINO: Bridging the Spectral Gap in VFMs via Lightweight Adapters 25

  71. [73]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)

  72. [74]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Zhang, H., Fromont, E., Lefèvre, S., Avignon, B.: Guided attentive feature fusion for multispectral pedestrian detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 72–80 (2021)

  73. [75]

    IEEE Transactions on intelligent transportation systems24(12), 14679–14694 (2023)

    Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., Stiefelhagen, R.: Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on intelligent transportation systems24(12), 14679–14694 (2023)

  74. [76]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhang,J.,Liu,R.,Shi,H.,Yang,K.,Reiß,S.,Peng,K.,Fu, H.,Wang,K.,Stiefelha- gen, R.: Delivering arbitrary-modal semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1136– 1147 (2023)

  75. [77]

    Zhang, T., Ding, K., Wen, J., Xiong, Y., Zhang, Z., Xiang, S., Pan, C.: Pad: Self- supervised pre-training with patchwise-scale adapter for infrared images (2023), https://arxiv.org/abs/2312.08192

  76. [78]

    In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=Xq7gwsnhPT

    Zhang, T., Wen, J., Chen, Z., Ding, K., Xiang, S., Pan, C.: UNIP: Rethink- ing pre-trained attention patterns for infrared semantic segmentation. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=Xq7gwsnhPT

  77. [79]

    Zhang, Y., Gao, C., Liu, F., Guo, J., Wang, L., Peng, X., Meng, D.: Iv-tuning: Parameter-efficient transfer learning for infrared-visible tasks (2026),https:// arxiv.org/abs/2412.16654

  78. [80]

    Zhao, T., Xi, J., Xiao, L., Li, J., Yang, X., Yuan, M., Wei, X.: Rgbt-ground benchmark: Visual grounding beyond rgb in complex real-world scenarios (2025), https://arxiv.org/abs/2512.24561

  79. [81]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhao, Z., Bai, H., Zhang, J., Zhang, Y., Xu, S., Lin, Z., Timofte, R., Van Gool, L.: Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5906–5916 (2023)

  80. [82]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhao, Z., Bai, H., Zhu, Y., Zhang, J., Xu, S., Zhang, Y., Zhang, K., Meng, D., Tim- ofte, R., Van Gool, L.: Ddfm: denoising diffusion model for multi-modality image fusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8082–8093 (2023)

Showing first 80 references.