SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters
Pith reviewed 2026-05-09 16:07 UTC · model grok-4.3
The pith
SpectraDINO extends frozen DINOv2 backbones to multispectral modalities using per-modality adapters and staged distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SpectraDINO bridges the spectral gap by keeping a DINOv2 ViT backbone frozen and attaching small per-modality bottleneck adapters. A multi-stage teacher-student protocol guides training with cosine distillation, symmetric contrastive loss, patch-level alignment, and a neighborhood-structure-preservation loss, allowing cross-modal alignment while avoiding catastrophic forgetting of the original RGB representations. When evaluated on multispectral object detection and semantic segmentation tasks across NIR, SWIR, and LWIR datasets, the adapted models achieve state-of-the-art results with common fusion methods.
What carries the argument
Lightweight per-modality bottleneck adapters inserted into a frozen DINOv2 backbone, trained via a multi-stage distillation curriculum that includes a neighborhood-structure-preservation loss.
If this is right
- Multispectral perception pipelines can reuse large RGB foundation-model weights with only modest added parameters.
- The same backbone can serve as a general-purpose feature extractor for visible and beyond-visible spectra without separate large-scale pretraining.
- Adverse-condition robustness improves because complementary spectral channels become accessible through simple adapter attachment.
- Modality-specific fine-tuning becomes cheaper and faster, lowering the barrier to deploying vision models on new sensors.
Where Pith is reading between the lines
- The adapter-plus-distillation pattern could be tested on other non-RGB modalities such as radar or hyperspectral cubes to check how far the domain-gap closure generalizes.
- If the neighborhood-structure loss proves critical, similar structure-preserving terms might help when adapting foundation models across other large distribution shifts.
- The modular design suggests that a single frozen backbone could eventually host adapters for many sensor types simultaneously, enabling runtime switching between modalities with low overhead.
Load-bearing premise
The frozen RGB-pretrained DINOv2 model already holds priors that are close enough to spectral data that small adapters plus distillation can close the remaining gap without large drops in representation quality.
What would settle it
Training SpectraDINO on one set of spectral bands and then measuring whether its performance on an entirely unseen spectral band falls below that of a randomly initialized model trained only on the new band.
Figures
read the original abstract
Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and Long-Wave Infrared (LWIR) remains largely unexplored. These spectral modalities offer complementary sensing capabilities critical for robust perception in adverse conditions, but present a fundamental domain gap relative to RGB-centric pretrained models. We present SpectraDINO, a multispectral VFM that bridges this spectral gap by extending DINOv2 ViT backbones to beyond-visible modalities through lightweight, per-modality bottleneck adapters, while preserving the rich representations of the frozen RGB backbone. We introduce a multi-stage teacher-student training protocol in which a frozen DINOv2 teacher guides a spectral student via cosine distillation, symmetric contrastive loss, patch-level alignment, and a novel neighborhood-structure-preservation loss. This staged curriculum enables strong cross-modal alignment without catastrophic forgetting of RGB priors. We evaluate SpectraDINO on multispectral object detection and semantic segmentation across challenging NIR, SWIR, and LWIR benchmarks using widely adopted fusion strategies. SpectraDINO achieves state-of-the-art performance across most benchmarks, validating its effectiveness as a general-purpose backbone for spectral generalization. The code and weights for model variants are available at https://github.com/Yonsei-STL/SpectraDINO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SpectraDINO, which adapts frozen DINOv2 ViT backbones to NIR, SWIR, and LWIR modalities via lightweight per-modality bottleneck adapters while preserving RGB priors. It employs a multi-stage teacher-student distillation protocol using cosine distillation, symmetric contrastive loss, patch-level alignment, and a novel neighborhood-structure-preservation loss to achieve cross-modal alignment. The approach is evaluated on multispectral object detection and semantic segmentation benchmarks, claiming state-of-the-art performance across most of them.
Significance. If the empirical results hold with proper controls, this provides a practical, parameter-efficient route to generalize RGB-pretrained vision foundation models to non-visible spectra without full retraining or catastrophic forgetting. The public code and weights release supports reproducibility and potential adoption in domains requiring robust perception under adverse conditions.
major comments (1)
- The central SOTA claim on NIR/SWIR/LWIR detection and segmentation benchmarks is load-bearing, yet the abstract supplies no quantitative numbers, baselines, error bars, or dataset details; if the experiments section does not provide these with statistical rigor, the claim cannot be assessed.
minor comments (2)
- The neighborhood-structure-preservation loss is described as novel but would benefit from an explicit equation and comparison to standard contrastive or structure-preserving losses to clarify its contribution.
- Ablation studies isolating the effect of each distillation term and the adapter bottleneck dimension would help substantiate the design choices, even if not strictly required for the main claim.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback. We address the single major comment below and have prepared revisions to improve the clarity and self-containment of our SOTA claims.
read point-by-point responses
-
Referee: The central SOTA claim on NIR/SWIR/LWIR detection and segmentation benchmarks is load-bearing, yet the abstract supplies no quantitative numbers, baselines, error bars, or dataset details; if the experiments section does not provide these with statistical rigor, the claim cannot be assessed.
Authors: We agree that the abstract would benefit from greater specificity to make the central claim immediately verifiable. The experiments section (Section 4) already contains the requested details: comprehensive tables reporting mAP for object detection and mIoU for semantic segmentation across multiple NIR, SWIR, and LWIR benchmarks, direct comparisons against baselines including the frozen DINOv2 backbone and prior adaptation methods, and results averaged over repeated runs with standard deviations to indicate statistical stability. To address the referee's concern directly, we will revise the abstract to incorporate key quantitative results, baseline references, and dataset identifiers while preserving its concise style. This change will strengthen the manuscript without requiring alterations to the experimental design or results. revision: yes
Circularity Check
No significant circularity: purely empirical method with external benchmarks
full rationale
The paper describes an engineering approach: lightweight per-modality adapters on a frozen DINOv2 backbone, trained via a multi-stage distillation protocol (cosine, contrastive, patch alignment, and neighborhood-structure-preservation losses) and evaluated on NIR/SWIR/LWIR detection and segmentation benchmarks. No equations, uniqueness theorems, or first-principles derivations appear in the provided text. Performance claims rest on reported experimental results against external datasets and baselines, not on quantities defined by the fitted components themselves. No self-citation chains, ansatz smuggling, or renaming of known results are load-bearing. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- adapter bottleneck dimension
- loss weighting coefficients
axioms (1)
- domain assumption Frozen DINOv2 RGB backbone contains transferable priors sufficient for spectral modalities when augmented by lightweight adapters
invented entities (1)
-
neighborhood-structure-preservation loss
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: European Conference on Computer Vision
Astruc, G., Gonthier, N., Mallet, C., Landrieu, L.: Omnisat: Self-supervised modal- ity fusion for earth observation. In: European Conference on Computer Vision. pp. 409–427. Springer (2024)
work page 2024
-
[2]
Bao, H., Dong, L., Piao, S., Wei, F.: BEit: BERT pre-training of image trans- formers. In: International Conference on Learning Representations (2022),https: //openreview.net/forum?id=p-BhZSz59o4
work page 2022
-
[3]
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2025)
Braham, N.A.A., Albrecht, C.M., Mairal, J., Chanussot, J., Wang, Y., Zhu, X.X.: Spectralearth: Training hyperspectral foundation models at scale. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2025)
work page 2025
-
[4]
In: Proceedings of the IEEE/CVF international conference on computer vision
Cao, B., Sun, Y., Zhu, P., Hu, Q.: Multi-modal gated mixture of local-to-global experts for dynamic image fusion. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 23555–23564 (2023)
work page 2023
-
[5]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Cao, Y., Bin, J., Hamari, J., Blasch, E., Liu, Z.: Multimodal object detection by channel switching and spatial attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 403–411 (2023)
work page 2023
-
[6]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, J., Wang, X., Guo, Z., Zhang, X., Sun, J.: Dynamic region-aware convolu- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8064–8073 (2021)
work page 2021
-
[7]
In: Proceedings of the European conference on computer vision (ECCV)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 801–818 (2018)
work page 2018
-
[8]
Advances in Neural Information Processing Systems35, 16664–16678 (2022)
Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., Luo, P.: Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems35, 16664–16678 (2022)
work page 2022
-
[9]
In: European Conference on Computer Vision
Chen, Y.T., Shi, J., Ye, Z., Mertz, C., Ramanan, D., Kong, S.: Multimodal ob- ject detection via probabilistic ensembling. In: European Conference on Computer Vision. pp. 139–158. Springer (2022)
work page 2022
-
[10]
Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., Qiao, Y.: Vision transformer adapter for dense predictions. In: The Eleventh International Conference on Learn- ing Representations (2023),https://openreview.net/forum?id=plKu2GByCNW
work page 2023
-
[11]
Choe, G., Kim, S.H., Im, S., Lee, J.Y., Narasimhan, S.G., Kweon, I.S.: Ranus: Rgb andnirurbanscenedatasetfordeepsceneparsing.IEEERoboticsandAutomation Letters3(3), 1808–1815 (2018)
work page 2018
-
[12]
Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems35, 197–211 (2022) SpectraDINO: Bridging the Spectral Gap in VFMs via Lightweight Adapters 21
work page 2022
-
[13]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Deevi, S.A., Lee, C., Gan, L., Nagesh, S., Pandey, G., Chung, S.J.: Rgb-x ob- ject detection via scene-specific fusion modules. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 7366–7375 (2024)
work page 2024
-
[14]
In: Infrared Imaging Systems: Design, Analysis, Modeling, and Testing XXIV
Driggers, R.G., Hodgkin, V., Vollmerhausen, R.: What good is swir? passive day comparison of vis, nir, and swir. In: Infrared Imaging Systems: Design, Analysis, Modeling, and Testing XXIV. vol. 8706, pp. 187–201 (2013)
work page 2013
-
[15]
FLIR, D.: Flir thermal dataset for algorithm training,https://www.flir.com/ oem/adas/adas-dataset-form/, accessed on August 30, 2024
work page 2024
-
[16]
Advances in Neural Information Processing Systems36, 5506–5538 (2023)
Fuller, A., Millard, K., Green, J.: Croma: Remote sensing representations with contrastive radar-optical masked autoencoders. Advances in Neural Information Processing Systems36, 5506–5538 (2023)
work page 2023
-
[17]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15180– 15190 (2023)
work page 2023
-
[18]
Govardhan, P., Pati, U.C.: Nir image based pedestrian detection in night vision with cascade classification and validation. In: 2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies. pp. 1435–
work page 2014
-
[19]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27672–27683 (2024)
work page 2024
-
[20]
In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y., Harada, T.: Mfnet: Towards real- time semantic segmentation for autonomous vehicles with multi-spectral scenes. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 5108–5115. IEEE (2017)
work page 2017
-
[21]
In: Asian conference on computer vision
Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Asian conference on computer vision. pp. 213–228. Springer (2016)
work page 2016
-
[22]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)
work page 2022
-
[23]
IEEE transactions on pattern analysis and machine intelligence46(8), 5227–5244 (2024)
Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: Spectralgpt: Spectral remote sensing foundation model. IEEE transactions on pattern analysis and machine intelligence46(8), 5227–5244 (2024)
work page 2024
-
[24]
In: European conference on computer Vision
Huang, Z., Liu, J., Fan, X., Liu, R., Zhong, W., Luo, Z.: Reconet: Recurrent cor- rection network for fast and efficient multi-modality image fusion. In: European conference on computer Vision. pp. 539–555. Springer (2022)
work page 2022
-
[26]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Hwang, S., Park, J., Kim, N., Choi, Y., So Kweon, I.: Multispectral pedestrian detection: Benchmark dataset and baseline. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1037–1045 (2015)
work page 2015
-
[27]
International Conference on Learning Representations (ICLR) (2022) 22 Y
Jaegle,A.,Borgeaud,S.,Alayrac,J.B.,Doersch,C.,Ionescu,C.,Ding,D.,Koppula, S., Zoran, D., Brock, A., Shelhamer, E., Hénaff, O., Botvinick, M.M., Zisserman, A., Vinyals, O., Carreira, J.: Perceiver io: A general architecture for structured inputs & outputs. International Conference on Learning Representations (ICLR) (2022) 22 Y. Nalcakan et al
work page 2022
-
[28]
In: Proceedings of the 31st ACM International Conference on Multimedia
Ji, W., Li, J., Bian, C., Zhang, Z., Cheng, L.: Semanticrt: A large-scale dataset and method for robust semantic segmentation in multispectral images. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 3307–3316 (2023)
work page 2023
-
[29]
In: European conference on computer vision
Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: European conference on computer vision. pp. 709–727. Springer (2022)
work page 2022
-
[30]
In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision
Jia, X., Zhu, C., Li, M., Tang, W., Zhou, W.: Llvip: A visible-infrared paired dataset for low-light vision. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 3496–3504 (2021)
work page 2021
-
[31]
Jin, Y., Kovac, M., Nalcakan, Y., Park, I., Yeo, S., Ju, H., Kim, S.: Rasmd: Rgb and swir multispectral driving dataset for robust perception in adverse conditions. Information Fusion p. 103872 (2025)
work page 2025
-
[32]
In: Infrared Tech- nology and Applications L
Jobert, G., Vannier, N., Pelletier, S., Delubac, R., Brenière, X., Péré-Laperne, N., Rubaldo, L.: Swir’s advantage over the visible in long-range imaging scenarios: comparative field trials in a variety of atmospheric conditions. In: Infrared Tech- nology and Applications L. vol. 13046, pp. 66–83. SPIE (2024)
work page 2024
-
[33]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: Mmtm: Multimodal trans- fer module for cnn fusion. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 13289–13299 (2020)
work page 2020
-
[34]
In: Proceedings of the IEEE/CVF international conference on computer vision
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)
work page 2023
-
[35]
Machine Vision and Applications32(4), 88 (2021)
Kumar, W.K., Singh, N.J., Singh, A.D., Nongmeikapam, K.: Enhanced machine perception by a scalable fusion of rgb–nir image pairs in diverse exposure environ- ments. Machine Vision and Applications32(4), 88 (2021)
work page 2021
-
[36]
Pattern Recognition Letters179, 144–150 (2024)
Lee, S., Park, J., Park, J.: Crossformer: Cross-guided attention for multi-modal object detection. Pattern Recognition Letters179, 144–150 (2024)
work page 2024
-
[37]
Li, H., Xu, T., Wu, X.J., Lu, J., Kittler, J.: Lrrnet: A novel representation learning guidedfusionnetworkforinfraredandvisibleimages.IEEEtransactionsonpattern analysis and machine intelligence45(9), 11040–11052 (2023)
work page 2023
-
[38]
Biomimetic Intelligence and Robotics p
Li, J., Yun, P., Xu, Y., Zhang, Y., Sun, M., Chen, Q., Alexander, I., Fan, R.: Hapnet: Toward superior rgb-thermal scene parsing via hybrid, asymmetric, and progressive heterogeneous feature fusion. Biomimetic Intelligence and Robotics p. 100309 (2026)
work page 2026
-
[39]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Li, J., Liu, Y., Wang, X., Peng, Y., Sun, C., Wang, S., Sun, Z., Ke, T., Jiang, X., Lu, T., et al.: Hyperfree: A channel-adaptive and tuning-free foundation model for hyperspectral remote sensing imagery. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23048–23058 (2025)
work page 2025
-
[40]
In: International confer- ence on machine learning
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International confer- ence on machine learning. pp. 12888–12900. PMLR (2022)
work page 2022
-
[41]
In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition
Li, X., Hong, D., Chanussot, J.: S2mae: A spatial-spectral pretraining foundation model for spectral remote sensing data. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. pp. 24088–24097 (2024)
work page 2024
-
[42]
In: European conference on computer vision
Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: European conference on computer vision. pp. 280–296. Springer (2022)
work page 2022
-
[43]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liang, Y., Wakaki, R., Nobuhara, S., Nishino, K.: Multimodal material segmen- tation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19800–19808 (2022) SpectraDINO: Bridging the Spectral Gap in VFMs via Lightweight Adapters 23
work page 2022
-
[44]
In: European Conference on Computer Vision
Liu, F., Gao, C., Zhang, Y., Guo, J., Wang, J., Meng, D.: Infmae: A foundation model in the infrared modality. In: European Conference on Computer Vision. pp. 420–437. Springer (2024)
work page 2024
-
[46]
In: Proceedings of the IEEE/CVF international confer- ence on computer vision
Liu, J., Liu, Z., Wu, G., Ma, L., Liu, R., Zhong, W., Luo, Z., Fan, X.: Multi- interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. In: Proceedings of the IEEE/CVF international confer- ence on computer vision. pp. 8115–8124 (2023)
work page 2023
-
[47]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [48]
-
[49]
IEEE Transactions on Multimedia26, 6348–6360 (2024)
Lv, Y., Liu, Z., Li, G.: Context-aware interaction network for rgb-t semantic seg- mentation. IEEE Transactions on Multimedia26, 6348–6360 (2024)
work page 2024
-
[50]
Information fusion45, 153–178 (2019)
Ma, J., Ma, Y., Li, C.: Infrared and visible image fusion methods and applications: A survey. Information fusion45, 153–178 (2019)
work page 2019
-
[51]
IEEE/CAA Journal of Automatica Sinica9(7), 1200–1217 (2022)
Ma, J., Tang, L., Fan, F., Huang, J., Mei, X., Ma, Y.: Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA Journal of Automatica Sinica9(7), 1200–1217 (2022)
work page 2022
-
[52]
Representation Learning with Contrastive Predictive Coding
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[53]
Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual feat...
work page 2024
-
[54]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Prakash, A., Chitta, K., Geiger, A.: Multi-modal fusion transformer for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7077–7087 (2021)
work page 2021
-
[55]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
work page 2021
-
[56]
Ran, L., Wang, L., Wang, G., Wang, P., Zhang, Y.: Diffv2ir: visible-to-infrared dif- fusion model via vision-language understanding. arXiv preprint arXiv:2503.19012 (2025)
-
[57]
IEEE Open Journal of Signal Processing 5, 599–610 (2024)
Reza, M.K., Prater-Bennette, A., Asif, M.S.: Mmsformer: Multimodal transformer for material and semantic segmentation. IEEE Open Journal of Signal Processing 5, 599–610 (2024)
work page 2024
-
[58]
In: 2024 IEEE International Conference on Robotics and Automation (ICRA)
Shin, U., Lee, K., Kweon, I.S., Oh, J.: Complementary random masking for rgb-thermal semantic segmentation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 11110–11117. IEEE (2024)
work page 2024
-
[59]
Shivakumar, S.S., Rodrigues, N., Zhou, A., Miller, I.D., Kumar, V., Taylor, C.J.: Pst900:Rgb-thermalcalibration,datasetandsegmentationnetwork.In:2020IEEE 24 Y. Nalcakan et al. international conference on robotics and automation (ICRA). pp. 9441–9447. IEEE (2020)
work page 2020
-
[60]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: Dinov3 (2025),https://ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
IEEE Robotics and Automation Letters4(3), 2576– 2583 (2019)
Sun, Y., Zuo, W., Liu, M.: Rtfnet: Rgb-thermal fusion network for semantic seg- mentation of urban scenes. IEEE Robotics and Automation Letters4(3), 2576– 2583 (2019)
work page 2019
-
[62]
Szwarcman, D., Roy, S., Fraccaro, P., Þorsteinn Elí Gíslason, Blumenstiel, B., Ghosal, R., de Oliveira, P.H., de Sousa Almeida, J.L., Sedona, R., Kang, Y., Chakraborty, S., Wang, S., Gomes, C., Kumar, A., Truong, M., Godwin, D., Lee, H., Hsu, C.Y., Asanjan, A.A., Mujeci, B., Shidham, D., Keenan, T., Arevalo, P., Li, W., Alemohammad, H., Olofsson, P., Hain...
-
[63]
IEEE/CAA Journal of Automatica Sinica9(12), 2121–2137 (2022)
Tang, L., Deng, Y., Ma, Y., Huang, J., Ma, J.: Superfusion: A versatile image registration and fusion network with semantic awareness. IEEE/CAA Journal of Automatica Sinica9(12), 2121–2137 (2022)
work page 2022
-
[64]
IEEE Transactions on Multimedia25, 5413–5428 (2022)
Tang, W., He, F., Liu, Y.: Ydtr: Infrared and visible image fusion via y-shape dynamic transformer. IEEE Transactions on Multimedia25, 5413–5428 (2022)
work page 2022
-
[65]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
Wang, D., Hu, M., Jin, Y., Miao, Y., Yang, J., Xu, Y., Qin, X., Ma, J., Sun, L., Li, C., et al.: Hypersigma: Hyperspectral intelligence comprehension foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
work page 2025
-
[66]
Information Fusion120, 103030 (2025)
Wang, Y., Chu, H.K., Sun, Y.: Peafusion: Parameter-efficient adaptation for rgb- thermal fusion-based semantic segmentation. Information Fusion120, 103030 (2025)
work page 2025
-
[67]
In: Proceedings of the 31st ACM International Conference on Multi- media
Wang, Z., Colonnier, F., Zheng, J., Acharya, J., Jiang, W., Huang, K.: Tirdet: Mono-modality thermal infrared object detection based on prior thermal-to-visible translation. In: Proceedings of the 31st ACM International Conference on Multi- media. pp. 2663–2672 (2023)
work page 2023
-
[68]
In: Proceedings of the European conference on computer vision (ECCV)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV). pp. 3–19 (2018)
work page 2018
-
[69]
In: Proceedings of the European conference on computer vision (ECCV)
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European conference on computer vision (ECCV). pp. 418–434 (2018)
work page 2018
-
[70]
In: Pro- ceedings of the 33rd ACM International Conference on Multimedia
Yuan, M., Cui, B., Zhao, T., Wang, J., Fu, S., Yang, X., Wei, X.: Unirgb-ir: A unified framework for visible-infrared semantic tasks via adapter tuning. In: Pro- ceedings of the 33rd ACM International Conference on Multimedia. pp. 2409–2418 (2025)
work page 2025
-
[71]
IEEE Transactions on Circuits and Systems for Video Technology34(11), 11198–11213 (2024)
Zeng, Y., Liang, T., Jin, Y., Li, Y.: Mmi-det: Exploring multi-modal integration for visible and infrared object detection. IEEE Transactions on Circuits and Systems for Video Technology34(11), 11198–11213 (2024)
work page 2024
-
[72]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023) SpectraDINO: Bridging the Spectral Gap in VFMs via Lightweight Adapters 25
work page 2023
-
[73]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)
work page internal anchor Pith review arXiv 2022
-
[74]
In: Proceedings of the IEEE/CVF winter conference on applications of computer vision
Zhang, H., Fromont, E., Lefèvre, S., Avignon, B.: Guided attentive feature fusion for multispectral pedestrian detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 72–80 (2021)
work page 2021
-
[75]
IEEE Transactions on intelligent transportation systems24(12), 14679–14694 (2023)
Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., Stiefelhagen, R.: Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on intelligent transportation systems24(12), 14679–14694 (2023)
work page 2023
-
[76]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zhang,J.,Liu,R.,Shi,H.,Yang,K.,Reiß,S.,Peng,K.,Fu, H.,Wang,K.,Stiefelha- gen, R.: Delivering arbitrary-modal semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1136– 1147 (2023)
work page 2023
- [77]
-
[78]
Zhang, T., Wen, J., Chen, Z., Ding, K., Xiang, S., Pan, C.: UNIP: Rethink- ing pre-trained attention patterns for infrared semantic segmentation. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=Xq7gwsnhPT
work page 2025
- [79]
- [80]
-
[81]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhao, Z., Bai, H., Zhang, J., Zhang, Y., Xu, S., Lin, Z., Timofte, R., Van Gool, L.: Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5906–5916 (2023)
work page 2023
-
[82]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Zhao, Z., Bai, H., Zhu, Y., Zhang, J., Xu, S., Zhang, Y., Zhang, K., Meng, D., Tim- ofte, R., Van Gool, L.: Ddfm: denoising diffusion model for multi-modality image fusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8082–8093 (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.