FreqKD: Frequency-Decoupled Cross-Modal Knowledge Distillation for Infrared Object Detection

Abdalmalek Aburaddaha; Keval Thaker; Samir A. Rawashdeh; Venkatraman Narayanan

arxiv: 2606.11572 · v1 · pith:YPD4CEERnew · submitted 2026-06-10 · 💻 cs.CV

FreqKD: Frequency-Decoupled Cross-Modal Knowledge Distillation for Infrared Object Detection

Keval Thaker , Venkatraman Narayanan , Abdalmalek Aburaddaha , Samir A. Rawashdeh This is my paper

Pith reviewed 2026-06-27 10:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords knowledge distillationinfrared object detectionfrequency decouplingcross-modal transfermultispectral pedestrian detectionKAIST datasetDINOv2

0 comments

The pith

Frequency-decoupled distillation improves RGB-to-IR object detection by aligning shared structure while tolerating texture differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the gap between RGB and infrared features varies by spatial frequency, with low-frequency shape and layout aligning more closely than high-frequency texture and edges. It introduces FreqKD to apply strict MSE supervision only to the low-frequency band and a relaxed, down-weighted log-MSE loss to the high-frequency band. This asymmetric treatment yields 64.1 mAP50 on the KAIST multispectral pedestrian detection benchmark, a 2.4-point gain over a DINOv2 baseline. The same representation also improves results when transferred to the FLIR ADAS dataset, MFNet segmentation, and a ResNet-50 backbone.

Core claim

Spectral analysis of 500 paired RGB-IR samples reveals that high-frequency feature divergence exceeds low-frequency divergence by a factor of 2.4 on average across transformer layers. FreqKD exploits this by enforcing strict mean-squared error alignment on low-frequency components to preserve shared structural information and applying a 0.1-weighted log-MSE loss on high-frequency components to supply edge guidance without forcing alignment of modality-specific texture.

What carries the argument

Frequency-decoupled distillation that splits features into low- and high-frequency bands and applies asymmetric losses (strict MSE on low, relaxed weighted log-MSE on high) according to measured cross-modal consistency.

If this is right

Raises KAIST mAP50 from the DINOv2 baseline to 64.1, a 2.4-point absolute gain.
Transfers the learned representation to FLIR ADAS with a 2.1 mAP50 improvement.
Improves mean intersection-over-union by 1.85 on MFNet segmentation.
Delivers a 1.0 mAP50 gain when the student uses a ResNet-50 backbone instead of the original transformer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frequency split could be tested on other cross-modal pairs such as RGB-to-depth or RGB-to-event data where texture statistics also differ.
The fixed 0.1 weighting might be replaced by a small validation sweep or learned scalar without changing the core decoupling idea.
Because the method operates on intermediate features, it could be combined with existing KD techniques that already separate content and style.

Load-bearing premise

The extra divergence measured in high-frequency bands arises mainly from modality-specific characteristics that should be tolerated rather than aligned.

What would settle it

A controlled experiment in which uniform MSE applied across all frequencies on the same backbone and data produces equal or higher mAP50 than the frequency-decoupled version.

Figures

Figures reproduced from arXiv: 2606.11572 by Abdalmalek Aburaddaha, Keval Thaker, Samir A. Rawashdeh, Venkatraman Narayanan.

**Figure 1.** Figure 1: Overview of FreqKD. Stage 1 (left): a frozen RGB DINOv2 ViT-L teacher and an IR DINOv2 student with rank-64 LoRA adapters process registered RGB–IR pairs; only the LoRA parameters are trainable. Frequency decomposition module (centre): at each matched block l ∈ {7,15,19,21,23}, teacher and student features are channel-wise normalised, transformed by 2D FFT, and split at radial cut-off rc=0.50 into a share… view at source ↗

**Figure 2.** Figure 2: Qualitative results across the three transfer settings. Columns show ground truth, [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

Transfer learning from large-scale RGB foundation models to infrared (IR) imagery through knowledge distillation (KD) remains challenging due to fundamental differences in image formation physics. We investigate the spectral structure of the RGB--IR modality gap and observe that feature divergence is not uniform across spatial frequencies: low-frequency components (shape, layout) show greater cross-modal alignment than high-frequency components (texture, fine edges), which reflect modality-specific characteristics. Based on this analysis, we propose FreqKD, a frequency-decoupled distillation framework that applies asymmetric supervision adapted to each band's cross-modal consistency. The method employs strict mean squared error (MSE) on the low-frequency band to preserve shared structural information and a relaxed log-MSE loss (weighted at 0.1) on the high-frequency band to provide edge guidance while tolerating texture differences. Spectral divergence analysis on 500 paired samples shows that high-frequency divergence exceeds low-frequency divergence by a factor of 2.4x on average across all analysed transformer layers. On KAIST multispectral pedestrian detection, FreqKD achieves 64.1 mAP50, improving 2.4 points over the DINOv2 baseline. The learned representation transfers across datasets (FLIR ADAS, +2.1 mAP50), tasks (MFNet segmentation, +1.85 mean intersection-over-union), and architectures (ResNet-50, +1.0 mAP50). Code is available at: https://anonymous.4open.science/r/freq_decoupled_kd-5E5A

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FreqKD adds a frequency split with asymmetric losses for RGB-to-IR distillation and reports modest gains, but the 0.1 weight and divergence interpretation rest on thin controls.

read the letter

The core idea is to measure that high-frequency features diverge more between RGB and IR than low-frequency ones, then apply strict MSE on the low band and a light log-MSE (weight 0.1) on the high band during distillation from DINOv2. This produces 64.1 mAP50 on KAIST, a 2.4-point lift over the baseline, plus smaller gains on FLIR (+2.1), MFNet segmentation (+1.85 mIoU), and even when swapping to ResNet-50 (+1.0). The transfer across datasets and tasks is the most concrete evidence that the learned features are useful.

The frequency-decoupled asymmetric loss is the actual novelty here; prior cross-modal KD work does not appear to use this exact band-specific relaxation. The spectral analysis on 500 paired samples is straightforward and the code release helps reproducibility.

The main weakness is that the load-bearing choices lack supporting checks. The 0.1 weight is fixed with no ablation, the 2.4x divergence figure has no error bars or comparison to same-modality pairs, and there is no test showing that forcing high-frequency alignment actually degrades detection. Without those, the causal story tying the divergence measurement to the relaxed loss remains under-determined. The results are consistent but preliminary.

This is for people already working on RGB-to-IR distillation or multispectral detection who want a simple tweak to try. It is not a broad theoretical advance. A serious editor should send it to review because the empirical results are positive on standard benchmarks, the method is easy to implement, and the gaps are addressable with additional experiments rather than fundamental flaws.

Referee Report

4 major / 1 minor

Summary. The paper proposes FreqKD, a frequency-decoupled knowledge distillation method for transferring RGB foundation models (DINOv2) to infrared object detection. Based on spectral analysis of 500 paired RGB-IR samples showing 2.4x higher feature divergence in high-frequency bands than low-frequency bands across transformer layers, it applies strict MSE to low-frequency components and a relaxed 0.1-weighted log-MSE to high-frequency components. This yields 64.1 mAP50 on KAIST multispectral pedestrian detection (+2.4 over baseline), with reported transfers to FLIR ADAS (+2.1 mAP50), MFNet segmentation (+1.85 mIoU), and ResNet-50 backbone (+1.0 mAP50).

Significance. If the frequency-specific asymmetry is shown to be robust and causally responsible for the gains, the method could offer a practical way to handle modality gaps in cross-modal KD without forcing alignment on texture differences. The cross-dataset, cross-task, and cross-architecture transfer results suggest potential generality beyond the KAIST benchmark.

major comments (4)

[Abstract / §3] Abstract and §3 (spectral analysis): the 2.4x high-frequency divergence claim is based on 500 paired samples but provides no description of the frequency decomposition procedure, no per-layer or per-sample variance, and no error bars; this measurement directly motivates the asymmetric loss design and must be reproducible.
[§4] §4 (loss formulation): the 0.1 weighting factor on the high-frequency log-MSE term is presented as fixed without ablation against uniform weighting, alternative factors, or learned weights; because the central claim attributes the 2.4 mAP50 gain to this specific relaxation, the lack of sensitivity analysis makes the result under-determined.
[§5] §5 (experiments): no control experiments test whether enforcing high-frequency alignment (e.g., equal MSE on both bands) actually harms downstream detection, nor whether the observed divergence gap persists in same-modality (RGB-RGB or IR-IR) pairs after controlling for general high-frequency noise.
[§5] §5 (KAIST results): the 64.1 mAP50 figure and +2.4 improvement are reported without standard deviations across runs or statistical significance tests, weakening the claim that the frequency-decoupled losses are responsible for the observed gain.

minor comments (1)

[Abstract] The code link is given as anonymous; a permanent repository with the exact spectral-analysis script would strengthen reproducibility.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the thoughtful comments on our manuscript. We address each of the major comments below and will incorporate revisions to improve clarity, reproducibility, and experimental rigor.

read point-by-point responses

Referee: [Abstract / §3] the 2.4x high-frequency divergence claim is based on 500 paired samples but provides no description of the frequency decomposition procedure, no per-layer or per-sample variance, and no error bars; this measurement directly motivates the asymmetric loss design and must be reproducible.

Authors: We agree that additional details are necessary for reproducibility. In the revised manuscript, we will provide a full description of the frequency decomposition procedure in §3, including the use of 2D Fourier transforms with specific low-pass and high-pass filters. We will also report per-layer and per-sample variance along with error bars on the 2.4x factor to quantify the consistency of the observation across the 500 samples. revision: yes
Referee: [§4] the 0.1 weighting factor on the high-frequency log-MSE term is presented as fixed without ablation against uniform weighting, alternative factors, or learned weights; because the central claim attributes the 2.4 mAP50 gain to this specific relaxation, the lack of sensitivity analysis makes the result under-determined.

Authors: We acknowledge this point and will add a comprehensive ablation study in the revised §5. This will include results for different weighting factors (0.01, 0.05, 0.1, 0.5, 1.0) on the high-frequency term, as well as uniform weighting across both bands. We will also explore a learned weight variant if feasible. These experiments will help substantiate the choice of 0.1 and its contribution to the performance gains. revision: yes
Referee: [§5] no control experiments test whether enforcing high-frequency alignment (e.g., equal MSE on both bands) actually harms downstream detection, nor whether the observed divergence gap persists in same-modality (RGB-RGB or IR-IR) pairs after controlling for general high-frequency noise.

Authors: We agree that such controls would provide stronger evidence for the frequency-specific approach. We commit to adding these experiments in the revision: (1) a variant with equal MSE on high-frequency components to measure any degradation in detection performance, and (2) divergence analysis on same-modality pairs to demonstrate that the observed gap is modality-specific. These will be presented in §5. revision: yes
Referee: [§5] the 64.1 mAP50 figure and +2.4 improvement are reported without standard deviations across runs or statistical significance tests, weakening the claim that the frequency-decoupled losses are responsible for the observed gain.

Authors: We will address this by conducting multiple runs (at least 5 seeds) and reporting mean performance with standard deviations for the KAIST results. We will also perform and report a statistical significance test (e.g., t-test) to support the significance of the +2.4 mAP50 improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method defined independently of results

full rationale

The paper measures spectral divergence empirically on 500 paired samples (2.4x factor) to motivate an asymmetric loss design (strict MSE on low-frequency, log-MSE weighted at 0.1 on high-frequency), then reports downstream mAP gains on public benchmarks. No equation shows the 0.1 weight or performance numbers reducing to the divergence measurement by construction, nor any self-citation chain, uniqueness theorem, or ansatz smuggling that would make the central claim tautological. The derivation chain remains self-contained against external data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on one explicit free parameter (the 0.1 high-frequency weight) and one domain assumption about frequency-specific cross-modal alignment; no new entities are postulated.

free parameters (1)

high-frequency loss weight = 0.1
The factor 0.1 is introduced to relax supervision on the high-frequency band.

axioms (1)

domain assumption Low-frequency components exhibit greater cross-modal alignment than high-frequency components.
This premise directly motivates the asymmetric loss design and is supported only by the internal spectral analysis on 500 samples.

pith-pipeline@v0.9.1-grok · 5830 in / 1336 out tokens · 34024 ms · 2026-06-27T10:51:13.571672+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 2 linked inside Pith

[1]

Multimodal object detection via probabilistic ensembling

Yi-Ting Chen, Jinghao Shi, Zelin Ye, Christoph Mertz, Deva Ramanan, and Shu Kong. Multimodal object detection via probabilistic ensembling. InECCV, 2022

2022
[2]

Cross-modality fusion transformer for multispectral object detection.arXiv:2111.00273, 2021

Qingyun Fang, Dapeng Han, and Zhaokui Wang. Cross-modality fusion transformer for multispectral object detection.arXiv:2111.00273, 2021

arXiv 2021
[3]

Dantas, Luigi Di Caro, and Dino Ienco

Roger Ferrod, Cássio F. Dantas, Luigi Di Caro, and Dino Ienco. Revisiting cross-modal knowledge distillation: A disentanglement approach for RGBD semantic segmentation. InECML-PKDD, 2025

2025
[4]

Domain-adversarial train- ing of neural networks

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial train- ing of neural networks. InJMLR, volume 17, pages 1–35, 2016

2016
[5]

Dual-stream spectral decoupling distillation for remote sensing object detection.IEEE Trans

Xiangyi Gao, Danpei Zhao, Bo Yuan, and Wentao Li. Dual-stream spectral decoupling distillation for remote sensing object detection.IEEE Trans. Geosci. Remote Sens., 2025

2025
[6]

ImageBind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Al- wala, Armand Joulin, and Ishan Misra. ImageBind: One embedding space to bind them all. InCVPR, 2023

2023
[7]

A kernel two-sample test.JMLR, 13:723–773, 2012

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.JMLR, 13:723–773, 2012

2012
[8]

MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes

Qishen Ha, Kohei Watanabe, Takumi Karasawa, Yoshitaka Ushiku, and Tatsuya Harada. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. InIROS, 2017

2017
[9]

Distilling the knowledge in a neural network.arXiv:1503.02531, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015
[10]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Shen Yelong, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022
[11]

C2KD: Bridg- ing the modality gap for cross-modal knowledge distillation

Fushuo Huo, Wenchao Xu, Jingcai Guo, Haozhao Wang, and Song Guo. C2KD: Bridg- ing the modality gap for cross-modal knowledge distillation. InCVPR, 2024

2024
[12]

KAIST multispectral pedestrian dataset.IEEE Trans

Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon. KAIST multispectral pedestrian dataset.IEEE Trans. Intell. Transp. Syst., 19(3), 2018

2018
[13]

LLVIP: A visible- infrared paired dataset for low-light vision

Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. LLVIP: A visible- infrared paired dataset for low-light vision. InICCV Workshops, 2021

2021
[14]

Contrast-guided cross-modal distillation for thermal object detection.arXiv:2511.01435, 2025

SiWoo Kim and JhongHyun An. Contrast-guided cross-modal distillation for thermal object detection.arXiv:2511.01435, 2025. 16THAKER, NARA Y ANAN, ABURADDAHA, RAW ASHDEH: FREQKD CROSS-MODAL KD

arXiv 2025
[15]

Seg- ment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Seg- ment anything. InICCV, 2023

2023
[16]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, 2019

2019
[17]

Multispectral pedestrian detection via simultaneous detection and segmentation

Chengyang Li, Dan Song, Ruofeng Tong, and Min Tang. Multispectral pedestrian detection via simultaneous detection and segmentation. InBMVC, 2018

2018
[18]

Multi-teacher knowledge distillation with triplet loss for cross-modal object tracking

Yi Li, Lei Liu, Mengya Zhang, and Chenglong Li. Multi-teacher knowledge distillation with triplet loss for cross-modal object tracking. InInt. Conf. Brain Inspired Cognitive Systems (BICS), 2024

2024
[19]

Distilling cross- modal knowledge via feature disentanglement

Junhong Liu, Yuan Zhang, Tao Huang, Wenchao Xu, and Renyu Yang. Distilling cross- modal knowledge via feature disentanglement. InAAAI, 2026

2026
[20]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

2019
[21]

Guerrero Pena, Masih Aminbeidokhti, Thomas Dubail, Eric Granger, and Marco Pedersoli

Heitor Rapela Medeiros, Fidel A. Guerrero Pena, Masih Aminbeidokhti, Thomas Dubail, Eric Granger, and Marco Pedersoli. HalluciDet: Hallucinating RGB modal- ity for person detection through privileged information. InWACV, 2024

2024
[22]

DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

2024
[23]

How do vision transformers work? InICLR, 2022

Namuk Park and Songkuk Kim. How do vision transformers work? InICLR, 2022

2022
[24]

Frequency attention for knowledge distillation

Cuong Pham, Van-Anh Nguyen, Trung Le, Dinh Phung, Gustavo Carneiro, and Thanh- Toan Do. Frequency attention for knowledge distillation. InWACV, 2024

2024
[25]

FcaNet: Frequency channel attention networks

Zequn Qin, Pengyi Zhang, Fei Wu, and Xi Li. FcaNet: Frequency channel attention networks. InICCV, 2021

2021
[26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

2021
[27]

PHI-S: Distribution balancing for label-free multi-teacher distillation

Mike Ranzinger, Jon Barker, Greg Heinrich, Pavlo Molchanov, Bryan Catanzaro, and Andrew Tao. PHI-S: Distribution balancing for label-free multi-teacher distillation. arXiv:2410.01680, 2024

arXiv 2024
[28]

AM-RADIO: Ag- glomerative vision foundation model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. AM-RADIO: Ag- glomerative vision foundation model reduce all domains into one. InCVPR, pages 12490–12500, 2024

2024
[29]

Global filter networks for image classification

Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Global filter networks for image classification. InNeurIPS, 2021. THAKER, NARA Y ANAN, ABURADDAHA, RAW ASHDEH: FREQKD CROSS-MODAL KD17

2021
[30]

SAM 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos. InICLR, 2025

2025
[31]

Faster R-CNN: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. InNeurIPS, 2015

2015
[32]

FitNets: Hints for thin deep nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. InICLR, 2015

2015
[33]

Return of frustratingly easy domain adaptation

Baochen Sun and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016

2016
[34]

CRC Press, 2016

Glenn J Tattersall.Infrared Thermography: A Complete Laboratory and Field Manual. CRC Press, 2016

2016
[35]

FLIR thermal dataset for algorithm training.https://oem.flir

Teledyne FLIR. FLIR thermal dataset for algorithm training.https://oem.flir. com/solutions/automotive/adas-dataset-form/, 2018

2018
[36]

SigLIP 2: Multilingual vision-language encoders with im- proved semantic understanding, localization, and dense features.arXiv:2502.14786, 2025

Michael Tschannen et al. SigLIP 2: Multilingual vision-language encoders with im- proved semantic understanding, localization, and dense features.arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025
[37]

SAMamba: Adaptive state space modeling with hierarchical vision for infrared small target detection.Information Fusion, 124, 2025

Wenhao Xu, Shuchen Zheng, Changwei Wang, Zherui Zhang, Chuan Ren, Rongtao Xu, and Shibiao Xu. SAMamba: Adaptive state space modeling with hierarchical vision for infrared small target detection.Information Fusion, 124, 2025

2025
[38]

DistillMatch: Leveraging knowledge distillation from vision foundation model for multimodal image matching.arXiv:2509.16017, 2025

Meng Yang, Fan Fan, Zizhuo Li, Songchu Deng, Yong Ma, and Jiayi Ma. DistillMatch: Leveraging knowledge distillation from vision foundation model for multimodal image matching.arXiv:2509.16017, 2025

arXiv 2025
[39]

Focal and global knowledge distillation for detectors

Zhendong Yang, Zhe Li, Xiaohu Jiang, Yuan Gong, Zehuan Yuan, Danpei Zhao, and Chun Yuan. Focal and global knowledge distillation for detectors. InCVPR, 2022

2022
[40]

Masked generative distillation

Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Zehuan Yuan, and Chun Yuan. Masked generative distillation. InECCV, 2022

2022
[41]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023

2023
[42]

DINO: DETR with improved denoising anchor boxes for end-to- end object detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. DINO: DETR with improved denoising anchor boxes for end-to- end object detection. InICLR, 2023

2023
[43]

Wavelet knowledge distillation: Towards efficient image-to-image translation

Linfeng Zhang, Xin Chen, Xiaobing Tu, Pengfei Wan, Ning Xu, and Kaisheng Ma. Wavelet knowledge distillation: Towards efficient image-to-image translation. In CVPR, 2022

2022
[44]

Efficient RGB-T tracking via cross-modality distillation

Tianlu Zhang, Hongyuan Guo, Qiang Jiao, Qiang Zhang, and Jungong Han. Efficient RGB-T tracking via cross-modality distillation. InCVPR, pages 5404–5413, 2023

2023
[45]

SS- DC: Spatial-spectral decoupling and coupling across visible-infrared gap for domain adaptive object detection.arXiv:2507.12017, 2025

Xiwei Zhang, Chunjin Yang, Yiming Xiao, Runtong Zhang, and Fanman Meng. SS- DC: Spatial-spectral decoupling and coupling across visible-infrared gap for domain adaptive object detection.arXiv:2507.12017, 2025. 18THAKER, NARA Y ANAN, ABURADDAHA, RAW ASHDEH: FREQKD CROSS-MODAL KD

arXiv 2025
[46]

FreeKD: Knowledge distillation via semantic frequency prompt

Yuan Zhang, Tao Huang, Jiaming Liu, Tao Jiang, Kuan Cheng, and Shanghang Zhang. FreeKD: Knowledge distillation via semantic frequency prompt. InCVPR, 2024

2024
[47]

Unveiling the potential of segment anything model 2 for RGB-thermal semantic segmentation with language guidance

Jiayi Zhao, Fei Teng, Kai Luo, Guoqiang Zhao, Zhiyong Li, Xu Zheng, and Kailun Yang. Unveiling the potential of segment anything model 2 for RGB-thermal semantic segmentation with language guidance. InIROS, 2025

2025
[48]

Improving multispectral pedestrian detection by addressing modality imbalance problems

Kailai Zhou, Linsen Chen, and Xun Cao. Improving multispectral pedestrian detection by addressing modality imbalance problems. InECCV, 2020

2020

[1] [1]

Multimodal object detection via probabilistic ensembling

Yi-Ting Chen, Jinghao Shi, Zelin Ye, Christoph Mertz, Deva Ramanan, and Shu Kong. Multimodal object detection via probabilistic ensembling. InECCV, 2022

2022

[2] [2]

Cross-modality fusion transformer for multispectral object detection.arXiv:2111.00273, 2021

Qingyun Fang, Dapeng Han, and Zhaokui Wang. Cross-modality fusion transformer for multispectral object detection.arXiv:2111.00273, 2021

arXiv 2021

[3] [3]

Dantas, Luigi Di Caro, and Dino Ienco

Roger Ferrod, Cássio F. Dantas, Luigi Di Caro, and Dino Ienco. Revisiting cross-modal knowledge distillation: A disentanglement approach for RGBD semantic segmentation. InECML-PKDD, 2025

2025

[4] [4]

Domain-adversarial train- ing of neural networks

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial train- ing of neural networks. InJMLR, volume 17, pages 1–35, 2016

2016

[5] [5]

Dual-stream spectral decoupling distillation for remote sensing object detection.IEEE Trans

Xiangyi Gao, Danpei Zhao, Bo Yuan, and Wentao Li. Dual-stream spectral decoupling distillation for remote sensing object detection.IEEE Trans. Geosci. Remote Sens., 2025

2025

[6] [6]

ImageBind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Al- wala, Armand Joulin, and Ishan Misra. ImageBind: One embedding space to bind them all. InCVPR, 2023

2023

[7] [7]

A kernel two-sample test.JMLR, 13:723–773, 2012

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.JMLR, 13:723–773, 2012

2012

[8] [8]

MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes

Qishen Ha, Kohei Watanabe, Takumi Karasawa, Yoshitaka Ushiku, and Tatsuya Harada. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. InIROS, 2017

2017

[9] [9]

Distilling the knowledge in a neural network.arXiv:1503.02531, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015

[10] [10]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Shen Yelong, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022

[11] [11]

C2KD: Bridg- ing the modality gap for cross-modal knowledge distillation

Fushuo Huo, Wenchao Xu, Jingcai Guo, Haozhao Wang, and Song Guo. C2KD: Bridg- ing the modality gap for cross-modal knowledge distillation. InCVPR, 2024

2024

[12] [12]

KAIST multispectral pedestrian dataset.IEEE Trans

Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon. KAIST multispectral pedestrian dataset.IEEE Trans. Intell. Transp. Syst., 19(3), 2018

2018

[13] [13]

LLVIP: A visible- infrared paired dataset for low-light vision

Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. LLVIP: A visible- infrared paired dataset for low-light vision. InICCV Workshops, 2021

2021

[14] [14]

Contrast-guided cross-modal distillation for thermal object detection.arXiv:2511.01435, 2025

SiWoo Kim and JhongHyun An. Contrast-guided cross-modal distillation for thermal object detection.arXiv:2511.01435, 2025. 16THAKER, NARA Y ANAN, ABURADDAHA, RAW ASHDEH: FREQKD CROSS-MODAL KD

arXiv 2025

[15] [15]

Seg- ment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Seg- ment anything. InICCV, 2023

2023

[16] [16]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, 2019

2019

[17] [17]

Multispectral pedestrian detection via simultaneous detection and segmentation

Chengyang Li, Dan Song, Ruofeng Tong, and Min Tang. Multispectral pedestrian detection via simultaneous detection and segmentation. InBMVC, 2018

2018

[18] [18]

Multi-teacher knowledge distillation with triplet loss for cross-modal object tracking

Yi Li, Lei Liu, Mengya Zhang, and Chenglong Li. Multi-teacher knowledge distillation with triplet loss for cross-modal object tracking. InInt. Conf. Brain Inspired Cognitive Systems (BICS), 2024

2024

[19] [19]

Distilling cross- modal knowledge via feature disentanglement

Junhong Liu, Yuan Zhang, Tao Huang, Wenchao Xu, and Renyu Yang. Distilling cross- modal knowledge via feature disentanglement. InAAAI, 2026

2026

[20] [20]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

2019

[21] [21]

Guerrero Pena, Masih Aminbeidokhti, Thomas Dubail, Eric Granger, and Marco Pedersoli

Heitor Rapela Medeiros, Fidel A. Guerrero Pena, Masih Aminbeidokhti, Thomas Dubail, Eric Granger, and Marco Pedersoli. HalluciDet: Hallucinating RGB modal- ity for person detection through privileged information. InWACV, 2024

2024

[22] [22]

DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

2024

[23] [23]

How do vision transformers work? InICLR, 2022

Namuk Park and Songkuk Kim. How do vision transformers work? InICLR, 2022

2022

[24] [24]

Frequency attention for knowledge distillation

Cuong Pham, Van-Anh Nguyen, Trung Le, Dinh Phung, Gustavo Carneiro, and Thanh- Toan Do. Frequency attention for knowledge distillation. InWACV, 2024

2024

[25] [25]

FcaNet: Frequency channel attention networks

Zequn Qin, Pengyi Zhang, Fei Wu, and Xi Li. FcaNet: Frequency channel attention networks. InICCV, 2021

2021

[26] [26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

2021

[27] [27]

PHI-S: Distribution balancing for label-free multi-teacher distillation

Mike Ranzinger, Jon Barker, Greg Heinrich, Pavlo Molchanov, Bryan Catanzaro, and Andrew Tao. PHI-S: Distribution balancing for label-free multi-teacher distillation. arXiv:2410.01680, 2024

arXiv 2024

[28] [28]

AM-RADIO: Ag- glomerative vision foundation model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. AM-RADIO: Ag- glomerative vision foundation model reduce all domains into one. InCVPR, pages 12490–12500, 2024

2024

[29] [29]

Global filter networks for image classification

Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Global filter networks for image classification. InNeurIPS, 2021. THAKER, NARA Y ANAN, ABURADDAHA, RAW ASHDEH: FREQKD CROSS-MODAL KD17

2021

[30] [30]

SAM 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos. InICLR, 2025

2025

[31] [31]

Faster R-CNN: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. InNeurIPS, 2015

2015

[32] [32]

FitNets: Hints for thin deep nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. InICLR, 2015

2015

[33] [33]

Return of frustratingly easy domain adaptation

Baochen Sun and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016

2016

[34] [34]

CRC Press, 2016

Glenn J Tattersall.Infrared Thermography: A Complete Laboratory and Field Manual. CRC Press, 2016

2016

[35] [35]

FLIR thermal dataset for algorithm training.https://oem.flir

Teledyne FLIR. FLIR thermal dataset for algorithm training.https://oem.flir. com/solutions/automotive/adas-dataset-form/, 2018

2018

[36] [36]

SigLIP 2: Multilingual vision-language encoders with im- proved semantic understanding, localization, and dense features.arXiv:2502.14786, 2025

Michael Tschannen et al. SigLIP 2: Multilingual vision-language encoders with im- proved semantic understanding, localization, and dense features.arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025

[37] [37]

SAMamba: Adaptive state space modeling with hierarchical vision for infrared small target detection.Information Fusion, 124, 2025

Wenhao Xu, Shuchen Zheng, Changwei Wang, Zherui Zhang, Chuan Ren, Rongtao Xu, and Shibiao Xu. SAMamba: Adaptive state space modeling with hierarchical vision for infrared small target detection.Information Fusion, 124, 2025

2025

[38] [38]

DistillMatch: Leveraging knowledge distillation from vision foundation model for multimodal image matching.arXiv:2509.16017, 2025

Meng Yang, Fan Fan, Zizhuo Li, Songchu Deng, Yong Ma, and Jiayi Ma. DistillMatch: Leveraging knowledge distillation from vision foundation model for multimodal image matching.arXiv:2509.16017, 2025

arXiv 2025

[39] [39]

Focal and global knowledge distillation for detectors

Zhendong Yang, Zhe Li, Xiaohu Jiang, Yuan Gong, Zehuan Yuan, Danpei Zhao, and Chun Yuan. Focal and global knowledge distillation for detectors. InCVPR, 2022

2022

[40] [40]

Masked generative distillation

Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Zehuan Yuan, and Chun Yuan. Masked generative distillation. InECCV, 2022

2022

[41] [41]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023

2023

[42] [42]

DINO: DETR with improved denoising anchor boxes for end-to- end object detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. DINO: DETR with improved denoising anchor boxes for end-to- end object detection. InICLR, 2023

2023

[43] [43]

Wavelet knowledge distillation: Towards efficient image-to-image translation

Linfeng Zhang, Xin Chen, Xiaobing Tu, Pengfei Wan, Ning Xu, and Kaisheng Ma. Wavelet knowledge distillation: Towards efficient image-to-image translation. In CVPR, 2022

2022

[44] [44]

Efficient RGB-T tracking via cross-modality distillation

Tianlu Zhang, Hongyuan Guo, Qiang Jiao, Qiang Zhang, and Jungong Han. Efficient RGB-T tracking via cross-modality distillation. InCVPR, pages 5404–5413, 2023

2023

[45] [45]

SS- DC: Spatial-spectral decoupling and coupling across visible-infrared gap for domain adaptive object detection.arXiv:2507.12017, 2025

Xiwei Zhang, Chunjin Yang, Yiming Xiao, Runtong Zhang, and Fanman Meng. SS- DC: Spatial-spectral decoupling and coupling across visible-infrared gap for domain adaptive object detection.arXiv:2507.12017, 2025. 18THAKER, NARA Y ANAN, ABURADDAHA, RAW ASHDEH: FREQKD CROSS-MODAL KD

arXiv 2025

[46] [46]

FreeKD: Knowledge distillation via semantic frequency prompt

Yuan Zhang, Tao Huang, Jiaming Liu, Tao Jiang, Kuan Cheng, and Shanghang Zhang. FreeKD: Knowledge distillation via semantic frequency prompt. InCVPR, 2024

2024

[47] [47]

Unveiling the potential of segment anything model 2 for RGB-thermal semantic segmentation with language guidance

Jiayi Zhao, Fei Teng, Kai Luo, Guoqiang Zhao, Zhiyong Li, Xu Zheng, and Kailun Yang. Unveiling the potential of segment anything model 2 for RGB-thermal semantic segmentation with language guidance. InIROS, 2025

2025

[48] [48]

Improving multispectral pedestrian detection by addressing modality imbalance problems

Kailai Zhou, Linsen Chen, and Xun Cao. Improving multispectral pedestrian detection by addressing modality imbalance problems. InECCV, 2020

2020