pith. machine review for the scientific record. sign in

arxiv: 2605.10130 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: no theorem link

Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary detectionthermal object detectioncross-modal distillationlanguage-guided detectionsynthetic thermal datasetRGB-thermal pairsmodality fusionthermal perception
0
0 comments X

The pith

Thermal-Det establishes open-vocabulary detection for thermal images by distilling language-guided supervision from an RGB teacher on synthetic thermal data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that open-vocabulary detectors, which recognize objects from text descriptions rather than fixed labels, can be made to work on thermal images despite their low texture and differing visual cues from RGB. It does so by converting an existing large RGB grounding dataset into the thermal domain, filtering out RGB-only terms to create over one million training samples, and then training a thermal model that receives geometric and semantic cues from a frozen RGB teacher on paired but unlabeled images. A sympathetic reader would care because thermal cameras are already deployed in low-light and night settings where standard vision systems lose effectiveness, and language-driven detection would let users query for arbitrary objects without collecting new labeled thermal data. The resulting model is fully fine-tuned on thermal contrasts while keeping its language alignment intact through joint detection, captioning, and distillation losses plus specialized alignment and fusion modules.

Core claim

Thermal-Det is the first large language model supervised open-vocabulary detector tailored for thermal images. It is enabled by a synthetic dataset of over one million thermally aligned samples obtained by converting GroundingCap-1M into the thermal domain and removing RGB-specific terms from the captions. The detector jointly optimizes detection, captioning, and cross-modal distillation objectives, letting a frozen RGB teacher supply geometric and semantic pseudo-supervision on paired but unlabeled RGB-thermal data. It adds a Thermal-Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation approaches, it

What carries the argument

Cross-modal distillation pipeline that transfers open-vocabulary knowledge from a frozen RGB teacher to a thermal student via pseudo-supervision on paired unlabeled images and synthetic thermally aligned captions.

If this is right

  • Open-vocabulary queries become usable for locating arbitrary objects in thermal footage without retraining the detector for each new class.
  • Large-scale training of thermal detectors no longer requires manual thermal annotations because existing RGB datasets can be converted and paired data can supply the missing supervision.
  • The detector fully adapts to thermal-specific contrast patterns while preserving its ability to align with language descriptions.
  • Consistent 2-4 percent AP improvements appear on standard thermal object-detection benchmarks relative to prior open-vocabulary methods.
  • Language-driven thermal perception is now possible as a practical starting point for applications that rely on heat imagery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conversion-plus-distillation recipe could be tried on other data-scarce modalities such as depth maps or event-camera streams whenever paired RGB data exists.
  • If the caption-filtering step works reliably, the method lowers the cost of building detectors for any visual domain that lacks large labeled corpora.
  • Zero-shot detection of novel objects described only in text might become feasible in thermal imagery once the language alignment is strong enough.
  • The performance edge could be tested further by replacing the frozen RGB teacher with a jointly trained multimodal teacher or by collecting a small set of real thermal captions for comparison.

Load-bearing premise

Converting RGB captions and bounding boxes into the thermal domain while filtering RGB-specific terms still produces semantically valid training examples whose boxes and descriptions match what thermal images actually show.

What would settle it

Running the trained Thermal-Det on a held-out thermal benchmark that uses only human-verified thermal annotations and finding that its average precision does not exceed or falls below the best RGB open-vocabulary detectors on the same data.

Figures

Figures reproduced from arXiv: 2605.10130 by Christopher Funk, Elim Schenck, Florence Yellin, Shuowen Hu, Vishal M. Patel, Yasiru Ranasinghe.

Figure 1
Figure 1. Figure 1: Our method preserves full open-vocabulary reasoning in the thermal domain, outperforming both RGB-trained open-vocabulary [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Thermal-Det combines a large language model (LLM) with a dual-stream RGB–thermal detector for zero-shot thermal object [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Thermal-Det zero-shot performance across various [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Existing open-vocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we develop a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB-thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal-Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2-4% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents Thermal-Det, an open-vocabulary object detector for thermal images that uses LLM supervision via cross-modal distillation from a frozen RGB teacher model. It constructs a synthetic training set of over 1M samples by converting GroundingCap-1M into the thermal domain and filtering RGB-specific caption terms, then jointly optimizes detection, captioning, and distillation losses with a Thermal-Text Alignment Head and Modality-Fused Cross-Attention module. The central empirical claim is consistent 2-4% AP gains over prior open-vocabulary detectors on public benchmarks.

Significance. If the synthetic dataset preserves semantically valid bounding boxes and captions under the domain shift and the reported gains prove robust, the approach would offer a scalable route to language-guided thermal perception without requiring large-scale manual thermal annotations. The distillation strategy and joint optimization objectives represent a reasonable extension of RGB open-vocabulary methods to the thermal domain.

major comments (2)
  1. [§3.1] §3.1 (Dataset Construction): The conversion of GroundingCap-1M and subsequent RGB-term filtering are presented as producing >1M thermally aligned samples with valid bounding boxes and captions, yet no quantitative validation (human agreement rates, thermal-specific IoU, or caption similarity metrics) is reported to confirm that transferred boxes align with actual thermal contrast regions and that filtered captions accurately describe emissivity patterns rather than RGB appearance.
  2. [§4] §4 (Experiments): The claim of consistent 2-4% AP gains is stated without specifying the exact baselines, the number of random seeds or runs used to compute the gains, error bars, statistical significance tests, or ablation studies that isolate the contribution of the synthetic dataset versus the distillation objectives and new modules.
minor comments (2)
  1. [Abstract] The abstract and §1 refer to 'public benchmarks' without naming the datasets (e.g., FLIR, KAIST, or others) or the evaluation protocol (open-vocabulary split details).
  2. [§3.2] Notation for the loss weighting coefficients (mentioned as free parameters) is introduced without an explicit equation or table showing their values or sensitivity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating the changes we will make in the revised manuscript.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Dataset Construction): The conversion of GroundingCap-1M and subsequent RGB-term filtering are presented as producing >1M thermally aligned samples with valid bounding boxes and captions, yet no quantitative validation (human agreement rates, thermal-specific IoU, or caption similarity metrics) is reported to confirm that transferred boxes align with actual thermal contrast regions and that filtered captions accurately describe emissivity patterns rather than RGB appearance.

    Authors: We agree that explicit quantitative validation of the synthetic dataset would strengthen the paper. The original manuscript describes the conversion pipeline from GroundingCap-1M and the RGB-term filtering process but does not report agreement rates or similarity metrics. In the revision we will add a dedicated validation subsection to §3.1 that includes: (i) human evaluation on a sampled subset of 500 images reporting inter-annotator agreement for bounding-box alignment with thermal contrast regions and caption accuracy with respect to emissivity patterns, and (ii) CLIP-based caption similarity scores between original RGB captions and the filtered thermal versions. We note that direct thermal-specific IoU is difficult because the source dataset lacks paired thermal ground truth; the human study will serve as the primary validation proxy. These additions will appear in the main text and supplementary material. revision: yes

  2. Referee: [§4] §4 (Experiments): The claim of consistent 2-4% AP gains is stated without specifying the exact baselines, the number of random seeds or runs used to compute the gains, error bars, statistical significance tests, or ablation studies that isolate the contribution of the synthetic dataset versus the distillation objectives and new modules.

    Authors: We acknowledge that the experimental section would benefit from greater reproducibility details and component-wise analysis. The reported gains compare against standard open-vocabulary RGB detectors (GroundingDINO, OWL-ViT) fine-tuned on thermal data, yet the manuscript omits seed counts, variance, and targeted ablations. In the revised version we will: (1) explicitly list all baselines and their exact configurations, (2) report mean AP ± standard deviation over five random seeds together with error bars in all tables, (3) include paired t-test p-values to assess statistical significance of the 2-4% gains, and (4) add a new ablation subsection (4.3) that isolates the synthetic dataset size, distillation loss weight, Thermal-Text Alignment Head, and Modality-Fused Cross-Attention module. Updated tables, figures, and statistical results will be provided in the main paper and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Thermal-Det derivation chain

full rationale

The paper's core method creates a synthetic training set by converting an external dataset (GroundingCap-1M) and applies cross-modal distillation from a frozen external RGB teacher; performance gains are reported from benchmark experiments rather than being forced by construction from fitted parameters or self-referential definitions. No self-definitional equations, fitted-input predictions, load-bearing self-citations, or ansatzes that reduce to the target result appear in the provided claims or abstract.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claim rests on the validity of the synthetic thermal data generation process and the assumption that RGB-derived pseudo-labels transfer meaningfully to thermal images. Several training hyperparameters for the joint optimization are left unspecified.

free parameters (1)
  • loss weighting coefficients
    Weights balancing detection, captioning, and distillation losses are required for joint optimization but not reported in the abstract.
axioms (2)
  • domain assumption Paired but unlabeled RGB-thermal images exist and can be used for pseudo-supervision without introducing harmful domain-shift artifacts.
    The cross-modal distillation step relies on this assumption to transfer open-vocabulary knowledge.
  • domain assumption Caption filtering successfully removes all RGB-specific terms while preserving thermal-relevant semantics.
    The synthetic dataset construction depends on this step to produce valid training samples.
invented entities (2)
  • Thermal-Text Alignment Head no independent evidence
    purpose: Calibrates thermal features to text embeddings
    New module introduced to improve language alignment in the thermal domain.
  • Modality-Fused Cross-Attention module no independent evidence
    purpose: Enables reasoning across RGB and thermal modalities
    New architectural component for dual-modality fusion.

pith-pipeline@v0.9.0 · 5530 in / 1688 out tokens · 44838 ms · 2026-05-12T03:20:17.844917+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 1 internal anchor

  1. [1]

    Exploring visual prompts for adapting large- scale models

    Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for adapting large- scale models.arXiv preprint arXiv:2203.17274, 2022. 2

  2. [2]

    Thermal image sensing model for robotic planning and search.Sensors, 16(8):1253, 2016

    Lidice E Castro Jimenez and Edgar A Mart ´ınez-Garc´ıa. Thermal image sensing model for robotic planning and search.Sensors, 16(8):1253, 2016. 1

  3. [3]

    Yolo-world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024. 3, 1, 2

  4. [4]

    Kaist multi-spectral day/night data set for autonomous and assisted driving.IEEE Transactions on Intelligent Transportation Systems, 19(3):934–948, 2018

    Yukyung Choi, Namil Kim, Soonmin Hwang, Kibaek Park, Jae Shin Yoon, Kyounghwan An, and In So Kweon. Kaist multi-spectral day/night data set for autonomous and assisted driving.IEEE Transactions on Intelligent Transportation Systems, 19(3):934–948, 2018. 2

  5. [5]

    FLIR Thermal Starter Dataset Introduction Version 1.3

    FLIR Systems. FLIR Thermal Starter Dataset Introduction Version 1.3. Available for download athttps://www. flir.eu/oem/adas/adas-dataset-form/, 2019. Version 1.3. 2

  6. [6]

    Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models

    Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, and Wei-Shi Zheng. Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14987–14997, 2025. 2, 1

  7. [7]

    On the effectiveness of parameter-efficient fine-tuning

    Zihao Fu, Haoran Yang, Anthony Man-Cho So, Wai Lam, Lidong Bing, and Nigel Collier. On the effectiveness of parameter-efficient fine-tuning. InProceedings of the AAAI conference on artificial intelligence, pages 12799–12807,

  8. [8]

    Thermal cameras and applications: a survey.Machine vision and applications, 25 (1):245–262, 2014

    Rikke Gade and Thomas B Moeslund. Thermal cameras and applications: a survey.Machine vision and applications, 25 (1):245–262, 2014. 1

  9. [9]

    Camel dataset for vi- sual and thermal infrared multiple object detection and track- ing

    Evan Gebhardt and Marilyn Wolf. Camel dataset for vi- sual and thermal infrared multiple object detection and track- ing. In2018 15th IEEE international conference on ad- vanced video and signal based surveillance (AVSS), pages 1–6. IEEE, 2018. 2

  10. [10]

    Domain-adaptive pedestrian detection in thermal images

    Tiantong Guo, Cong Phuoc Huynh, and Mashhour Solh. Domain-adaptive pedestrian detection in thermal images. In 2019 IEEE international conference on image processing (ICIP), pages 1660–1664. IEEE, 2019. 2, 1

  11. [11]

    Image-to-image translation for improvement of synthetic thermal infrared training data using generative adversarial networks

    Hanna H Hamrell and J ¨orgen M Karlholm. Image-to-image translation for improvement of synthetic thermal infrared training data using generative adversarial networks. InAr- tificial Intelligence and Machine Learning in Defense Appli- cations III, pages 61–72. SPIE, 2021. 2

  12. [12]

    Inference-time scaling of diffusion models for infrared data generation.arXiv preprint arXiv:2511.07362, 2025

    Kai A Horstmann, Maxim Clouser, and Kia Khezeli. Inference-time scaling of diffusion models for infrared data generation.arXiv preprint arXiv:2511.07362, 2025. 2

  13. [13]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2, 4

  14. [14]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 2

  15. [15]

    Llvip: A visible-infrared paired dataset for low-light vision, 2021

    Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, Shengjie Liu, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision, 2021. 2

  16. [16]

    Llvip: A visible-infrared paired dataset for low-light vision

    Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 3496–3504, 2021. 2

  17. [17]

    Object detection from uav thermal infrared images and videos using yolo models.International Journal of Applied Earth Observation and Geoinformation, 112:102912, 2022

    Chenchen Jiang, Huazhong Ren, Xin Ye, Jinshun Zhu, Hui Zeng, Yang Nan, Min Sun, Xiang Ren, and Hongtao Huo. Object detection from uav thermal infrared images and videos using yolo models.International Journal of Applied Earth Observation and Geoinformation, 112:102912, 2022. 1

  18. [18]

    T-rex2: Towards generic object detec- tion via text-visual prompt synergy

    Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-rex2: Towards generic object detec- tion via text-visual prompt synergy. InEuropean Conference on Computer Vision, pages 38–57. Springer, 2024. 1, 2

  19. [19]

    A novel approach for surveillance using vi- sual and thermal images

    GD Jones, MA Hodgetts, RE Allsop, N Sumpter, and MA Vicencio-Silva. A novel approach for surveillance using vi- sual and thermal images. InA DERA/IEE Workshop on In- telligent Sensor Processing (Ref. No. 2001/050), pages 9–1. IET, 2001. 1

  20. [20]

    A spoken language dataset of descrip- tions for speech-based grounded language learning

    Gaoussou Youssouf Kebe, Padraig Higgins, Patrick Jenk- ins, Kasra Darvish, Rishabh Sachdeva, Ryan Barron, John Winder, Donald Engel, Edward Raff, Francis Ferraro, and Cynthia Matuszek. A spoken language dataset of descrip- tions for speech-based grounded language learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Be...

  21. [21]

    Object detection, recognition, and tracking from uavs using a thermal camera.Journal of Field Robotics, 38(2):242–267, 2021

    Frederik S Leira, H ˚akon Hagen Helgesen, Tor Arne Jo- hansen, and Thor I Fossen. Object detection, recognition, and tracking from uavs using a thermal camera.Journal of Field Robotics, 38(2):242–267, 2021. 1

  22. [22]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975, 2022. 2, 1

  23. [23]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 3

  24. [24]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

  25. [25]

    Pm-adapter: Moe based dynamic denoising fine-tuning for thermal infrared object detection

    Haijun Liu, Jun Zhang, Hang Yu, Boya Wei, Jing Nie, Suju Li, and Xichuan Zhou. Pm-adapter: Moe based dynamic denoising fine-tuning for thermal infrared object detection. Neurocomputing, page 131718, 2025. 2

  26. [26]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  27. [27]

    Dora: Weight-decomposed low-rank adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InForty-first International Conference on Ma- chine Learning, 2024. 4

  28. [28]

    Thermalsynth: A novel approach for generating synthetic thermal human scenarios

    Neelu Madan, Mia Sandra Nicole Siemon, Magnus Kauf- mann Gjerde, Bastian Starup Petersson, Arijus Gro- tuzas, Malthe Aaholm Esbensen, Ivan Adriyanov Nikolov, Mark Philip Philipsen, Kamal Nasrollahi, and Thomas B Moeslund. Thermalsynth: A novel approach for generating synthetic thermal human scenarios. InProceedings of the IEEE/CVF Winter Conference on App...

  29. [29]

    Visual modality prompt for adapting vision-language object detectors

    Heitor R Medeiros, Atif Belal, Srikanth Muralidharan, Eric Granger, and Marco Pedersoli. Visual modality prompt for adapting vision-language object detectors. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2172–2182, 2025. 2

  30. [30]

    Scaling open-vocabulary object detection.Advances in Neu- ral Information Processing Systems, 36:72983–73007, 2023

    Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection.Advances in Neu- ral Information Processing Systems, 36:72983–73007, 2023. 2

  31. [31]

    Sstn: Self- supervised domain adaptation thermal object detection for autonomous driving

    Farzeen Munir, Shoaib Azam, and Moongu Jeon. Sstn: Self- supervised domain adaptation thermal object detection for autonomous driving. In2021 IEEE/RSJ international con- ference on intelligent robots and systems (IROS), pages 206–

  32. [32]

    Contrastive language-image pre-training with knowledge graphs.Advances in Neural Information Process- ing Systems, 35:22895–22910, 2022

    Xuran Pan, Tianzhu Ye, Dongchen Han, Shiji Song, and Gao Huang. Contrastive language-image pre-training with knowledge graphs.Advances in Neural Information Process- ing Systems, 35:22895–22910, 2022. 2

  33. [33]

    Paranjape, Celso de Melo, and Vishal M

    Jay N. Paranjape, Celso de Melo, and Vishal M. Patel. F- vita: Foundation model guided visible to thermal translation,

  34. [34]

    F- vita: Foundation model guided visible to thermal translation

    Jay N Paranjape, Celso de Melo, and Vishal M Patel. F- vita: Foundation model guided visible to thermal translation. arXiv preprint arXiv:2504.02801, 2025. 2

  35. [35]

    Object detection on thermal images for unmanned aerial vehicles us- ing domain adaption through fine-tuning

    Jonas Rauch, Christopher Doer, and Gert F Trommer. Object detection on thermal images for unmanned aerial vehicles us- ing domain adaption through fine-tuning. In2021 28th Saint Petersburg International Conference on Integrated Naviga- tion Systems (ICINS), pages 1–4. IEEE, 2021. 2

  36. [36]

    Object clas- sification in thermal images using convolutional neural net- works for search and rescue missions with unmanned aerial systems

    Christopher Dahlin Rodin, Luciano Netto de Lima, Fabio Augusto de Alcantara Andrade, Diego Barreto Had- dad, Tor Arne Johansen, and Rune Storvold. Object clas- sification in thermal images using convolutional neural net- works for search and rescue missions with unmanned aerial systems. In2018 International Joint Conference on Neural Networks (IJCNN), pag...

  37. [37]

    Springer Science & Business Media, 2012

    Luc Steels and Manfred Hild.Language grounding in robots. Springer Science & Business Media, 2012. 3

  38. [38]

    Language grounding with 3d ob- jects

    Jesse Thomason, Mohit Shridhar, Yonatan Bisk, Chris Pax- ton, and Luke Zettlemoyer. Language grounding with 3d ob- jects. InConference on Robot Learning, pages 1691–1701. PMLR, 2022. 3

  39. [39]

    Meta-uda: Unsupervised domain adaptive thermal object detection using meta-learning

    Vibashan Vs, Domenick Poster, Suya You, Shuowen Hu, and Vishal M Patel. Meta-uda: Unsupervised domain adaptive thermal object detection using meta-learning. Inproceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 1412–1423, 2022. 2

  40. [40]

    V3det: Vast vocabulary visual detection dataset

    Jiaqi Wang, Pan Zhang, Tao Chu, Yuhang Cao, Yujie Zhou, Tong Wu, Bin Wang, Conghui He, and Dahua Lin. V3det: Vast vocabulary visual detection dataset. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 19844–19854, 2023. 3

  41. [41]

    Thermalgen: Style-disentangled flow-based generative models for rgb-to-thermal image translation.arXiv preprint arXiv:2509.24878, 2025

    Jiuhong Xiao, Roshan Nayak, Ning Zhang, Daniel Tortei, and Giuseppe Loianno. Thermalgen: Style-disentangled flow-based generative models for rgb-to-thermal image translation.arXiv preprint arXiv:2509.24878, 2025. 2

  42. [42]

    Filip: Fine-grained interactive language-image pre-training.ArXiv, abs/2111.07783, 2021

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783, 2021. 2

  43. [43]

    Thermal image tracking for search and res- cue missions with a drone.Drones, 8(2):53, 2024

    Seokwon Yeom. Thermal image tracking for search and res- cue missions with a drone.Drones, 8(2):53, 2024. 1

  44. [44]

    Adapter is all you need for tuning visual tasks.arXiv preprint arXiv:2311.15010, 2023

    Dongshuo Yin, Leiyi Hu, Bin Li, and Youqun Zhang. Adapter is all you need for tuning visual tasks.arXiv preprint arXiv:2311.15010, 2023. 2

  45. [45]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 3

  46. [46]

    Conditional prompt learning for vision-language mod- els

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825,

  47. [47]

    Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

  48. [48]

    Implementation Details We build our framework on Grounding DINO, extending it with Thermal Adapters and TTAH for thermal domain adaptation

    2 Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection Supplementary Material S1. Implementation Details We build our framework on Grounding DINO, extending it with Thermal Adapters and TTAH for thermal domain adaptation. All experiments use MMDetection v3.2 with mixed precision and gradient checkpointing. A p...

  49. [49]

    The CLIP text encoder is frozen, while the adapters, detection head, and TTAH are optimized jointly with detection, distillation, and alignment losses

    The detector processesp4andp5encoder features (resized to27×27and20×20) concatenated into one to- ken sequence. The CLIP text encoder is frozen, while the adapters, detection head, and TTAH are optimized jointly with detection, distillation, and alignment losses. Training runs on8×RTX A6000 GPUs with a total batch size of 16 for 150K iterations. We use Ad...

  50. [50]

    Zero-shot detection results on thermal datasets FLIR-Aligned, FLIR-V2, CAMEL, and Utokyo

    used self-supervised and adversarial methods to re- duce modality gaps, while ThermalSynth [28] and Ham- Table S2. Zero-shot detection results on thermal datasets FLIR-Aligned, FLIR-V2, CAMEL, and Utokyo. Method FLIR-Aligned FLIR-V2 CAMEL Utokyo AP AP 50 AP75 AP AP 50 AP75 AP AP 50 AP75 AP AP 50 AP75 GLIP 0.251 0.471 0.226 0.025 0.041 0.028 0.186 0.324 0....

  51. [51]

    and ThermalGen [41], achieve semantically consistent visible-to-thermal synthesis, marking a shift toward scal- able cross-modal generation. Yet, most approaches treat thermal adaptation and semantic grounding separately, and our work unifies them through synthetic supervision, lan- guage alignment, and cross-modal distillation for zero-shot thermal detec...