pith. sign in

arxiv: 2606.00673 · v1 · pith:VFMDBEOCnew · submitted 2026-05-30 · 💻 cs.CV

T-CLIP: Enabling Thermal Perception for Contrastive Language-Image Pretraining

Pith reviewed 2026-06-28 18:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords thermal imagingCLIP adaptationcross-modal retrievalLoRAthermal captioningvision-language modelsIR-Cap dataset
0
0 comments X

The pith

Decoupled dual-LoRA adaptation aligns thermal images with text by separating scene context from object heat signatures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that CLIP cannot match thermal images to text descriptions because global scene context and object-level heat signatures interfere when forced into one embedding space. It builds the IR-Cap dataset of physics-aware captions that supply both broad scene and fine-grained object descriptions, then applies T-CLIP, a framework with two independent low-rank adaptations that handle the two levels separately. This produces better results on cross-modal retrieval across three thermal benchmarks and supports text-guided thermal image generation. A reader would care because thermal cameras function when visible light fails, so language models that understand them could support tasks such as night search or weather-independent monitoring.

Core claim

T-CLIP is a decoupled dual-LoRA framework that independently adapts CLIP for scene-level and object-level thermal understanding, paired with the IR-Cap dataset providing complementary global and fine-grained thermal descriptions, resulting in consistent improvements over baselines across three thermal benchmarks in cross-modal retrieval.

What carries the argument

The decoupled dual-LoRA framework that adapts CLIP separately for global scene context and object-level heat signatures to resolve their representational conflict.

If this is right

  • Consistent gains in cross-modal retrieval performance on three thermal benchmarks.
  • Direct applicability to text-conditioned thermal image generation.
  • Fills the gap in captioned thermal data through the IR-Cap pipeline.
  • Enables standard models to reason about thermal phenomena via targeted separate adaptations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation tactic might help other imaging domains where wide context and local detail clash, such as medical or satellite data.
  • Adding more than two independent adapters could capture additional thermal scales like material properties.
  • Deployment on edge devices could allow language-based control of thermal sensors in low-visibility environments.

Load-bearing premise

Global scene context and object-level heat signatures fundamentally conflict when learned together in a single embedding space.

What would settle it

A single joint LoRA trained on the same IR-Cap data matching or exceeding the dual-LoRA retrieval scores on the three benchmarks would undermine the need for separate adaptations.

Figures

Figures reproduced from arXiv: 2606.00673 by Ayush Maheshwari, Brejesh Lall, Prerana Mukherjee, Tayeba Qazi.

Figure 1
Figure 1. Figure 1: Mean image-text cosine similarity of matched thermal image-caption pairs on the KAIST test [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: IR-Cap captioning pipeline and dataset. (Top): The IR-Cap captioning pipeline leverages paired [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Attribute coverage of IR-Cap captions. Values indicate the percentage of captions within each [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Feature geometry of independently trained global and fine-grained LoRA branches on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: T-CLIP dual LoRA training pipeline. Two independent LoRA modules on a frozen CLIP back [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: T-CLIP inference. Features from the global and fine-grained LoRA branches are combined via a [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Fusion weight α ablation (KAIST (Hwang et al., 2015)). Peak performance at α = 0.8 confirms the complementary contribution of global and fine-grained branches, with steeper degradation toward α = 0.0 reflecting the dominance of global thermal context. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: T-CLIP as a plug-and-play replacement for the text encoder in SDXL for thermal image genera [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Instruction prompts for IR-Cap caption generation pipeline. Instruction Prompt 1 generates [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: FID (lower is better) and CLIP Score (higher is better) for zero-shot SDXL vs T-CLIP + SDXL [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Generated thermal image samples using captions from KAIST (Hwang et al., 2015), FLIR (FLIR [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: T-CLIP + SDXL performance under challenging conditions from FMB (Liu et al., 2023) dataset. [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: T-CLIP training and inference pseudocode. The global LoRA ( [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
read the original abstract

Thermal imaging offers a powerful alternative to visible-spectrum vision under challenging conditions such as low illumination and adverse weather, yet foundational vision-language models like CLIP fail to align thermal images with textual descriptions due to a fundamental thermal perception gap. We identify three major challenges: the lack of captioned thermal datasets, the inability of standard LLMs to reason about thermal phenomena, and a key representational challenge in thermal imaging where global scene context and object-level heat signatures conflict when learned together in a single embedding space. To address these, we introduce IR-Cap, the first physics-aware thermal captioning pipeline and dataset providing complementary global and fine-grained thermal descriptions across three public benchmarks, and T-CLIP, a decoupled dual-LoRA framework that independently adapts CLIP for scene-level and object-level thermal understanding. T-CLIP achieves consistent improvements over all baselines across three thermal benchmarks in cross-modal retrieval, and we provide an exploratory demonstration of its applicability to text-conditioned thermal image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper identifies three challenges for thermal vision-language alignment (lack of captioned datasets, LLM reasoning limitations on thermal phenomena, and a representational conflict between global scene context and object-level heat signatures in a single embedding space). It introduces the IR-Cap dataset via a physics-aware captioning pipeline and proposes T-CLIP, a decoupled dual-LoRA architecture that adapts CLIP separately for scene-level and object-level thermal understanding. The central empirical claim is that T-CLIP yields consistent improvements over baselines in cross-modal retrieval across three thermal benchmarks, with an exploratory demonstration on text-conditioned thermal image generation.

Significance. A validated method for bridging the thermal perception gap in CLIP-style models would be useful for low-illumination and adverse-weather applications. The introduction of IR-Cap as the first physics-aware thermal caption dataset is a concrete contribution; however, the significance of the dual-LoRA architectural choice is undercut by the absence of evidence that the claimed representational conflict actually occurs or that decoupling is required.

major comments (2)
  1. [Abstract] Abstract: the claim that 'global scene context and object-level heat signatures conflict when learned together in a single embedding space' is presented as a key representational challenge motivating the decoupled dual-LoRA design, yet no quantitative evidence (retrieval metrics, embedding similarity analysis, or performance degradation under joint training) is supplied to substantiate the conflict.
  2. [Method / Experiments] Method / Experiments: no ablation is reported that compares the proposed decoupled dual-LoRA against a single joint LoRA (or standard fine-tuning) on IR-Cap; without this comparison the attribution of gains to the decoupling remains unsupported and the architectural novelty cannot be assessed.
minor comments (1)
  1. [Abstract] Abstract states 'consistent improvements over all baselines' but supplies no numerical values, standard deviations, or baseline identities, making it impossible to judge effect size or statistical significance from the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'global scene context and object-level heat signatures conflict when learned together in a single embedding space' is presented as a key representational challenge motivating the decoupled dual-LoRA design, yet no quantitative evidence (retrieval metrics, embedding similarity analysis, or performance degradation under joint training) is supplied to substantiate the conflict.

    Authors: We acknowledge that the manuscript presents the representational conflict as a motivating challenge without direct quantitative substantiation such as embedding similarity metrics or joint-training degradation results. The claim is grounded in the distinct physics of thermal imaging, where global temperature distributions and localized heat signatures can interfere in a shared space, as indirectly supported by the consistent gains of T-CLIP over baselines. To address this, we will add an embedding analysis and a joint-versus-decoupled comparison in the revised version. revision: yes

  2. Referee: [Method / Experiments] Method / Experiments: no ablation is reported that compares the proposed decoupled dual-LoRA against a single joint LoRA (or standard fine-tuning) on IR-Cap; without this comparison the attribution of gains to the decoupling remains unsupported and the architectural novelty cannot be assessed.

    Authors: We agree that the current experiments do not include an ablation isolating the decoupled dual-LoRA from a single joint LoRA on IR-Cap, which limits direct attribution of gains to the decoupling. The reported improvements are over external baselines, but an internal comparison is needed to assess the architectural choice. We will add this ablation study in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical result on held-out benchmarks

full rationale

The paper introduces a new dataset (IR-Cap) and architecture (decoupled dual-LoRA T-CLIP) and reports empirical improvements on three thermal benchmarks for cross-modal retrieval. No equations, fitted parameters renamed as predictions, or derivation chain appear in the provided text. The central claim is framed as an experimental outcome rather than a first-principles result that reduces to its inputs by construction. The stated representational challenge is presented as motivation, not as a derived theorem. Self-citations, if present in the full text, are not load-bearing for the reported gains. This is a standard empirical ML contribution with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. The work implicitly relies on standard contrastive-learning assumptions and the premise that thermal images can be meaningfully captioned by LLMs once physics-aware prompts are supplied.

pith-pipeline@v0.9.1-grok · 5706 in / 1157 out tokens · 16780 ms · 2026-06-28T18:51:04.559845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

183 extracted references · 19 canonical work pages · 10 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher =

    Deep learning , author =. 2016 , publisher =

  4. [4]

    NeurIPS , year =

    Denoising Diffusion Probabilistic Models , author =. NeurIPS , year =

  5. [5]

    Advances in neural information processing systems , volume =

    Photorealistic text-to-image diffusion models with deep language understanding , author =. Advances in neural information processing systems , volume =

  6. [6]

    Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XVIII 16 , pages =

    Improving multispectral pedestrian detection by addressing modality imbalance problems , author =. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XVIII 16 , pages =. 2020 , organization =

  7. [7]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages =

    Multispectral pedestrian detection: Benchmark dataset and baseline , author =. Proceedings of the IEEE conference on computer vision and pattern recognition , pages =

  8. [8]

    Journal of Visual Communication and Image Representation , volume =

    Vehicle detection in aerial imagery: A small target detection benchmark , author =. Journal of Visual Communication and Image Representation , volume =. 2016 , publisher =

  9. [9]

    CVPR , year =

    Image-to-Image Translation with Conditional Adversarial Networks , author =. CVPR , year =

  10. [10]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

    A u-net based discriminator for generative adversarial networks , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

  11. [11]

    Pattern Recognition Letters , volume =

    InfraGAN: A GAN architecture to transfer visible images to infrared domain , author =. Pattern Recognition Letters , volume =. 2022 , publisher =

  12. [12]

    Thirty-fifth Conference on Neural Information Processing Systems , year =

    Seasons in drift: A long-term thermal imaging dataset for studying concept drift , author =. Thirty-fifth Conference on Neural Information Processing Systems , year =

  13. [13]

    NeurIPS , year =

    Diffusion Models Beat GANs on Image Synthesis , author =. NeurIPS , year =

  14. [14]

    ACM SIGGRAPH 2022 Conference Proceedings , pages =

    Palette: Image-to-image diffusion models , author =. ACM SIGGRAPH 2022 Conference Proceedings , pages =

  15. [15]

    Intelligence Science and Big Data Engineering

    Multi-branch semantic GAN for infrared image generation from optical image , author =. Intelligence Science and Big Data Engineering. Visual Data Engineering: 9th International Conference, IScIDE 2019, Nanjing, China, October 17--20, 2019, Proceedings, Part I 9 , pages =. 2019 , organization =

  16. [16]

    2022 International Conference on Machine Vision and Image Processing (MVIP) , pages =

    I-GANs for Synthetical Infrared Images Generation , author =. 2022 International Conference on Machine Vision and Image Processing (MVIP) , pages =. 2022 , organization =

  17. [17]

    IEEE Access , volume =

    Sparse gans for thermal infrared image generation from optical image , author =. IEEE Access , volume =. 2020 , publisher =

  18. [18]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

    Analyzing and improving the image quality of stylegan , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

  19. [19]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

    Taming transformers for high-resolution image synthesis , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

  20. [20]

    doi:https://doi.org/10.48550/arXiv.2208.11970, 2208.11970

    Understanding diffusion models: A unified perspective , author =. arXiv preprint arXiv:2208.11970 , year =

  21. [21]

    International journal of computer vision , volume =

    Imagenet large scale visual recognition challenge , author =. International journal of computer vision , volume =. 2015 , publisher =

  22. [22]

    arXiv preprint arXiv:2303.13336 , volume =

    A survey on audio diffusion models: Text to speech synthesis and enhancement in generative ai , author =. arXiv preprint arXiv:2303.13336 , volume =

  23. [23]

    IEEE Sensors Journal , year =

    Recent Advances in Thermal Imaging and its Applications using Machine Learning: A Review , author =. IEEE Sensors Journal , year =

  24. [24]

    Machine vision and applications , volume =

    Thermal cameras and applications: a survey , author =. Machine vision and applications , volume =. 2014 , publisher =

  25. [25]

    Physiological measurement , volume =

    Infrared thermal imaging in medicine , author =. Physiological measurement , volume =. 2012 , publisher =

  26. [26]

    2018 World Automation Congress (WAC) , pages =

    Virtual sensors determined through machine learning , author =. 2018 World Automation Congress (WAC) , pages =. 2018 , organization =

  27. [27]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages =

    Mu-net: Deep learning-based thermal ir image estimation from rgb image , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages =

  28. [28]

    IEEE Access , year =

    On the Role of Thermal Imaging in Automotive Applications: A Critical Review , author =. IEEE Access , year =

  29. [29]

    Targets and Backgrounds VIII: Characterization and Representation , volume =

    Multicolor and dual-band IR camera for missile warning and automatic target recognition , author =. Targets and Backgrounds VIII: Characterization and Representation , volume =. 2002 , organization =

  30. [30]

    Infrared Technology and Applications XXXVIII , volume =

    Hybrid dual-color MWIR detector for airborne missile warning systems , author =. Infrared Technology and Applications XXXVIII , volume =. 2012 , organization =

  31. [31]

    Journal of Imaging , volume =

    A review of modern thermal imaging sensor technology and applications for autonomous aerial navigation , author =. Journal of Imaging , volume =. 2021 , publisher =

  32. [32]

    Sensors , volume =

    Pedestrian detection at day/night time with visible and FIR cameras: A comparison , author =. Sensors , volume =. 2016 , publisher =

  33. [33]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

    nuscenes: A multimodal dataset for autonomous driving , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

  34. [34]

    Computer Vision: A Reference Guide , pages =

    Infrared thermal imaging , author =. Computer Vision: A Reference Guide , pages =. 2020 , publisher =

  35. [35]

    2018 , publisher =

    Infrared thermal imaging: fundamentals, research and applications , author =. 2018 , publisher =

  36. [36]

    Proceedings of the European Conference on Computer Vision (ECCV) Workshops , pages =

    Unpaired thermal to visible spectrum transfer using adversarial training , author =. Proceedings of the European Conference on Computer Vision (ECCV) Workshops , pages =

  37. [37]

    Infrared Physics & Technology , volume =

    A survey of infrared and visual image fusion methods , author =. Infrared Physics & Technology , volume =. 2017 , publisher =

  38. [38]

    Journal of Geophysical Research: Planets , volume =

    Mars Global Surveyor Thermal Emission Spectrometer experiment: investigation description and surface science results , author =. Journal of Geophysical Research: Planets , volume =. 2001 , publisher =

  39. [39]

    Ieee transactions on biomedical engineering , volume =

    A dataset for breast cancer histopathological image classification , author =. Ieee transactions on biomedical engineering , volume =. 2015 , publisher =

  40. [40]

    The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences , volume =

    A method for synthesizing thermal images using GAN multi-layered approach , author =. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences , volume =. 2021 , publisher =

  41. [41]

    The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences , volume =

    Thermalnet: a deep convolutional network for synthetic thermal image generation , author =. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences , volume =. 2017 , publisher =

  42. [42]

    Journal of big Data , volume =

    Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions , author =. Journal of big Data , volume =. 2021 , publisher =

  43. [43]

    Medical Image Computing and Computer-Assisted Intervention--MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 , pages =

    U-net: Convolutional networks for biomedical image segmentation , author =. Medical Image Computing and Computer-Assisted Intervention--MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 , pages =. 2015 , organization =

  44. [44]

    Communications of the ACM , volume =

    Generative adversarial networks , author =. Communications of the ACM , volume =. 2020 , publisher =

  45. [45]

    International Conference on Machine Learning , pages =

    Improved denoising diffusion probabilistic models , author =. International Conference on Machine Learning , pages =. 2021 , organization =

  46. [46]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Imagen video: High definition video generation with diffusion models , author =. arXiv preprint arXiv:2210.02303 , year =

  47. [47]

    IEEE Transactions on Medical Imaging , year =

    Unsupervised medical image translation with adversarial diffusion models , author =. IEEE Transactions on Medical Imaging , year =

  48. [48]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

    BBDM: Image-to-image translation with Brownian bridge diffusion models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

  49. [49]

    arXiv preprint arXiv:2203.08382 , year =

    Dual diffusion implicit bridges for image-to-image translation , author =. arXiv preprint arXiv:2203.08382 , year =

  50. [50]

    arXiv preprint arXiv:2104.05358 , year =

    Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models , author =. arXiv preprint arXiv:2104.05358 , year =

  51. [51]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

    Image super-resolution via iterative refinement , author =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =. 2022 , publisher =

  52. [52]

    ACM Transactions on Graphics (proceedings of SIGGRAPH) , volume =

    Transient Attributes for High-Level Understanding and Editing of Outdoor Scenes , author =. ACM Transactions on Graphics (proceedings of SIGGRAPH) , volume =

  53. [53]

    ICCV , year =

    Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , author =. ICCV , year =

  54. [54]

    Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14 , pages =

    Colorful image colorization , author =. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14 , pages =. 2016 , organization =

  55. [55]

    nature , volume =

    Human-level control through deep reinforcement learning , author =. nature , volume =. 2015 , publisher =

  56. [56]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages =

    Deepface: Closing the gap to human-level performance in face verification , author =. Proceedings of the IEEE conference on computer vision and pattern recognition , pages =

  57. [57]

    International conference on machine learning , pages =

    Deep speech 2: End-to-end speech recognition in english and mandarin , author =. International conference on machine learning , pages =. 2016 , organization =

  58. [58]

    Proceedings of the European conference on computer vision (ECCV) , pages =

    Scaling egocentric vision: The epic-kitchens dataset , author =. Proceedings of the European conference on computer vision (ECCV) , pages =

  59. [59]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

    Scalability in perception for autonomous driving: Waymo open dataset , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

  60. [60]

    International Conference on Machine Learning , pages =

    Internet explorer: Targeted representation learning on the open web , author =. International Conference on Machine Learning , pages =. 2023 , organization =

  61. [61]

    Pattern Recognition , volume =

    Integrated multilevel image fusion and match score fusion of visible and infrared face images for robust face recognition , author =. Pattern Recognition , volume =. 2008 , publisher =

  62. [62]

    Pattern Recognition , volume =

    Fusion of color and infrared video for moving human detection , author =. Pattern Recognition , volume =. 2007 , publisher =

  63. [63]

    IEEE Transactions on Information Forensics and Security , volume =

    Bi-directional center-constrained top-ranking for visible thermal person re-identification , author =. IEEE Transactions on Information Forensics and Security , volume =. 2019 , publisher =

  64. [64]

    , author =

    Visible thermal person re-identification via dual-constrained top-ranking. , author =. IJCAI , volume =

  65. [65]

    Neurocomputing , volume =

    A deep thermal-guided approach for effective low-light visible image enhancement , author =. Neurocomputing , volume =. 2023 , publisher =

  66. [66]

    Mathematical Problems in Engineering , volume =

    Infrared target detection and location for visual surveillance using fusion scheme of visible and infrared images , author =. Mathematical Problems in Engineering , volume =. 2013 , publisher =

  67. [67]

    IEEE Robotics and Automation Letters , volume =

    RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes , author =. IEEE Robotics and Automation Letters , volume =. 2019 , publisher =

  68. [68]

    IEEE Transactions on Automation Science and Engineering , volume =

    FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion , author =. IEEE Transactions on Automation Science and Engineering , volume =. 2020 , publisher =

  69. [69]

    IEEE Transactions on image processing , volume =

    A statistical evaluation of recent full reference image quality assessment algorithms , author =. IEEE Transactions on image processing , volume =. 2006 , publisher =

  70. [70]

    IEEE transactions on image processing , volume =

    Image quality assessment: from error visibility to structural similarity , author =. IEEE transactions on image processing , volume =. 2004 , publisher =

  71. [71]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages =

    The unreasonable effectiveness of deep features as a perceptual metric , author =. Proceedings of the IEEE conference on computer vision and pattern recognition , pages =

  72. [72]

    SSIM , author =

    Image quality metrics: PSNR vs. SSIM , author =. 2010 20th international conference on pattern recognition , pages =. 2010 , organization =

  73. [73]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

    High-resolution image synthesis with latent diffusion models , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

  74. [74]

    GIT: A Generative Image-to-text Transformer for Vision and Language

    Git: A generative image-to-text transformer for vision and language , author =. arXiv preprint arXiv:2205.14100 , year =

  75. [75]

    European conference on computer vision , pages =

    End-to-end object detection with transformers , author =. European conference on computer vision , pages =. 2020 , organization =

  76. [76]

    Advances in neural information processing systems , volume =

    Faster r-cnn: Towards real-time object detection with region proposal networks , author =. Advances in neural information processing systems , volume =

  77. [77]

    arXiv preprint arXiv:2305.09972 , year =

    Real-time flying object detection with YOLOv8 , author =. arXiv preprint arXiv:2305.09972 , year =

  78. [78]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Dino: Detr with improved denoising anchor boxes for end-to-end object detection , author =. arXiv preprint arXiv:2203.03605 , year =

  79. [79]

    Proceedings of the IEEE/CVF international conference on computer vision , pages =

    Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation , author =. Proceedings of the IEEE/CVF international conference on computer vision , pages =

  80. [80]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Grounded sam: Assembling open-world models for diverse visual tasks , author =. arXiv preprint arXiv:2401.14159 , year =

Showing first 80 references.