pith. machine review for the scientific record. sign in

arxiv: 2605.13621 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords multispectral object detectioninfrared-visible fusionwavelet decompositionfrequency domaindetection transformermodality-shared featuresmodality-specific featuresquery selection
0
0 comments X

The pith

Wavelet decomposition decouples shared low-frequency and specific high-frequency features from infrared and visible images to improve object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a detection framework that separates infrared and visible image features into low-frequency parts common to both modalities and high-frequency parts unique to each using wavelet decomposition. This separation allows a cross-modal attention module to align the shared low-frequency information and a gradient consistency loss to retain the specific high-frequency details. A hybrid enhancement module incorporates spatial information, while a query selection module adjusts the balance between shared and specific features depending on the scene. The result is reduced bias in shared features and less insufficiency in specific ones compared to prior fusion approaches.

Core claim

WD-FQDet explicitly decouples modality-shared and modality-specific information from infrared and visible modalities in the new view of low- and high-frequency domains via wavelet decomposition. A low-frequency homogeneity alignment module aligns shared features across modalities via cross-modal attention, a high-frequency specificity retention module preserves modality-specific features through multi-scale gradient consistency loss, a hybrid feature enhancement module incorporates spatial cues, and a frequency-aware query selection module dynamically regulates their contributions, yielding state-of-the-art performance on the FLIR, LLVIP, and M3FD datasets.

What carries the argument

Wavelet decomposition that splits inputs into low-frequency modality-shared and high-frequency modality-specific domains, paired with alignment, retention, and frequency-aware query selection modules.

If this is right

  • Shared low-frequency features can be aligned across modalities to reduce bias without losing complementary details.
  • Modality-specific high-frequency features are retained via gradient consistency to address insufficiency in fusion.
  • Dynamic query selection adapts the weight of homogeneous versus specific features to different detection scenarios.
  • State-of-the-art results appear across multiple metrics on the FLIR, LLVIP, and M3FD benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The frequency split could be tested on additional modality pairs such as RGB and depth to check whether low-frequency alignment generalizes beyond infrared-visible cases.
  • If the separation works reliably, it may reduce reliance on separate backbone designs for each modality in future multispectral detectors.
  • Efficiency measurements on embedded hardware would show whether the added wavelet and query modules support real-time use.

Load-bearing premise

Wavelet decomposition cleanly separates modality-shared low-frequency features from modality-specific high-frequency features without introducing artifacts or bias that the alignment and retention modules cannot correct.

What would settle it

Retraining and testing the model on the same datasets after removing the wavelet decomposition step; if accuracy falls to levels matching or below standard fusion baselines, the frequency-decoupling premise does not hold.

Figures

Figures reproduced from arXiv: 2605.13621 by Chunjin Yang, Fanman Meng, Xiwei Zhang, Yiming Xiao.

Figure 1
Figure 1. Figure 1: (a) The backbone-specific method introduces severe bias [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture our WD-FQDet. Our WD-FQDet comprises three key modules. First, features extracted by a modality [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the frequency-aware query-selection mod [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization results on the FLIR dataset. False posi [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) represents the visible image; (b) represents the cor [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Infrared-visible object detection improves detection performance by combining complementary features from multispectral images. Existing backbone-specific and backbone-shared approaches still suffer from the problems of severe bias of modality-shared features and the insufficiency of modality-specific features. To address these issues, we propose a novel detection framework WD-FQDet that explicitly decouples modality-shared and modality-specific information from infrared and visible modalities in the new view of low- and high-frequency domains, allowing fusion strategies tailored to their frequency characteristics. Specifically, a low-frequency homogeneity alignment module is proposed to align modality-shared features across modalities via a cross-modal attention mechanism, and a high-frequency specificity retention module is proposed to preserve modality-specific features through the multi-scale gradient consistency loss. To reinforce the feature representation in the frequency domain, we propose a hybrid feature enhancement module that incorporates spatial cues. Furthermore, considering that the contributions of homogeneous and modality-specific features to object detection vary across scenarios, we propose a frequency-aware query selection module to dynamically regulate their contributions. Experimental results on the FLIR, LLVIP, and M3FD datasets demonstrate that WD-FQDet achieves state-of-the-art performance across multiple evaluation metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents WD-FQDet, a multispectral detection transformer for infrared-visible object detection. It claims to explicitly decouple modality-shared and modality-specific information via wavelet decomposition into low- and high-frequency domains, enabling tailored fusion through a low-frequency homogeneity alignment module (cross-modal attention), a high-frequency specificity retention module (multi-scale gradient consistency loss), a hybrid feature enhancement module, and a frequency-aware query selection module. The work reports state-of-the-art performance across multiple metrics on the FLIR, LLVIP, and M3FD datasets.

Significance. If the frequency-domain decoupling proves effective and the modules mitigate bias and insufficiency without introducing artifacts, the framework could offer a principled advance over backbone-specific or shared fusion methods by exploiting complementary frequency characteristics in multispectral imagery. The dynamic query selection and gradient consistency loss represent potentially useful mechanisms for scenario-adaptive fusion.

major comments (2)
  1. [Method (wavelet decomposition and frequency modules)] The central claim in the abstract and method description rests on the assumption that wavelet decomposition (presumably DWT) maps modality-shared information predominantly to low-frequency subbands and modality-specific information to high-frequency subbands. No quantitative validation—such as cross-modal mutual information, cosine similarity, or correlation metrics computed per subband on FLIR/LLVIP/M3FD—is reported to confirm the separation is sufficiently clean; low-frequency components can encode modality-specific biases (thermal gradients vs. illumination) while high-frequency edges may be shared, undermining the subsequent alignment and retention modules.
  2. [Experiments] The experimental claims of SOTA performance lack supporting details: no full baseline tables, module-wise ablations, statistical significance tests, or error analysis (e.g., failure cases under varying illumination) are described, making it impossible to assess whether the reported gains are attributable to the frequency-aware components or to implementation specifics.
minor comments (1)
  1. [Method] Clarify the exact wavelet basis and decomposition levels used, and provide equations for the multi-scale gradient consistency loss and frequency-aware query selection to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript on WD-FQDet. We have carefully considered each major comment and provide point-by-point responses below. We agree that additional validation and experimental details will strengthen the paper and will incorporate the suggested changes in the revised version.

read point-by-point responses
  1. Referee: [Method (wavelet decomposition and frequency modules)] The central claim in the abstract and method description rests on the assumption that wavelet decomposition (presumably DWT) maps modality-shared information predominantly to low-frequency subbands and modality-specific information to high-frequency subbands. No quantitative validation—such as cross-modal mutual information, cosine similarity, or correlation metrics computed per subband on FLIR/LLVIP/M3FD—is reported to confirm the separation is sufficiently clean; low-frequency components can encode modality-specific biases (thermal gradients vs. illumination) while high-frequency edges may be shared, undermining the subsequent alignment and retention modules.

    Authors: We appreciate this observation regarding the need for explicit validation of the frequency-domain separation. The design of WD-FQDet is motivated by the established frequency separation properties of discrete wavelet transform (DWT), where low-frequency subbands typically encode shared structural information and high-frequency subbands capture modality-specific details. However, we acknowledge that direct quantitative metrics were not reported in the original submission. In the revised manuscript, we will add cross-modal mutual information, cosine similarity, and correlation analyses computed per subband on the FLIR, LLVIP, and M3FD datasets to empirically confirm the decoupling quality and address potential concerns about modality-specific biases in low-frequency components. revision: yes

  2. Referee: [Experiments] The experimental claims of SOTA performance lack supporting details: no full baseline tables, module-wise ablations, statistical significance tests, or error analysis (e.g., failure cases under varying illumination) are described, making it impossible to assess whether the reported gains are attributable to the frequency-aware components or to implementation specifics.

    Authors: We agree that expanded experimental details are required for full transparency. While the original manuscript presents SOTA results and some ablation studies, we will revise the experimental section to include complete baseline comparison tables, comprehensive module-wise ablations, statistical significance testing (e.g., paired t-tests across multiple runs), and a dedicated error analysis subsection examining failure cases under varying illumination and other conditions. These additions will better isolate the contributions of the frequency-aware modules. revision: yes

Circularity Check

0 steps flagged

No significant circularity in WD-FQDet derivation chain

full rationale

The paper introduces an architectural framework using wavelet decomposition to separate low- and high-frequency components, followed by explicitly defined modules (low-frequency homogeneity alignment via cross-modal attention, high-frequency specificity retention via gradient consistency loss, hybrid enhancement, and frequency-aware query selection). These are presented as design choices motivated by the frequency-domain view rather than derived from or reducing to the final outputs. No equations or claims reduce predictions to fitted parameters by construction, and no load-bearing self-citations or uniqueness theorems are invoked. Performance is assessed via standard empirical evaluation on external datasets (FLIR, LLVIP, M3FD), keeping the central claims independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no mathematical formulations, derivations, or implementation details, so specific free parameters, axioms, or invented entities cannot be identified or audited.

pith-pipeline@v0.9.0 · 5510 in / 1181 out tokens · 42356 ms · 2026-05-14T19:11:48.767464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 2 internal anchors

  1. [1]

    Multi-modality medical im- age fusion using discrete wavelet transform.Procedia Com- puter Science, 70:625–631, 2015

    V Bhavana and HK Krishnappa. Multi-modality medical im- age fusion using discrete wavelet transform.Procedia Com- puter Science, 70:625–631, 2015. 3

  2. [2]

    Multimodal object detection by channel switching and spatial attention

    Yue Cao, Junchi Bin, Jozsef Hamari, Erik Blasch, and Zheng Liu. Multimodal object detection by channel switching and spatial attention. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 403–411, 2023. 6, 7

  3. [3]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 6, 7

  4. [4]

    Timothy Chase Jr, Chris Gnam, John Crassidis, and Karthik Dantu. You only crash once: Improved object detection for real-time, sim-to-real hazardous terrain detection and classi- fication for autonomous planetary landings.arXiv preprint arXiv:2303.04891, 2023. 1

  5. [5]

    Multimodal object detection via probabilistic ensembling

    Yi-Ting Chen, Jinghao Shi, Zelin Ye, Christoph Mertz, Deva Ramanan, and Shu Kong. Multimodal object detection via probabilistic ensembling. InEuropean Conference on Com- puter Vision, pages 139–158. Springer, 2022. 7

  6. [6]

    Deep learning based efficient ship detection from drone-captured images for maritime surveillance.Ocean engineering, 285: 115440, 2023

    Shuxiao Cheng, Yishuang Zhu, and Shaohua Wu. Deep learning based efficient ship detection from drone-captured images for maritime surveillance.Ocean engineering, 285: 115440, 2023. 1

  7. [7]

    Xception: Deep learning with depthwise separable convolutions

    Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 1251–1258, 2017. 4

  8. [8]

    A survey on 3d object de- tection in real time for autonomous driving.Frontiers in Robotics and AI, 11:1212070, 2024

    Marcelo Contreras, Aayush Jain, Neel P Bhatt, Arunava Banerjee, and Ehsan Hashemi. A survey on 3d object de- tection in real time for autonomous driving.Frontiers in Robotics and AI, 11:1212070, 2024. 1

  9. [9]

    Histograms of oriented gra- dients for human detection

    Navneet Dalal and Bill Triggs. Histograms of oriented gra- dients for human detection. In2005 IEEE computer soci- ety conference on computer vision and pattern recognition (CVPR’05), pages 886–893. Ieee, 2005. 4

  10. [10]

    Pedestrian detection by fusion of rgb and infrared im- ages in low-light environment

    Qing Deng, Wei Tian, Yuyao Huang, Lu Xiong, and Xin Bi. Pedestrian detection by fusion of rgb and infrared im- ages in low-light environment. In2021 IEEE 24th Interna- tional Conference on Information Fusion (FUSION), pages 1–8. IEEE, 2021. 2

  11. [11]

    Fusion-mamba for cross-modality object detection.arXiv preprint arXiv:2404.09146, 2024

    Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Xuhui Liu, Juan Zhang, Guodong Guo, and Baochang Zhang. Fusion-mamba for cross-modality object detection.arXiv preprint arXiv:2404.09146, 2024. 1, 7

  12. [12]

    Wavelet transforms and their applications to turbulence.Annual review of fluid mechanics, 24(1):395– 458, 1992

    Marie Farge et al. Wavelet transforms and their applications to turbulence.Annual review of fluid mechanics, 24(1):395– 458, 1992. 2, 4

  13. [13]

    Wavelet convolutions for large receptive fields

    Shahaf E Finder, Roy Amoyal, Eran Treister, and Oren Freifeld. Wavelet convolutions for large receptive fields. In European Conference on Computer Vision, pages 363–380. Springer, 2024. 3

  14. [14]

    Flir thermal dataset for algorithm training.https: / / www

    FLIR. Flir thermal dataset for algorithm training.https: / / www . flir . in / oem / adas / adas - dataset - form/. 2018. 6

  15. [15]

    Lraf-net: Long- range attention fusion network for visible–infrared object de- tection.IEEE Transactions on Neural Networks and Learn- ing Systems, 2023

    Haolong Fu, Shixun Wang, Puhong Duan, Changyan Xiao, Renwei Dian, Shutao Li, and Zhiyong Li. Lraf-net: Long- range attention fusion network for visible–infrared object de- tection.IEEE Transactions on Neural Networks and Learn- ing Systems, 2023. 1, 6, 7, 8

  16. [16]

    Machine vision based fire detection techniques: A survey.Fire technology, 57(2):591–623, 2021

    S Geetha, CS Abhishek, and CS Akshayanat. Machine vision based fire detection techniques: A survey.Fire technology, 57(2):591–623, 2021. 1

  17. [17]

    Fusion of multispectral data through illumination-aware deep neural networks for pedestrian de- tection.Information Fusion, 50:148–157, 2019

    Dayan Guan, Yanpeng Cao, Jiangxin Yang, Yanlong Cao, and Michael Ying Yang. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian de- tection.Information Fusion, 50:148–157, 2019. 2

  18. [18]

    Dpdetr: Decoupled position detection trans- former for infrared-visible object detection.arXiv preprint arXiv:2408.06123, 2024

    Junjie Guo, Chenqiang Gao, Fangcen Liu, and Deyu Meng. Dpdetr: Decoupled position detection trans- former for infrared-visible object detection.arXiv preprint arXiv:2408.06123, 2024. 1, 2

  19. [19]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7

  20. [20]

    Wavedh: Wavelet sub-bands guided convnet for efficient image dehazing.arXiv preprint arXiv:2404.01604, 2024

    Seongmin Hwang, Daeyoung Han, Cheolkon Jung, and Moongu Jeon. Wavedh: Wavelet sub-bands guided convnet for efficient image dehazing.arXiv preprint arXiv:2404.01604, 2024. 2

  21. [21]

    Llvip: A visible-infrared paired dataset for low-light vision

    Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 3496–3504, 2021. 6

  22. [22]

    G. Jocher. Yolov5 by ultralytics.https://github. com/ultralytics/yolov5. 2020. 7

  23. [23]

    Design of an image edge detection filter using the sobel operator.IEEE Journal of solid-state circuits, 23(2): 358–367, 1988

    Nick Kanopoulos, Nagesh Vasanthavada, and Robert L Baker. Design of an image edge detection filter using the sobel operator.IEEE Journal of solid-state circuits, 23(2): 358–367, 1988. 5

  24. [24]

    Fully convo- lutional region proposal networks for multispectral person detection

    Daniel Konig, Michael Adam, Christian Jarvers, Georg Lay- her, Heiko Neumann, and Michael Teutsch. Fully convo- lutional region proposal networks for multispectral person detection. InProceedings of the IEEE conference on com- puter vision and pattern recognition workshops, pages 49– 56, 2017. 2

  25. [25]

    Crossformer: Cross-guided attention for multi-modal object detection.Pat- tern Recognition Letters, 179:144–150, 2024

    Seungik Lee, Jaehyeong Park, and Jinsun Park. Crossformer: Cross-guided attention for multi-modal object detection.Pat- tern Recognition Letters, 179:144–150, 2024. 8

  26. [26]

    Illumination-aware faster r-cnn for robust multispectral pedestrian detection.Pattern Recognition, 85:161–171,

    Chengyang Li, Dan Song, Ruofeng Tong, and Min Tang. Illumination-aware faster r-cnn for robust multispectral pedestrian detection.Pattern Recognition, 85:161–171,

  27. [27]

    Fd2-net: Frequency-driven fea- ture decomposition network for infrared-visible object detec- tion.arXiv preprint arXiv:2412.09258, 2024

    Ke Li, Di Wang, Zhangyuan Hu, Shaofeng Li, Weiping Ni, Lin Zhao, and Quan Wang. Fd2-net: Frequency-driven fea- ture decomposition network for infrared-visible object detec- tion.arXiv preprint arXiv:2412.09258, 2024. 6, 7

  28. [28]

    Vehicle detection algorithms for autonomous driving: A review.Sensors, 24 (10):3088, 2024

    Liang Liang, Haihua Ma, Le Zhao, Xiaopeng Xie, Chengxin Hua, Miao Zhang, and Yonghui Zhang. Vehicle detection algorithms for autonomous driving: A review.Sensors, 24 (10):3088, 2024. 1

  29. [29]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 6

  30. [30]

    Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection

    Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 5802–5811, 2022. 6, 7

  31. [31]

    Multi- interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation

    Jinyuan Liu, Zhu Liu, Guanyao Wu, Long Ma, Risheng Liu, Wei Zhong, Zhongxuan Luo, and Xin Fan. Multi- interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 8115–8124, 2023. 6

  32. [32]

    Tianshan Liu, Kin-Man Lam, Rui Zhao, and Guoping Qiu. Deep cross-modal representation learning and distillation for illumination-invariant pedestrian detection.IEEE Transac- tions on Circuits and Systems for Video Technology, 32(1): 315–329, 2021. 2

  33. [33]

    Cross- modality fusion transformer for multispectral object detec- tion.arXiv preprint arXiv:2111.00273, 2021

    Fang Qingyun, Han Dapeng, and Wang Zhaokui. Cross- modality fusion transformer for multispectral object detec- tion.arXiv preprint arXiv:2111.00273, 2021. 7, 8

  34. [34]

    A guide to image-and video-based small object detection using deep learning: Case study of maritime surveillance.IEEE Transactions on Intelligent Transportation Systems, 2025

    Aref Miri Rekavandi, Lian Xu, Farid Boussaid, Abd-Krim Seghouane, Stephen Hoefs, and Mohammed Bennamoun. A guide to image-and video-based small object detection using deep learning: Case study of maritime surveillance.IEEE Transactions on Intelligent Transportation Systems, 2025. 1

  35. [35]

    Deep learning-based change detec- tion in remote sensing images: A review.Remote Sensing, 14(4):871, 2022

    Ayesha Shafique, Guo Cao, Zia Khan, Muhammad Asad, and Muhammad Aslam. Deep learning-based change detec- tion in remote sensing images: A review.Remote Sensing, 14(4):871, 2022. 1

  36. [36]

    Divfusion: Darkness-free infrared and visible im- age fusion.Information Fusion, 91:477–493, 2023

    Linfeng Tang, Xinyu Xiang, Hao Zhang, Meiqi Gong, and Jiayi Ma. Divfusion: Darkness-free infrared and visible im- age fusion.Information Fusion, 91:477–493, 2023. 7

  37. [37]

    Linfeng Tang, Hao Zhang, Han Xu, and Jiayi Ma. Rethink- ing the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity.Infor- mation Fusion, 99:101870, 2023. 7

  38. [38]

    Multi-stage image denoising with the wavelet transform.Pattern Recognition, 134:109050, 2023

    Chunwei Tian, Menghua Zheng, Wangmeng Zuo, Bob Zhang, Yanning Zhang, and David Zhang. Multi-stage image denoising with the wavelet transform.Pattern Recognition, 134:109050, 2023. 3

  39. [39]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 5

  40. [40]

    Multispectral pedestrian detection using deep fusion convolutional neural networks

    J ¨org Wagner, V olker Fischer, Michael Herman, Sven Behnke, et al. Multispectral pedestrian detection using deep fusion convolutional neural networks. InESANN, pages 509–514, 2016. 2

  41. [41]

    Eca-net: Efficient channel at- tention for deep convolutional neural networks

    Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wang- meng Zuo, and Qinghua Hu. Eca-net: Efficient channel at- tention for deep convolutional neural networks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11534–11542, 2020. 4

  42. [42]

    Machine learning-based ship detection and tracking using satellite im- ages for maritime surveillance.Journal of Ambient Intelli- gence and Smart Environments, 13(5):361–371, 2021

    Yu Wang, G Rajesh, X Mercilin Raajini, N Kritika, A Kavinkumar, and Syed Bilal Hussain Shah. Machine learning-based ship detection and tracking using satellite im- ages for maritime surveillance.Journal of Ambient Intelli- gence and Smart Environments, 13(5):361–371, 2021. 1

  43. [43]

    Gm-detr: Generalized muiltispectral detection transformer with efficient fusion en- coder for visible-infrared detection

    Yiming Xiao, Fanman Meng, Qingbo Wu, Linfeng Xu, Mingzhou He, and Hongliang Li. Gm-detr: Generalized muiltispectral detection transformer with efficient fusion en- coder for visible-infrared detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5541–5549, 2024. 1, 5, 6, 7

  44. [44]

    Ms- detr: Multispectral pedestrian detection transformer with loosely coupled fusion and modality-balanced optimization

    Yinghui Xing, Shuo Yang, Song Wang, Shizhou Zhang, Guoqiang Liang, Xiuwei Zhang, and Yanning Zhang. Ms- detr: Multispectral pedestrian detection transformer with loosely coupled fusion and modality-balanced optimization. IEEE Transactions on Intelligent Transportation Systems,

  45. [45]

    Sffnet: A wavelet-based spatial and frequency domain fusion network for remote sensing segmentation.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Yunsong Yang, Genji Yuan, and Jinjiang Li. Sffnet: A wavelet-based spatial and frequency domain fusion network for remote sensing segmentation.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3

  46. [46]

    Multi-Scale Context Aggregation by Dilated Convolutions

    Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions.arXiv preprint arXiv:1511.07122, 2015. 5

  47. [47]

    C 2 former: Calibrated and complementary transformer for rgb-infrared object de- tection.IEEE Transactions on Geoscience and Remote Sens- ing, 2024

    Maoxun Yuan and Xingxing Wei. C 2 former: Calibrated and complementary transformer for rgb-infrared object de- tection.IEEE Transactions on Geoscience and Remote Sens- ing, 2024. 8

  48. [48]

    Transla- tion, scale and rotation: cross-modal alignment meets rgb- infrared vehicle detection

    Maoxun Yuan, Yinyan Wang, and Xingxing Wei. Transla- tion, scale and rotation: cross-modal alignment meets rgb- infrared vehicle detection. InEuropean Conference on Com- puter Vision, pages 509–525. Springer, 2022. 8

  49. [49]

    Sdnet: A versatile squeeze-and- decomposition network for real-time image fusion.Inter- national Journal of Computer Vision, 129(10):2761–2785,

    Hao Zhang and Jiayi Ma. Sdnet: A versatile squeeze-and- decomposition network for real-time image fusion.Inter- national Journal of Computer Vision, 129(10):2761–2785,

  50. [50]

    Dino: Detr with improved denoising anchor boxes for end-to-end object de- tection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object de- tection. InThe Eleventh International Conference on Learn- ing Representations. 7

  51. [51]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 7

  52. [52]

    Differential feature awareness network within antagonistic learning for infrared- visible object detection.IEEE Transactions on Circuits and Systems for Video Technology, 2023

    Ruiheng Zhang, Lu Li, Qi Zhang, Jin Zhang, Lixin Xu, Baomin Zhang, and Binglu Wang. Differential feature awareness network within antagonistic learning for infrared- visible object detection.IEEE Transactions on Circuits and Systems for Video Technology, 2023. 1

  53. [53]

    Dense distinct query for end-to-end object detection

    Shilong Zhang, Xinjiang Wang, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wenwei Zhang, Ping Luo, and Kai Chen. Dense distinct query for end-to-end object detection. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 7329–7338, 2023. 7

  54. [54]

    Enhanced wavelet based spatiotemporal fusion networks us- ing cross-paired remote sensing images.ISPRS Journal of Photogrammetry and Remote Sensing, 211:281–297, 2024

    Xingjian Zhang, Shuang Li, Zhenyu Tan, and Xinghua Li. Enhanced wavelet based spatiotemporal fusion networks us- ing cross-paired remote sensing images.ISPRS Journal of Photogrammetry and Remote Sensing, 211:281–297, 2024. 3

  55. [55]

    Tfdet: Target-aware fusion for rgb-t pedestrian detection

    Xue Zhang, Xiaohan Zhang, Jiangtao Wang, Jiacheng Ying, Zehua Sheng, Heng Yu, Chunguang Li, and Hui-Liang Shen. Tfdet: Target-aware fusion for rgb-t pedestrian detection. IEEE Transactions on Neural Networks and Learning Sys- tems, 2024. 2, 6, 7

  56. [56]

    Enhancing autonomous driving safety: a robust traffic sign detection and recognition model tsd-yolo.Signal Processing, 225:109619,

    Ruixin Zhao, Sai Hong Tang, Jiazheng Shen, Eris Elianddy Bin Supeni, and Sharafiz Abdul Rahim. Enhancing autonomous driving safety: a robust traffic sign detection and recognition model tsd-yolo.Signal Processing, 225:109619,

  57. [57]

    Removal and selection: Improving rgb- infrared object detection via coarse-to-fine fusion.arXiv preprint arXiv:2401.10731, 2024

    Tianyi Zhao, Maoxun Yuan, Feng Jiang, Nan Wang, and Xingxing Wei. Removal and selection: Improving rgb- infrared object detection via coarse-to-fine fusion.arXiv preprint arXiv:2401.10731, 2024. 7

  58. [58]

    Metafusion: Infrared and visible image fusion via meta- feature embedding from object detection

    Wenda Zhao, Shigeng Xie, Fan Zhao, You He, and Huchuan Lu. Metafusion: Infrared and visible image fusion via meta- feature embedding from object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13955–13965, 2023. 6

  59. [59]

    Detrs beat yolos on real-time object detection

    Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024. 4, 6, 7

  60. [60]

    Efficient and model- based infrared and visible image fusion via algorithm un- rolling.IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1186–1196, 2021

    Zixiang Zhao, Shuang Xu, Jiangshe Zhang, Chengyang Liang, Chunxia Zhang, and Junmin Liu. Efficient and model- based infrared and visible image fusion via algorithm un- rolling.IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1186–1196, 2021. 7

  61. [61]

    Cddfuse: Correlation-driven dual-branch feature decompo- sition for multi-modality image fusion

    Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, and Luc Van Gool. Cddfuse: Correlation-driven dual-branch feature decompo- sition for multi-modality image fusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5906–5916, 2023. 1, 6, 7

  62. [62]

    Ddfm: denoising diffusion model for multi-modality image fusion

    Zixiang Zhao, Haowen Bai, Yuanzhi Zhu, Jiangshe Zhang, Shuang Xu, Yulun Zhang, Kai Zhang, Deyu Meng, Radu Timofte, and Luc Van Gool. Ddfm: denoising diffusion model for multi-modality image fusion. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 8082–8093, 2023. 6

  63. [63]

    Xnet: Wavelet-based low and high frequency fusion networks for fully-and semi-supervised semantic seg- mentation of biomedical images

    Yanfeng Zhou, Jiaxing Huang, Chenlong Wang, Le Song, and Ge Yang. Xnet: Wavelet-based low and high frequency fusion networks for fully-and semi-supervised semantic seg- mentation of biomedical images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21085–21096, 2023. 3