arxiv: 2605.13621 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning

Chunjin Yang , Xiwei Zhang , Yiming Xiao , Fanman Meng

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords multispectral object detectioninfrared-visible fusionwavelet decompositionfrequency domaindetection transformermodality-shared featuresmodality-specific featuresquery selection

0 comments

The pith

Wavelet decomposition decouples shared low-frequency and specific high-frequency features from infrared and visible images to improve object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a detection framework that separates infrared and visible image features into low-frequency parts common to both modalities and high-frequency parts unique to each using wavelet decomposition. This separation allows a cross-modal attention module to align the shared low-frequency information and a gradient consistency loss to retain the specific high-frequency details. A hybrid enhancement module incorporates spatial information, while a query selection module adjusts the balance between shared and specific features depending on the scene. The result is reduced bias in shared features and less insufficiency in specific ones compared to prior fusion approaches.

Core claim

WD-FQDet explicitly decouples modality-shared and modality-specific information from infrared and visible modalities in the new view of low- and high-frequency domains via wavelet decomposition. A low-frequency homogeneity alignment module aligns shared features across modalities via cross-modal attention, a high-frequency specificity retention module preserves modality-specific features through multi-scale gradient consistency loss, a hybrid feature enhancement module incorporates spatial cues, and a frequency-aware query selection module dynamically regulates their contributions, yielding state-of-the-art performance on the FLIR, LLVIP, and M3FD datasets.

What carries the argument

Wavelet decomposition that splits inputs into low-frequency modality-shared and high-frequency modality-specific domains, paired with alignment, retention, and frequency-aware query selection modules.

If this is right

Shared low-frequency features can be aligned across modalities to reduce bias without losing complementary details.
Modality-specific high-frequency features are retained via gradient consistency to address insufficiency in fusion.
Dynamic query selection adapts the weight of homogeneous versus specific features to different detection scenarios.
State-of-the-art results appear across multiple metrics on the FLIR, LLVIP, and M3FD benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The frequency split could be tested on additional modality pairs such as RGB and depth to check whether low-frequency alignment generalizes beyond infrared-visible cases.
If the separation works reliably, it may reduce reliance on separate backbone designs for each modality in future multispectral detectors.
Efficiency measurements on embedded hardware would show whether the added wavelet and query modules support real-time use.

Load-bearing premise

Wavelet decomposition cleanly separates modality-shared low-frequency features from modality-specific high-frequency features without introducing artifacts or bias that the alignment and retention modules cannot correct.

What would settle it

Retraining and testing the model on the same datasets after removing the wavelet decomposition step; if accuracy falls to levels matching or below standard fusion baselines, the frequency-decoupling premise does not hold.

Figures

Figures reproduced from arXiv: 2605.13621 by Chunjin Yang, Fanman Meng, Xiwei Zhang, Yiming Xiao.

**Figure 2.** Figure 2: The overall architecture our WD-FQDet. Our WD-FQDet comprises three key modules. First, features extracted by a modality [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the frequency-aware query-selection mod [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization results on the FLIR dataset. False posi [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: (a) represents the visible image; (b) represents the cor [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Infrared-visible object detection improves detection performance by combining complementary features from multispectral images. Existing backbone-specific and backbone-shared approaches still suffer from the problems of severe bias of modality-shared features and the insufficiency of modality-specific features. To address these issues, we propose a novel detection framework WD-FQDet that explicitly decouples modality-shared and modality-specific information from infrared and visible modalities in the new view of low- and high-frequency domains, allowing fusion strategies tailored to their frequency characteristics. Specifically, a low-frequency homogeneity alignment module is proposed to align modality-shared features across modalities via a cross-modal attention mechanism, and a high-frequency specificity retention module is proposed to preserve modality-specific features through the multi-scale gradient consistency loss. To reinforce the feature representation in the frequency domain, we propose a hybrid feature enhancement module that incorporates spatial cues. Furthermore, considering that the contributions of homogeneous and modality-specific features to object detection vary across scenarios, we propose a frequency-aware query selection module to dynamically regulate their contributions. Experimental results on the FLIR, LLVIP, and M3FD datasets demonstrate that WD-FQDet achieves state-of-the-art performance across multiple evaluation metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Wavelet decomposition to split shared low-frequency and specific high-frequency features in IR-visible detection is a reasonable new angle, but the paper gives no direct check that the split is clean enough to support the claimed gains.

read the letter

Hi, the main takeaway is that WD-FQDet decomposes IR and visible images with wavelets to separate modality-shared low-frequency content from modality-specific high-frequency content, then applies a cross-modal attention module to align the low-frequency parts, a gradient consistency loss to retain the high-frequency parts, a hybrid spatial-frequency enhancer, and a frequency-aware query selector in the transformer decoder. It reports SOTA numbers on FLIR, LLVIP, and M3FD. What is new is the explicit frequency-domain framing for the shared-versus-specific problem plus the combination of those four modules; earlier work mostly varied backbone sharing without this decomposition step. The approach is logically motivated and gives a concrete way to tailor fusion to frequency characteristics rather than treating all features uniformly. The soft spots are straightforward. The core assumption that wavelets cleanly map shared information to low-frequency subbands and specific information to high-frequency subbands is stated but not tested; no per-subband similarity, mutual information, or cross-modal correlation numbers are mentioned to show the separation actually holds on these datasets. In IR-visible pairs low-frequency bands can still carry modality-specific biases such as thermal gradients versus illumination, and high-frequency bands often share edge structure, so the alignment and retention modules may not be operating on the intended signals. The SOTA claim is also hard to evaluate from the abstract alone because full baseline comparisons, ablation tables, and error breakdowns are not provided. This paper is aimed at researchers working on multispectral object detection who want to experiment with frequency-aware fusion inside detection transformers. A reader already following transformer-based detectors or multimodal fusion would get concrete module ideas to try. It is coherent enough on its own terms to deserve a serious referee who can inspect the experiments and the actual separation quality.

Referee Report

2 major / 1 minor

Summary. The manuscript presents WD-FQDet, a multispectral detection transformer for infrared-visible object detection. It claims to explicitly decouple modality-shared and modality-specific information via wavelet decomposition into low- and high-frequency domains, enabling tailored fusion through a low-frequency homogeneity alignment module (cross-modal attention), a high-frequency specificity retention module (multi-scale gradient consistency loss), a hybrid feature enhancement module, and a frequency-aware query selection module. The work reports state-of-the-art performance across multiple metrics on the FLIR, LLVIP, and M3FD datasets.

Significance. If the frequency-domain decoupling proves effective and the modules mitigate bias and insufficiency without introducing artifacts, the framework could offer a principled advance over backbone-specific or shared fusion methods by exploiting complementary frequency characteristics in multispectral imagery. The dynamic query selection and gradient consistency loss represent potentially useful mechanisms for scenario-adaptive fusion.

major comments (2)

[Method (wavelet decomposition and frequency modules)] The central claim in the abstract and method description rests on the assumption that wavelet decomposition (presumably DWT) maps modality-shared information predominantly to low-frequency subbands and modality-specific information to high-frequency subbands. No quantitative validation—such as cross-modal mutual information, cosine similarity, or correlation metrics computed per subband on FLIR/LLVIP/M3FD—is reported to confirm the separation is sufficiently clean; low-frequency components can encode modality-specific biases (thermal gradients vs. illumination) while high-frequency edges may be shared, undermining the subsequent alignment and retention modules.
[Experiments] The experimental claims of SOTA performance lack supporting details: no full baseline tables, module-wise ablations, statistical significance tests, or error analysis (e.g., failure cases under varying illumination) are described, making it impossible to assess whether the reported gains are attributable to the frequency-aware components or to implementation specifics.

minor comments (1)

[Method] Clarify the exact wavelet basis and decomposition levels used, and provide equations for the multi-scale gradient consistency loss and frequency-aware query selection to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript on WD-FQDet. We have carefully considered each major comment and provide point-by-point responses below. We agree that additional validation and experimental details will strengthen the paper and will incorporate the suggested changes in the revised version.

read point-by-point responses

Referee: [Method (wavelet decomposition and frequency modules)] The central claim in the abstract and method description rests on the assumption that wavelet decomposition (presumably DWT) maps modality-shared information predominantly to low-frequency subbands and modality-specific information to high-frequency subbands. No quantitative validation—such as cross-modal mutual information, cosine similarity, or correlation metrics computed per subband on FLIR/LLVIP/M3FD—is reported to confirm the separation is sufficiently clean; low-frequency components can encode modality-specific biases (thermal gradients vs. illumination) while high-frequency edges may be shared, undermining the subsequent alignment and retention modules.

Authors: We appreciate this observation regarding the need for explicit validation of the frequency-domain separation. The design of WD-FQDet is motivated by the established frequency separation properties of discrete wavelet transform (DWT), where low-frequency subbands typically encode shared structural information and high-frequency subbands capture modality-specific details. However, we acknowledge that direct quantitative metrics were not reported in the original submission. In the revised manuscript, we will add cross-modal mutual information, cosine similarity, and correlation analyses computed per subband on the FLIR, LLVIP, and M3FD datasets to empirically confirm the decoupling quality and address potential concerns about modality-specific biases in low-frequency components. revision: yes
Referee: [Experiments] The experimental claims of SOTA performance lack supporting details: no full baseline tables, module-wise ablations, statistical significance tests, or error analysis (e.g., failure cases under varying illumination) are described, making it impossible to assess whether the reported gains are attributable to the frequency-aware components or to implementation specifics.

Authors: We agree that expanded experimental details are required for full transparency. While the original manuscript presents SOTA results and some ablation studies, we will revise the experimental section to include complete baseline comparison tables, comprehensive module-wise ablations, statistical significance testing (e.g., paired t-tests across multiple runs), and a dedicated error analysis subsection examining failure cases under varying illumination and other conditions. These additions will better isolate the contributions of the frequency-aware modules. revision: yes

Circularity Check

0 steps flagged

No significant circularity in WD-FQDet derivation chain

full rationale

The paper introduces an architectural framework using wavelet decomposition to separate low- and high-frequency components, followed by explicitly defined modules (low-frequency homogeneity alignment via cross-modal attention, high-frequency specificity retention via gradient consistency loss, hybrid enhancement, and frequency-aware query selection). These are presented as design choices motivated by the frequency-domain view rather than derived from or reducing to the final outputs. No equations or claims reduce predictions to fitted parameters by construction, and no load-bearing self-citations or uniqueness theorems are invoked. Performance is assessed via standard empirical evaluation on external datasets (FLIR, LLVIP, M3FD), keeping the central claims independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no mathematical formulations, derivations, or implementation details, so specific free parameters, axioms, or invented entities cannot be identified or audited.

pith-pipeline@v0.9.0 · 5510 in / 1181 out tokens · 42356 ms · 2026-05-14T19:11:48.767464+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 2 internal anchors

[1]

Multi-modality medical im- age fusion using discrete wavelet transform.Procedia Com- puter Science, 70:625–631, 2015

V Bhavana and HK Krishnappa. Multi-modality medical im- age fusion using discrete wavelet transform.Procedia Com- puter Science, 70:625–631, 2015. 3

work page 2015
[2]

Multimodal object detection by channel switching and spatial attention

Yue Cao, Junchi Bin, Jozsef Hamari, Erik Blasch, and Zheng Liu. Multimodal object detection by channel switching and spatial attention. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 403–411, 2023. 6, 7

work page 2023
[3]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 6, 7

work page 2020
[4]

Timothy Chase Jr, Chris Gnam, John Crassidis, and Karthik Dantu. You only crash once: Improved object detection for real-time, sim-to-real hazardous terrain detection and classi- fication for autonomous planetary landings.arXiv preprint arXiv:2303.04891, 2023. 1

work page arXiv 2023
[5]

Multimodal object detection via probabilistic ensembling

Yi-Ting Chen, Jinghao Shi, Zelin Ye, Christoph Mertz, Deva Ramanan, and Shu Kong. Multimodal object detection via probabilistic ensembling. InEuropean Conference on Com- puter Vision, pages 139–158. Springer, 2022. 7

work page 2022
[6]

Deep learning based efficient ship detection from drone-captured images for maritime surveillance.Ocean engineering, 285: 115440, 2023

Shuxiao Cheng, Yishuang Zhu, and Shaohua Wu. Deep learning based efficient ship detection from drone-captured images for maritime surveillance.Ocean engineering, 285: 115440, 2023. 1

work page 2023
[7]

Xception: Deep learning with depthwise separable convolutions

Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 1251–1258, 2017. 4

work page 2017
[8]

A survey on 3d object de- tection in real time for autonomous driving.Frontiers in Robotics and AI, 11:1212070, 2024

Marcelo Contreras, Aayush Jain, Neel P Bhatt, Arunava Banerjee, and Ehsan Hashemi. A survey on 3d object de- tection in real time for autonomous driving.Frontiers in Robotics and AI, 11:1212070, 2024. 1

work page 2024
[9]

Histograms of oriented gra- dients for human detection

Navneet Dalal and Bill Triggs. Histograms of oriented gra- dients for human detection. In2005 IEEE computer soci- ety conference on computer vision and pattern recognition (CVPR’05), pages 886–893. Ieee, 2005. 4

work page 2005
[10]

Pedestrian detection by fusion of rgb and infrared im- ages in low-light environment

Qing Deng, Wei Tian, Yuyao Huang, Lu Xiong, and Xin Bi. Pedestrian detection by fusion of rgb and infrared im- ages in low-light environment. In2021 IEEE 24th Interna- tional Conference on Information Fusion (FUSION), pages 1–8. IEEE, 2021. 2

work page 2021
[11]

Fusion-mamba for cross-modality object detection.arXiv preprint arXiv:2404.09146, 2024

Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Xuhui Liu, Juan Zhang, Guodong Guo, and Baochang Zhang. Fusion-mamba for cross-modality object detection.arXiv preprint arXiv:2404.09146, 2024. 1, 7

work page arXiv 2024
[12]

Wavelet transforms and their applications to turbulence.Annual review of fluid mechanics, 24(1):395– 458, 1992

Marie Farge et al. Wavelet transforms and their applications to turbulence.Annual review of fluid mechanics, 24(1):395– 458, 1992. 2, 4

work page 1992
[13]

Wavelet convolutions for large receptive fields

Shahaf E Finder, Roy Amoyal, Eran Treister, and Oren Freifeld. Wavelet convolutions for large receptive fields. In European Conference on Computer Vision, pages 363–380. Springer, 2024. 3

work page 2024
[14]

Flir thermal dataset for algorithm training.https: / / www

FLIR. Flir thermal dataset for algorithm training.https: / / www . flir . in / oem / adas / adas - dataset - form/. 2018. 6

work page 2018
[15]

Lraf-net: Long- range attention fusion network for visible–infrared object de- tection.IEEE Transactions on Neural Networks and Learn- ing Systems, 2023

Haolong Fu, Shixun Wang, Puhong Duan, Changyan Xiao, Renwei Dian, Shutao Li, and Zhiyong Li. Lraf-net: Long- range attention fusion network for visible–infrared object de- tection.IEEE Transactions on Neural Networks and Learn- ing Systems, 2023. 1, 6, 7, 8

work page 2023
[16]

Machine vision based fire detection techniques: A survey.Fire technology, 57(2):591–623, 2021

S Geetha, CS Abhishek, and CS Akshayanat. Machine vision based fire detection techniques: A survey.Fire technology, 57(2):591–623, 2021. 1

work page 2021
[17]

Fusion of multispectral data through illumination-aware deep neural networks for pedestrian de- tection.Information Fusion, 50:148–157, 2019

Dayan Guan, Yanpeng Cao, Jiangxin Yang, Yanlong Cao, and Michael Ying Yang. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian de- tection.Information Fusion, 50:148–157, 2019. 2

work page 2019
[18]

Dpdetr: Decoupled position detection trans- former for infrared-visible object detection.arXiv preprint arXiv:2408.06123, 2024

Junjie Guo, Chenqiang Gao, Fangcen Liu, and Deyu Meng. Dpdetr: Decoupled position detection trans- former for infrared-visible object detection.arXiv preprint arXiv:2408.06123, 2024. 1, 2

work page arXiv 2024
[19]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7

work page 2016
[20]

Wavedh: Wavelet sub-bands guided convnet for efficient image dehazing.arXiv preprint arXiv:2404.01604, 2024

Seongmin Hwang, Daeyoung Han, Cheolkon Jung, and Moongu Jeon. Wavedh: Wavelet sub-bands guided convnet for efficient image dehazing.arXiv preprint arXiv:2404.01604, 2024. 2

work page arXiv 2024
[21]

Llvip: A visible-infrared paired dataset for low-light vision

Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 3496–3504, 2021. 6

work page 2021
[22]

G. Jocher. Yolov5 by ultralytics.https://github. com/ultralytics/yolov5. 2020. 7

work page 2020
[23]

Design of an image edge detection filter using the sobel operator.IEEE Journal of solid-state circuits, 23(2): 358–367, 1988

Nick Kanopoulos, Nagesh Vasanthavada, and Robert L Baker. Design of an image edge detection filter using the sobel operator.IEEE Journal of solid-state circuits, 23(2): 358–367, 1988. 5

work page 1988
[24]

Fully convo- lutional region proposal networks for multispectral person detection

Daniel Konig, Michael Adam, Christian Jarvers, Georg Lay- her, Heiko Neumann, and Michael Teutsch. Fully convo- lutional region proposal networks for multispectral person detection. InProceedings of the IEEE conference on com- puter vision and pattern recognition workshops, pages 49– 56, 2017. 2

work page 2017
[25]

Crossformer: Cross-guided attention for multi-modal object detection.Pat- tern Recognition Letters, 179:144–150, 2024

Seungik Lee, Jaehyeong Park, and Jinsun Park. Crossformer: Cross-guided attention for multi-modal object detection.Pat- tern Recognition Letters, 179:144–150, 2024. 8

work page 2024
[26]

Illumination-aware faster r-cnn for robust multispectral pedestrian detection.Pattern Recognition, 85:161–171,

Chengyang Li, Dan Song, Ruofeng Tong, and Min Tang. Illumination-aware faster r-cnn for robust multispectral pedestrian detection.Pattern Recognition, 85:161–171,

work page
[27]

Fd2-net: Frequency-driven fea- ture decomposition network for infrared-visible object detec- tion.arXiv preprint arXiv:2412.09258, 2024

Ke Li, Di Wang, Zhangyuan Hu, Shaofeng Li, Weiping Ni, Lin Zhao, and Quan Wang. Fd2-net: Frequency-driven fea- ture decomposition network for infrared-visible object detec- tion.arXiv preprint arXiv:2412.09258, 2024. 6, 7

work page arXiv 2024
[28]

Vehicle detection algorithms for autonomous driving: A review.Sensors, 24 (10):3088, 2024

Liang Liang, Haihua Ma, Le Zhao, Xiaopeng Xie, Chengxin Hua, Miao Zhang, and Yonghui Zhang. Vehicle detection algorithms for autonomous driving: A review.Sensors, 24 (10):3088, 2024. 1

work page 2024
[29]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 6

work page 2014
[30]

Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection

Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 5802–5811, 2022. 6, 7

work page 2022
[31]

Multi- interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation

Jinyuan Liu, Zhu Liu, Guanyao Wu, Long Ma, Risheng Liu, Wei Zhong, Zhongxuan Luo, and Xin Fan. Multi- interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 8115–8124, 2023. 6

work page 2023
[32]

Tianshan Liu, Kin-Man Lam, Rui Zhao, and Guoping Qiu. Deep cross-modal representation learning and distillation for illumination-invariant pedestrian detection.IEEE Transac- tions on Circuits and Systems for Video Technology, 32(1): 315–329, 2021. 2

work page 2021
[33]

Cross- modality fusion transformer for multispectral object detec- tion.arXiv preprint arXiv:2111.00273, 2021

Fang Qingyun, Han Dapeng, and Wang Zhaokui. Cross- modality fusion transformer for multispectral object detec- tion.arXiv preprint arXiv:2111.00273, 2021. 7, 8

work page arXiv 2021
[34]

A guide to image-and video-based small object detection using deep learning: Case study of maritime surveillance.IEEE Transactions on Intelligent Transportation Systems, 2025

Aref Miri Rekavandi, Lian Xu, Farid Boussaid, Abd-Krim Seghouane, Stephen Hoefs, and Mohammed Bennamoun. A guide to image-and video-based small object detection using deep learning: Case study of maritime surveillance.IEEE Transactions on Intelligent Transportation Systems, 2025. 1

work page 2025
[35]

Deep learning-based change detec- tion in remote sensing images: A review.Remote Sensing, 14(4):871, 2022

Ayesha Shafique, Guo Cao, Zia Khan, Muhammad Asad, and Muhammad Aslam. Deep learning-based change detec- tion in remote sensing images: A review.Remote Sensing, 14(4):871, 2022. 1

work page 2022
[36]

Divfusion: Darkness-free infrared and visible im- age fusion.Information Fusion, 91:477–493, 2023

Linfeng Tang, Xinyu Xiang, Hao Zhang, Meiqi Gong, and Jiayi Ma. Divfusion: Darkness-free infrared and visible im- age fusion.Information Fusion, 91:477–493, 2023. 7

work page 2023
[37]

Linfeng Tang, Hao Zhang, Han Xu, and Jiayi Ma. Rethink- ing the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity.Infor- mation Fusion, 99:101870, 2023. 7

work page 2023
[38]

Multi-stage image denoising with the wavelet transform.Pattern Recognition, 134:109050, 2023

Chunwei Tian, Menghua Zheng, Wangmeng Zuo, Bob Zhang, Yanning Zhang, and David Zhang. Multi-stage image denoising with the wavelet transform.Pattern Recognition, 134:109050, 2023. 3

work page 2023
[39]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 5

work page 2017
[40]

Multispectral pedestrian detection using deep fusion convolutional neural networks

J ¨org Wagner, V olker Fischer, Michael Herman, Sven Behnke, et al. Multispectral pedestrian detection using deep fusion convolutional neural networks. InESANN, pages 509–514, 2016. 2

work page 2016
[41]

Eca-net: Efficient channel at- tention for deep convolutional neural networks

Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wang- meng Zuo, and Qinghua Hu. Eca-net: Efficient channel at- tention for deep convolutional neural networks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11534–11542, 2020. 4

work page 2020
[42]

Machine learning-based ship detection and tracking using satellite im- ages for maritime surveillance.Journal of Ambient Intelli- gence and Smart Environments, 13(5):361–371, 2021

Yu Wang, G Rajesh, X Mercilin Raajini, N Kritika, A Kavinkumar, and Syed Bilal Hussain Shah. Machine learning-based ship detection and tracking using satellite im- ages for maritime surveillance.Journal of Ambient Intelli- gence and Smart Environments, 13(5):361–371, 2021. 1

work page 2021
[43]

Gm-detr: Generalized muiltispectral detection transformer with efficient fusion en- coder for visible-infrared detection

Yiming Xiao, Fanman Meng, Qingbo Wu, Linfeng Xu, Mingzhou He, and Hongliang Li. Gm-detr: Generalized muiltispectral detection transformer with efficient fusion en- coder for visible-infrared detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5541–5549, 2024. 1, 5, 6, 7

work page 2024
[44]

Ms- detr: Multispectral pedestrian detection transformer with loosely coupled fusion and modality-balanced optimization

Yinghui Xing, Shuo Yang, Song Wang, Shizhou Zhang, Guoqiang Liang, Xiuwei Zhang, and Yanning Zhang. Ms- detr: Multispectral pedestrian detection transformer with loosely coupled fusion and modality-balanced optimization. IEEE Transactions on Intelligent Transportation Systems,

work page
[45]

Sffnet: A wavelet-based spatial and frequency domain fusion network for remote sensing segmentation.IEEE Transactions on Geoscience and Remote Sensing, 2024

Yunsong Yang, Genji Yuan, and Jinjiang Li. Sffnet: A wavelet-based spatial and frequency domain fusion network for remote sensing segmentation.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3

work page 2024
[46]

Multi-Scale Context Aggregation by Dilated Convolutions

Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions.arXiv preprint arXiv:1511.07122, 2015. 5

work page internal anchor Pith review Pith/arXiv arXiv 2015
[47]

C 2 former: Calibrated and complementary transformer for rgb-infrared object de- tection.IEEE Transactions on Geoscience and Remote Sens- ing, 2024

Maoxun Yuan and Xingxing Wei. C 2 former: Calibrated and complementary transformer for rgb-infrared object de- tection.IEEE Transactions on Geoscience and Remote Sens- ing, 2024. 8

work page 2024
[48]

Transla- tion, scale and rotation: cross-modal alignment meets rgb- infrared vehicle detection

Maoxun Yuan, Yinyan Wang, and Xingxing Wei. Transla- tion, scale and rotation: cross-modal alignment meets rgb- infrared vehicle detection. InEuropean Conference on Com- puter Vision, pages 509–525. Springer, 2022. 8

work page 2022
[49]

Sdnet: A versatile squeeze-and- decomposition network for real-time image fusion.Inter- national Journal of Computer Vision, 129(10):2761–2785,

Hao Zhang and Jiayi Ma. Sdnet: A versatile squeeze-and- decomposition network for real-time image fusion.Inter- national Journal of Computer Vision, 129(10):2761–2785,

work page
[50]

Dino: Detr with improved denoising anchor boxes for end-to-end object de- tection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object de- tection. InThe Eleventh International Conference on Learn- ing Representations. 7

work page
[51]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Differential feature awareness network within antagonistic learning for infrared- visible object detection.IEEE Transactions on Circuits and Systems for Video Technology, 2023

Ruiheng Zhang, Lu Li, Qi Zhang, Jin Zhang, Lixin Xu, Baomin Zhang, and Binglu Wang. Differential feature awareness network within antagonistic learning for infrared- visible object detection.IEEE Transactions on Circuits and Systems for Video Technology, 2023. 1

work page 2023
[53]

Dense distinct query for end-to-end object detection

Shilong Zhang, Xinjiang Wang, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wenwei Zhang, Ping Luo, and Kai Chen. Dense distinct query for end-to-end object detection. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 7329–7338, 2023. 7

work page 2023
[54]

Enhanced wavelet based spatiotemporal fusion networks us- ing cross-paired remote sensing images.ISPRS Journal of Photogrammetry and Remote Sensing, 211:281–297, 2024

Xingjian Zhang, Shuang Li, Zhenyu Tan, and Xinghua Li. Enhanced wavelet based spatiotemporal fusion networks us- ing cross-paired remote sensing images.ISPRS Journal of Photogrammetry and Remote Sensing, 211:281–297, 2024. 3

work page 2024
[55]

Tfdet: Target-aware fusion for rgb-t pedestrian detection

Xue Zhang, Xiaohan Zhang, Jiangtao Wang, Jiacheng Ying, Zehua Sheng, Heng Yu, Chunguang Li, and Hui-Liang Shen. Tfdet: Target-aware fusion for rgb-t pedestrian detection. IEEE Transactions on Neural Networks and Learning Sys- tems, 2024. 2, 6, 7

work page 2024
[56]

Enhancing autonomous driving safety: a robust traffic sign detection and recognition model tsd-yolo.Signal Processing, 225:109619,

Ruixin Zhao, Sai Hong Tang, Jiazheng Shen, Eris Elianddy Bin Supeni, and Sharafiz Abdul Rahim. Enhancing autonomous driving safety: a robust traffic sign detection and recognition model tsd-yolo.Signal Processing, 225:109619,

work page
[57]

Removal and selection: Improving rgb- infrared object detection via coarse-to-fine fusion.arXiv preprint arXiv:2401.10731, 2024

Tianyi Zhao, Maoxun Yuan, Feng Jiang, Nan Wang, and Xingxing Wei. Removal and selection: Improving rgb- infrared object detection via coarse-to-fine fusion.arXiv preprint arXiv:2401.10731, 2024. 7

work page arXiv 2024
[58]

Metafusion: Infrared and visible image fusion via meta- feature embedding from object detection

Wenda Zhao, Shigeng Xie, Fan Zhao, You He, and Huchuan Lu. Metafusion: Infrared and visible image fusion via meta- feature embedding from object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13955–13965, 2023. 6

work page 2023
[59]

Detrs beat yolos on real-time object detection

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024. 4, 6, 7

work page 2024
[60]

Efficient and model- based infrared and visible image fusion via algorithm un- rolling.IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1186–1196, 2021

Zixiang Zhao, Shuang Xu, Jiangshe Zhang, Chengyang Liang, Chunxia Zhang, and Junmin Liu. Efficient and model- based infrared and visible image fusion via algorithm un- rolling.IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1186–1196, 2021. 7

work page 2021
[61]

Cddfuse: Correlation-driven dual-branch feature decompo- sition for multi-modality image fusion

Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, and Luc Van Gool. Cddfuse: Correlation-driven dual-branch feature decompo- sition for multi-modality image fusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5906–5916, 2023. 1, 6, 7

work page 2023
[62]

Ddfm: denoising diffusion model for multi-modality image fusion

Zixiang Zhao, Haowen Bai, Yuanzhi Zhu, Jiangshe Zhang, Shuang Xu, Yulun Zhang, Kai Zhang, Deyu Meng, Radu Timofte, and Luc Van Gool. Ddfm: denoising diffusion model for multi-modality image fusion. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 8082–8093, 2023. 6

work page 2023
[63]

Xnet: Wavelet-based low and high frequency fusion networks for fully-and semi-supervised semantic seg- mentation of biomedical images

Yanfeng Zhou, Jiaxing Huang, Chenlong Wang, Le Song, and Ge Yang. Xnet: Wavelet-based low and high frequency fusion networks for fully-and semi-supervised semantic seg- mentation of biomedical images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21085–21096, 2023. 3

work page 2023