LER-YOLO: Reliability-Aware Expert Routing for Misaligned RGB-Infrared UAV Detection

Hexiang Hao; Ji Wang; Liming Hou; Wei Tang; Xin Ying; Xuekai Zhang; Yubo He; Yueping Peng; Zecong Ye

arxiv: 2605.20667 · v1 · pith:PEWOS2I7new · submitted 2026-05-20 · 💻 cs.CV

LER-YOLO: Reliability-Aware Expert Routing for Misaligned RGB-Infrared UAV Detection

Liming Hou , Yueping Peng , Hexiang Hao , Ji Wang , Xuekai Zhang , Wei Tang , Zecong Ye , Xin Ying

show 1 more author

Yubo He

This is my paper

Pith reviewed 2026-05-21 05:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords UAV detectionRGB-infrared fusionmixture of expertsspatial misalignmentreliability maptarget alignmentremote sensing

0 comments

The pith

A spatial reliability map from target alignment lets sparse MoE fusion suppress unreliable RGB-infrared matches for UAV detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Detecting small UAVs in RGB-infrared remote sensing is difficult because spatial misalignment between the two sensors creates local mismatches that standard fusion methods propagate into the detector. The paper establishes that first resampling RGB features to an infrared reference and estimating a per-location trustworthiness map allows the system to know which cross-sensor correspondences are safe to use. This map then controls a sparse mixture-of-experts fusion block that picks among RGB-dominant, infrared-dominant, and interactive experts on a per-region basis. The result is trustworthy cross-modal interaction without letting mismatch artifacts reach the detection head. If the approach holds, detectors can handle real-world misalignment more gracefully while keeping model size comparable to a standard YOLOv5s.

Core claim

The central claim is that an Uncertainty-Aware Target Alignment module produces a spatial reliability map by resampling visible features toward the infrared reference, and a Reliability-Guided Sparse MoE Fusion module then uses this map to adaptively route to k experts drawn from RGB-dominant, infrared-dominant, and interactive fusion experts, enabling suppression of unreliable fusion while preserving useful information and yielding 89.7 percent average AP50 on the MBU benchmark.

What carries the argument

The Uncertainty-Aware Target Alignment module that generates the spatial reliability map, combined with the Reliability-Guided Sparse MoE Fusion module that uses the map to select and weight experts.

If this is right

Detection reaches 89.7 percent AP50 with 0.2 percent standard deviation across three independent seeds and a best run of 89.9 percent.
Gains arise from the reliability-guided routing mechanism rather than from added model capacity.
Unreliable cross-modal interactions are suppressed while useful information from either modality is retained.
Performance remains stable under synthetic spatial shifts that simulate varying degrees of misalignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reliability-guided routing could be tested on other misaligned multi-modal tasks such as visible-thermal pedestrian detection or satellite-ground fusion.
If the reliability map correlates with actual geometric error, the method might reduce the need for hardware-level sensor calibration in field deployments.
Applying the routing layer to backbones other than YOLOv5s would test whether the benefit is tied to the particular detection architecture.

Load-bearing premise

The spatial reliability map produced by the Uncertainty-Aware Target Alignment module accurately reflects the trustworthiness of local cross-sensor correspondence and can be used to safely suppress unreliable fusion without discarding useful information.

What would settle it

Replace the learned spatial reliability map with a uniform or random map of equal average value and check whether the AP50 gain over a parameter-matched baseline disappears on the MBU benchmark.

Figures

Figures reproduced from arXiv: 2605.20667 by Hexiang Hao, Ji Wang, Liming Hou, Wei Tang, Xin Ying, Xuekai Zhang, Yubo He, Yueping Peng, Zecong Ye.

**Figure 1.** Figure 1: Overall architecture of MoE-MBUDet,a YOLOv8-based anti-drone detection framework formed Y O L O v5 Detectio n Hea d Tin y U A V L ocaliz atio n & Classificatio n R L Dynamic Gating Router E1 E2 EK CSPDarknet (IR) P5 P4 P3 {Frgb} P5 P4 P3 {Fir} … MR Ftr F'rgb Top-k Expert Fusion … … w1 w2 wK Fir Fir Frgb Frgb' Aligned Non-shared weights (Visible image) (Infrared image) 0 0 3 H W rgb I R    0 0 3 H W ir I… view at source ↗

**Figure 2.** Figure 2: Detailed architecture at: Mixture-of-Experts Fusion mechanism (MoEFusion)for anti-drone detection. Reliability-Guided Sparse MoE Fusion. The gating router uses the reliability prior to [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 6.** Figure 6: Representative RGB-infrared examples and measured expert weights under daytime, dark, and strong-backlight scenes. 4.10. Discussion and Limitations The experimental results support three observations. First, RGB-only and infraredonly detection both provide useful single-modality evidence on the MBU benchmark, with infrared remaining slightly stronger under the infrared-reference protocol. This is consiste… view at source ↗

read the original abstract

Detecting small unmanned aerial vehicles from RGB-infrared remote-sensing pairs remains challenging due to tiny target scale, cluttered backgrounds, and spatial misalignment between heterogeneous sensors. Existing bimodal detectors often align or fuse features without assessing the reliability of local cross-sensor correspondence, allowing mismatch artifacts to propagate into the detection head. To address this issue, we propose LER-YOLO, a reliability-aware sparse mixture-of-experts framework for misaligned RGB-infrared UAV detection. LER-YOLO first introduces an Uncertainty-Aware Target Alignment module that resamples visible features toward the infrared reference and estimates a spatial reliability map. This reliability prior is then used by a Reliability-Guided Sparse MoE Fusion module to adaptively select k experts from RGB-dominant, infrared-dominant, and interactive fusion experts, enabling trustworthy cross-modal interaction while suppressing unreliable fusion. Experiments on the public MBU benchmark under a YOLOv5s-family protocol show that LER-YOLO achieves 89.7+/-0.2% AP50 over three independent seeds, with a best result of 89.9%. Extensive ablations, parameter-matched comparisons, synthetic-shift evaluations, and complexity analysis demonstrate that the gains mainly come from reliability-guided expert routing rather than increased model capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LER-YOLO's reliability-guided MoE routing delivers measurable gains on the MBU benchmark for misaligned RGB-IR UAV detection, though the reliability map itself receives no independent accuracy evaluation.

read the letter

The punchline for this paper is that LER-YOLO introduces a reliability map from an uncertainty-aware alignment module to control expert selection in a sparse MoE fusion block for handling spatial misalignment between RGB and infrared images in UAV detection. It reaches 89.7% AP50 on the MBU benchmark with some ablations supporting that the routing helps. The approach is new in how it links the estimated reliability directly to choosing among RGB, IR, and interactive experts while suppressing the unreliable ones. This seems like a reasonable way to deal with the mismatch artifacts that plague bimodal detectors. The paper does a good job with parameter-matched comparisons and synthetic-shift evaluations to isolate the effect of the guided routing from just adding model capacity. On the downside, the reliability map is not checked on its own. There are no numbers showing how well it predicts actual alignment quality or matches synthetic misalignment masks. The end-to-end detection improvements are there, but without that check it is possible the gains trace back to the resampling or the MoE structure alone rather than accurate reliability guidance. This kind of work is useful for researchers focused on practical multimodal detection in remote sensing or UAV applications where sensor misalignment is common. A reader looking for ideas on adaptive fusion under uncertainty would get value from the specific routing mechanism and the benchmark results. The experiments are concrete enough and the problem is well-motivated, so the paper deserves a serious referee even if some additional validation on the map would strengthen it. I would recommend sending it for peer review.

Referee Report

1 major / 2 minor

Summary. The paper proposes LER-YOLO, a reliability-aware sparse mixture-of-experts framework for detecting small UAVs from spatially misaligned RGB-infrared remote-sensing pairs. It introduces an Uncertainty-Aware Target Alignment module that resamples RGB features to the IR reference while producing a spatial reliability map, which then guides a Reliability-Guided Sparse MoE Fusion module to select k experts (RGB-dominant, IR-dominant, and interactive) for trustworthy cross-modal interaction. On the public MBU benchmark under a YOLOv5s-family protocol, LER-YOLO reports 89.7 ± 0.2% AP50 (best run 89.9%) over three seeds; ablations, parameter-matched baselines, synthetic-shift tests, and complexity analysis are used to attribute gains primarily to the reliability-guided routing rather than added capacity.

Significance. If the reliability map is shown to be accurate, the approach provides a concrete mechanism for suppressing mismatch artifacts in bimodal UAV detection without discarding useful cross-modal information. The parameter-matched comparisons and synthetic-shift evaluations strengthen the case that the routing mechanism, rather than model size, drives the reported AP50 improvement. Reproducibility via multiple seeds and public benchmark use are positive; the work could influence future multimodal remote-sensing detectors if the map's trustworthiness is directly validated.

major comments (1)

[Abstract and §4] Abstract and §4 (Experiments): The central claim that gains derive from reliability-guided expert routing (rather than capacity or alignment alone) depends on the spatial reliability map correctly identifying trustworthy local RGB-IR correspondence. No direct quantitative validation of the map is reported—e.g., no precision/recall against ground-truth alignment labels, no correlation analysis with synthetic-shift masks, and no ablation isolating map accuracy from the MoE architecture—leaving open the possibility that end-to-end AP50 improvements arise from resampling or expert selection mechanics irrespective of map trustworthiness.

minor comments (2)

[§3.2] §3.2: The exact selection criterion for the k experts and the formulation of the reliability prior (e.g., how the map is thresholded or normalized before routing) should be stated with an equation or pseudocode for reproducibility.
[Table 2 and Figure 4] Table 2 and Figure 4: Include standard deviations for all compared methods (not only LER-YOLO) and clarify whether the synthetic-shift tests use the same misalignment distribution as the MBU test set.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive and detailed review. The major comment concerns the absence of direct quantitative validation for the spatial reliability map. We respond point-by-point below, clarifying the evidence already present in the manuscript while acknowledging where additional analysis can be provided.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that gains derive from reliability-guided expert routing (rather than capacity or alignment alone) depends on the spatial reliability map correctly identifying trustworthy local RGB-IR correspondence. No direct quantitative validation of the map is reported—e.g., no precision/recall against ground-truth alignment labels, no correlation analysis with synthetic-shift masks, and no ablation isolating map accuracy from the MoE architecture—leaving open the possibility that end-to-end AP50 improvements arise from resampling or expert selection mechanics irrespective of map trustworthiness.

Authors: We agree that direct validation of the reliability map would strengthen the central claim. The MBU benchmark does not provide ground-truth local alignment labels, so precision/recall against such labels cannot be computed without new annotations. However, the synthetic-shift experiments introduce controlled, known misalignment patterns and show that performance gains appear specifically when the reliability map is used to guide expert routing; removing this guidance while retaining the MoE structure and alignment module leads to measurable drops. Parameter-matched baselines further isolate the routing mechanism from capacity increases. We will add a correlation analysis between the reliability maps and the synthetic-shift masks, plus an explicit ablation that disables only the reliability weighting inside the MoE, to the revised §4. This addresses the concern as far as the available data allow. revision: partial

standing simulated objections not resolved

Direct precision/recall evaluation of the reliability map against ground-truth alignment labels, because the MBU benchmark provides no such per-pixel or per-region alignment annotations.

Circularity Check

0 steps flagged

No significant circularity; performance measured on external benchmark

full rationale

The paper proposes an architectural change to YOLOv5s (Uncertainty-Aware Target Alignment plus Reliability-Guided Sparse MoE Fusion) and reports AP50 on the public MBU benchmark. The central claim that gains arise from reliability-guided routing rather than capacity is supported by parameter-matched ablations and synthetic-shift tests whose metrics are computed from standard detection evaluation protocols. No equation, module definition, or self-citation reduces the reported 89.7 % AP50 or the routing decisions to a fitted parameter or prior result by construction; the reliability map is an internal estimate whose accuracy is not claimed to be proven by the final detection score itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard deep-learning assumptions plus two domain-specific premises about alignment reliability.

free parameters (1)

k (experts selected per location)
Top-k routing hyperparameter in the sparse MoE; value not stated in abstract.

axioms (1)

domain assumption A spatial reliability map can be estimated from the alignment resampling process that meaningfully indicates cross-sensor trustworthiness.
Invoked when the reliability prior is fed to the MoE routing decision.

pith-pipeline@v0.9.0 · 5772 in / 1261 out tokens · 30045 ms · 2026-05-21T05:46:56.694073+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Uncertainty-Aware Target Alignment module that resamples visible features toward the infrared reference and estimates a spatial reliability map... Reliability-Guided Sparse MoE Fusion module to adaptively select k experts
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

self-supervised U-TA reliability loss Luta = ... Rij ||Fir - eFrgb||1 - lambda log(Rij + eps)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

[1]

You Only Look Once: Unified, Real-Time Object Detection

Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV , USA, 27–30 June 2016; pp. 779–788

work page 2016
[2]

Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking.arXiv2021, arXiv:2101.08466

Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Zhao, J.; Guo, G.; Han, Z.; et al. Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking.arXiv2021, arXiv:2101.08466

work page arXiv
[3]

MBUDet: Misaligned Bimodal UAV Target Detection via Target Offset Label Generation.Inf

Ye, Z.; Hao, H.; Peng, Y.; Tang, W.; Zhang, X.; Han, B.; Zhai, H. MBUDet: Misaligned Bimodal UAV Target Detection via Target Offset Label Generation.Inf. Fusion2026,127, 103756. https://doi.org/10.1016/j.inffus.2025.103756

work page doi:10.1016/j.inffus.2025.103756 2025
[4]

Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions.arXiv2025, arXiv:2504.11967

Dong, Y.; Wu, F.; Zhang, S.; Chen, G.; Hu, Y.; Yano, M.; Sun, J.; Huang, S.; Liu, F.; Dai, Q.; et al. Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions.arXiv2025, arXiv:2504.11967

work page arXiv
[5]

Infrared and Visible Camera Integration for Detection and Tracking of Small UAVs: Systematic Evaluation.Drones2024,8, 650

Pereira, A.; Warwick, S.; Moutinho, A.; Suleman, A. Infrared and Visible Camera Integration for Detection and Tracking of Small UAVs: Systematic Evaluation.Drones2024,8, 650. https://doi.org/10.3390/drones8110650

work page doi:10.3390/drones8110650
[6]

Drone Detection and Tracking in Real-Time by Fusion of Different Sensing Modalities.Drones2022,6, 317

Svanstrom, F.; Alonso-Fernandez, F.; Englund, C. Drone Detection and Tracking in Real-Time by Fusion of Different Sensing Modalities.Drones2022,6, 317. https://doi.org/10.3390/drones6110317

work page doi:10.3390/drones6110317
[7]

ITD-YOLOv8: An Infrared Target Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 161

Zhao, X.; Zhang, W.; Zhang, H.; Zheng, C.; Ma, J.; Zhang, Z. ITD-YOLOv8: An Infrared Target Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 161. https://doi.org/10.3390/drones8040161

work page doi:10.3390/drones8040161
[8]

G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8.Drones2024,8, 495

Zhao, X.; Zhang, W.; Xia, Y.; Zhang, H.; Zheng, C.; Ma, J.; Zhang, Z. G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8.Drones2024,8, 495. https://doi.org/10.3390/drones8090495

work page doi:10.3390/drones8090495
[9]

A Lightweight Real-Time Infrared Object Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 479

Ding, B.; Zhang, Y.; Ma, S. A Lightweight Real-Time Infrared Object Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 479. https://doi.org/10.3390/drones8090479

work page doi:10.3390/drones8090479
[10]

An All-Time Detection Algorithm for UAV Images in Urban Low Altitude.Drones2024,8,

Huang, Y.; Qu, J.; Wang, H.; Yang, J. An All-Time Detection Algorithm for UAV Images in Urban Low Altitude.Drones2024,8,

work page
[11]

https://doi.org/10.3390/drones8070332

work page doi:10.3390/drones8070332
[12]

MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone- Acquired Thermal Infrared Imagery.Drones2025,9, 224

Wang, Z.; Dang, C.; Zhang, R.; Wang, L.; He, Y.; Wu, R. MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone- Acquired Thermal Infrared Imagery.Drones2025,9, 224. https://doi.org/10.3390/drones9030224

work page doi:10.3390/drones9030224
[13]

Drone-Based Visible-Thermal Object Detection with Transformers and Prompt Tuning

Chen, R.; Li, D.; Gao, Z.; Kuai, Y.; Wang, C. Drone-Based Visible-Thermal Object Detection with Transformers and Prompt Tuning. Drones2024,8, 451. https://doi.org/10.3390/drones8090451

work page doi:10.3390/drones8090451
[14]

Single-Stage UAV Detection and Classification with YOLOv5: Mosaic Data Augmentation and PANet

Dadboud, F.; Patel, V .; Mehta, V .; Bolic, M.; Mantegh, I. Single-Stage UAV Detection and Classification with YOLOv5: Mosaic Data Augmentation and PANet. InProceedings of the 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Washington, DC, USA, 16–19 November 2021; pp. 1–8

work page 2021
[16]

Overview of UAV Target Detection Algorithms Based on Deep Learning

Dai, J.; Wu, L.; Wang, P . Overview of UAV Target Detection Algorithms Based on Deep Learning. InProceedings of the 2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 17–19 December 2021; pp. 736–745

work page 2021
[17]

A Real-Time and Lightweight Method for Tiny Airborne Object Detection

Lyu, Y.; Liu, Z.; Li, H.; Guo, D.; Fu, Y. A Real-Time and Lightweight Method for Tiny Airborne Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 3016–3025

work page 2023
[18]

Investigation of UAV Detection in Images with Complex Backgrounds and Rainy Artifacts

Munir, A.; Siddiqui, A.J.; Anwar, S. Investigation of UAV Detection in Images with Complex Backgrounds and Rainy Artifacts. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 3–8 January 2024; pp. 232–241

work page 2024
[19]

Enhanced Thermal-RGB Fusion for Robust Object Detection

El Ahmar, W.; Massoud, Y.; Kolhatkar, D.; AlGhamdi, H.; Alja’Afreh, M.; Laganiere, R.; Hammoud, R. Enhanced Thermal-RGB Fusion for Robust Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 365–374

work page 2023
[20]

Cross-Modal Transformers for Infrared and Visible Image Fusion.IEEE Trans

Park, S.; Vien, A.G.; Lee, C. Cross-Modal Transformers for Infrared and Visible Image Fusion.IEEE Trans. Circuits Syst. Video Technol.2024,34, 770–785

work page 2024
[21]

Cross-Modal and Cross-Level Attention Interaction Network for Salient Object Detection.IEEE Trans

Wang, F.; Su, Y.; Wang, R.; Sun, J.; Sun, F.; Li, H. Cross-Modal and Cross-Level Attention Interaction Network for Salient Object Detection.IEEE Trans. Artif. Intell.2024,5, 2907–2920

work page 2024
[22]

Task-Customized Mixture of Adapters for General Image Fusion

Zhu, P .; Sun, Y.; Cao, B.; Hu, Q. Task-Customized Mixture of Adapters for General Image Fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 7099–7108

work page 2024
[23]

Weakly Misalignment-Free Adaptive Feature Alignment for UAVs-Based Multimodal Object Detection

Chen, C.; Qi, J.; Liu, X.; Bin, K.; Fu, R.; Hu, X.; Zhong, P . Weakly Misalignment-Free Adaptive Feature Alignment for UAVs-Based Multimodal Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 26826–26835

work page 2024
[24]

Cross-Modal Offset-Guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection.arXiv2025, arXiv:2506.16737

Liu, Z.; Luo, H.; Wang, Z.; Wei, Y.; Zuo, H.; Zhang, J. Cross-Modal Offset-Guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection.arXiv2025, arXiv:2506.16737

work page arXiv
[25]

GM-DETR: Generalized Multispectral Detection Transformer with Efficient Fusion Encoder for Visible-Infrared Detection

Xiao, Y.; Meng, F.; Wu, Q.; Xu, L.; He, M.; Li, H. GM-DETR: Generalized Multispectral Detection Transformer with Efficient Fusion Encoder for Visible-Infrared Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 5541–5549

work page 2024
[26]

Deformable Convolutional Networks

Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. InProceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773

work page 2017
[27]

Spatial Transformer Networks

Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 2017–2025

work page 2015
[28]

Misaligned RGB-Infrared Object Detection via Adaptive Dual-Discrepancy Calibration.Remote Sens.2023,15, 4887

He, M.; Wu, Q.; Ngan, K.N.; Jiang, F.; Meng, F.; Xu, L. Misaligned RGB-Infrared Object Detection via Adaptive Dual-Discrepancy Calibration.Remote Sens.2023,15, 4887

work page 2023
[29]

Weakly Alignment-Free RGBT Salient Object Detection with Deep Correlation Network.IEEE Trans

Tu, Z.; Li, Z.; Li, C.; Tang, J. Weakly Alignment-Free RGBT Salient Object Detection with Deep Correlation Network.IEEE Trans. Image Process.2022,31, 3752–3764

work page 2022
[30]

Modality Registration and Object Search Framework for UAV-Based Unregistered RGB-T Image Salient Object Detection.IEEE Trans

Song, K.; Wen, H.; Xue, X.; Huang, L.; Ji, Y.; Yan, Y. Modality Registration and Object Search Framework for UAV-Based Unregistered RGB-T Image Salient Object Detection.IEEE Trans. Geosci. Remote Sens.2023,61, 1–15

work page 2023
[31]

Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

Shen, F.; Wang, C.; Gao, J.; Guo, Q.; Dang, J.; Tang, J.; Chua, T.-S. Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model. InProceedings of the Forty-Second International Conference on Machine Learning (ICML), 2025

work page 2025
[32]

ImagPose: A Unified Conditional Framework for Pose-Guided Person Generation.Advances in Neural Information Processing Systems2024,37, 6246–6266

Shen, F.; Tang, J. ImagPose: A Unified Conditional Framework for Pose-Guided Person Generation.Advances in Neural Information Processing Systems2024,37, 6246–6266

work page
[33]

ImagDressing-v1: Customizable Virtual Dressing

Shen, F.; Jiang, X.; He, X.; Ye, H.; Wang, C.; Du, X.; Li, Z.; Tang, J. ImagDressing-v1: Customizable Virtual Dressing. InProceedings of the AAAI Conference on Artificial Intelligence, 2025; Volume 39, Number 7, pp. 6795–6804

work page 2025
[34]

Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models

Shen, F.; Ye, H.; Zhang, J.; Wang, C.; Han, X.; Wei, Y. Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models. InProceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024
[35]

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

Shen, F.; Ye, H.; Liu, S.; Zhang, J.; Wang, C.; Han, X.; Wei, Y. Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models. InProceedings of the AAAI Conference on Artificial Intelligence, 2025; Volume 39, Number 7, pp. 6785–6794

work page 2025
[36]

Kendall, A.; Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5574–5584

work page 2017
[37]

Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion.arXiv2024, arXiv:2410.12592

Cho, M.; Cao, Y.; Sun, J.; Zhang, Q.; Pavone, M.; Park, J.J.; Yang, H.; Mao, Z.M. Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion.arXiv2024, arXiv:2410.12592

work page arXiv
[39]

AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of- Experts

Chen, T.; Chen, X.; Du, X.; Rashwan, A.; Yang, F.; Chen, H.; Wang, Z.; Li, Y. AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of- Experts. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 17300–17311

work page 2023
[40]

YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection.arXiv 2025, arXiv:2511.13344

Meiraz, O.; Shalev, S.; Weizman, A. YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection.arXiv 2025, arXiv:2511.13344

work page arXiv 2025
[41]

MoE3D: Mixture of Experts Meets Multi-Modal 3D Understanding.arXiv 2025, arXiv:2511.22103

Li, Y.; Hou, Y.; Wei, Y.; Zhu, X.; Ma, Y.; Shao, W.; Guo, Y. MoE3D: Mixture of Experts Meets Multi-Modal 3D Understanding.arXiv 2025, arXiv:2511.22103

work page arXiv 2025
[42]

AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection

Lin, H.; Huang, X.; Wen, C.; Wang, C. AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection. arXiv2026, arXiv:2603.16261

work page arXiv
[43]

SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery.IEEE Trans

Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery.IEEE Trans. Geosci. Remote Sens.2023,61, 1–15

work page 2023
[44]

DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection

Chen, Y.; Wang, B.; Guo, X.; Zhu, W.; He, J.; Liu, X.; Yuan, J. DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection. InProceedings of the International Conference on Pattern Recognition, Kolkata, India, 1–5 December 2024; pp. 236–252

work page 2024
[45]

CDC-YOLOFusion: Leveraging Cross-Scale Dynamic Convolution Fusion for Visible-Infrared Object Detection.IEEE Trans

Wang, Z.; Liao, X.; Yuan, J.; Yao, Y.; Li, Z. CDC-YOLOFusion: Leveraging Cross-Scale Dynamic Convolution Fusion for Visible-Infrared Object Detection.IEEE Trans. Intell. Veh.2024,10, 2080–2093

work page 2024
[46]

Cross-Modality Fusion Transformer for Multispectral Object Detection.arXiv2021, arXiv:2111.00273

Qingyun, F.; Dapeng, H.; Zhaokui, W. Cross-Modality Fusion Transformer for Multispectral Object Detection.arXiv2021, arXiv:2111.00273

work page arXiv
[47]

ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection.Pattern Recognit.2024,145, 109913

Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection.Pattern Recognit.2024,145, 109913

work page 2024
[48]

Multimodal Object Detection via Probabilistic Ensembling

Chen, Y.T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 139–158

work page 2022
[49]

FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection

Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. InProceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 8673–8681

work page 2025
[50]

DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection

Guo, J.; Gao, C.; Liu, F.; Meng, D. DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection. arXiv2025, arXiv:2408.06123

work page arXiv
[51]

MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking.arXiv2025, arXiv:2503.17699

Qin, H.; Xu, T.; Li, T.; Chen, Z.; Feng, T.; Li, J. MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking.arXiv2025, arXiv:2503.17699

work page arXiv
[52]

Caltech Aerial RGB-Thermal Dataset in the Wild

Lee, C.; Anderson, M.; Raganathan, N.; Zuo, X.; Do, K.; Gkioxari, G.; Chung, S.J. Caltech Aerial RGB-Thermal Dataset in the Wild. arXiv2024, arXiv:2403.08997

work page arXiv
[53]

ATR-UMMIM: A Benchmark Dataset for UAV-Based Multimodal Image Registration under Complex Imaging Conditions.arXiv2025, arXiv:2507.20764

Bin, K.; Chen, C.; Hu, T.; Qi, J.; Zhong, P . ATR-UMMIM: A Benchmark Dataset for UAV-Based Multimodal Image Registration under Complex Imaging Conditions.arXiv2025, arXiv:2507.20764

work page arXiv
[54]

Fusion Meets Diverse Conditions: A High-Diversity Benchmark and Baseline for UAV-Based Multimodal Object Detection with Condition Cues.arXiv2025, arXiv:2510.13620

Chen, C.; Bin, K.; Hu, T.; Qi, J.; Liu, X.; Liu, T.; Liu, Z.; Liu, Y.; Zhong, P . Fusion Meets Diverse Conditions: A High-Diversity Benchmark and Baseline for UAV-Based Multimodal Object Detection with Condition Cues.arXiv2025, arXiv:2510.13620

work page arXiv
[55]

High-Altitude Infrared Thermal Object Detection for UAVs Based on an Improved RT-DETR

Huang, L.; Li, Y.; Zhang, S. High-Altitude Infrared Thermal Object Detection for UAVs Based on an Improved RT-DETR. In Proceedings of the 2025 International Conference on Computer and Information Processing Technology, 2025; pp. 316–321

work page 2025
[56]

CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes.arXiv2025, arXiv:2507.23473

Xie, B.; Zhang, C.; Wang, F.; Liu, P .; Lu, F.; Chen, Z.; Hu, W. CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes.arXiv2025, arXiv:2507.23473

work page arXiv
[57]

Deep Learning Based Infrared Small Object Segmentation: Challenges and Future Directions.Inf

Yang, Z.; Yu, H.; Zhang, J.; Tang, Q.; Mian, A. Deep Learning Based Infrared Small Object Segmentation: Challenges and Future Directions.Inf. Fusion2025,118, 103007

work page
[58]

Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines.IEEE Trans

Ying, X.; Xiao, C.; An, W.; Li, R.; He, X.; Li, B.; Cao, X.; Li, Z.; Wang, Y.; Hu, M.; et al. Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines.IEEE Trans. Pattern Anal. Mach. Intell.2025,47, 6088–6096

work page 2025
[59]

DEPFusion: Dual-Domain Enhancement and Priority-Guided Mamba Fusion for UAV Multispectral Object Detection.arXiv2025, arXiv:2509.07327

Li, S.; Liu, Z.; Hong, Z.; Zhou, Z.; Cao, X. DEPFusion: Dual-Domain Enhancement and Priority-Guided Mamba Fusion for UAV Multispectral Object Detection.arXiv2025, arXiv:2509.07327

work page arXiv
[60]

COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection.arXiv2024, arXiv:2412.18076

Liu, C.; Ma, X.; Yang, X.; Zhang, Y.; Dong, Y. COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection.arXiv2024, arXiv:2412.18076

work page arXiv
[61]

Cf-Yolo: Cross-Modal Fusion for Weakly Aligned RGB-IR UAV Object Detection

Nguyen, T.L.; Tran, C.T.; Nguyen Thi, H.V . Cf-Yolo: Cross-Modal Fusion for Weakly Aligned RGB-IR UAV Object Detection. In Proceedings of the 2025 International Symposium on Communications and Information Technologies, 2025; pp. 254–259

work page 2025
[62]

SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection.arXiv2025, arXiv:2511.06298

Zuo, X.; Qu, C.; Zhan, H.; Shen, J.; Yang, W. SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection.arXiv2025, arXiv:2511.06298

work page arXiv
[63]

A Deep Learning Framework for Infrared and Visible Image Fusion without Strict Registration

Li, H.; Liu, J.; Zhang, Y.; Liu, Y. A Deep Learning Framework for Infrared and Visible Image Fusion without Strict Registration. Int. J. Comput. Vis.2024,132, 1625–1644

work page 2024
[65]

Semantics Lead All: Towards Unified Image Registration and Fusion from a Semantic Perspective.Inf

Xie, H.; Zhang, Y.; Qiu, J.; Zhai, X.; Liu, X.; Yang, Y.; Zhao, S.; Luo, Y.; Zhong, J. Semantics Lead All: Towards Unified Image Registration and Fusion from a Semantic Perspective.Inf. Fusion2023,98, 101835

work page
[66]

C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection.IEEE Trans

Yuan, M.; Wei, X.; Xingxing. C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection.IEEE Trans. Geosci. Remote Sens.2024,62, 1–12

work page 2024
[67]

Improving RGB-Infrared Object Detection with Cascade Alignment-Guided Transformer.Inf

Yuan, M.; Shi, X.; Wang, N.; Wang, Y.; Wei, X. Improving RGB-Infrared Object Detection with Cascade Alignment-Guided Transformer.Inf. Fusion2024,105, 102246

work page
[68]

Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection

Yuan, M.; Wang, Y.; Wei, X. Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 509–525

work page 2022
[69]

CornerNet: Detecting Objects as Paired Keypoints

Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. InProceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750

work page 2018
[70]

CenterNet: Keypoint Triplets for Object Detection

Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578

work page 2019
[71]

Template Matching Advances and Applications in Image Analysis

Hashemi, N.S.; Aghdam, R.B.; Ghiasi, A.S.B.; Fatemi, P . Template Matching Advances and Applications in Image Analysis.arXiv 2016, arXiv:1610.07231

work page internal anchor Pith review Pith/arXiv arXiv 2016
[72]

SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale.arXiv2026, arXiv:2603.17920

Gross, M.; Matha, S.B.; Song, R.; Muthuveerappan, V .; Christoph, C.; Huber, J.; Cremers, D. SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale.arXiv2026, arXiv:2603.17920. Disclaimer/Publisher’s Note:The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(...

work page doi:10.3390/rs1010000

[1] [1]

You Only Look Once: Unified, Real-Time Object Detection

Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV , USA, 27–30 June 2016; pp. 779–788

work page 2016

[2] [2]

Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking.arXiv2021, arXiv:2101.08466

Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Zhao, J.; Guo, G.; Han, Z.; et al. Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking.arXiv2021, arXiv:2101.08466

work page arXiv

[3] [3]

MBUDet: Misaligned Bimodal UAV Target Detection via Target Offset Label Generation.Inf

Ye, Z.; Hao, H.; Peng, Y.; Tang, W.; Zhang, X.; Han, B.; Zhai, H. MBUDet: Misaligned Bimodal UAV Target Detection via Target Offset Label Generation.Inf. Fusion2026,127, 103756. https://doi.org/10.1016/j.inffus.2025.103756

work page doi:10.1016/j.inffus.2025.103756 2025

[4] [4]

Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions.arXiv2025, arXiv:2504.11967

Dong, Y.; Wu, F.; Zhang, S.; Chen, G.; Hu, Y.; Yano, M.; Sun, J.; Huang, S.; Liu, F.; Dai, Q.; et al. Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions.arXiv2025, arXiv:2504.11967

work page arXiv

[5] [5]

Infrared and Visible Camera Integration for Detection and Tracking of Small UAVs: Systematic Evaluation.Drones2024,8, 650

Pereira, A.; Warwick, S.; Moutinho, A.; Suleman, A. Infrared and Visible Camera Integration for Detection and Tracking of Small UAVs: Systematic Evaluation.Drones2024,8, 650. https://doi.org/10.3390/drones8110650

work page doi:10.3390/drones8110650

[6] [6]

Drone Detection and Tracking in Real-Time by Fusion of Different Sensing Modalities.Drones2022,6, 317

Svanstrom, F.; Alonso-Fernandez, F.; Englund, C. Drone Detection and Tracking in Real-Time by Fusion of Different Sensing Modalities.Drones2022,6, 317. https://doi.org/10.3390/drones6110317

work page doi:10.3390/drones6110317

[7] [7]

ITD-YOLOv8: An Infrared Target Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 161

Zhao, X.; Zhang, W.; Zhang, H.; Zheng, C.; Ma, J.; Zhang, Z. ITD-YOLOv8: An Infrared Target Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 161. https://doi.org/10.3390/drones8040161

work page doi:10.3390/drones8040161

[8] [8]

G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8.Drones2024,8, 495

Zhao, X.; Zhang, W.; Xia, Y.; Zhang, H.; Zheng, C.; Ma, J.; Zhang, Z. G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8.Drones2024,8, 495. https://doi.org/10.3390/drones8090495

work page doi:10.3390/drones8090495

[9] [9]

A Lightweight Real-Time Infrared Object Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 479

Ding, B.; Zhang, Y.; Ma, S. A Lightweight Real-Time Infrared Object Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 479. https://doi.org/10.3390/drones8090479

work page doi:10.3390/drones8090479

[10] [10]

An All-Time Detection Algorithm for UAV Images in Urban Low Altitude.Drones2024,8,

Huang, Y.; Qu, J.; Wang, H.; Yang, J. An All-Time Detection Algorithm for UAV Images in Urban Low Altitude.Drones2024,8,

work page

[11] [11]

https://doi.org/10.3390/drones8070332

work page doi:10.3390/drones8070332

[12] [12]

MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone- Acquired Thermal Infrared Imagery.Drones2025,9, 224

Wang, Z.; Dang, C.; Zhang, R.; Wang, L.; He, Y.; Wu, R. MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone- Acquired Thermal Infrared Imagery.Drones2025,9, 224. https://doi.org/10.3390/drones9030224

work page doi:10.3390/drones9030224

[13] [13]

Drone-Based Visible-Thermal Object Detection with Transformers and Prompt Tuning

Chen, R.; Li, D.; Gao, Z.; Kuai, Y.; Wang, C. Drone-Based Visible-Thermal Object Detection with Transformers and Prompt Tuning. Drones2024,8, 451. https://doi.org/10.3390/drones8090451

work page doi:10.3390/drones8090451

[14] [14]

Single-Stage UAV Detection and Classification with YOLOv5: Mosaic Data Augmentation and PANet

Dadboud, F.; Patel, V .; Mehta, V .; Bolic, M.; Mantegh, I. Single-Stage UAV Detection and Classification with YOLOv5: Mosaic Data Augmentation and PANet. InProceedings of the 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Washington, DC, USA, 16–19 November 2021; pp. 1–8

work page 2021

[15] [16]

Overview of UAV Target Detection Algorithms Based on Deep Learning

Dai, J.; Wu, L.; Wang, P . Overview of UAV Target Detection Algorithms Based on Deep Learning. InProceedings of the 2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 17–19 December 2021; pp. 736–745

work page 2021

[16] [17]

A Real-Time and Lightweight Method for Tiny Airborne Object Detection

Lyu, Y.; Liu, Z.; Li, H.; Guo, D.; Fu, Y. A Real-Time and Lightweight Method for Tiny Airborne Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 3016–3025

work page 2023

[17] [18]

Investigation of UAV Detection in Images with Complex Backgrounds and Rainy Artifacts

Munir, A.; Siddiqui, A.J.; Anwar, S. Investigation of UAV Detection in Images with Complex Backgrounds and Rainy Artifacts. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 3–8 January 2024; pp. 232–241

work page 2024

[18] [19]

Enhanced Thermal-RGB Fusion for Robust Object Detection

El Ahmar, W.; Massoud, Y.; Kolhatkar, D.; AlGhamdi, H.; Alja’Afreh, M.; Laganiere, R.; Hammoud, R. Enhanced Thermal-RGB Fusion for Robust Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 365–374

work page 2023

[19] [20]

Cross-Modal Transformers for Infrared and Visible Image Fusion.IEEE Trans

Park, S.; Vien, A.G.; Lee, C. Cross-Modal Transformers for Infrared and Visible Image Fusion.IEEE Trans. Circuits Syst. Video Technol.2024,34, 770–785

work page 2024

[20] [21]

Cross-Modal and Cross-Level Attention Interaction Network for Salient Object Detection.IEEE Trans

Wang, F.; Su, Y.; Wang, R.; Sun, J.; Sun, F.; Li, H. Cross-Modal and Cross-Level Attention Interaction Network for Salient Object Detection.IEEE Trans. Artif. Intell.2024,5, 2907–2920

work page 2024

[21] [22]

Task-Customized Mixture of Adapters for General Image Fusion

Zhu, P .; Sun, Y.; Cao, B.; Hu, Q. Task-Customized Mixture of Adapters for General Image Fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 7099–7108

work page 2024

[22] [23]

Weakly Misalignment-Free Adaptive Feature Alignment for UAVs-Based Multimodal Object Detection

Chen, C.; Qi, J.; Liu, X.; Bin, K.; Fu, R.; Hu, X.; Zhong, P . Weakly Misalignment-Free Adaptive Feature Alignment for UAVs-Based Multimodal Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 26826–26835

work page 2024

[23] [24]

Cross-Modal Offset-Guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection.arXiv2025, arXiv:2506.16737

Liu, Z.; Luo, H.; Wang, Z.; Wei, Y.; Zuo, H.; Zhang, J. Cross-Modal Offset-Guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection.arXiv2025, arXiv:2506.16737

work page arXiv

[24] [25]

GM-DETR: Generalized Multispectral Detection Transformer with Efficient Fusion Encoder for Visible-Infrared Detection

Xiao, Y.; Meng, F.; Wu, Q.; Xu, L.; He, M.; Li, H. GM-DETR: Generalized Multispectral Detection Transformer with Efficient Fusion Encoder for Visible-Infrared Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 5541–5549

work page 2024

[25] [26]

Deformable Convolutional Networks

Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. InProceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773

work page 2017

[26] [27]

Spatial Transformer Networks

Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 2017–2025

work page 2015

[27] [28]

Misaligned RGB-Infrared Object Detection via Adaptive Dual-Discrepancy Calibration.Remote Sens.2023,15, 4887

He, M.; Wu, Q.; Ngan, K.N.; Jiang, F.; Meng, F.; Xu, L. Misaligned RGB-Infrared Object Detection via Adaptive Dual-Discrepancy Calibration.Remote Sens.2023,15, 4887

work page 2023

[28] [29]

Weakly Alignment-Free RGBT Salient Object Detection with Deep Correlation Network.IEEE Trans

Tu, Z.; Li, Z.; Li, C.; Tang, J. Weakly Alignment-Free RGBT Salient Object Detection with Deep Correlation Network.IEEE Trans. Image Process.2022,31, 3752–3764

work page 2022

[29] [30]

Modality Registration and Object Search Framework for UAV-Based Unregistered RGB-T Image Salient Object Detection.IEEE Trans

Song, K.; Wen, H.; Xue, X.; Huang, L.; Ji, Y.; Yan, Y. Modality Registration and Object Search Framework for UAV-Based Unregistered RGB-T Image Salient Object Detection.IEEE Trans. Geosci. Remote Sens.2023,61, 1–15

work page 2023

[30] [31]

Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

Shen, F.; Wang, C.; Gao, J.; Guo, Q.; Dang, J.; Tang, J.; Chua, T.-S. Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model. InProceedings of the Forty-Second International Conference on Machine Learning (ICML), 2025

work page 2025

[31] [32]

ImagPose: A Unified Conditional Framework for Pose-Guided Person Generation.Advances in Neural Information Processing Systems2024,37, 6246–6266

Shen, F.; Tang, J. ImagPose: A Unified Conditional Framework for Pose-Guided Person Generation.Advances in Neural Information Processing Systems2024,37, 6246–6266

work page

[32] [33]

ImagDressing-v1: Customizable Virtual Dressing

Shen, F.; Jiang, X.; He, X.; Ye, H.; Wang, C.; Du, X.; Li, Z.; Tang, J. ImagDressing-v1: Customizable Virtual Dressing. InProceedings of the AAAI Conference on Artificial Intelligence, 2025; Volume 39, Number 7, pp. 6795–6804

work page 2025

[33] [34]

Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models

Shen, F.; Ye, H.; Zhang, J.; Wang, C.; Han, X.; Wei, Y. Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models. InProceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024

[34] [35]

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

Shen, F.; Ye, H.; Liu, S.; Zhang, J.; Wang, C.; Han, X.; Wei, Y. Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models. InProceedings of the AAAI Conference on Artificial Intelligence, 2025; Volume 39, Number 7, pp. 6785–6794

work page 2025

[35] [36]

Kendall, A.; Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5574–5584

work page 2017

[36] [37]

Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion.arXiv2024, arXiv:2410.12592

Cho, M.; Cao, Y.; Sun, J.; Zhang, Q.; Pavone, M.; Park, J.J.; Yang, H.; Mao, Z.M. Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion.arXiv2024, arXiv:2410.12592

work page arXiv

[37] [39]

AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of- Experts

Chen, T.; Chen, X.; Du, X.; Rashwan, A.; Yang, F.; Chen, H.; Wang, Z.; Li, Y. AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of- Experts. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 17300–17311

work page 2023

[38] [40]

YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection.arXiv 2025, arXiv:2511.13344

Meiraz, O.; Shalev, S.; Weizman, A. YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection.arXiv 2025, arXiv:2511.13344

work page arXiv 2025

[39] [41]

MoE3D: Mixture of Experts Meets Multi-Modal 3D Understanding.arXiv 2025, arXiv:2511.22103

Li, Y.; Hou, Y.; Wei, Y.; Zhu, X.; Ma, Y.; Shao, W.; Guo, Y. MoE3D: Mixture of Experts Meets Multi-Modal 3D Understanding.arXiv 2025, arXiv:2511.22103

work page arXiv 2025

[40] [42]

AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection

Lin, H.; Huang, X.; Wen, C.; Wang, C. AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection. arXiv2026, arXiv:2603.16261

work page arXiv

[41] [43]

SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery.IEEE Trans

Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery.IEEE Trans. Geosci. Remote Sens.2023,61, 1–15

work page 2023

[42] [44]

DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection

Chen, Y.; Wang, B.; Guo, X.; Zhu, W.; He, J.; Liu, X.; Yuan, J. DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection. InProceedings of the International Conference on Pattern Recognition, Kolkata, India, 1–5 December 2024; pp. 236–252

work page 2024

[43] [45]

CDC-YOLOFusion: Leveraging Cross-Scale Dynamic Convolution Fusion for Visible-Infrared Object Detection.IEEE Trans

Wang, Z.; Liao, X.; Yuan, J.; Yao, Y.; Li, Z. CDC-YOLOFusion: Leveraging Cross-Scale Dynamic Convolution Fusion for Visible-Infrared Object Detection.IEEE Trans. Intell. Veh.2024,10, 2080–2093

work page 2024

[44] [46]

Cross-Modality Fusion Transformer for Multispectral Object Detection.arXiv2021, arXiv:2111.00273

Qingyun, F.; Dapeng, H.; Zhaokui, W. Cross-Modality Fusion Transformer for Multispectral Object Detection.arXiv2021, arXiv:2111.00273

work page arXiv

[45] [47]

ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection.Pattern Recognit.2024,145, 109913

Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection.Pattern Recognit.2024,145, 109913

work page 2024

[46] [48]

Multimodal Object Detection via Probabilistic Ensembling

Chen, Y.T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 139–158

work page 2022

[47] [49]

FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection

Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. InProceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 8673–8681

work page 2025

[48] [50]

DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection

Guo, J.; Gao, C.; Liu, F.; Meng, D. DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection. arXiv2025, arXiv:2408.06123

work page arXiv

[49] [51]

MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking.arXiv2025, arXiv:2503.17699

Qin, H.; Xu, T.; Li, T.; Chen, Z.; Feng, T.; Li, J. MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking.arXiv2025, arXiv:2503.17699

work page arXiv

[50] [52]

Caltech Aerial RGB-Thermal Dataset in the Wild

Lee, C.; Anderson, M.; Raganathan, N.; Zuo, X.; Do, K.; Gkioxari, G.; Chung, S.J. Caltech Aerial RGB-Thermal Dataset in the Wild. arXiv2024, arXiv:2403.08997

work page arXiv

[51] [53]

ATR-UMMIM: A Benchmark Dataset for UAV-Based Multimodal Image Registration under Complex Imaging Conditions.arXiv2025, arXiv:2507.20764

Bin, K.; Chen, C.; Hu, T.; Qi, J.; Zhong, P . ATR-UMMIM: A Benchmark Dataset for UAV-Based Multimodal Image Registration under Complex Imaging Conditions.arXiv2025, arXiv:2507.20764

work page arXiv

[52] [54]

Fusion Meets Diverse Conditions: A High-Diversity Benchmark and Baseline for UAV-Based Multimodal Object Detection with Condition Cues.arXiv2025, arXiv:2510.13620

Chen, C.; Bin, K.; Hu, T.; Qi, J.; Liu, X.; Liu, T.; Liu, Z.; Liu, Y.; Zhong, P . Fusion Meets Diverse Conditions: A High-Diversity Benchmark and Baseline for UAV-Based Multimodal Object Detection with Condition Cues.arXiv2025, arXiv:2510.13620

work page arXiv

[53] [55]

High-Altitude Infrared Thermal Object Detection for UAVs Based on an Improved RT-DETR

Huang, L.; Li, Y.; Zhang, S. High-Altitude Infrared Thermal Object Detection for UAVs Based on an Improved RT-DETR. In Proceedings of the 2025 International Conference on Computer and Information Processing Technology, 2025; pp. 316–321

work page 2025

[54] [56]

CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes.arXiv2025, arXiv:2507.23473

Xie, B.; Zhang, C.; Wang, F.; Liu, P .; Lu, F.; Chen, Z.; Hu, W. CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes.arXiv2025, arXiv:2507.23473

work page arXiv

[55] [57]

Deep Learning Based Infrared Small Object Segmentation: Challenges and Future Directions.Inf

Yang, Z.; Yu, H.; Zhang, J.; Tang, Q.; Mian, A. Deep Learning Based Infrared Small Object Segmentation: Challenges and Future Directions.Inf. Fusion2025,118, 103007

work page

[56] [58]

Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines.IEEE Trans

Ying, X.; Xiao, C.; An, W.; Li, R.; He, X.; Li, B.; Cao, X.; Li, Z.; Wang, Y.; Hu, M.; et al. Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines.IEEE Trans. Pattern Anal. Mach. Intell.2025,47, 6088–6096

work page 2025

[57] [59]

DEPFusion: Dual-Domain Enhancement and Priority-Guided Mamba Fusion for UAV Multispectral Object Detection.arXiv2025, arXiv:2509.07327

Li, S.; Liu, Z.; Hong, Z.; Zhou, Z.; Cao, X. DEPFusion: Dual-Domain Enhancement and Priority-Guided Mamba Fusion for UAV Multispectral Object Detection.arXiv2025, arXiv:2509.07327

work page arXiv

[58] [60]

COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection.arXiv2024, arXiv:2412.18076

Liu, C.; Ma, X.; Yang, X.; Zhang, Y.; Dong, Y. COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection.arXiv2024, arXiv:2412.18076

work page arXiv

[59] [61]

Cf-Yolo: Cross-Modal Fusion for Weakly Aligned RGB-IR UAV Object Detection

Nguyen, T.L.; Tran, C.T.; Nguyen Thi, H.V . Cf-Yolo: Cross-Modal Fusion for Weakly Aligned RGB-IR UAV Object Detection. In Proceedings of the 2025 International Symposium on Communications and Information Technologies, 2025; pp. 254–259

work page 2025

[60] [62]

SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection.arXiv2025, arXiv:2511.06298

Zuo, X.; Qu, C.; Zhan, H.; Shen, J.; Yang, W. SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection.arXiv2025, arXiv:2511.06298

work page arXiv

[61] [63]

A Deep Learning Framework for Infrared and Visible Image Fusion without Strict Registration

Li, H.; Liu, J.; Zhang, Y.; Liu, Y. A Deep Learning Framework for Infrared and Visible Image Fusion without Strict Registration. Int. J. Comput. Vis.2024,132, 1625–1644

work page 2024

[62] [65]

Semantics Lead All: Towards Unified Image Registration and Fusion from a Semantic Perspective.Inf

Xie, H.; Zhang, Y.; Qiu, J.; Zhai, X.; Liu, X.; Yang, Y.; Zhao, S.; Luo, Y.; Zhong, J. Semantics Lead All: Towards Unified Image Registration and Fusion from a Semantic Perspective.Inf. Fusion2023,98, 101835

work page

[63] [66]

C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection.IEEE Trans

Yuan, M.; Wei, X.; Xingxing. C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection.IEEE Trans. Geosci. Remote Sens.2024,62, 1–12

work page 2024

[64] [67]

Improving RGB-Infrared Object Detection with Cascade Alignment-Guided Transformer.Inf

Yuan, M.; Shi, X.; Wang, N.; Wang, Y.; Wei, X. Improving RGB-Infrared Object Detection with Cascade Alignment-Guided Transformer.Inf. Fusion2024,105, 102246

work page

[65] [68]

Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection

Yuan, M.; Wang, Y.; Wei, X. Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 509–525

work page 2022

[66] [69]

CornerNet: Detecting Objects as Paired Keypoints

Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. InProceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750

work page 2018

[67] [70]

CenterNet: Keypoint Triplets for Object Detection

Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578

work page 2019

[68] [71]

Template Matching Advances and Applications in Image Analysis

Hashemi, N.S.; Aghdam, R.B.; Ghiasi, A.S.B.; Fatemi, P . Template Matching Advances and Applications in Image Analysis.arXiv 2016, arXiv:1610.07231

work page internal anchor Pith review Pith/arXiv arXiv 2016

[69] [72]

SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale.arXiv2026, arXiv:2603.17920

Gross, M.; Matha, S.B.; Song, R.; Muthuveerappan, V .; Christoph, C.; Huber, J.; Cremers, D. SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale.arXiv2026, arXiv:2603.17920. Disclaimer/Publisher’s Note:The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(...

work page doi:10.3390/rs1010000