pith. sign in

arxiv: 2605.20667 · v1 · pith:PEWOS2I7new · submitted 2026-05-20 · 💻 cs.CV

LER-YOLO: Reliability-Aware Expert Routing for Misaligned RGB-Infrared UAV Detection

Pith reviewed 2026-05-21 05:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords UAV detectionRGB-infrared fusionmixture of expertsspatial misalignmentreliability maptarget alignmentremote sensing
0
0 comments X

The pith

A spatial reliability map from target alignment lets sparse MoE fusion suppress unreliable RGB-infrared matches for UAV detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Detecting small UAVs in RGB-infrared remote sensing is difficult because spatial misalignment between the two sensors creates local mismatches that standard fusion methods propagate into the detector. The paper establishes that first resampling RGB features to an infrared reference and estimating a per-location trustworthiness map allows the system to know which cross-sensor correspondences are safe to use. This map then controls a sparse mixture-of-experts fusion block that picks among RGB-dominant, infrared-dominant, and interactive experts on a per-region basis. The result is trustworthy cross-modal interaction without letting mismatch artifacts reach the detection head. If the approach holds, detectors can handle real-world misalignment more gracefully while keeping model size comparable to a standard YOLOv5s.

Core claim

The central claim is that an Uncertainty-Aware Target Alignment module produces a spatial reliability map by resampling visible features toward the infrared reference, and a Reliability-Guided Sparse MoE Fusion module then uses this map to adaptively route to k experts drawn from RGB-dominant, infrared-dominant, and interactive fusion experts, enabling suppression of unreliable fusion while preserving useful information and yielding 89.7 percent average AP50 on the MBU benchmark.

What carries the argument

The Uncertainty-Aware Target Alignment module that generates the spatial reliability map, combined with the Reliability-Guided Sparse MoE Fusion module that uses the map to select and weight experts.

If this is right

  • Detection reaches 89.7 percent AP50 with 0.2 percent standard deviation across three independent seeds and a best run of 89.9 percent.
  • Gains arise from the reliability-guided routing mechanism rather than from added model capacity.
  • Unreliable cross-modal interactions are suppressed while useful information from either modality is retained.
  • Performance remains stable under synthetic spatial shifts that simulate varying degrees of misalignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reliability-guided routing could be tested on other misaligned multi-modal tasks such as visible-thermal pedestrian detection or satellite-ground fusion.
  • If the reliability map correlates with actual geometric error, the method might reduce the need for hardware-level sensor calibration in field deployments.
  • Applying the routing layer to backbones other than YOLOv5s would test whether the benefit is tied to the particular detection architecture.

Load-bearing premise

The spatial reliability map produced by the Uncertainty-Aware Target Alignment module accurately reflects the trustworthiness of local cross-sensor correspondence and can be used to safely suppress unreliable fusion without discarding useful information.

What would settle it

Replace the learned spatial reliability map with a uniform or random map of equal average value and check whether the AP50 gain over a parameter-matched baseline disappears on the MBU benchmark.

Figures

Figures reproduced from arXiv: 2605.20667 by Hexiang Hao, Ji Wang, Liming Hou, Wei Tang, Xin Ying, Xuekai Zhang, Yubo He, Yueping Peng, Zecong Ye.

Figure 1
Figure 1. Figure 1: Overall architecture of MoE-MBUDet,a YOLOv8-based anti-drone detection framework formed Y O L O v5 Detectio n Hea d Tin y U A V L ocaliz atio n & Classificatio n R L Dynamic Gating Router E1 E2 EK CSPDarknet (IR) P5 P4 P3 {Frgb} P5 P4 P3 {Fir} … MR Ftr F'rgb Top-k Expert Fusion … … w1 w2 wK Fir Fir Frgb Frgb' Aligned Non-shared weights (Visible image) (Infrared image) 0 0 3 H W rgb I R    0 0 3 H W ir I… view at source ↗
Figure 2
Figure 2. Figure 2: Detailed architecture at: Mixture-of-Experts Fusion mechanism (MoEFusion)for anti-drone detection. Reliability-Guided Sparse MoE Fusion. The gating router uses the reliability prior to [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative RGB-infrared examples and measured expert weights under daytime, dark, and strong-backlight scenes. 4.10. Discussion and Limitations The experimental results support three observations. First, RGB-only and infrared￾only detection both provide useful single-modality evidence on the MBU benchmark, with infrared remaining slightly stronger under the infrared-reference protocol. This is consiste… view at source ↗
read the original abstract

Detecting small unmanned aerial vehicles from RGB-infrared remote-sensing pairs remains challenging due to tiny target scale, cluttered backgrounds, and spatial misalignment between heterogeneous sensors. Existing bimodal detectors often align or fuse features without assessing the reliability of local cross-sensor correspondence, allowing mismatch artifacts to propagate into the detection head. To address this issue, we propose LER-YOLO, a reliability-aware sparse mixture-of-experts framework for misaligned RGB-infrared UAV detection. LER-YOLO first introduces an Uncertainty-Aware Target Alignment module that resamples visible features toward the infrared reference and estimates a spatial reliability map. This reliability prior is then used by a Reliability-Guided Sparse MoE Fusion module to adaptively select k experts from RGB-dominant, infrared-dominant, and interactive fusion experts, enabling trustworthy cross-modal interaction while suppressing unreliable fusion. Experiments on the public MBU benchmark under a YOLOv5s-family protocol show that LER-YOLO achieves 89.7+/-0.2% AP50 over three independent seeds, with a best result of 89.9%. Extensive ablations, parameter-matched comparisons, synthetic-shift evaluations, and complexity analysis demonstrate that the gains mainly come from reliability-guided expert routing rather than increased model capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes LER-YOLO, a reliability-aware sparse mixture-of-experts framework for detecting small UAVs from spatially misaligned RGB-infrared remote-sensing pairs. It introduces an Uncertainty-Aware Target Alignment module that resamples RGB features to the IR reference while producing a spatial reliability map, which then guides a Reliability-Guided Sparse MoE Fusion module to select k experts (RGB-dominant, IR-dominant, and interactive) for trustworthy cross-modal interaction. On the public MBU benchmark under a YOLOv5s-family protocol, LER-YOLO reports 89.7 ± 0.2% AP50 (best run 89.9%) over three seeds; ablations, parameter-matched baselines, synthetic-shift tests, and complexity analysis are used to attribute gains primarily to the reliability-guided routing rather than added capacity.

Significance. If the reliability map is shown to be accurate, the approach provides a concrete mechanism for suppressing mismatch artifacts in bimodal UAV detection without discarding useful cross-modal information. The parameter-matched comparisons and synthetic-shift evaluations strengthen the case that the routing mechanism, rather than model size, drives the reported AP50 improvement. Reproducibility via multiple seeds and public benchmark use are positive; the work could influence future multimodal remote-sensing detectors if the map's trustworthiness is directly validated.

major comments (1)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central claim that gains derive from reliability-guided expert routing (rather than capacity or alignment alone) depends on the spatial reliability map correctly identifying trustworthy local RGB-IR correspondence. No direct quantitative validation of the map is reported—e.g., no precision/recall against ground-truth alignment labels, no correlation analysis with synthetic-shift masks, and no ablation isolating map accuracy from the MoE architecture—leaving open the possibility that end-to-end AP50 improvements arise from resampling or expert selection mechanics irrespective of map trustworthiness.
minor comments (2)
  1. [§3.2] §3.2: The exact selection criterion for the k experts and the formulation of the reliability prior (e.g., how the map is thresholded or normalized before routing) should be stated with an equation or pseudocode for reproducibility.
  2. [Table 2 and Figure 4] Table 2 and Figure 4: Include standard deviations for all compared methods (not only LER-YOLO) and clarify whether the synthetic-shift tests use the same misalignment distribution as the MBU test set.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive and detailed review. The major comment concerns the absence of direct quantitative validation for the spatial reliability map. We respond point-by-point below, clarifying the evidence already present in the manuscript while acknowledging where additional analysis can be provided.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that gains derive from reliability-guided expert routing (rather than capacity or alignment alone) depends on the spatial reliability map correctly identifying trustworthy local RGB-IR correspondence. No direct quantitative validation of the map is reported—e.g., no precision/recall against ground-truth alignment labels, no correlation analysis with synthetic-shift masks, and no ablation isolating map accuracy from the MoE architecture—leaving open the possibility that end-to-end AP50 improvements arise from resampling or expert selection mechanics irrespective of map trustworthiness.

    Authors: We agree that direct validation of the reliability map would strengthen the central claim. The MBU benchmark does not provide ground-truth local alignment labels, so precision/recall against such labels cannot be computed without new annotations. However, the synthetic-shift experiments introduce controlled, known misalignment patterns and show that performance gains appear specifically when the reliability map is used to guide expert routing; removing this guidance while retaining the MoE structure and alignment module leads to measurable drops. Parameter-matched baselines further isolate the routing mechanism from capacity increases. We will add a correlation analysis between the reliability maps and the synthetic-shift masks, plus an explicit ablation that disables only the reliability weighting inside the MoE, to the revised §4. This addresses the concern as far as the available data allow. revision: partial

standing simulated objections not resolved
  • Direct precision/recall evaluation of the reliability map against ground-truth alignment labels, because the MBU benchmark provides no such per-pixel or per-region alignment annotations.

Circularity Check

0 steps flagged

No significant circularity; performance measured on external benchmark

full rationale

The paper proposes an architectural change to YOLOv5s (Uncertainty-Aware Target Alignment plus Reliability-Guided Sparse MoE Fusion) and reports AP50 on the public MBU benchmark. The central claim that gains arise from reliability-guided routing rather than capacity is supported by parameter-matched ablations and synthetic-shift tests whose metrics are computed from standard detection evaluation protocols. No equation, module definition, or self-citation reduces the reported 89.7 % AP50 or the routing decisions to a fitted parameter or prior result by construction; the reliability map is an internal estimate whose accuracy is not claimed to be proven by the final detection score itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard deep-learning assumptions plus two domain-specific premises about alignment reliability.

free parameters (1)
  • k (experts selected per location)
    Top-k routing hyperparameter in the sparse MoE; value not stated in abstract.
axioms (1)
  • domain assumption A spatial reliability map can be estimated from the alignment resampling process that meaningfully indicates cross-sensor trustworthiness.
    Invoked when the reliability prior is fed to the MoE routing decision.

pith-pipeline@v0.9.0 · 5772 in / 1261 out tokens · 30045 ms · 2026-05-21T05:46:56.694073+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

  1. [1]

    You Only Look Once: Unified, Real-Time Object Detection

    Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV , USA, 27–30 June 2016; pp. 779–788

  2. [2]

    Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking.arXiv2021, arXiv:2101.08466

    Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Zhao, J.; Guo, G.; Han, Z.; et al. Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking.arXiv2021, arXiv:2101.08466

  3. [3]

    MBUDet: Misaligned Bimodal UAV Target Detection via Target Offset Label Generation.Inf

    Ye, Z.; Hao, H.; Peng, Y.; Tang, W.; Zhang, X.; Han, B.; Zhai, H. MBUDet: Misaligned Bimodal UAV Target Detection via Target Offset Label Generation.Inf. Fusion2026,127, 103756. https://doi.org/10.1016/j.inffus.2025.103756

  4. [4]

    Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions.arXiv2025, arXiv:2504.11967

    Dong, Y.; Wu, F.; Zhang, S.; Chen, G.; Hu, Y.; Yano, M.; Sun, J.; Huang, S.; Liu, F.; Dai, Q.; et al. Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions.arXiv2025, arXiv:2504.11967

  5. [5]

    Infrared and Visible Camera Integration for Detection and Tracking of Small UAVs: Systematic Evaluation.Drones2024,8, 650

    Pereira, A.; Warwick, S.; Moutinho, A.; Suleman, A. Infrared and Visible Camera Integration for Detection and Tracking of Small UAVs: Systematic Evaluation.Drones2024,8, 650. https://doi.org/10.3390/drones8110650

  6. [6]

    Drone Detection and Tracking in Real-Time by Fusion of Different Sensing Modalities.Drones2022,6, 317

    Svanstrom, F.; Alonso-Fernandez, F.; Englund, C. Drone Detection and Tracking in Real-Time by Fusion of Different Sensing Modalities.Drones2022,6, 317. https://doi.org/10.3390/drones6110317

  7. [7]

    ITD-YOLOv8: An Infrared Target Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 161

    Zhao, X.; Zhang, W.; Zhang, H.; Zheng, C.; Ma, J.; Zhang, Z. ITD-YOLOv8: An Infrared Target Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 161. https://doi.org/10.3390/drones8040161

  8. [8]

    G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8.Drones2024,8, 495

    Zhao, X.; Zhang, W.; Xia, Y.; Zhang, H.; Zheng, C.; Ma, J.; Zhang, Z. G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8.Drones2024,8, 495. https://doi.org/10.3390/drones8090495

  9. [9]

    A Lightweight Real-Time Infrared Object Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 479

    Ding, B.; Zhang, Y.; Ma, S. A Lightweight Real-Time Infrared Object Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles.Drones2024,8, 479. https://doi.org/10.3390/drones8090479

  10. [10]

    An All-Time Detection Algorithm for UAV Images in Urban Low Altitude.Drones2024,8,

    Huang, Y.; Qu, J.; Wang, H.; Yang, J. An All-Time Detection Algorithm for UAV Images in Urban Low Altitude.Drones2024,8,

  11. [11]

    https://doi.org/10.3390/drones8070332

  12. [12]

    MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone- Acquired Thermal Infrared Imagery.Drones2025,9, 224

    Wang, Z.; Dang, C.; Zhang, R.; Wang, L.; He, Y.; Wu, R. MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone- Acquired Thermal Infrared Imagery.Drones2025,9, 224. https://doi.org/10.3390/drones9030224

  13. [13]

    Drone-Based Visible-Thermal Object Detection with Transformers and Prompt Tuning

    Chen, R.; Li, D.; Gao, Z.; Kuai, Y.; Wang, C. Drone-Based Visible-Thermal Object Detection with Transformers and Prompt Tuning. Drones2024,8, 451. https://doi.org/10.3390/drones8090451

  14. [14]

    Single-Stage UAV Detection and Classification with YOLOv5: Mosaic Data Augmentation and PANet

    Dadboud, F.; Patel, V .; Mehta, V .; Bolic, M.; Mantegh, I. Single-Stage UAV Detection and Classification with YOLOv5: Mosaic Data Augmentation and PANet. InProceedings of the 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Washington, DC, USA, 16–19 November 2021; pp. 1–8

  15. [16]

    Overview of UAV Target Detection Algorithms Based on Deep Learning

    Dai, J.; Wu, L.; Wang, P . Overview of UAV Target Detection Algorithms Based on Deep Learning. InProceedings of the 2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 17–19 December 2021; pp. 736–745

  16. [17]

    A Real-Time and Lightweight Method for Tiny Airborne Object Detection

    Lyu, Y.; Liu, Z.; Li, H.; Guo, D.; Fu, Y. A Real-Time and Lightweight Method for Tiny Airborne Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 3016–3025

  17. [18]

    Investigation of UAV Detection in Images with Complex Backgrounds and Rainy Artifacts

    Munir, A.; Siddiqui, A.J.; Anwar, S. Investigation of UAV Detection in Images with Complex Backgrounds and Rainy Artifacts. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 3–8 January 2024; pp. 232–241

  18. [19]

    Enhanced Thermal-RGB Fusion for Robust Object Detection

    El Ahmar, W.; Massoud, Y.; Kolhatkar, D.; AlGhamdi, H.; Alja’Afreh, M.; Laganiere, R.; Hammoud, R. Enhanced Thermal-RGB Fusion for Robust Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 365–374

  19. [20]

    Cross-Modal Transformers for Infrared and Visible Image Fusion.IEEE Trans

    Park, S.; Vien, A.G.; Lee, C. Cross-Modal Transformers for Infrared and Visible Image Fusion.IEEE Trans. Circuits Syst. Video Technol.2024,34, 770–785

  20. [21]

    Cross-Modal and Cross-Level Attention Interaction Network for Salient Object Detection.IEEE Trans

    Wang, F.; Su, Y.; Wang, R.; Sun, J.; Sun, F.; Li, H. Cross-Modal and Cross-Level Attention Interaction Network for Salient Object Detection.IEEE Trans. Artif. Intell.2024,5, 2907–2920

  21. [22]

    Task-Customized Mixture of Adapters for General Image Fusion

    Zhu, P .; Sun, Y.; Cao, B.; Hu, Q. Task-Customized Mixture of Adapters for General Image Fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 7099–7108

  22. [23]

    Weakly Misalignment-Free Adaptive Feature Alignment for UAVs-Based Multimodal Object Detection

    Chen, C.; Qi, J.; Liu, X.; Bin, K.; Fu, R.; Hu, X.; Zhong, P . Weakly Misalignment-Free Adaptive Feature Alignment for UAVs-Based Multimodal Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 26826–26835

  23. [24]

    Cross-Modal Offset-Guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection.arXiv2025, arXiv:2506.16737

    Liu, Z.; Luo, H.; Wang, Z.; Wei, Y.; Zuo, H.; Zhang, J. Cross-Modal Offset-Guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection.arXiv2025, arXiv:2506.16737

  24. [25]

    GM-DETR: Generalized Multispectral Detection Transformer with Efficient Fusion Encoder for Visible-Infrared Detection

    Xiao, Y.; Meng, F.; Wu, Q.; Xu, L.; He, M.; Li, H. GM-DETR: Generalized Multispectral Detection Transformer with Efficient Fusion Encoder for Visible-Infrared Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 5541–5549

  25. [26]

    Deformable Convolutional Networks

    Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. InProceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773

  26. [27]

    Spatial Transformer Networks

    Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 2017–2025

  27. [28]

    Misaligned RGB-Infrared Object Detection via Adaptive Dual-Discrepancy Calibration.Remote Sens.2023,15, 4887

    He, M.; Wu, Q.; Ngan, K.N.; Jiang, F.; Meng, F.; Xu, L. Misaligned RGB-Infrared Object Detection via Adaptive Dual-Discrepancy Calibration.Remote Sens.2023,15, 4887

  28. [29]

    Weakly Alignment-Free RGBT Salient Object Detection with Deep Correlation Network.IEEE Trans

    Tu, Z.; Li, Z.; Li, C.; Tang, J. Weakly Alignment-Free RGBT Salient Object Detection with Deep Correlation Network.IEEE Trans. Image Process.2022,31, 3752–3764

  29. [30]

    Modality Registration and Object Search Framework for UAV-Based Unregistered RGB-T Image Salient Object Detection.IEEE Trans

    Song, K.; Wen, H.; Xue, X.; Huang, L.; Ji, Y.; Yan, Y. Modality Registration and Object Search Framework for UAV-Based Unregistered RGB-T Image Salient Object Detection.IEEE Trans. Geosci. Remote Sens.2023,61, 1–15

  30. [31]

    Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

    Shen, F.; Wang, C.; Gao, J.; Guo, Q.; Dang, J.; Tang, J.; Chua, T.-S. Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model. InProceedings of the Forty-Second International Conference on Machine Learning (ICML), 2025

  31. [32]

    ImagPose: A Unified Conditional Framework for Pose-Guided Person Generation.Advances in Neural Information Processing Systems2024,37, 6246–6266

    Shen, F.; Tang, J. ImagPose: A Unified Conditional Framework for Pose-Guided Person Generation.Advances in Neural Information Processing Systems2024,37, 6246–6266

  32. [33]

    ImagDressing-v1: Customizable Virtual Dressing

    Shen, F.; Jiang, X.; He, X.; Ye, H.; Wang, C.; Du, X.; Li, Z.; Tang, J. ImagDressing-v1: Customizable Virtual Dressing. InProceedings of the AAAI Conference on Artificial Intelligence, 2025; Volume 39, Number 7, pp. 6795–6804

  33. [34]

    Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models

    Shen, F.; Ye, H.; Zhang, J.; Wang, C.; Han, X.; Wei, Y. Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models. InProceedings of the International Conference on Learning Representations (ICLR), 2024

  34. [35]

    Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

    Shen, F.; Ye, H.; Liu, S.; Zhang, J.; Wang, C.; Han, X.; Wei, Y. Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models. InProceedings of the AAAI Conference on Artificial Intelligence, 2025; Volume 39, Number 7, pp. 6785–6794

  35. [36]

    Kendall, A.; Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5574–5584

  36. [37]

    Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion.arXiv2024, arXiv:2410.12592

    Cho, M.; Cao, Y.; Sun, J.; Zhang, Q.; Pavone, M.; Park, J.J.; Yang, H.; Mao, Z.M. Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion.arXiv2024, arXiv:2410.12592

  37. [39]

    AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of- Experts

    Chen, T.; Chen, X.; Du, X.; Rashwan, A.; Yang, F.; Chen, H.; Wang, Z.; Li, Y. AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of- Experts. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 17300–17311

  38. [40]

    YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection.arXiv 2025, arXiv:2511.13344

    Meiraz, O.; Shalev, S.; Weizman, A. YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection.arXiv 2025, arXiv:2511.13344

  39. [41]

    MoE3D: Mixture of Experts Meets Multi-Modal 3D Understanding.arXiv 2025, arXiv:2511.22103

    Li, Y.; Hou, Y.; Wei, Y.; Zhu, X.; Ma, Y.; Shao, W.; Guo, Y. MoE3D: Mixture of Experts Meets Multi-Modal 3D Understanding.arXiv 2025, arXiv:2511.22103

  40. [42]

    AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection

    Lin, H.; Huang, X.; Wen, C.; Wang, C. AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection. arXiv2026, arXiv:2603.16261

  41. [43]

    SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery.IEEE Trans

    Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery.IEEE Trans. Geosci. Remote Sens.2023,61, 1–15

  42. [44]

    DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection

    Chen, Y.; Wang, B.; Guo, X.; Zhu, W.; He, J.; Liu, X.; Yuan, J. DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection. InProceedings of the International Conference on Pattern Recognition, Kolkata, India, 1–5 December 2024; pp. 236–252

  43. [45]

    CDC-YOLOFusion: Leveraging Cross-Scale Dynamic Convolution Fusion for Visible-Infrared Object Detection.IEEE Trans

    Wang, Z.; Liao, X.; Yuan, J.; Yao, Y.; Li, Z. CDC-YOLOFusion: Leveraging Cross-Scale Dynamic Convolution Fusion for Visible-Infrared Object Detection.IEEE Trans. Intell. Veh.2024,10, 2080–2093

  44. [46]

    Cross-Modality Fusion Transformer for Multispectral Object Detection.arXiv2021, arXiv:2111.00273

    Qingyun, F.; Dapeng, H.; Zhaokui, W. Cross-Modality Fusion Transformer for Multispectral Object Detection.arXiv2021, arXiv:2111.00273

  45. [47]

    ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection.Pattern Recognit.2024,145, 109913

    Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection.Pattern Recognit.2024,145, 109913

  46. [48]

    Multimodal Object Detection via Probabilistic Ensembling

    Chen, Y.T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 139–158

  47. [49]

    FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection

    Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. InProceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 8673–8681

  48. [50]

    DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection

    Guo, J.; Gao, C.; Liu, F.; Meng, D. DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection. arXiv2025, arXiv:2408.06123

  49. [51]

    MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking.arXiv2025, arXiv:2503.17699

    Qin, H.; Xu, T.; Li, T.; Chen, Z.; Feng, T.; Li, J. MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking.arXiv2025, arXiv:2503.17699

  50. [52]

    Caltech Aerial RGB-Thermal Dataset in the Wild

    Lee, C.; Anderson, M.; Raganathan, N.; Zuo, X.; Do, K.; Gkioxari, G.; Chung, S.J. Caltech Aerial RGB-Thermal Dataset in the Wild. arXiv2024, arXiv:2403.08997

  51. [53]

    ATR-UMMIM: A Benchmark Dataset for UAV-Based Multimodal Image Registration under Complex Imaging Conditions.arXiv2025, arXiv:2507.20764

    Bin, K.; Chen, C.; Hu, T.; Qi, J.; Zhong, P . ATR-UMMIM: A Benchmark Dataset for UAV-Based Multimodal Image Registration under Complex Imaging Conditions.arXiv2025, arXiv:2507.20764

  52. [54]

    Fusion Meets Diverse Conditions: A High-Diversity Benchmark and Baseline for UAV-Based Multimodal Object Detection with Condition Cues.arXiv2025, arXiv:2510.13620

    Chen, C.; Bin, K.; Hu, T.; Qi, J.; Liu, X.; Liu, T.; Liu, Z.; Liu, Y.; Zhong, P . Fusion Meets Diverse Conditions: A High-Diversity Benchmark and Baseline for UAV-Based Multimodal Object Detection with Condition Cues.arXiv2025, arXiv:2510.13620

  53. [55]

    High-Altitude Infrared Thermal Object Detection for UAVs Based on an Improved RT-DETR

    Huang, L.; Li, Y.; Zhang, S. High-Altitude Infrared Thermal Object Detection for UAVs Based on an Improved RT-DETR. In Proceedings of the 2025 International Conference on Computer and Information Processing Technology, 2025; pp. 316–321

  54. [56]

    CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes.arXiv2025, arXiv:2507.23473

    Xie, B.; Zhang, C.; Wang, F.; Liu, P .; Lu, F.; Chen, Z.; Hu, W. CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes.arXiv2025, arXiv:2507.23473

  55. [57]

    Deep Learning Based Infrared Small Object Segmentation: Challenges and Future Directions.Inf

    Yang, Z.; Yu, H.; Zhang, J.; Tang, Q.; Mian, A. Deep Learning Based Infrared Small Object Segmentation: Challenges and Future Directions.Inf. Fusion2025,118, 103007

  56. [58]

    Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines.IEEE Trans

    Ying, X.; Xiao, C.; An, W.; Li, R.; He, X.; Li, B.; Cao, X.; Li, Z.; Wang, Y.; Hu, M.; et al. Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines.IEEE Trans. Pattern Anal. Mach. Intell.2025,47, 6088–6096

  57. [59]

    DEPFusion: Dual-Domain Enhancement and Priority-Guided Mamba Fusion for UAV Multispectral Object Detection.arXiv2025, arXiv:2509.07327

    Li, S.; Liu, Z.; Hong, Z.; Zhou, Z.; Cao, X. DEPFusion: Dual-Domain Enhancement and Priority-Guided Mamba Fusion for UAV Multispectral Object Detection.arXiv2025, arXiv:2509.07327

  58. [60]

    COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection.arXiv2024, arXiv:2412.18076

    Liu, C.; Ma, X.; Yang, X.; Zhang, Y.; Dong, Y. COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection.arXiv2024, arXiv:2412.18076

  59. [61]

    Cf-Yolo: Cross-Modal Fusion for Weakly Aligned RGB-IR UAV Object Detection

    Nguyen, T.L.; Tran, C.T.; Nguyen Thi, H.V . Cf-Yolo: Cross-Modal Fusion for Weakly Aligned RGB-IR UAV Object Detection. In Proceedings of the 2025 International Symposium on Communications and Information Technologies, 2025; pp. 254–259

  60. [62]

    SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection.arXiv2025, arXiv:2511.06298

    Zuo, X.; Qu, C.; Zhan, H.; Shen, J.; Yang, W. SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection.arXiv2025, arXiv:2511.06298

  61. [63]

    A Deep Learning Framework for Infrared and Visible Image Fusion without Strict Registration

    Li, H.; Liu, J.; Zhang, Y.; Liu, Y. A Deep Learning Framework for Infrared and Visible Image Fusion without Strict Registration. Int. J. Comput. Vis.2024,132, 1625–1644

  62. [65]

    Semantics Lead All: Towards Unified Image Registration and Fusion from a Semantic Perspective.Inf

    Xie, H.; Zhang, Y.; Qiu, J.; Zhai, X.; Liu, X.; Yang, Y.; Zhao, S.; Luo, Y.; Zhong, J. Semantics Lead All: Towards Unified Image Registration and Fusion from a Semantic Perspective.Inf. Fusion2023,98, 101835

  63. [66]

    C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection.IEEE Trans

    Yuan, M.; Wei, X.; Xingxing. C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection.IEEE Trans. Geosci. Remote Sens.2024,62, 1–12

  64. [67]

    Improving RGB-Infrared Object Detection with Cascade Alignment-Guided Transformer.Inf

    Yuan, M.; Shi, X.; Wang, N.; Wang, Y.; Wei, X. Improving RGB-Infrared Object Detection with Cascade Alignment-Guided Transformer.Inf. Fusion2024,105, 102246

  65. [68]

    Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection

    Yuan, M.; Wang, Y.; Wei, X. Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 509–525

  66. [69]

    CornerNet: Detecting Objects as Paired Keypoints

    Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. InProceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750

  67. [70]

    CenterNet: Keypoint Triplets for Object Detection

    Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578

  68. [71]

    Template Matching Advances and Applications in Image Analysis

    Hashemi, N.S.; Aghdam, R.B.; Ghiasi, A.S.B.; Fatemi, P . Template Matching Advances and Applications in Image Analysis.arXiv 2016, arXiv:1610.07231

  69. [72]

    SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale.arXiv2026, arXiv:2603.17920

    Gross, M.; Matha, S.B.; Song, R.; Muthuveerappan, V .; Christoph, C.; Huber, J.; Cremers, D. SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale.arXiv2026, arXiv:2603.17920. Disclaimer/Publisher’s Note:The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(...