Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method

Chao Xiao; Gaowei Guo; Hongge Li; Jun Chen; Li Liu; Longguang Wang; Miao Li; Nuo Chen; Qiang Ling; Wei An

arxiv: 2605.20963 · v1 · pith:22JYKM6Jnew · submitted 2026-05-20 · 💻 cs.CV

Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method

Yihang Luo , Jun Chen , Chao Xiao , Yingqian Wang , Zhaoxu Li , Qiang Ling , Xu He , Nuo Chen

show 8 more authors

Gaowei Guo Hongge Li Miao Li Longguang Wang Yulan Guo Li Liu Wei An Zhijie Chen

This is my paper

Pith reviewed 2026-05-21 05:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords UAV detectionmultispectral imagingsmall object detectiondatasetsensor fusiondeep learninglow contrast detection

0 comments

The pith

Multispectral imaging supplies material signatures that raise small-UAV detection accuracy by 6.2 percent over the best RGB-only detectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first dedicated multispectral dataset for fine-grained detection of small unmanned aerial vehicles, containing more than fifteen thousand synchronized RGB and multispectral image cubes with very small annotated targets. It shows that existing RGB systems lose effectiveness because they depend only on spatial patterns that blur at low resolution and low contrast. A dual-stream network called MFDNet is introduced to align the two modalities and fuse their information. Experiments across twenty detectors establish that adding the spectral channel produces a clear performance gain, indicating that material-aware cues supply evidence RGB images alone cannot provide.

Core claim

UAVNet-MS supplies 15,618 temporally synchronized RGB-MSI data cubes of 1440 by 1080 pixels with bounding-box labels; 93.7 percent of the UAVs occupy 32 squared pixels or less. MFDNet processes the two streams to correct array-induced parallax and performs spatial-spectral fusion. When evaluated under RGB-only, MSI-only, and combined protocols, MFDNet raises AP50 by 6.2 percent relative to the strongest RGB baseline, confirming that multispectral signatures furnish complementary material evidence for separating UAVs from clutter.

What carries the argument

MFDNet, a dual-stream network that aligns array-induced parallax between RGB and multispectral channels and fuses spatial features with spectral signatures for small-object classification.

If this is right

Spectral channels can be added to existing RGB pipelines to improve detection of tiny, low-contrast objects without requiring higher spatial resolution.
The UAVNet-MS benchmark allows direct comparison of future multispectral fusion methods against the reported dual-stream baseline.
Material-aware detection reduces false positives in cluttered outdoor scenes where spatial appearance alone is ambiguous.
The dataset supports training of detectors that generalize across UAV types sharing similar materials but differing shapes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectral-fusion strategy could be tested on other small-object categories such as birds or insects where material contrast differs from the background.
Extending the dataset with additional spectral bands or temporal sequences might further reduce reliance on spatial cues.
Real-time deployment on UAV platforms would require measuring the computational overhead of the dual-stream architecture under onboard power limits.

Load-bearing premise

The multispectral signatures collected for UAVs stay distinct enough from background materials to aid detection even when the objects occupy only a few dozen pixels and appear at low contrast.

What would settle it

A controlled experiment in which new scenes contain background materials whose spectral reflectance closely matches that of the UAV airframes, causing the reported AP50 gain to disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.20963 by Chao Xiao, Gaowei Guo, Hongge Li, Jun Chen, Li Liu, Longguang Wang, Miao Li, Nuo Chen, Qiang Ling, Wei An, Xu He, Yihang Luo, Yingqian Wang, Yulan Guo, Zhaoxu Li, Zhijie Chen.

**Figure 1.** Figure 1: Three key challenges in fine-grained small-UAV detection and the motivation for spectral cues. Each case compares RGB with an [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: AMIS imaging system and spectral separability. (a) Example of [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Environmental diversity in UAVNet-MS dataset. Two circular markers [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Statistics of the UAVNet-MS dataset. (a) Local peak-contrast SNR (LPC-SNR) distribution across different scene attributes. Black boxes mark the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Inter-type spectral separability across UAV scales. Boxplots show the [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of MFDNet. ArrayCode, dual-stream feature extraction, and fine-scale fusion with semantic decoupling are the three key components. First, [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison under key challenges: extremely tiny objects, [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study on alignment strategies. TABLE 3 Ablation of spectral-branch design choices under the MSI-only setting. Method mAP AP50 AP75 APET APT APS APM 2DConv-SH 0.5 1.8 0.1 0.2 0.4 1.5 2.9 3DGDeform-OR 2.7 8.4 0.7 4.0 2.5 0.1 1.5 2DConv-OR 4.5 15.8 1.1 3.9 3.3 6.9 7.3 3DGAT-OR 4.7 14.4 1.1 6.0 3.7 4.1 0.5 BandSelect-OR 5.1 15.5 1.5 6.0 4.2 5.2 1.1 ArrayCode-OR 7.1 23.9 1.6 9.2 6.0 7.5 7.4 robust alig… view at source ↗

**Figure 9.** Figure 9: Robustness of MFDNet across conditions. 5.3.3 Fusion stage and mechanism Finally, we examine how fusion design affects RGB–MSI small object detection, in terms of both where to fuse (stage) and how to fuse (mechanism). Impact of fusion level [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

The proliferation of unmanned aerial vehicles (UAVs) has created urgent demand for precise UAV monitoring. Existing RGB-based systems rely on spatial cues that degrade at small scales, particularly with high inter-type similarity, target-clutter ambiguity, and low contrast. Multispectral imaging (MSI) encodes material-aware spectral signatures, yet MSI-based fine-grained small-UAV detection remains underexplored due to lack of dedicated datasets. We introduce UAVNet-MS, the first multispectral dataset for fine-grained small-UAV detection, comprising 15,618 temporally synchronized RGB-MSI data cubes (1440x1080) with bounding box annotations. The dataset features challenging small objects (93.7% <= 32^2 pixels, average 18^2 pixels, ~0.02% image area) under low contrast. We propose MFDNet, a dual-stream baseline addressing array-induced parallax and spatial-spectral fusion. Extensive evaluation under RGB-only, MSI-only, and RGB+MSI protocols against 20 detectors shows MFDNet achieves +6.2% AP50 improvement over best RGB-only methods, demonstrating spectral cues provide complementary material evidence beyond spatial cues. This work provides foundational dataset, strong baseline, and benchmark for multispectral UAV monitoring research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New multispectral UAV dataset plus a fusion baseline that reports gains, but the source of those gains is not isolated.

read the letter

The main thing to know is that this paper puts out the first public multispectral dataset built specifically for fine-grained small-UAV detection and pairs it with a dual-stream network that handles parallax and fuses the streams. The dataset has 15k synchronized RGB-MSI cubes with a heavy emphasis on tiny objects under low contrast, which matches real monitoring needs better than most existing RGB collections. The network, MFDNet, is presented as a reproducible baseline and shows a 6.2% AP50 edge over the strongest RGB-only detectors across the reported protocols. That dataset release is the clearest practical step forward here, since prior work stayed in RGB and lacked material signatures for this scale of target-clutter problem. The evaluation against 20 detectors under RGB-only, MSI-only, and combined settings is straightforward and gives readers a starting point for comparison. The architecture choices for parallax correction and fusion are sensible given the sensor array setup. The soft spot is the missing link between the numerical gain and the claimed spectral material evidence. With 93.7% of objects at or below 32 squared pixels and average size around 18 squared, any per-pixel or small-region spectral difference is easily diluted by mixing, noise, or residual misalignment. The paper does not include band-wise separability checks, signature plots, or a controlled ablation that keeps the architecture fixed while replacing MSI channels with noise or duplicated RGB. Without those, the improvement could trace to extra capacity or registration handling rather than actual material discrimination. The central claim therefore rests on the assumption that the collected MSI vectors stay useful under the stated conditions, and that assumption is not directly tested. This work is aimed at researchers who need a multispectral benchmark for small-object UAV tasks or who are extending fusion methods in security and monitoring applications. Readers looking for a new dataset to train or compare against will get immediate value even if they treat the network as one possible baseline. The combination of a public dataset and a working fusion method is enough to justify sending it for peer review, though any referee will want clearer evidence on why the spectral channels help at these sizes.

Referee Report

3 major / 2 minor

Summary. The paper introduces UAVNet-MS, the first multispectral dataset for fine-grained small-UAV detection comprising 15,618 temporally synchronized RGB-MSI data cubes (1440x1080) with bounding-box annotations. It emphasizes challenging conditions with 93.7% of objects ≤32² pixels (average 18² pixels, ~0.02% image area) under low contrast. The authors propose MFDNet, a dual-stream baseline that corrects array-induced parallax and fuses spatial-spectral features, reporting a +6.2% AP50 improvement over the best of 20 RGB-only detectors and attributing the gain to complementary material evidence from spectral signatures.

Significance. If the numerical gain is shown to arise from genuine spectral separability rather than capacity or registration effects, the dataset and baseline would provide a valuable foundation for multispectral UAV monitoring research, extending beyond RGB limitations in small-object, low-contrast regimes. The work supplies both a new resource and an initial benchmark that future methods can build upon.

major comments (3)

Results section: the headline +6.2% AP50 improvement is presented without error bars, standard deviations across runs, or statistical significance tests. Given that 93.7% of targets are ≤32² pixels (mean ~18² pixels), performance variance is expected to be high; the absence of these statistics makes it impossible to judge whether the reported margin is reliable or reproducible.
Ablation / fusion analysis: no controlled experiment replaces the MSI channels with duplicated RGB or additive noise while freezing the dual-stream architecture. Without this isolation, the observed gain cannot be confidently attributed to material-aware spectral cues rather than increased model capacity or the parallax-correction module itself.
Dataset characterization: the manuscript contains no band-wise signature plots, per-class separability metrics (e.g., Bhattacharyya distance between UAV and clutter distributions), or even simple mean/variance statistics per spectral band for the small-object subset. This leaves the core assumption—that MSI signatures remain distinct and useful under the stated low-contrast, sub-32²-pixel regime—unverified.

minor comments (2)

The abstract states evaluation against “20 detectors” yet the main text does not provide a single consolidated table listing all baselines with their exact AP50 scores under the RGB-only protocol; adding such a table would improve clarity and reproducibility.
Notation for the dual-stream fusion module is introduced without an accompanying equation or diagram that explicitly shows how the parallax-corrected MSI features are combined with RGB features; a concise mathematical description would aid readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results and the characterization of UAVNet-MS. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: Results section: the headline +6.2% AP50 improvement is presented without error bars, standard deviations across runs, or statistical significance tests. Given that 93.7% of targets are ≤32² pixels (mean ~18² pixels), performance variance is expected to be high; the absence of these statistics makes it impossible to judge whether the reported margin is reliable or reproducible.

Authors: We agree that statistical validation is essential given the small target sizes and expected variance. In the revised manuscript we will report results from five independent training runs with different random seeds, include error bars showing mean and standard deviation for AP50, and add paired t-tests to establish statistical significance of the observed gains over the RGB-only baselines. revision: yes
Referee: Ablation / fusion analysis: no controlled experiment replaces the MSI channels with duplicated RGB or additive noise while freezing the dual-stream architecture. Without this isolation, the observed gain cannot be confidently attributed to material-aware spectral cues rather than increased model capacity or the parallax-correction module itself.

Authors: This is a fair criticism. We will add a new ablation study that freezes the dual-stream architecture and parallax-correction module while replacing the MSI input channels with either duplicated RGB channels or Gaussian noise matched to the original channel statistics. The results will be reported alongside the existing experiments to isolate the contribution of genuine spectral material cues. revision: yes
Referee: Dataset characterization: the manuscript contains no band-wise signature plots, per-class separability metrics (e.g., Bhattacharyya distance between UAV and clutter distributions), or even simple mean/variance statistics per spectral band for the small-object subset. This leaves the core assumption—that MSI signatures remain distinct and useful under the stated low-contrast, sub-32²-pixel regime—unverified.

Authors: We acknowledge the value of explicit spectral characterization. The revised manuscript will include (i) mean spectral signature plots for UAV versus background pixels on the small-object subset, (ii) per-band mean and variance statistics, and (iii) Bhattacharyya distance and Jeffries-Matusita separability metrics computed between UAV and clutter distributions in the sub-32²-pixel regime. These additions will directly support the claim that MSI provides complementary material evidence. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation on newly collected dataset with no derivation chain

full rationale

The manuscript introduces UAVNet-MS as a new dataset and MFDNet as a dual-stream detector, then reports protocol-wise AP50 numbers from direct comparisons against 20 existing detectors. No equations, predictions, or uniqueness claims are present that reduce by construction to fitted inputs, self-citations, or renamed empirical patterns. The +6.2% gain is presented strictly as an observed experimental outcome on the held-out test split, rendering the reported chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that multispectral bands supply material signatures orthogonal to spatial appearance; no free parameters or invented entities are enumerated in the abstract.

axioms (1)

domain assumption Multispectral imaging encodes material-aware spectral signatures that are complementary to spatial cues in RGB images for small-object discrimination.
Invoked to justify why MSI should improve detection when RGB fails on size, contrast, and inter-type similarity.

pith-pipeline@v0.9.0 · 5810 in / 1112 out tokens · 34046 ms · 2026-05-21T05:04:23.138280+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

[1]

Detection and tracking meet drones challenge,

P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet drones challenge,”IEEE TPAMI, vol. 44, no. 11, pp. 7380–7399, 2022

work page 2022
[2]

Anti-uav410: A thermal infrared benchmarkandcustomizedschemefortrackingdronesinthewild,

B. Huang, J. Li, J. Chen, G. Wang, J. Zhao, and T. Xu, “Anti-uav410: A thermal infrared benchmarkandcustomizedschemefortrackingdronesinthewild,”IEEETPAMI,vol.46, no. 5, pp. 2852–2865, 2024

work page 2024
[3]

Webuav-3m: A benchmark for unveiling the power of million-scale deep uav tracking,

C.Zhang,G.Huang,L.Liu,S.Huang,Y.Yang,X.Wan,S.Ge,andD.Tao,“Webuav-3m: A benchmark for unveiling the power of million-scale deep uav tracking,”IEEE TPAMI, vol. 45, no. 7, pp. 9186–9205, 2023

work page 2023
[4]

Overview on autonomous aircraft technology and its application to low-altitude economy,

C. Lin, M. Zhiqiang, W. Xiangke, C. Mou, D. Haibin, and W. Yaonan, “Overview on autonomous aircraft technology and its application to low-altitude economy,” inRobot, vol. 47, no. 3, 2025, pp. 470–496

work page 2025
[5]

ATRNet-STAR: A large dataset and benchmark towards remote sensing object recognition in the wild,

Y. Liu, W. Li, L. Liu, J. Zhou, B. Peng, Y. Song, X. Xiong, W. Yang, T. Liu, Z. Liu, and X. Li, “ATRNet-STAR: A large dataset and benchmark towards remote sensing object recognition in the wild,”IEEE TPAMI, pp. 1–18, 2026

work page 2026
[6]

Rgb-t object tracking: Benchmark and baseline,

C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang, “Rgb-t object tracking: Benchmark and baseline,”PR, vol. 96, p. 106977, 2019

work page 2019
[7]

Visible-thermal uav tracking: A large- scale benchmark and new baseline,

P. Zhang, J. Zhao, D. Wang, H. Lu, and X. Ruan, “Visible-thermal uav tracking: A large- scale benchmark and new baseline,” inCVPR, 2022, pp. 8886–8895

work page 2022
[8]

Anti-uav: A large-scale benchmark for vision-based uav tracking,

N. Jiang, K. Wang, X. Peng, X. Yu, Q. Wang, J. Xing, G. Li, G. Guo, Q. Ye, J. Jiaoet al., “Anti-uav: A large-scale benchmark for vision-based uav tracking,”IEEE TMM, vol. 25, pp. 486–500, 2021

work page 2021
[9]

Material based object tracking in hyperspectral videos,

F. Xiong, J. Zhou, and Y. Qian, “Material based object tracking in hyperspectral videos,” IEEE TIP, vol. 29, pp. 3719–3733, 2020

work page 2020
[10]

Unsupervised deep hyperspectral video target tracking and high spectral-spatial-temporal resolution (h 3) benchmark dataset,

Z. Liu, Y. Zhong, X. Wang, M. Shu, and L. Zhang, “Unsupervised deep hyperspectral video target tracking and high spectral-spatial-temporal resolution (h 3) benchmark dataset,”IEEE TGRS, vol. 60, pp. 1–14, 2021

work page 2021
[11]

Must: The first dataset and unified framework for multispectral uav single object tracking,

H. Qin, T. Xu, T. Li, Z. Chen, T. Feng, and J. Li, “Must: The first dataset and unified framework for multispectral uav single object tracking,” inCVPR, 2025, pp. 16882– 16891

work page 2025
[12]

A benchmark and simulator for uav tracking

M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for uav tracking",” inECCV, 2016, pp. 445–461

work page 2016
[13]

Visualobjecttrackingforunmannedaerialvehicles:Abenchmark and new motion models,

S.LiandD.-Y.Yeung,“Visualobjecttrackingforunmannedaerialvehicles:Abenchmark and new motion models,” inAAAI, vol. 31, no. 1, 2017

work page 2017
[14]

Visdrone-det2021: The vision meets drone object detection challenge results,

Y. Cao, Z. He, L. Wang, W. Wang, Y. Yuan, D. Zhang, J. Zhang, P. Zhu, L. Van Gool, J. Hanet al., “Visdrone-det2021: The vision meets drone object detection challenge results,” inICCV, 2021, pp. 2847–2854

work page 2021
[15]

The unmanned aerial vehicle benchmark: Object detection and tracking,

D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian, “The unmanned aerial vehicle benchmark: Object detection and tracking,” inECCV, 2018, pp. 370–386

work page 2018
[16]

Learningsocialetiquette:Human trajectory understanding in crowded scenes,

A.Robicquet,A.Sadeghian,A.Alahi,andS.Savarese,“Learningsocialetiquette:Human trajectory understanding in crowded scenes,” inECCV, 2016, pp. 549–565

work page 2016
[17]

Adaptive inattentional framework for video object detection with reward-conditional training,

A. Rodriguez-Ramos, J. Rodriguez-Vazquez, C. Sampedro, and P. Campoy, “Adaptive inattentional framework for video object detection with reward-conditional training,” IEEE Access, vol. 8, pp. 124451–124466, 2020

work page 2020
[18]

Tju-dhd: A diverse high-resolution dataset for object detection,

Y. Pang, J. Cao, Y. Li, J. Xie, H. Sun, and J. Gong, “Tju-dhd: A diverse high-resolution dataset for object detection,”IEEE TIP, vol. 30, pp. 207–219, 2020

work page 2020
[19]

Multispectralpedestriandetection: Benchmark dataset and baseline,

S.Hwang,J.Park,N.Kim,Y.Choi,andI.SoKweon,“Multispectralpedestriandetection: Benchmark dataset and baseline,” inCVPR, 2015, pp. 1037–1045

work page 2015
[20]

Multispectral fusion for object detection with cyclic fuse-and-refine blocks,

H. Zhang, E. Fromont, S. Lefevre, and B. Avignon, “Multispectral fusion for object detection with cyclic fuse-and-refine blocks,” inICIP, 2020, pp. 276–280

work page 2020
[21]

Llvip: A visible-infrared paired dataset for low-light vision,

X. Jia, C. Zhu, M. Li, W. Tang, and W. Zhou, “Llvip: A visible-infrared paired dataset for low-light vision,” inICCV, 2021, pp. 3496–3504

work page 2021
[22]

Lasher: A large-scale high- diversity benchmark for rgbt tracking,

C. Li, W. Xue, Y. Jia, Z. Qu, B. Luo, J. Tang, and D. Sun, “Lasher: A large-scale high- diversity benchmark for rgbt tracking,”IEEE TIP, vol. 31, pp. 392–404, 2021

work page 2021
[23]

Visible-thermaltinyobjectdetection:Abenchmarkdatasetandbaselines,

X. Ying, C. Xiao, W. An, R. Li, X. He, B. Li, X. Cao, Z. Li, Y. Wang, M. Huet al., “Visible-thermaltinyobjectdetection:Abenchmarkdatasetandbaselines,”IEEETPAMI, vol. 47, no. 7, pp. 6088–6096, 2025

work page 2025
[24]

Vehicle detection in aerial imagery: A small target detection benchmark,

S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,”JVCIR, vol. 34, pp. 187–203, 2016

work page 2016
[25]

Semi-supervised hyperspectral object detection challenge results-pbvs 2022,

A. Rangnekar, Z. Mulhollan, A. Vodacek, M. Hoffman, A. D. Sappa, E. Blasch, J. Yu, L. Zhang, S. Du, H. Changet al., “Semi-supervised hyperspectral object detection challenge results-pbvs 2022,” inCVPRW, 2022, pp. 390–398

work page 2022
[26]

Figvcl: Fine-grained benchmark and method for video copy localization,

W. Luo, Y. Liu, B. Li, W. Hu, and S. Maybank, “Figvcl: Fine-grained benchmark and method for video copy localization,”IEEE TPAMI, vol. 47, no. 11, pp. 10457–10474, 2025

work page 2025
[27]

Advances in multimodal adaptation and generalization: From traditional approaches to foundation models,

H. Dong, M. Liu, K. Zhou, E. Chatzi, J. Kannala, C. Stachniss, and O. Fink, “Advances in multimodal adaptation and generalization: From traditional approaches to foundation models,”IEEE TPAMI, pp. 1–20, 2026

work page 2026
[28]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”NeurIPS, vol. 28, 2015

work page 2015
[29]

Fully convolutional one-stage object detection,

Z. Tian, C. Shen, H. Chen, and T. He, “Fully convolutional one-stage object detection,” inICCV, 2019, pp. 9626–9635

work page 2019
[30]

Motion and appearance decoupling representation for event cameras,

N. Chen, B. Li, Y. Wang, X. Ying, L. Wang, C. Zhang, Y. Guo, M. Li, and W. An, “Motion and appearance decoupling representation for event cameras,”IEEE TIP, 2025. SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 9

work page 2025
[31]

Objects as Points

X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,”arXiv, 2019, arXiv:1904.07850

work page internal anchor Pith review Pith/arXiv arXiv 2019
[32]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,”arXiv, 2020, arXiv:2010.04159

work page internal anchor Pith review Pith/arXiv arXiv 2020
[33]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,”arXiv, 2022, arXiv:2203.03605

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Slicingaidedhyperinferenceandfine-tuning for small object detection,

F.C.Akyon,S.O.Altinuc,andA.Temizel,“Slicingaidedhyperinferenceandfine-tuning for small object detection,” inIEEE ICIP, 2022, pp. 966–970

work page 2022
[35]

Parameter-inverted image pyramid networks for visual perception and multimodal understanding,

Z. Wang, X. Zhu, X. Yang, G. Luo, H. Li, C. Tian, W. Dou, J. Ge, L. Lu, Y. Qiao, and J. Dai, “Parameter-inverted image pyramid networks for visual perception and multimodal understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 11, pp. 10142–10159, 2025

work page 2025
[36]

Direction-coded temporal u-shape module for multiframe infrared small target detection,

R. Li, W. An, C. Xiao, B. Li, Y. Wang, M. Li, and Y. Guo, “Direction-coded temporal u-shape module for multiframe infrared small target detection,”IEEE TNNLS, vol. 36, no. 1, pp. 555–568, 2025

work page 2025
[37]

Specdetr: A transformer-based hyperspectral point object detection network,

Z. Li, W. An, G. Guo, L. Wang, Y. Wang, and Z. Lin, “Specdetr: A transformer-based hyperspectral point object detection network,”ISPRS, vol. 226, pp. 221–246, 2025

work page 2025
[38]

Cascade r-cnn: Delving into high quality object detection,

Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” inCVPR, 2018, pp. 6154–6162

work page 2018
[39]

Tood: Task-aligned one-stage object detection,

C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang, “Tood: Task-aligned one-stage object detection,” inICCV, 2021, pp. 3490–3499

work page 2021
[40]

Ultralytics YOLO,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLO,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

work page 2023
[41]

Yolov9: Learning what you want to learn using programmable gradient information,

C.-Y. Wang, I.-H. Yeh, and H.-Y. Mark Liao, “Yolov9: Learning what you want to learn using programmable gradient information,” inECCV, 2024, pp. 1–21

work page 2024
[42]

Anchordetr:Querydesignfortransformer-based detector,

Y.Wang,X.Zhang,T.Yang,andJ.Sun,“Anchordetr:Querydesignfortransformer-based detector,” inAAAI, vol. 36, no. 3, 2022, pp. 2567–2575

work page 2022
[43]

Oriented tiny object detection: A dataset, benchmark, and dynamic unbiased learning,

C. Xu, R. Zhang, W. Yang, H. Zhu, F. Xu, J. Ding, and G.-S. Xia, “Oriented tiny object detection: A dataset, benchmark, and dynamic unbiased learning,”IEEE TPAMI, vol. 48, no. 3, pp. 3167–3184, 2026

work page 2026

[1] [1]

Detection and tracking meet drones challenge,

P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet drones challenge,”IEEE TPAMI, vol. 44, no. 11, pp. 7380–7399, 2022

work page 2022

[2] [2]

Anti-uav410: A thermal infrared benchmarkandcustomizedschemefortrackingdronesinthewild,

B. Huang, J. Li, J. Chen, G. Wang, J. Zhao, and T. Xu, “Anti-uav410: A thermal infrared benchmarkandcustomizedschemefortrackingdronesinthewild,”IEEETPAMI,vol.46, no. 5, pp. 2852–2865, 2024

work page 2024

[3] [3]

Webuav-3m: A benchmark for unveiling the power of million-scale deep uav tracking,

C.Zhang,G.Huang,L.Liu,S.Huang,Y.Yang,X.Wan,S.Ge,andD.Tao,“Webuav-3m: A benchmark for unveiling the power of million-scale deep uav tracking,”IEEE TPAMI, vol. 45, no. 7, pp. 9186–9205, 2023

work page 2023

[4] [4]

Overview on autonomous aircraft technology and its application to low-altitude economy,

C. Lin, M. Zhiqiang, W. Xiangke, C. Mou, D. Haibin, and W. Yaonan, “Overview on autonomous aircraft technology and its application to low-altitude economy,” inRobot, vol. 47, no. 3, 2025, pp. 470–496

work page 2025

[5] [5]

ATRNet-STAR: A large dataset and benchmark towards remote sensing object recognition in the wild,

Y. Liu, W. Li, L. Liu, J. Zhou, B. Peng, Y. Song, X. Xiong, W. Yang, T. Liu, Z. Liu, and X. Li, “ATRNet-STAR: A large dataset and benchmark towards remote sensing object recognition in the wild,”IEEE TPAMI, pp. 1–18, 2026

work page 2026

[6] [6]

Rgb-t object tracking: Benchmark and baseline,

C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang, “Rgb-t object tracking: Benchmark and baseline,”PR, vol. 96, p. 106977, 2019

work page 2019

[7] [7]

Visible-thermal uav tracking: A large- scale benchmark and new baseline,

P. Zhang, J. Zhao, D. Wang, H. Lu, and X. Ruan, “Visible-thermal uav tracking: A large- scale benchmark and new baseline,” inCVPR, 2022, pp. 8886–8895

work page 2022

[8] [8]

Anti-uav: A large-scale benchmark for vision-based uav tracking,

N. Jiang, K. Wang, X. Peng, X. Yu, Q. Wang, J. Xing, G. Li, G. Guo, Q. Ye, J. Jiaoet al., “Anti-uav: A large-scale benchmark for vision-based uav tracking,”IEEE TMM, vol. 25, pp. 486–500, 2021

work page 2021

[9] [9]

Material based object tracking in hyperspectral videos,

F. Xiong, J. Zhou, and Y. Qian, “Material based object tracking in hyperspectral videos,” IEEE TIP, vol. 29, pp. 3719–3733, 2020

work page 2020

[10] [10]

Unsupervised deep hyperspectral video target tracking and high spectral-spatial-temporal resolution (h 3) benchmark dataset,

Z. Liu, Y. Zhong, X. Wang, M. Shu, and L. Zhang, “Unsupervised deep hyperspectral video target tracking and high spectral-spatial-temporal resolution (h 3) benchmark dataset,”IEEE TGRS, vol. 60, pp. 1–14, 2021

work page 2021

[11] [11]

Must: The first dataset and unified framework for multispectral uav single object tracking,

H. Qin, T. Xu, T. Li, Z. Chen, T. Feng, and J. Li, “Must: The first dataset and unified framework for multispectral uav single object tracking,” inCVPR, 2025, pp. 16882– 16891

work page 2025

[12] [12]

A benchmark and simulator for uav tracking

M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for uav tracking",” inECCV, 2016, pp. 445–461

work page 2016

[13] [13]

Visualobjecttrackingforunmannedaerialvehicles:Abenchmark and new motion models,

S.LiandD.-Y.Yeung,“Visualobjecttrackingforunmannedaerialvehicles:Abenchmark and new motion models,” inAAAI, vol. 31, no. 1, 2017

work page 2017

[14] [14]

Visdrone-det2021: The vision meets drone object detection challenge results,

Y. Cao, Z. He, L. Wang, W. Wang, Y. Yuan, D. Zhang, J. Zhang, P. Zhu, L. Van Gool, J. Hanet al., “Visdrone-det2021: The vision meets drone object detection challenge results,” inICCV, 2021, pp. 2847–2854

work page 2021

[15] [15]

The unmanned aerial vehicle benchmark: Object detection and tracking,

D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian, “The unmanned aerial vehicle benchmark: Object detection and tracking,” inECCV, 2018, pp. 370–386

work page 2018

[16] [16]

Learningsocialetiquette:Human trajectory understanding in crowded scenes,

A.Robicquet,A.Sadeghian,A.Alahi,andS.Savarese,“Learningsocialetiquette:Human trajectory understanding in crowded scenes,” inECCV, 2016, pp. 549–565

work page 2016

[17] [17]

Adaptive inattentional framework for video object detection with reward-conditional training,

A. Rodriguez-Ramos, J. Rodriguez-Vazquez, C. Sampedro, and P. Campoy, “Adaptive inattentional framework for video object detection with reward-conditional training,” IEEE Access, vol. 8, pp. 124451–124466, 2020

work page 2020

[18] [18]

Tju-dhd: A diverse high-resolution dataset for object detection,

Y. Pang, J. Cao, Y. Li, J. Xie, H. Sun, and J. Gong, “Tju-dhd: A diverse high-resolution dataset for object detection,”IEEE TIP, vol. 30, pp. 207–219, 2020

work page 2020

[19] [19]

Multispectralpedestriandetection: Benchmark dataset and baseline,

S.Hwang,J.Park,N.Kim,Y.Choi,andI.SoKweon,“Multispectralpedestriandetection: Benchmark dataset and baseline,” inCVPR, 2015, pp. 1037–1045

work page 2015

[20] [20]

Multispectral fusion for object detection with cyclic fuse-and-refine blocks,

H. Zhang, E. Fromont, S. Lefevre, and B. Avignon, “Multispectral fusion for object detection with cyclic fuse-and-refine blocks,” inICIP, 2020, pp. 276–280

work page 2020

[21] [21]

Llvip: A visible-infrared paired dataset for low-light vision,

X. Jia, C. Zhu, M. Li, W. Tang, and W. Zhou, “Llvip: A visible-infrared paired dataset for low-light vision,” inICCV, 2021, pp. 3496–3504

work page 2021

[22] [22]

Lasher: A large-scale high- diversity benchmark for rgbt tracking,

C. Li, W. Xue, Y. Jia, Z. Qu, B. Luo, J. Tang, and D. Sun, “Lasher: A large-scale high- diversity benchmark for rgbt tracking,”IEEE TIP, vol. 31, pp. 392–404, 2021

work page 2021

[23] [23]

Visible-thermaltinyobjectdetection:Abenchmarkdatasetandbaselines,

X. Ying, C. Xiao, W. An, R. Li, X. He, B. Li, X. Cao, Z. Li, Y. Wang, M. Huet al., “Visible-thermaltinyobjectdetection:Abenchmarkdatasetandbaselines,”IEEETPAMI, vol. 47, no. 7, pp. 6088–6096, 2025

work page 2025

[24] [24]

Vehicle detection in aerial imagery: A small target detection benchmark,

S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,”JVCIR, vol. 34, pp. 187–203, 2016

work page 2016

[25] [25]

Semi-supervised hyperspectral object detection challenge results-pbvs 2022,

A. Rangnekar, Z. Mulhollan, A. Vodacek, M. Hoffman, A. D. Sappa, E. Blasch, J. Yu, L. Zhang, S. Du, H. Changet al., “Semi-supervised hyperspectral object detection challenge results-pbvs 2022,” inCVPRW, 2022, pp. 390–398

work page 2022

[26] [26]

Figvcl: Fine-grained benchmark and method for video copy localization,

W. Luo, Y. Liu, B. Li, W. Hu, and S. Maybank, “Figvcl: Fine-grained benchmark and method for video copy localization,”IEEE TPAMI, vol. 47, no. 11, pp. 10457–10474, 2025

work page 2025

[27] [27]

Advances in multimodal adaptation and generalization: From traditional approaches to foundation models,

H. Dong, M. Liu, K. Zhou, E. Chatzi, J. Kannala, C. Stachniss, and O. Fink, “Advances in multimodal adaptation and generalization: From traditional approaches to foundation models,”IEEE TPAMI, pp. 1–20, 2026

work page 2026

[28] [28]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”NeurIPS, vol. 28, 2015

work page 2015

[29] [29]

Fully convolutional one-stage object detection,

Z. Tian, C. Shen, H. Chen, and T. He, “Fully convolutional one-stage object detection,” inICCV, 2019, pp. 9626–9635

work page 2019

[30] [30]

Motion and appearance decoupling representation for event cameras,

N. Chen, B. Li, Y. Wang, X. Ying, L. Wang, C. Zhang, Y. Guo, M. Li, and W. An, “Motion and appearance decoupling representation for event cameras,”IEEE TIP, 2025. SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 9

work page 2025

[31] [31]

Objects as Points

X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,”arXiv, 2019, arXiv:1904.07850

work page internal anchor Pith review Pith/arXiv arXiv 2019

[32] [32]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,”arXiv, 2020, arXiv:2010.04159

work page internal anchor Pith review Pith/arXiv arXiv 2020

[33] [33]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,”arXiv, 2022, arXiv:2203.03605

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Slicingaidedhyperinferenceandfine-tuning for small object detection,

F.C.Akyon,S.O.Altinuc,andA.Temizel,“Slicingaidedhyperinferenceandfine-tuning for small object detection,” inIEEE ICIP, 2022, pp. 966–970

work page 2022

[35] [35]

Parameter-inverted image pyramid networks for visual perception and multimodal understanding,

Z. Wang, X. Zhu, X. Yang, G. Luo, H. Li, C. Tian, W. Dou, J. Ge, L. Lu, Y. Qiao, and J. Dai, “Parameter-inverted image pyramid networks for visual perception and multimodal understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 11, pp. 10142–10159, 2025

work page 2025

[36] [36]

Direction-coded temporal u-shape module for multiframe infrared small target detection,

R. Li, W. An, C. Xiao, B. Li, Y. Wang, M. Li, and Y. Guo, “Direction-coded temporal u-shape module for multiframe infrared small target detection,”IEEE TNNLS, vol. 36, no. 1, pp. 555–568, 2025

work page 2025

[37] [37]

Specdetr: A transformer-based hyperspectral point object detection network,

Z. Li, W. An, G. Guo, L. Wang, Y. Wang, and Z. Lin, “Specdetr: A transformer-based hyperspectral point object detection network,”ISPRS, vol. 226, pp. 221–246, 2025

work page 2025

[38] [38]

Cascade r-cnn: Delving into high quality object detection,

Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” inCVPR, 2018, pp. 6154–6162

work page 2018

[39] [39]

Tood: Task-aligned one-stage object detection,

C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang, “Tood: Task-aligned one-stage object detection,” inICCV, 2021, pp. 3490–3499

work page 2021

[40] [40]

Ultralytics YOLO,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLO,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

work page 2023

[41] [41]

Yolov9: Learning what you want to learn using programmable gradient information,

C.-Y. Wang, I.-H. Yeh, and H.-Y. Mark Liao, “Yolov9: Learning what you want to learn using programmable gradient information,” inECCV, 2024, pp. 1–21

work page 2024

[42] [42]

Anchordetr:Querydesignfortransformer-based detector,

Y.Wang,X.Zhang,T.Yang,andJ.Sun,“Anchordetr:Querydesignfortransformer-based detector,” inAAAI, vol. 36, no. 3, 2022, pp. 2567–2575

work page 2022

[43] [43]

Oriented tiny object detection: A dataset, benchmark, and dynamic unbiased learning,

C. Xu, R. Zhang, W. Yang, H. Zhu, F. Xu, J. Ding, and G.-S. Xia, “Oriented tiny object detection: A dataset, benchmark, and dynamic unbiased learning,”IEEE TPAMI, vol. 48, no. 3, pp. 3167–3184, 2026

work page 2026