arxiv: 2604.21435 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery

Fang Xu, Gui-Song Xia, Haijian Zhang, Haoran Zhu, Jingfang Li, Jinrui Zhang, Wen Yang

Pith reviewed 2026-05-09 22:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords small object detectionultra-high-resolution imageryremote sensingtransformer detectorsparse encoderend-to-end detectionglobal-local decoding

0 comments

The pith

UHR-DETR detects small objects in ultra-high-resolution remote sensing images by focusing computation on key regions rather than processing entire scenes uniformly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UHR-DETR as an end-to-end transformer detector that addresses the core tradeoff in ultra-high-resolution imagery: full resolution preserves small objects but exhausts memory, while downsampling or fixed cropping loses them or breaks context. It introduces a Coverage-Maximizing Sparse Encoder to select and process only the most informative high-resolution patches and a Global-Local Decoupled Decoder to combine overall scene understanding with localized object features. On the STAR dataset this yields a 2.8 percent gain in mean average precision and a tenfold reduction in inference time versus sliding-window baselines, all within the limits of a single 24 GB GPU. Readers care because remote sensing tasks such as monitoring vehicles, ships, or infrastructure across vast areas become feasible on ordinary hardware without sacrificing detection quality.

Core claim

UHR-DETR is an efficient transformer-based detector for ultra-high-resolution remote sensing imagery that uses a Coverage-Maximizing Sparse Encoder to dynamically allocate limited compute to the most informative high-resolution regions, thereby maximizing small-object coverage while minimizing spatial redundancy, and a Global-Local Decoupled Decoder that fuses macroscopic scene context with microscopic object details to reduce semantic ambiguity and scene fragmentation.

What carries the argument

Coverage-Maximizing Sparse Encoder, which dynamically identifies and prioritizes high-resolution image regions containing small objects to allocate finite computational resources for maximum coverage at minimal redundancy.

If this is right

Full-resolution ultra-high-resolution images can be analyzed end-to-end without downsampling or memory overflow on consumer GPUs.
Small-object detection accuracy improves by 2.8 percent mAP on the STAR benchmark while inference runs ten times faster than sliding-window approaches.
Semantic ambiguities arising from scene fragmentation are reduced by combining global context with local detail in the decoder.
The same model architecture works on multiple ultra-high-resolution remote sensing datasets including STAR and SODA-A under identical hardware constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The sparse allocation approach could scale object detection to even larger image sizes in domains such as aerial surveillance or large-scale mapping.
Decoupling global and local processing in the decoder offers a template for handling multi-scale features in other transformer-based vision tasks.
If the encoder generalizes, practitioners may shift from specialized high-memory hardware to standard GPUs for routine remote sensing analysis.
Further tests on scenes with denser small-object clusters or lower contrast would clarify the limits of coverage maximization.

Load-bearing premise

The sparse encoder can reliably locate every important high-resolution region that contains small objects across varied scenes without missing critical instances.

What would settle it

Running UHR-DETR on a new ultra-high-resolution dataset with small objects placed in patterns that the encoder consistently overlooks, and measuring whether its mean average precision then falls below that of a standard sliding-window baseline.

Figures

Figures reproduced from arXiv: 2604.21435 by Fang Xu, Gui-Song Xia, Haijian Zhang, Haoran Zhu, Jingfang Li, Jinrui Zhang, Wen Yang.

**Figure 1.** Figure 1: The perception paradox in UHR imagery and our solution. (a) Standard object detectors suffer from severe Outof-Memory (OOM) bottlenecks on 24GB GPUs, whereas our method efficiently scales up to ultra-high resolutions. (b) Naive downsampling leads to a catastrophic drop in small object accuracy. (c) Sliding window strategies inevitably fracture crucial macroscopic context. (d) To break these dilemmas, UHR-… view at source ↗

**Figure 2.** Figure 2: Overall architecture of the proposed UHR-DETR. The framework comprises two modules. (Left) The CoverageMaximizing Sparse Encoder utilizes a lightweight backbone to generate global tokens and an object-aware Gain Map. Guided by this map, an Iterative Soft-Subtraction Greedy Algorithm (ISSGA) dynamically extracts high-value image patches, which are subsequently processed by the primary feature extractor to … view at source ↗

**Figure 3.** Figure 3: Visualization of dynamic patch selection. Green boxes indicate the selected local patches. Our method efficiently concentrates computational resources on object-dense regions while filtering out vast uninformative backgrounds. A zoomed-in view is shown on the right. 0 20 40 60 80 100 Coverage Rate (%) 0 50 100 150 200 Number of Images (a) Per-Image Coverage Distribution Mean = 84.0% Median = 92.1% 0 20 40 … view at source ↗

**Figure 4.** Figure 4: Statistical analysis of object coverage rates. (a) Per-image coverage distribution and (b) cumulative distribution curve (CDF) across the dataset. (c) Average coverage rate versus the patch budget (𝐾). (d) Marginal coverage gain per patch iteration. These results collectively demonstrate that our dynamic allocation rapidly achieves high object coverage while efficiently managing spatial redundancy. our Gai… view at source ↗

**Figure 5.** Figure 5: Visualization of the final detection results in a massive UHR scenario. The zoomed-in panels on the right demonstrate our method’s capability to accurately localize densely packed small objects. By effectively leveraging macroscopic context, the framework maintains robust perception across the expansive background. preserving fine-grained features for small objects and managing the prohibitive computation… view at source ↗

read the original abstract

Ultra-High-Resolution (UHR) imagery has become essential for modern remote sensing, offering unprecedented spatial coverage. However, detecting small objects in such vast scenes presents a critical dilemma: retaining the original resolution for small objects causes prohibitive memory bottlenecks. Conversely, conventional compromises like image downsampling or patch cropping either erase small objects or destroy context. To break this dilemma, we propose UHR-DETR, an efficient end-to-end transformer-based detector designed for UHR imagery. First, we introduce a Coverage-Maximizing Sparse Encoder that dynamically allocates finite computational resources to informative high-resolution regions, ensuring maximum object coverage with minimal spatial redundancy. Second, we design a Global-Local Decoupled Decoder. By integrating macroscopic scene awareness with microscopic object details, this module resolves semantic ambiguities and prevents scene fragmentation. Extensive experiments on the UHR imagery datasets (e.g., STAR and SODA-A) demonstrate the superiority of UHR-DETR under strict hardware constraints (e.g., a single 24GB RTX 3090). It achieves a 2.8\% mAP improvement while delivering a 10$\times$ inference speedup compared to standard sliding-window baselines on the STAR dataset. Our codes and models will be available at https://github.com/Li-JingFang/UHR-DETR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UHR-DETR adds a sparse encoder and decoupled decoder to handle small objects in ultra-high-res remote sensing without full-image high-res processing, delivering claimed speed and accuracy gains, but the coverage guarantee needs direct measurement.

read the letter

The paper's main contribution is a pair of modules on top of DETR: a Coverage-Maximizing Sparse Encoder that selects high-resolution patches dynamically and a Global-Local Decoupled Decoder that combines broad scene context with local detail. This setup targets the memory-resolution tradeoff in remote sensing, where full UHR images are too big and downsampling or naive cropping loses small objects or context. They test on STAR and SODA-A and report a 2.8% mAP lift plus 10x inference speedup versus sliding-window baselines on a single 24 GB GPU, which is the kind of practical number that matters for deployment.

Referee Report

2 major / 1 minor

Summary. The paper proposes UHR-DETR, a transformer-based end-to-end detector for small objects in ultra-high-resolution remote sensing imagery. It introduces a Coverage-Maximizing Sparse Encoder that dynamically allocates computational resources to informative high-resolution regions for maximum object coverage with minimal redundancy, and a Global-Local Decoupled Decoder that combines macroscopic scene context with microscopic object details to reduce semantic ambiguity. Experiments on the STAR and SODA-A datasets report a 2.8% mAP gain and 10× inference speedup versus sliding-window baselines under a single 24 GB RTX 3090 constraint.

Significance. If the performance claims hold after verification, the approach could meaningfully advance practical small-object detection in UHR remote-sensing scenes by mitigating memory bottlenecks without sacrificing coverage, offering a concrete efficiency-accuracy trade-off for hardware-limited deployments.

major comments (2)

[Coverage-Maximizing Sparse Encoder description and experimental validation] The central claims of 2.8% mAP improvement and 10× speedup on STAR rest on the Coverage-Maximizing Sparse Encoder's premise of complete coverage of informative regions containing small objects. No quantitative coverage metric (e.g., recall of ground-truth small-object instances inside selected patches versus the full image, or saliency-threshold sensitivity analysis) is reported to bound the risk of omitting low-saliency instances across scene variations. This directly affects both the mAP and speedup figures.
[Experiments and results] The experimental section reports aggregate gains on STAR and SODA-A but provides insufficient detail on exact baseline implementations (including sliding-window patch size, overlap, and post-processing), full ablation results isolating the Sparse Encoder and Decoupled Decoder contributions, error bars across multiple runs, or statistical significance tests. These omissions make it difficult to assess whether the reported margins are robust under the stated 24 GB hardware constraint.

minor comments (1)

[Abstract] The abstract states that code will be released but does not specify the exact license or repository structure; adding this would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will incorporate revisions to improve the clarity and rigor of the manuscript.

read point-by-point responses

Referee: [Coverage-Maximizing Sparse Encoder description and experimental validation] The central claims of 2.8% mAP improvement and 10× speedup on STAR rest on the Coverage-Maximizing Sparse Encoder's premise of complete coverage of informative regions containing small objects. No quantitative coverage metric (e.g., recall of ground-truth small-object instances inside selected patches versus the full image, or saliency-threshold sensitivity analysis) is reported to bound the risk of omitting low-saliency instances across scene variations. This directly affects both the mAP and speedup figures.

Authors: We agree that explicitly quantifying coverage would strengthen the validation of the Sparse Encoder. In the revised manuscript, we will add a coverage analysis reporting the recall of ground-truth small-object instances contained within the dynamically selected high-resolution patches relative to the full image. We will also include a saliency-threshold sensitivity study across multiple scenes to demonstrate that the risk of omitting low-saliency objects remains low and does not undermine the reported mAP gains or speedup. These additions will be presented with supporting figures or tables. revision: yes
Referee: [Experiments and results] The experimental section reports aggregate gains on STAR and SODA-A but provides insufficient detail on exact baseline implementations (including sliding-window patch size, overlap, and post-processing), full ablation results isolating the Sparse Encoder and Decoupled Decoder contributions, error bars across multiple runs, or statistical significance tests. These omissions make it difficult to assess whether the reported margins are robust under the stated 24 GB hardware constraint.

Authors: We acknowledge the need for greater experimental transparency. In the revision, we will expand the experimental section to specify the exact sliding-window baseline settings (patch size, overlap ratio, and post-processing such as NMS), provide full ablation tables isolating the Sparse Encoder and Decoupled Decoder, report error bars from multiple runs with different random seeds, and include statistical significance tests (e.g., paired t-tests) confirming the improvements. All results will remain under the single 24 GB RTX 3090 constraint to ensure fair comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on dataset comparisons

full rationale

The paper introduces an architectural proposal (Coverage-Maximizing Sparse Encoder + Global-Local Decoupled Decoder) for UHR small-object detection and reports empirical gains (2.8% mAP, 10× speedup) on public benchmarks (STAR, SODA-A) under fixed hardware. No equations, uniqueness theorems, or predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims are falsifiable via external baselines and do not rely on internal parameter renaming or ansatz smuggling. This is the expected non-circular outcome for an applied CV architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Relies on standard assumptions of transformer-based object detection (attention mechanisms, end-to-end training) and empirical validation on existing datasets; no new axioms, free parameters beyond typical hyperparameters, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5550 in / 1075 out tokens · 59728 ms · 2026-05-09T22:11:52.912990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Ffca-yolo for small object detection in remote sensing images,

Y . Zhang, M. Ye, G. Zhu, Y . Liu, P. Guo, and J. Yan, “Ffca-yolo for small object detection in remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024

2024
[2]

Boundary- aware feature fusion with dual-stream attention for remote sensing small object detection,

J. Song, M. Zhou, J. Luo, H. Pu, Y . Feng, X. Wei, and W. Jia, “Boundary- aware feature fusion with dual-stream attention for remote sensing small object detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–13, 2025

2025
[3]

STAR: A first-ever dataset and a large-scale bench- mark for scene graph generation in large-size satellite imagery

Y . Li, L. Wang, T. Wang, X. Yang, J. Luo, Q. Wang, Y . Deng, W. Wang, X. Sun, H. Liet al., “STAR: A first-ever dataset and a large-scale bench- mark for scene graph generation in large-size satellite imagery.”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 3, pp. 1832–1849, 2025

2025
[4]

Object detection in aerial images: A large-scale benchmark and challenges,

J. Ding, N. Xue, G.-S. Xia, X. Bai, W. Yang, M. Y . Yang, S. Belongie, J. Luo, M. Datcu, M. Pelilloet al., “Object detection in aerial images: A large-scale benchmark and challenges,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7778–7796, 2021

2021
[5]

GeoLLaV A-8k: Scaling remote-sensing multimodal large language models to 8k resolution,

F. Wang, M. Chen, Y . Li, D. Wang, H. Wang, Z. Guo, Z. Wang, S. Boqi, L. Lan, Y . Wang, H. Wang, W. Yang, B. Du, and J. Zhang, “GeoLLaV A-8k: Scaling remote-sensing multimodal large language models to 8k resolution,” inAdvances in Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id= LTgUInLTbP

2025
[6]

Fovea: Foveated image magnification for autonomous navigation,

C. Thavamani, M. Li, N. Cebron, and D. Ramanan, “Fovea: Foveated image magnification for autonomous navigation,” inIEEE International Conference on Computer Vision, 2021, pp. 15 519–15 528

2021
[7]

Learning to zoom and unzoom,

C. Thavamani, M. Li, F. Ferroni, and D. Ramanan, “Learning to zoom and unzoom,” inIEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 5086–5095

2023
[8]

Adaptive image zoom-in with bounding box transformation for uav object detection,

T. Wang, C. Lin, C. Tang, J. Zhou, D. Xiong, J. Li, J. Zhao, and J. Lv, “Adaptive image zoom-in with bounding box transformation for uav object detection,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 233, pp. 452–466, 2026

2026
[9]

Clustered object detec- tion in aerial images,

F. Yang, H. Fan, P. Chu, E. Blasch, and H. Ling, “Clustered object detec- tion in aerial images,” inIEEE International Conference on Computer Vision, 2019, pp. 8310–8319

2019
[10]

‘skimming-perusal’ de- tection: A simple object detection baseline in gigapixel-level images,

Z. Zhang, W. Xue, K. Zhang, and S. Chen, “‘skimming-perusal’ de- tection: A simple object detection baseline in gigapixel-level images,” inIEEE International Conference on Multimedia and Expo, 2023, pp. 2471–2476

2023
[11]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean Conference on Computer Vision. Springer, 2014, pp. 740–755

2021
[12]

Detrs beat yolos on real-time object detection,

Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inIEEE Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 965–16 974

2024
[13]

To- wards large-scale small object detection: Survey and benchmarks,

G. Cheng, X. Yuan, X. Yao, K. Yan, Q. Zeng, X. Xie, and J. Han, “To- wards large-scale small object detection: Survey and benchmarks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 11, pp. 13 467–13 488, 2023

2023
[14]

Deformable detr: Deformable transformers for end-to-end object detection,

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=gZ9hCDWe6ke

2021
[15]

Dynamic adaptive region transformer for tiny-object detection in remote sensing,

A. Siddique, L. Zhengzhou, A. Azeem, Z. Yuting, and Y . Li, “Dynamic adaptive region transformer for tiny-object detection in remote sensing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1– 18, 2025

2025
[16]

Cross- dino: Cross the deep mlp and transformer for small object detection,

G. Cao, W. Huang, X. Lan, J. Zhang, D. Jiang, and Y . Wang, “Cross- dino: Cross the deep mlp and transformer for small object detection,” IEEE Transactions on Multimedia, vol. 27, pp. 7369–7379, 2025

2025
[17]

D3R-DETR: DETR with dual-domain density refinement for tiny object detection in aerial images,

Z. Wen, Z. Yang, X. Bao, L. Zhang, X. Xiang, W. Li, and Y . Liu, “D3R-DETR: DETR with dual-domain density refinement for tiny object detection in aerial images,” 2026. [Online]. Available: https://arxiv.org/abs/2601.02747

work page arXiv 2026
[18]

Detrs with collaborative hybrid assign- ments training,

Z. Zong, G. Song, and Y . Liu, “Detrs with collaborative hybrid assign- ments training,” inIEEE International Conference on Computer Vision, 2023, pp. 6748–6758

2023
[19]

FCOS: Fully convolutional one- stage object detection,

Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional one- stage object detection,” inIEEE International Conference on Computer Vision, 2019, pp. 9626–9635

2019
[20]

Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,

S. Zhang, C. Chi, Y . Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” inIEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9759–9768

2020
[21]

DQ-DETR: DETR with dynamic query for tiny object detection,

Y .-X. Huang, H.-I. Liu, H.-H. Shuai, and W.-H. Cheng, “DQ-DETR: DETR with dynamic query for tiny object detection,” inEuropean Conference on Computer Vision. Springer, 2024, p. 290–305

2024
[22]

Density-aware detr with dynamic query for end-to-end tiny object detection,

X. Ye, C. Xu, H. Zhu, F. Xu, H. Zhang, and W. Yang, “Density-aware detr with dynamic query for end-to-end tiny object detection,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 18, pp. 13 554–13 569, 2025

2025
[23]

Dome-detr: Detr with density-oriented feature-query manipulation for efficient tiny object detection,

Z. Hu, P. Wu, J. Chen, H. Zhu, Y . Wang, Y . Peng, H. Li, and X. Sun, “Dome-detr: Detr with density-oriented feature-query manipulation for efficient tiny object detection,” inProceedings of the 33rd ACM Inter- national Conference on Multimedia, 2025, pp. 101–110

2025
[24]

Generalized Small Object Detection:A Point-Prompted Paradigm and Benchmark

H. Zhu, W. Yang, G. Yang, C. Xu, R. Zhang, F. Xu, H. Zhang, and G.-S. Xia, “Generalized small object detection:a point-prompted paradigm and benchmark,” 2026. [Online]. Available: https://arxiv.org/abs/2604.02773

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Detection and tracking meet drones challenge,

P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet drones challenge,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 2021

2021
[26]

The unmanned aerial vehicle benchmark: Object detection and tracking,

D. Du, Y . Qi, H. Yu, Y . Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian, “The unmanned aerial vehicle benchmark: Object detection and tracking,” inEuropean Conference on Computer Vision, 2018, pp. 370–386

2018
[27]

Panda: A gigapixel-level human- centric video dataset,

X. Wang, X. Zhang, Y . Zhu, Y . Guo, X. Yuan, L. Xiang, Z. Wang, G. Ding, D. Brady, Q. Daiet al., “Panda: A gigapixel-level human- centric video dataset,” inIEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 3268–3278

2020
[28]

Object detection using clustering algorithm adaptive searching regions in aerial images,

Y . Wang, Y . Yang, and X. Zhao, “Object detection using clustering algorithm adaptive searching regions in aerial images,” inEuropean Con- ference on Computer Vision Workshops. Springer, 2020, p. 651–664

2020
[29]

Speed up object detection on gigapixel-level images with patch arrangement,

J. Fan, H. Liu, W. Yang, J. See, A. Zhang, and W. Lin, “Speed up object detection on gigapixel-level images with patch arrangement,” in IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 4643–4653

2022
[30]

Density map guided object detection in aerial images,

C. Li, T. Yang, S. Zhu, C. Chen, and S. Guan, “Density map guided object detection in aerial images,” inIEEE Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 737–746

2020
[31]

Saccadedet: A novel dual-stage architecture for rapid and accurate detection in gigapixel images,

W. Li, R. Zhang, H. Lin, Y . Guo, C. Ma, and X. Yang, “Saccadedet: A novel dual-stage architecture for rapid and accurate detection in gigapixel images,” inJoint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2024, p. 392–408

2024
[32]

Towards real-time object detection in gigapixel-level video,

K. Chen, Z. Wang, X. Wang, D. Gong, L. Yu, Y . Guo, and G. Ding, “Towards real-time object detection in gigapixel-level video,”Neuro- computing, vol. 477, pp. 14–24, 2022

2022
[33]

Salisa: Saliency-based input sampling for efficient video object detection,

B. Ehteshami Bejnordi, A. Habibian, F. Porikli, and A. Ghodrati, “Salisa: Saliency-based input sampling for efficient video object detection,” in European Conference on Computer Vision. Springer, 2022, pp. 300– 316

2022
[34]

Sparseformer: Detecting objects in HRW shots via sparse vision transformer,

W. Li, Y . Guo, J. Zheng, H. Lin, C. Ma, L. Fang, and X. Yang, “Sparseformer: Detecting objects in HRW shots via sparse vision transformer,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, p. 4851–4860

2024
[35]

GigaMoE: Sparsity-guided mixture of experts for efficient gigapixel object detection,

X. Li, W. Li, Y . Wang, C. Lyu, H. Lin, G. Ding, and Y . Guo, “GigaMoE: Sparsity-guided mixture of experts for efficient gigapixel object detection,” inAAAI Conference on Artificial Intelligence, vol. 40, no. 21, 2026, pp. 17 553–17 561

2026
[36]

Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images,

B. Du, Y . Huang, J. Chen, and D. Huang, “Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images,” inIEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 435–13 444

2023
[37]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inIEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

2016
[38]

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 21 002–21 012

2020
[39]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, 2017, p. 6000–6010

2017
[40]

DAB-DETR: Dynamic anchor boxes are better queries for DETR,

S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “DAB-DETR: Dynamic anchor boxes are better queries for DETR,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=oMI9PjOb9Jl

2022
[41]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137– 1149, 2016

2016
[42]

YOLOX: Exceeding yolo series in 2021,

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding yolo series in 2021,” 2021. [Online]. Available: https://arxiv.org/abs/2107. 08430

2021
[44]

MMDetection: Open mmlab detection toolbox and benchmark,

[Online]. Available: https://arxiv.org/abs/1906.07155

work page arXiv 1906
[45]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernsteinet al., “Imagenet large scale visual recognition challenge,”International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015

2015
[46]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7

2019