BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection

Ping Zhong; Wenlin Liu; Xikun Hu

arxiv: 2606.05586 · v1 · pith:OTDF7CTPnew · submitted 2026-06-04 · 💻 cs.CV · cs.MM

BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection

Wenlin Liu , Xikun Hu , Ping Zhong This is my paper

Pith reviewed 2026-06-28 02:49 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords remote sensingobject detectionbackbone compositionreinforcement learningoptimal transportCNNViTadaptive inference

0 comments

The pith

BMCR dynamically composes CNN and ViT modules via reinforcement learning for adaptive remote sensing object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that remote sensing object detectors can exploit the complementary local-detail strengths of CNNs and global-context strengths of ViTs by dynamically assembling reusable modules from existing backbones rather than relying on any single fixed architecture. This matters for inputs of varying complexity because manually designed hybrids cannot adapt on the fly. The method builds an extensible module toolbox, adds an Optimal Transport interface to align grid and token features, and trains a policy network with AMCO to choose task-relevant sequences. If the approach holds, detectors gain accuracy on standard benchmarks while preserving efficiency without requiring entirely new backbone designs.

Core claim

BMCR decomposes representative CNN and ViT backbones into reusable functional modules encapsulated with structural, semantic, and computational metadata, bridges them via a lightweight Optimal Transport transition interface that aligns grid-based and token-based representations in a distribution-aware way, and formulates composition as a sequential decision process solved by a policy network that selects modules according to intermediate multi-scale observations, with Adaptive Module Cooperative Optimization coordinating updates to achieve 79.31 percent, 73.41 percent and 71.86 percent mAP on DOTA-v1.0, DOTA-v1.5 and DIOR-R.

What carries the argument

The reinforcement learning policy network that progressively selects task-relevant modules from the extensible toolbox, coordinated by AMCO and enabled by the Optimal Transport transition interface for cross-family alignment.

Load-bearing premise

The Optimal Transport based transition interface can align grid-based CNN features with token-based ViT representations in a distribution-aware manner that preserves spatial consistency and enables effective cross-family module composition without introducing errors that offset the adaptive gains.

What would settle it

An experiment showing BMCR mAP on DOTA-v1.0, DOTA-v1.5 or DIOR-R falling below the strongest static or dynamic baseline by the reported margins would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.05586 by Ping Zhong, Wenlin Liu, Xikun Hu.

**Figure 2.** Figure 2: Overview of the proposed BMCR framework. (a) The routing agent observes intermediate features, selects task-relevant feature-extraction modules [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the proposed AMCO algorithm. Stage 1 warms up the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of oriented object detection results on large remote sensing images. Compared with representative static backbones, BMCR [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy–efficiency trade-off of BMCR under different computational [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of BMCR’s dynamic routing behavior on large-scale remote sensing imagery. Cold colors indicate shallow routing in simple regions, [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 9.** Figure 9: Inference latency breakdown of BMCR under different routing [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 8.** Figure 8: Average inference time for large-image inference. BMCR reduces the [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

In remote sensing object detection, Convolutional Neural Networks (CNNs) excel at capturing local details while Vision Transformers (ViTs) are better at global context modeling. However, existing detectors typically rely on a single fixed backbone or a manually designed hybrid architecture, and thus fail to adaptively exploit these complementary strengths across inputs of diverse complexity. To address this limitation, we propose Backbone Module Composition via Reinforcement Learning (BMCR). BMCR dynamically assembles input-adaptive inference paths from reusable modules decomposed from off-the-shelf CNN and ViT backbones. To enable such cross-family composition, we first construct an extensible module toolbox. Specifically, we decompose representative CNN and ViT backbones into reusable functional modules and encapsulate each module with explicit structural, semantic, and computational metadata for compatibility-aware assembly. To bridge the gap between grid-based CNN features and token-based ViT representations, we design a lightweight Optimal Transport (OT) based transition interface that ensures distribution-aware alignment while respecting spatial consistency. The backbone composition process is then formulated as a sequential decision problem, in which a policy network progressively selects task-relevant modules according to intermediate multi-scale observations. To stabilize the joint optimization of reusable modules and the routing policy, we further develop an Adaptive Module Cooperative Optimization (AMCO) strategy that coordinates module updating, routing exploration, and reward assignment during training. On DOTA-v1.0, DOTA-v1.5 and DIOR-R, BMCR achieves 79.31\%, 73.41\% and 71.86\% mAP, respectively, surpassing strong static and dynamic baselines by up to 2.5 points while maintaining competitive efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BMCR uses RL to route between decomposed CNN and ViT modules for remote sensing detection and reports modest mAP gains, but the abstract gives almost no implementation or validation detail.

read the letter

The core claim is that reinforcement learning can dynamically assemble input-specific backbones by picking reusable modules from both CNN and ViT families, with an optimal transport layer to handle the grid-to-token mismatch. That synthesis is the main thing a colleague should note.

What stands out as new is the explicit decomposition of off-the-shelf backbones into a shared module toolbox plus the AMCO training loop that tries to coordinate module updates with the routing policy. The reported numbers on DOTA-v1.0, DOTA-v1.5, and DIOR-R (79.31 %, 73.41 %, 71.86 % mAP) beat the listed static and dynamic baselines by up to 2.5 points while staying competitive on speed. Those are concrete, if narrow, results.

The description stays high-level. No equations appear for the OT alignment, the policy network, or the reward function, and the abstract supplies no ablation tables, statistical tests, or protocol details. It is therefore impossible to judge whether the OT step actually preserves spatial consistency or whether the joint optimization avoids circular reward signals. Those gaps are real and limit how far the claims can be trusted right now.

The work is aimed at researchers already working on hybrid or adaptive detectors in aerial imagery. Anyone looking for a practical RL routing scheme in that niche could get something out of the full paper once the missing controls are added.

It is worth sending to peer review. The idea is coherent enough and the benchmarks are standard, so referees can ask for the necessary ablations and equations without starting from zero.

Referee Report

2 major / 0 minor

Summary. The paper proposes BMCR for remote sensing object detection, which decomposes off-the-shelf CNN and ViT backbones into reusable modules stored in an extensible toolbox, employs a lightweight Optimal Transport based transition interface to align grid-based CNN features with token-based ViT representations, formulates backbone composition as a sequential decision process solved by a policy network, and introduces an Adaptive Module Cooperative Optimization (AMCO) strategy to jointly train the modules and routing policy. It reports mAP values of 79.31%, 73.41% and 71.86% on DOTA-v1.0, DOTA-v1.5 and DIOR-R respectively, claiming gains of up to 2.5 points over strong static and dynamic baselines while preserving efficiency.

Significance. If the central claim of effective cross-family adaptive composition is substantiated, the work would be significant for remote sensing detection by providing a principled mechanism to exploit complementary local-detail and global-context strengths on a per-input basis rather than relying on fixed or manually designed hybrids.

major comments (2)

[Abstract] Abstract: the performance claims (79.31% mAP on DOTA-v1.0 etc.) and superiority statements are presented without any experimental protocol, baseline definitions, statistical tests, ablation studies, or implementation details, so the central empirical claim cannot be evaluated.
[Abstract] Abstract: the Optimal Transport based transition interface and AMCO strategy are described only at high level with no equations, pseudocode, or formal definitions, preventing assessment of whether the alignment preserves spatial consistency or whether the reward signal is independent of fitted quantities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our submission. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: the performance claims (79.31% mAP on DOTA-v1.0 etc.) and superiority statements are presented without any experimental protocol, baseline definitions, statistical tests, ablation studies, or implementation details, so the central empirical claim cannot be evaluated.

Authors: The abstract is designed to be a concise summary of the paper's contributions and results, adhering to typical length constraints. The full experimental protocol, baseline definitions, statistical tests, ablation studies, and implementation details are thoroughly described in Sections 4.1 through 4.4 of the manuscript. We believe the central claims can be evaluated from the complete paper, and the abstract highlights the key outcomes. revision: no
Referee: [Abstract] Abstract: the Optimal Transport based transition interface and AMCO strategy are described only at high level with no equations, pseudocode, or formal definitions, preventing assessment of whether the alignment preserves spatial consistency or whether the reward signal is independent of fitted quantities.

Authors: Similar to the performance claims, the abstract provides a high-level overview of the proposed OT-based transition interface and AMCO strategy. Detailed equations, formal definitions, pseudocode (including Algorithm 1), and discussions on spatial consistency and reward signal independence are presented in Sections 3.2 and 3.4 of the manuscript. These sections allow for full assessment of the technical aspects. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and available text describe BMCR at a conceptual level (module decomposition from CNN/ViT backbones, OT-based transition interface, policy network for sequential decisions, and AMCO for joint optimization) without presenting any equations, fitted parameters, or derivations. No self-citations, uniqueness theorems, or ansatzes are quoted that reduce a claimed prediction or result to an input by construction. The reported mAP gains are presented as empirical outcomes on DOTA and DIOR-R datasets rather than tautological outputs of the method's own definitions. Absent specific technical details or equations from the full manuscript that exhibit reduction (e.g., reward signal equaling a fitted quantity), no load-bearing circular steps exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; the method implicitly assumes that modules remain functionally independent after decomposition and that the OT interface introduces negligible distortion, but these cannot be audited without the full text.

pith-pipeline@v0.9.1-grok · 5832 in / 1176 out tokens · 39716 ms · 2026-06-28T02:49:17.281134+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 1 linked inside Pith

[1]

Artificial intelligence to advance earth observation: A review of models, recent trends, and pathways forward,

D. Tuia, K. Schindler, B. Demir, X. X. Zhu, M. Kochupillai, S. D ˇzeroski, J. N. van Rijn, H. H. Hoos, F. Del Frate, M. Datcuet al., “Artificial intelligence to advance earth observation: A review of models, recent trends, and pathways forward,”IEEE Geoscience and Remote Sensing Magazine, 2024

2024
[2]

Open high-resolution satellite imagery: The worldstrat dataset–with application to super-resolution,

J. Cornebise, I. Or ˇsoli´c, and F. Kalaitzis, “Open high-resolution satellite imagery: The worldstrat dataset–with application to super-resolution,” in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 25 979–25 991

2022
[3]

Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery,

X. Guo, J. Lao, B. Dang, Y . Zhang, L. Yu, L. Ru, L. Zhong, Z. Huang, K. Wu, D. Hu, H. He, J. Wang, J. Chen, M. Yang, Y . Zhang, and Y . Li, “Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2024...

2024
[4]

Ringmo: A remote sensing foundation model with masked image modeling,

X. Sun, P. Wang, W. Lu, Z. Zhu, X. Lu, Q. He, J. Li, X. Rong, Z. Yang, H. Chang, Q. He, G. Yang, R. Wang, J. Lu, and K. Fu, “Ringmo: A remote sensing foundation model with masked image modeling,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–22, 2023

2023
[5]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

2016
[6]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

2021
[7]

A battle of network structures: An empirical study of CNN, transformer, and MLP,

Y . Zhao, G. Wang, C. Tang, C. Luo, W. Zeng, and Z.-J. Zha, “A battle of network structures: An empirical study of CNN, transformer, and MLP,” arXiv preprint arXiv:2108.13002, 2021

arXiv 2021
[8]

Dynamicvis: An efficient and general visual foundation model for remote sensing image understanding,

K. Chen, C. Liu, B. Chen, W. Li, Z. Zou, and Z. Shi, “Dynamicvis: An efficient and general visual foundation model for remote sensing image understanding,”arXiv preprint arXiv:2503.16426, 2025

arXiv 2025
[9]

Fastervit: Fast vision transformers with hierarchical attention,

A. Hatamizadeh, G. Heinrich, H. Yin, A. Tao, J. M. Alvarez, J. Kautz, and P. Molchanov, “Fastervit: Fast vision transformers with hierarchical attention,” inInternational Conference on Learning Representations, B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, Eds., vol. 2024, 2024, pp. 29 368–29 391

2024
[10]

Path-restore: Learning network path selection for image restoration,

K. Yu, X. Wang, C. Dong, X. Tang, and C. C. Loy, “Path-restore: Learning network path selection for image restoration,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 7078–7092, 2022

2022
[11]

Pathnet: Path- selective point cloud denoising,

Z. Wei, H. Chen, L. Nan, J. Wang, J. Qin, and M. Wei, “Pathnet: Path- selective point cloud denoising,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, pp. 4426–4442, 2024

2024
[12]

Deep learning in multimodal remote sensing data fusion: A comprehen- sive review,

J. Li, D. Hong, L. Gao, J. Yao, K. Zheng, B. Zhang, and J. Chanussot, “Deep learning in multimodal remote sensing data fusion: A comprehen- sive review,”International Journal of Applied Earth Observation and Geoinformation, vol. 112, p. 102926, 2022

2022
[13]

Efficient adaptive feature fusion network for remote-sensing image super-resolution,

S. Hao, S. Liu, X. Jia, H. Lu, and Y . He, “Efficient adaptive feature fusion network for remote-sensing image super-resolution,”IEEE Signal Processing Letters, 2024

2024
[14]

DOTA: A large-scale dataset for object detection in aerial images,

G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “DOTA: A large-scale dataset for object detection in aerial images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3974–3983

2018
[15]

Branchynet: Fast inference via early exiting from deep neural networks,

S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” inInternational Conference on Pattern Recognition. IEEE, 2016, pp. 2464–2469

2016
[16]

Beem: Boosting performance of early exit dnns using multi-exit classifiers as experts,

D. J. Bajpai and M. K. Hanawal, “Beem: Boosting performance of early exit dnns using multi-exit classifiers as experts,” inInternational Conference on Learning Representations, Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 62 520–62 535

2025
[17]

Skipnet: Learning dynamic routing in convolutional networks,

X. Wang, F. Yu, Z.-Y . Dou, T. Darrell, and J. E. Gonzalez, “Skipnet: Learning dynamic routing in convolutional networks,” inProceedings of the European Conference on Computer Vision, 2018, pp. 409–424

2018
[18]

Not all layers of llms are necessary during inference,

S. Fan, X. Jiang, X. Li, X. Meng, P. Han, S. Shang, A. Sun, Y . Wang, and Z. Wang, “Not all layers of llms are necessary during inference,” arXiv preprint arXiv:2403.02181, 2024

arXiv 2024
[19]

Skipdiff: Adaptive skip diffusion model for high-fidelity perceptual image super-resolution,

X. Luo, Y . Xie, Y . Qu, and Y . Fu, “Skipdiff: Adaptive skip diffusion model for high-fidelity perceptual image super-resolution,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4017–4025

2024
[20]

Dynamic convolution: Attention over convolution kernels,

Y . Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu, “Dynamic convolution: Attention over convolution kernels,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 030–11 039

2020
[21]

Deformable convolutional networks,

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei, “Deformable convolutional networks,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 764–773

2017
[22]

Internimage: Exploring large-scale vision foundation models with deformable convolutions,

W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu, L. Lu, H. Liet al., “Internimage: Exploring large-scale vision foundation models with deformable convolutions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 408–14 419

2023
[23]

Lsknet: Large selective kernel network for remote sensing object detection,

Y . Li, Q. Hou, and Z. Zheng, “Lsknet: Large selective kernel network for remote sensing object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4–6

2023
[24]

Transxnet: learning both global and local dynamics with a dual dynamic token mixer for visual recognition,

M. Lou, S. Zhang, H.-Y . Zhou, S. Yang, C. Wu, and Y . Yu, “Transxnet: learning both global and local dynamics with a dual dynamic token mixer for visual recognition,”IEEE Transactions on Neural Networks and Learning Systems, 2025

2025
[25]

Vision transformer adapter for dense predictions,

Z. Chen, Y . Duan, W. Wang, J. He, T. Lu, J. Dai, and Y . Qiao, “Vision transformer adapter for dense predictions,” inInternational Conference on Learning Representations, 2023

2023
[26]

Conformer: Local features coupling global representations for visual recognition,

Z. Peng, W. Huang, S. Gu, L. Xie, Y . Wang, J. Jiao, and Q. Ye, “Conformer: Local features coupling global representations for visual recognition,”arXiv preprint arXiv:2105.03889, 2021

arXiv 2021
[27]

Coatnet: Marrying convolution and attention for all data sizes,

Z. Dai, H. Liu, Q. V . Le, and M. Tan, “Coatnet: Marrying convolution and attention for all data sizes,”Advances in Neural Information Processing Systems, vol. 34, pp. 3965–3977, 2021

2021
[28]

Next-ViT: Next generation vision transformer for efficient deployment in realistic industrial scenarios,

J. Li, X. Xia, W. Li, H. Li, X. Wang, X. Xiao, R. Wang, M. Zheng, and X. Pan, “Next-ViT: Next generation vision transformer for efficient deployment in realistic industrial scenarios,”CoRR, 2022

2022
[29]

Cta-net: A CNN-transformer aggregation network for improving multi-scale feature extraction,

C. Meng, J. Yang, W. Lin, B. Liu, H. Zhang, Z. Ganet al., “Cta-net: A CNN-transformer aggregation network for improving multi-scale feature extraction,”arXiv preprint arXiv:2410.11428, 2024

arXiv 2024
[30]

Learning when and where to zoom with deep reinforcement learning,

B. Uzkent and S. Ermon, “Learning when and where to zoom with deep reinforcement learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 345–12 354

2020
[31]

Deep reinforcement learning for band selection in hyperspectral image classification,

L. Mou, S. Saha, Y . Hua, F. Bovolo, L. Bruzzone, and X. X. Zhu, “Deep reinforcement learning for band selection in hyperspectral image classification,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021

2021
[32]

Seeing beyond the patch: Scale-adaptive semantic segmentation of high-resolution remote sensing imagery based on reinforcement learning,

Y . Liu, S. Shi, J. Wang, and Y . Zhong, “Seeing beyond the patch: Scale-adaptive semantic segmentation of high-resolution remote sensing imagery based on reinforcement learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 868–16 878. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

2023
[33]

Scale-aware deep reinforcement learning for high resolution remote sensing imagery classification,

Y . Liu, Y . Zhong, S. Shi, and L. Zhang, “Scale-aware deep reinforcement learning for high resolution remote sensing imagery classification,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 209, pp. 296–311, 2024

2024
[34]

On learning intrinsic rewards for policy gradient methods,

Z. Zheng, J. Oh, and S. Singh, “On learning intrinsic rewards for policy gradient methods,” inAdvances in Neural Information Processing Systems, 2018, pp. 4649–4659

2018
[35]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 012–10 022

2021
[36]

Swin transformer v2: Scaling up capacity and resolution,

Z. Liu, H. Hu, Y . Lin, Z. Yao, Z. Xie, Y . Wei, J. Ning, Y . Cao, Z. Zhang, L. Donget al., “Swin transformer v2: Scaling up capacity and resolution,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 009–12 019

2022
[37]

Vitae: Vision transformer advanced by exploring intrinsic inductive bias,

Y . Xu, Q. Zhang, J. Zhang, and D. Tao, “Vitae: Vision transformer advanced by exploring intrinsic inductive bias,”Advances in Neural Information Processing Systems, vol. 34, pp. 28 522–28 535, 2021

2021
[38]

An empirical study of remote sensing pretraining,

D. Wang, J. Zhang, B. Du, G.-S. Xia, and D. Tao, “An empirical study of remote sensing pretraining,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, 2022

2022
[39]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141

2018
[40]

Coordinate attention for efficient mobile network design,

Q. Hou, C. Wang, D. Cheng, X. Cai, G. Xu, and Y . Wang, “Coordinate attention for efficient mobile network design,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 310–14 320

2021
[41]

Prox- imal policy optimization algorithms,

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[42]

Object detection in aerial images: A large-scale benchmark and challenges,

J. Ding, N. Xue, G.-S. Xia, X. Bai, W. Yang, M. Y . Yang, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Object detection in aerial images: A large-scale benchmark and challenges,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7778–7796, 2022

2022
[43]

Object detection in optical remote sensing images: A survey and a new benchmark,

K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 159, pp. 296–307, 2020

2020
[44]

FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,

X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li, Y . Feng, T. Xuet al., “FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 184, pp. 116–130, 2022

2022
[45]

Emo2-DETR: Efficient-matching oriented object detection with transformers,

Z. Hu, K. Gao, X. Zhang, J. Wang, H. Wang, Z. Yang, C. Li, and W. Li, “Emo2-DETR: Efficient-matching oriented object detection with transformers,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023

2023
[46]

Oriented R-CNN for object detection,

X. Xie, G. Cheng, J. Wang, X. Yao, and J. Han, “Oriented R-CNN for object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 3520–3529

2021
[47]

Orientedformer: An end-to-end transformer-based oriented object detector in remote sensing images,

J. Zhao, Z. Ding, Y . Zhou, H. Zhu, W.-L. Du, R. Yao, and A. El Sad- dik, “Orientedformer: An end-to-end transformer-based oriented object detector in remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, 2024

2024
[48]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernsteinet al., “Imagenet large scale visual recognition challenge,”International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015

2015
[49]

Poly kernel inception network for remote sensing detection,

X. Cai, Q. Lai, Y . Wang, W. Wang, Z. Sun, and Y . Yao, “Poly kernel inception network for remote sensing detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 706–27 716

2024
[50]

Advancing plain vision transformer toward remote sensing foundation model,

D. Wang, Q. Zhang, Y . Xu, J. Zhang, B. Du, D. Tao, and L. Zhang, “Advancing plain vision transformer toward remote sensing foundation model,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2022

2022
[51]

Adaptive rotated convolution for rotated object detection,

Y . Pu, Y . Wang, Z. Xia, Y . Han, Y . Wang, W. Gan, Z. Wang, S. Song, and G. Huang, “Adaptive rotated convolution for rotated object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6589–6600

2023
[52]

Learning roi transformer for oriented object detection in aerial images,

J. Ding, N. Xue, Y . Long, G.-S. Xia, and Q. Lu, “Learning roi transformer for oriented object detection in aerial images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2844–2853

2019
[53]

A billion-scale foundation model for remote sensing images,

K. Cha, J. Seo, and T. Lee, “A billion-scale foundation model for remote sensing images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024

2024
[54]

The kfiou loss for rotated object detection,

X. Yang, Y . Zhou, G. Zhang, J. Yang, W. Wang, J. Yan, X. Zhang, and Q. Tian, “The kfiou loss for rotated object detection,”arXiv preprint arXiv:2201.12558, 2022

arXiv 2022
[55]

Rqformer: Rotated query transformer for end-to-end oriented object detection,

J. Zhao, Z. Ding, Y . Zhou, H. Zhu, W.-L. Du, R. Yao, and A. El Saddik, “Rqformer: Rotated query transformer for end-to-end oriented object detection,”Expert Systems with Applications, vol. 266, p. 126034, 2025

2025
[56]

A unified remote sensing object detector based on fourier contour parametric learning,

T. Zhang, Y . Zhuang, G. Wang, H. Chen, L. Li, and J. Li, “A unified remote sensing object detector based on fourier contour parametric learning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–25, 2025

2025
[57]

Ars-DETR: Aspect ratio-sensitive detection transformer for aerial oriented object detection,

Y . Zeng, Y . Chen, X. Yang, Q. Li, and J. Yan, “Ars-DETR: Aspect ratio-sensitive detection transformer for aerial oriented object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024

2024
[58]

Strip r-CNN: Large strip convolution for remote sensing object detection,

X. Yuan, Z. Zheng, Y . Li, X. Liu, L. Liu, X. Li, Q. Hou, and M.- M. Cheng, “Strip r-CNN: Large strip convolution for remote sensing object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 15, 2026, pp. 12 259–12 267

2026
[59]

Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement,

W. Kool, H. Van Hoof, and M. Welling, “Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement,” inProceedings of the International Conference on Machine Learning. PMLR, 2019, pp. 3499–3508. Wenlin Liureceived his BS degree from the Nanjing University of Aeronautics and Astronautics in 2021, and he is cur...

2019

[1] [1]

Artificial intelligence to advance earth observation: A review of models, recent trends, and pathways forward,

D. Tuia, K. Schindler, B. Demir, X. X. Zhu, M. Kochupillai, S. D ˇzeroski, J. N. van Rijn, H. H. Hoos, F. Del Frate, M. Datcuet al., “Artificial intelligence to advance earth observation: A review of models, recent trends, and pathways forward,”IEEE Geoscience and Remote Sensing Magazine, 2024

2024

[2] [2]

Open high-resolution satellite imagery: The worldstrat dataset–with application to super-resolution,

J. Cornebise, I. Or ˇsoli´c, and F. Kalaitzis, “Open high-resolution satellite imagery: The worldstrat dataset–with application to super-resolution,” in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 25 979–25 991

2022

[3] [3]

Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery,

X. Guo, J. Lao, B. Dang, Y . Zhang, L. Yu, L. Ru, L. Zhong, Z. Huang, K. Wu, D. Hu, H. He, J. Wang, J. Chen, M. Yang, Y . Zhang, and Y . Li, “Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2024...

2024

[4] [4]

Ringmo: A remote sensing foundation model with masked image modeling,

X. Sun, P. Wang, W. Lu, Z. Zhu, X. Lu, Q. He, J. Li, X. Rong, Z. Yang, H. Chang, Q. He, G. Yang, R. Wang, J. Lu, and K. Fu, “Ringmo: A remote sensing foundation model with masked image modeling,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–22, 2023

2023

[5] [5]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

2016

[6] [6]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

2021

[7] [7]

A battle of network structures: An empirical study of CNN, transformer, and MLP,

Y . Zhao, G. Wang, C. Tang, C. Luo, W. Zeng, and Z.-J. Zha, “A battle of network structures: An empirical study of CNN, transformer, and MLP,” arXiv preprint arXiv:2108.13002, 2021

arXiv 2021

[8] [8]

Dynamicvis: An efficient and general visual foundation model for remote sensing image understanding,

K. Chen, C. Liu, B. Chen, W. Li, Z. Zou, and Z. Shi, “Dynamicvis: An efficient and general visual foundation model for remote sensing image understanding,”arXiv preprint arXiv:2503.16426, 2025

arXiv 2025

[9] [9]

Fastervit: Fast vision transformers with hierarchical attention,

A. Hatamizadeh, G. Heinrich, H. Yin, A. Tao, J. M. Alvarez, J. Kautz, and P. Molchanov, “Fastervit: Fast vision transformers with hierarchical attention,” inInternational Conference on Learning Representations, B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, Eds., vol. 2024, 2024, pp. 29 368–29 391

2024

[10] [10]

Path-restore: Learning network path selection for image restoration,

K. Yu, X. Wang, C. Dong, X. Tang, and C. C. Loy, “Path-restore: Learning network path selection for image restoration,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 7078–7092, 2022

2022

[11] [11]

Pathnet: Path- selective point cloud denoising,

Z. Wei, H. Chen, L. Nan, J. Wang, J. Qin, and M. Wei, “Pathnet: Path- selective point cloud denoising,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, pp. 4426–4442, 2024

2024

[12] [12]

Deep learning in multimodal remote sensing data fusion: A comprehen- sive review,

J. Li, D. Hong, L. Gao, J. Yao, K. Zheng, B. Zhang, and J. Chanussot, “Deep learning in multimodal remote sensing data fusion: A comprehen- sive review,”International Journal of Applied Earth Observation and Geoinformation, vol. 112, p. 102926, 2022

2022

[13] [13]

Efficient adaptive feature fusion network for remote-sensing image super-resolution,

S. Hao, S. Liu, X. Jia, H. Lu, and Y . He, “Efficient adaptive feature fusion network for remote-sensing image super-resolution,”IEEE Signal Processing Letters, 2024

2024

[14] [14]

DOTA: A large-scale dataset for object detection in aerial images,

G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “DOTA: A large-scale dataset for object detection in aerial images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3974–3983

2018

[15] [15]

Branchynet: Fast inference via early exiting from deep neural networks,

S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” inInternational Conference on Pattern Recognition. IEEE, 2016, pp. 2464–2469

2016

[16] [16]

Beem: Boosting performance of early exit dnns using multi-exit classifiers as experts,

D. J. Bajpai and M. K. Hanawal, “Beem: Boosting performance of early exit dnns using multi-exit classifiers as experts,” inInternational Conference on Learning Representations, Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 62 520–62 535

2025

[17] [17]

Skipnet: Learning dynamic routing in convolutional networks,

X. Wang, F. Yu, Z.-Y . Dou, T. Darrell, and J. E. Gonzalez, “Skipnet: Learning dynamic routing in convolutional networks,” inProceedings of the European Conference on Computer Vision, 2018, pp. 409–424

2018

[18] [18]

Not all layers of llms are necessary during inference,

S. Fan, X. Jiang, X. Li, X. Meng, P. Han, S. Shang, A. Sun, Y . Wang, and Z. Wang, “Not all layers of llms are necessary during inference,” arXiv preprint arXiv:2403.02181, 2024

arXiv 2024

[19] [19]

Skipdiff: Adaptive skip diffusion model for high-fidelity perceptual image super-resolution,

X. Luo, Y . Xie, Y . Qu, and Y . Fu, “Skipdiff: Adaptive skip diffusion model for high-fidelity perceptual image super-resolution,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4017–4025

2024

[20] [20]

Dynamic convolution: Attention over convolution kernels,

Y . Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu, “Dynamic convolution: Attention over convolution kernels,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 030–11 039

2020

[21] [21]

Deformable convolutional networks,

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei, “Deformable convolutional networks,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 764–773

2017

[22] [22]

Internimage: Exploring large-scale vision foundation models with deformable convolutions,

W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu, L. Lu, H. Liet al., “Internimage: Exploring large-scale vision foundation models with deformable convolutions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 408–14 419

2023

[23] [23]

Lsknet: Large selective kernel network for remote sensing object detection,

Y . Li, Q. Hou, and Z. Zheng, “Lsknet: Large selective kernel network for remote sensing object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4–6

2023

[24] [24]

Transxnet: learning both global and local dynamics with a dual dynamic token mixer for visual recognition,

M. Lou, S. Zhang, H.-Y . Zhou, S. Yang, C. Wu, and Y . Yu, “Transxnet: learning both global and local dynamics with a dual dynamic token mixer for visual recognition,”IEEE Transactions on Neural Networks and Learning Systems, 2025

2025

[25] [25]

Vision transformer adapter for dense predictions,

Z. Chen, Y . Duan, W. Wang, J. He, T. Lu, J. Dai, and Y . Qiao, “Vision transformer adapter for dense predictions,” inInternational Conference on Learning Representations, 2023

2023

[26] [26]

Conformer: Local features coupling global representations for visual recognition,

Z. Peng, W. Huang, S. Gu, L. Xie, Y . Wang, J. Jiao, and Q. Ye, “Conformer: Local features coupling global representations for visual recognition,”arXiv preprint arXiv:2105.03889, 2021

arXiv 2021

[27] [27]

Coatnet: Marrying convolution and attention for all data sizes,

Z. Dai, H. Liu, Q. V . Le, and M. Tan, “Coatnet: Marrying convolution and attention for all data sizes,”Advances in Neural Information Processing Systems, vol. 34, pp. 3965–3977, 2021

2021

[28] [28]

Next-ViT: Next generation vision transformer for efficient deployment in realistic industrial scenarios,

J. Li, X. Xia, W. Li, H. Li, X. Wang, X. Xiao, R. Wang, M. Zheng, and X. Pan, “Next-ViT: Next generation vision transformer for efficient deployment in realistic industrial scenarios,”CoRR, 2022

2022

[29] [29]

Cta-net: A CNN-transformer aggregation network for improving multi-scale feature extraction,

C. Meng, J. Yang, W. Lin, B. Liu, H. Zhang, Z. Ganet al., “Cta-net: A CNN-transformer aggregation network for improving multi-scale feature extraction,”arXiv preprint arXiv:2410.11428, 2024

arXiv 2024

[30] [30]

Learning when and where to zoom with deep reinforcement learning,

B. Uzkent and S. Ermon, “Learning when and where to zoom with deep reinforcement learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 345–12 354

2020

[31] [31]

Deep reinforcement learning for band selection in hyperspectral image classification,

L. Mou, S. Saha, Y . Hua, F. Bovolo, L. Bruzzone, and X. X. Zhu, “Deep reinforcement learning for band selection in hyperspectral image classification,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021

2021

[32] [32]

Seeing beyond the patch: Scale-adaptive semantic segmentation of high-resolution remote sensing imagery based on reinforcement learning,

Y . Liu, S. Shi, J. Wang, and Y . Zhong, “Seeing beyond the patch: Scale-adaptive semantic segmentation of high-resolution remote sensing imagery based on reinforcement learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 868–16 878. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

2023

[33] [33]

Scale-aware deep reinforcement learning for high resolution remote sensing imagery classification,

Y . Liu, Y . Zhong, S. Shi, and L. Zhang, “Scale-aware deep reinforcement learning for high resolution remote sensing imagery classification,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 209, pp. 296–311, 2024

2024

[34] [34]

On learning intrinsic rewards for policy gradient methods,

Z. Zheng, J. Oh, and S. Singh, “On learning intrinsic rewards for policy gradient methods,” inAdvances in Neural Information Processing Systems, 2018, pp. 4649–4659

2018

[35] [35]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 012–10 022

2021

[36] [36]

Swin transformer v2: Scaling up capacity and resolution,

Z. Liu, H. Hu, Y . Lin, Z. Yao, Z. Xie, Y . Wei, J. Ning, Y . Cao, Z. Zhang, L. Donget al., “Swin transformer v2: Scaling up capacity and resolution,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 009–12 019

2022

[37] [37]

Vitae: Vision transformer advanced by exploring intrinsic inductive bias,

Y . Xu, Q. Zhang, J. Zhang, and D. Tao, “Vitae: Vision transformer advanced by exploring intrinsic inductive bias,”Advances in Neural Information Processing Systems, vol. 34, pp. 28 522–28 535, 2021

2021

[38] [38]

An empirical study of remote sensing pretraining,

D. Wang, J. Zhang, B. Du, G.-S. Xia, and D. Tao, “An empirical study of remote sensing pretraining,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, 2022

2022

[39] [39]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141

2018

[40] [40]

Coordinate attention for efficient mobile network design,

Q. Hou, C. Wang, D. Cheng, X. Cai, G. Xu, and Y . Wang, “Coordinate attention for efficient mobile network design,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 310–14 320

2021

[41] [41]

Prox- imal policy optimization algorithms,

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[42] [42]

Object detection in aerial images: A large-scale benchmark and challenges,

J. Ding, N. Xue, G.-S. Xia, X. Bai, W. Yang, M. Y . Yang, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Object detection in aerial images: A large-scale benchmark and challenges,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7778–7796, 2022

2022

[43] [43]

Object detection in optical remote sensing images: A survey and a new benchmark,

K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 159, pp. 296–307, 2020

2020

[44] [44]

FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,

X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li, Y . Feng, T. Xuet al., “FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 184, pp. 116–130, 2022

2022

[45] [45]

Emo2-DETR: Efficient-matching oriented object detection with transformers,

Z. Hu, K. Gao, X. Zhang, J. Wang, H. Wang, Z. Yang, C. Li, and W. Li, “Emo2-DETR: Efficient-matching oriented object detection with transformers,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023

2023

[46] [46]

Oriented R-CNN for object detection,

X. Xie, G. Cheng, J. Wang, X. Yao, and J. Han, “Oriented R-CNN for object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 3520–3529

2021

[47] [47]

Orientedformer: An end-to-end transformer-based oriented object detector in remote sensing images,

J. Zhao, Z. Ding, Y . Zhou, H. Zhu, W.-L. Du, R. Yao, and A. El Sad- dik, “Orientedformer: An end-to-end transformer-based oriented object detector in remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, 2024

2024

[48] [48]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernsteinet al., “Imagenet large scale visual recognition challenge,”International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015

2015

[49] [49]

Poly kernel inception network for remote sensing detection,

X. Cai, Q. Lai, Y . Wang, W. Wang, Z. Sun, and Y . Yao, “Poly kernel inception network for remote sensing detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 706–27 716

2024

[50] [50]

Advancing plain vision transformer toward remote sensing foundation model,

D. Wang, Q. Zhang, Y . Xu, J. Zhang, B. Du, D. Tao, and L. Zhang, “Advancing plain vision transformer toward remote sensing foundation model,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2022

2022

[51] [51]

Adaptive rotated convolution for rotated object detection,

Y . Pu, Y . Wang, Z. Xia, Y . Han, Y . Wang, W. Gan, Z. Wang, S. Song, and G. Huang, “Adaptive rotated convolution for rotated object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6589–6600

2023

[52] [52]

Learning roi transformer for oriented object detection in aerial images,

J. Ding, N. Xue, Y . Long, G.-S. Xia, and Q. Lu, “Learning roi transformer for oriented object detection in aerial images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2844–2853

2019

[53] [53]

A billion-scale foundation model for remote sensing images,

K. Cha, J. Seo, and T. Lee, “A billion-scale foundation model for remote sensing images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024

2024

[54] [54]

The kfiou loss for rotated object detection,

X. Yang, Y . Zhou, G. Zhang, J. Yang, W. Wang, J. Yan, X. Zhang, and Q. Tian, “The kfiou loss for rotated object detection,”arXiv preprint arXiv:2201.12558, 2022

arXiv 2022

[55] [55]

Rqformer: Rotated query transformer for end-to-end oriented object detection,

J. Zhao, Z. Ding, Y . Zhou, H. Zhu, W.-L. Du, R. Yao, and A. El Saddik, “Rqformer: Rotated query transformer for end-to-end oriented object detection,”Expert Systems with Applications, vol. 266, p. 126034, 2025

2025

[56] [56]

A unified remote sensing object detector based on fourier contour parametric learning,

T. Zhang, Y . Zhuang, G. Wang, H. Chen, L. Li, and J. Li, “A unified remote sensing object detector based on fourier contour parametric learning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–25, 2025

2025

[57] [57]

Ars-DETR: Aspect ratio-sensitive detection transformer for aerial oriented object detection,

Y . Zeng, Y . Chen, X. Yang, Q. Li, and J. Yan, “Ars-DETR: Aspect ratio-sensitive detection transformer for aerial oriented object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024

2024

[58] [58]

Strip r-CNN: Large strip convolution for remote sensing object detection,

X. Yuan, Z. Zheng, Y . Li, X. Liu, L. Liu, X. Li, Q. Hou, and M.- M. Cheng, “Strip r-CNN: Large strip convolution for remote sensing object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 15, 2026, pp. 12 259–12 267

2026

[59] [59]

Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement,

W. Kool, H. Van Hoof, and M. Welling, “Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement,” inProceedings of the International Conference on Machine Learning. PMLR, 2019, pp. 3499–3508. Wenlin Liureceived his BS degree from the Nanjing University of Aeronautics and Astronautics in 2021, and he is cur...

2019