pith. machine review for the scientific record. sign in

arxiv: 2604.10210 · v1 · submitted 2026-04-11 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords feature pyramid networkdense predictionattention mechanismmulti-scale featuresasymptotic disentanglementobject detectionsemantic segmentationcontent-aware resampling
0
0 comments X

The pith

A3-FPN augments feature pyramids with an asymptotically disentangled column network and content-aware attention to capture more discriminative multi-scale features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing feature pyramid networks struggle to extract truly useful details at every scale and often miss small objects because their mixing of information stays too local and uniform. The paper introduces A3-FPN to fix this through a horizontally spread column network that gradually connects features across the entire pyramid while keeping each scale distinct, plus attention modules that use actual image content to decide how to resample and reweight features during fusion and reassembly. A reader would care because the design slots directly into current CNN and transformer models without major changes. If the gains hold, standard backbones would produce stronger results on object detection and scene segmentation without needing entirely new architectures.

Core claim

We propose A3-FPN to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on the信息,

What carries the argument

The horizontally-spread column network for asymptotic global interaction and level disentanglement, together with content-aware attention modules that generate offsets, weights, and reweights during feature fusion and reassembly.

If this is right

  • A3-FPN integrates directly into existing CNN and transformer architectures for dense prediction tasks.
  • It delivers measurable gains on MS COCO, VisDrone2019-DET, and Cityscapes.
  • Paired with OneFormer and Swin-L it reaches 49.6 mask AP on MS COCO.
  • The same pairing reaches 85.6 mIoU on Cityscapes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reassembly step that prunes features by information content could reduce memory use when scaling the same idea to higher-resolution inputs.
  • The horizontal column structure may transfer to video or 3D dense prediction where scale varies over time or depth.
  • If the disentanglement step proves essential, future work could isolate its contribution by ablating only that component across more backbones.

Load-bearing premise

The asymptotic disentanglement and content-aware modules truly improve feature discriminability across scales rather than merely adding parameters that fit the training distributions of the tested benchmarks.

What would settle it

Inserting A3-FPN into a standard backbone on a fresh dense-prediction dataset and measuring no accuracy gain or a drop relative to the original feature pyramid network would show the claimed benefits do not generalize.

Figures

Figures reproduced from arXiv: 2604.10210 by Meng'en Qin, Quanling Zhao, Xiaodong Yang, Xiaohui Yang, Yingtao Che, Yu Song.

Figure 1
Figure 1. Figure 1: Illustration of FPN [9] and PAFPN [10]. (a) Top-down multi-scale feature fusion path in FPN. (b) Extra bottom￾up path aggregation in PAFPN. Both methods have some defects: (1) information loss, (2) context-agnostic sampling, (3) pattern inconsistency. branch (PAFPN). But both the interpolation and strided convolution are context-agnostic and rely on fixed, static sampling patterns based solely on relative … view at source ↗
Figure 2
Figure 2. Figure 2: (a) Average precision, parameters, and FLOPs of various feature pyramid networks evaluated on COCO val2017 [17]. Bubble area scales with model GFLOPs; (b) Inference latency vs. performance on COCO val2017 for feature pyramid models. All models are trained for 12 epochs using Faster R-CNN with ResNet-50 as the baseline. Inference latency is measured on a single NVIDIA RTX 4090 GPU. mation loss and gradually… view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of A 3 -FPN. (a) The bottom-up asymptotically disentangled fusion framework consisting of m columns; (b) Multi-scale Context-aware Attention module for feature fusion; (c) Intra-scale Content-aware Attention module for feature reassembly. levels and achieves better fusion through the Gather-and-Distribute mechanism. A 2 -FPN [38] proposes MGC, GACARAFE, and GACAP to address inaccurate … view at source ↗
Figure 4
Figure 4. Figure 4: Offset generator and context weight generator in multi-scale context-aware attention module. (a) Offset gen￾erator gathers context information to produce position-wise coordinate offset maps and sampling weight maps for the subsequent Resampler; (b) Context weight generator learns the relationship among different representation patterns and assigns the corresponding context weight to different-level featur… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of different multi-scale designs, including (a) FPN [9] (layer-wise framework), (b) Gold-YOLO [33] (global convolutional framework), (c) Deformable Attention [40] (global attention framework), (d) A 3 -FPN (asymptoti￾cally disentangled framework). Suppose a framework induces a propagation path from Xi to Xj (i.e. Xi → Xi+1 → · · · → Xj). If each step Xt → Xt+1 satisfies a strong data processing … view at source ↗
Figure 6
Figure 6. Figure 6: In first row, H/8 × W/8 feature maps are upsampled to H/4 × W/4 by bilinesar interpolation, DySample [12] and Cotext-aware Resampler (ours); In second row, we downsample H/4 × W/4 feature maps to H/8 × W/8 through strided convolution, CARAFE++ [13] and Cotext-aware Resampler (ours). features are calculated by: X rs = [ G g=1 K X2 n=1 wg · X s g [(x, y) + pn + (∆xgn, ∆ygn)] · ∆mgn, (9) where S is the concat… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of detection results, the corresponding feature maps and heatmaps. Resampler refines coarsely sampled features and diminishes object displacement. The context weight generator learns the significance relationship of different feature patterns, decreasing the misclassifications and missed detections. ICAtten further enhances the dis￾criminative features and alleviates complex background interf… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative evaluation of various feature pyramid networks for object detection on MS COCO validation set, including FPN [9], PAFPN [10], NAS-FPN [34], AFPN [15] and our A 3 -FPN. Odd rows are the detection results, and the others are the corresponding AblationCAM [56] visualization. The object category in the images is sheep. and ICAtten, A 3 -FPN learns more discriminative and intra-category similar feat… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative evaluation. (a) Instance segmentation results of Mask RCNN [5] with different feature fusion ap￾proaches, including FPN [9], FPT [48], DySample [12], CARAFE [13] and our A 3 -FPN; (b) and (d) are comparison between some unified transformer-based models (Mask2Former [31] and Mask DINO [8]) with and without integrat￾ing A 3 -FPN on the instance and semantic segmentation task respectively; (c) Sem… view at source ↗
read the original abstract

Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes the Asymptotic Content-Aware Pyramid Attention Network (A3-FPN) to improve multi-scale feature representation in dense visual prediction tasks. It introduces a horizontally-spread column network for asymptotically global feature interaction and disentanglement, along with content-aware attention modules for position-wise resampling using offsets and weights, and deep context reweights for intra-category similarity in fusion, plus reassembly based on information content. The paper claims easy integration into CNN and Transformer architectures, with significant performance improvements on MS COCO, VisDrone2019-DET, and Cityscapes, including 49.6 mask AP and 85.6 mIoU when combined with OneFormer and Swin-L.

Significance. Should the empirical gains prove robust and attributable to the proposed mechanisms, A3-FPN would represent a meaningful advance in feature pyramid designs for handling scale variations and small objects in detection and segmentation. The public availability of the code at the provided GitHub link is a notable strength that facilitates verification and extension by the community.

major comments (3)
  1. The reported benchmark results, such as the 49.6 mask AP on MS COCO, do not include ablation studies that hold parameter count, FLOPs, and training schedule fixed while isolating the contributions of the asymptotically disentangled column network and the content-aware resampling/reweighting modules. This is load-bearing for the central claim that these components yield genuine improvements in discriminative feature capture.
  2. The description of the content-aware attention modules in feature fusion and reassembly lacks analysis of the computational overhead introduced by generating position-wise offsets, weights, and intra-category reweights, which is necessary to evaluate whether the 'remarkable performance gains' come at an acceptable cost compared to standard FPNs.
  3. No error bars, results from multiple random seeds, or statistical tests are provided for the performance numbers on the three datasets, reducing confidence that the gains over baselines are statistically significant rather than due to variance or post-hoc tuning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of empirical validation that will strengthen the paper. We provide point-by-point responses below and commit to revisions that address the concerns while preserving the core contributions of A3-FPN.

read point-by-point responses
  1. Referee: The reported benchmark results, such as the 49.6 mask AP on MS COCO, do not include ablation studies that hold parameter count, FLOPs, and training schedule fixed while isolating the contributions of the asymptotically disentangled column network and the content-aware resampling/reweighting modules. This is load-bearing for the central claim that these components yield genuine improvements in discriminative feature capture.

    Authors: We agree that controlled ablations with matched parameter counts, FLOPs, and training schedules are essential to isolate the contributions of the horizontally-spread column network and the content-aware modules. Our existing ablations demonstrate component-wise gains but do not enforce strict budget matching. In the revised manuscript we will add new ablation tables that scale baseline FPN variants (e.g., by adjusting channel widths) to match the exact parameter and FLOP counts of each A3-FPN configuration while keeping the training schedule identical. These results will be reported alongside the original numbers to directly support the claim of genuine improvements. revision: yes

  2. Referee: The description of the content-aware attention modules in feature fusion and reassembly lacks analysis of the computational overhead introduced by generating position-wise offsets, weights, and intra-category reweights, which is necessary to evaluate whether the 'remarkable performance gains' come at an acceptable cost compared to standard FPNs.

    Authors: We acknowledge the need for explicit overhead analysis. The current manuscript reports overall model FLOPs but does not break down the incremental cost of the offset/weight generation and reweighting operations. In the revision we will add a dedicated table and accompanying text that quantifies the additional parameters and FLOPs attributable to each content-aware module (resampling, reweighting, and reassembly) relative to a standard FPN. We will also compare these costs against recent pyramid variants (e.g., BiFPN, CARAFE) to demonstrate that the observed gains remain favorable on a performance-per-FLOP basis. revision: yes

  3. Referee: No error bars, results from multiple random seeds, or statistical tests are provided for the performance numbers on the three datasets, reducing confidence that the gains over baselines are statistically significant rather than due to variance or post-hoc tuning.

    Authors: We recognize that reporting variability across random seeds strengthens confidence in the results. Our experiments used a fixed seed for reproducibility, consistent with common practice in the field, but did not include multi-seed statistics. In the revised manuscript we will rerun the primary experiments (COCO detection/segmentation, Cityscapes, VisDrone) with at least three independent random seeds, reporting mean and standard deviation for all key metrics. We will also note that the consistent gains across three distinct datasets and multiple backbone architectures already provide supporting evidence, but the added statistics will allow formal assessment of significance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with benchmark validation

full rationale

The paper proposes an empirical neural architecture (A3-FPN) consisting of a horizontally-spread column network and content-aware attention modules for multi-scale feature fusion and reassembly. It describes the design choices in prose and reports performance on MS COCO, VisDrone2019-DET, and Cityscapes when integrated with existing backbones. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior work by the same authors are invoked to justify the core mechanisms. The central claims rest on empirical gains rather than any closed-loop mathematical reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard deep-learning assumptions about multi-scale feature utility plus newly introduced architectural components whose benefit is shown only through empirical results on three datasets.

free parameters (2)
  • position-wise offsets and weights in content-aware resampling
    Learned parameters introduced by the proposed modules; their values are fitted during training on the target datasets.
  • deep context reweights
    Additional learned reweighting factors for intra-category similarity.
axioms (1)
  • domain assumption Multi-scale representations are the common strategy to tackle object scale variation in dense prediction tasks
    Stated explicitly in the opening sentence of the abstract as background.
invented entities (2)
  • Asymptotically disentangled framework with horizontally-spread column network no independent evidence
    purpose: To enable global feature interaction while disentangling each level from hierarchical representations
    Newly proposed architectural component without independent evidence outside this work.
  • Content-aware attention modules for resampling and reassembly no independent evidence
    purpose: To generate position-wise offsets/weights and strengthen intra-scale discriminative features
    Newly proposed modules whose effectiveness is claimed via benchmark gains.

pith-pipeline@v0.9.0 · 5567 in / 1571 out tokens · 39043 ms · 2026-05-10T16:55:41.755210+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network

    cs.CV 2026-04 unverdicted novelty 5.0

    StomaD2 integrates diffusion-based image restoration with a specialized rotated detection network to achieve high-accuracy stomatal phenotyping across more than 130 plant species.

Reference graph

Works this paper leans on

66 extracted references · 1 canonical work pages · cited by 1 Pith paper

  1. [1]

    M. Chen, L. Zhang, R. Feng, X. Xue, J. Feng, Rethinking local and global feature repre- sentation for dense prediction, Pattern Recognition 135 (2023) 109168

  2. [2]

    Zhang, Z

    G. Zhang, Z. Li, C. Tang, J. Li, X. Hu, Cednet: A cascade encoder–decoder network for dense prediction, Pattern Recognition 158 (2025) 111072

  3. [3]

    Y . Chen, Z. Zhang, Y . Cao, L. Wang, S. Lin, H. Hu, Reppoints v2: Verification meets regression for object detection, in: Advances in Neural Information Processing Systems, V ol. 33, 2020, pp. 5621–5631

  4. [4]

    X. Ding, R. Zhang, Q. Liu, Y . Yang, Real-time small object detection using adaptive weighted fusion of efficient positional features, Pattern Recognition 167 (2025) 111717

  5. [5]

    K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2961–2969

  6. [6]

    R. Li, C. He, S. Li, Y . Zhang, L. Zhang, Dynamask: dynamic mask selection for instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2023, pp. 11279–11288

  7. [7]

    J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440

  8. [8]

    F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, H.-Y . Shum, Mask dino: Towards a unified transformer-based framework for object detection and segmentation, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3041–3050

  9. [9]

    T.-Y . Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid net- works for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125. 28

  10. [10]

    S. Liu, L. Qi, H. Qin, J. Shi, J. Jia, Path aggregation network for instance segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768

  11. [11]

    M. Tan, R. Pang, Q. V . Le, Efficientdet: Scalable and efficient object detection, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10781–10790

  12. [12]

    W. Liu, H. Lu, H. Fu, Z. Cao, Learning to upsample by learning to sample, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2023, pp. 6027–6037

  13. [13]

    J. Wang, K. Chen, R. Xu, Z. Liu, C. C. Loy, D. Lin, Carafe++: Unified content-aware reassembly of features, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (9) (2022) 4674–4687

  14. [14]

    Huang, Z

    S. Huang, Z. Lu, R. Cheng, C. He, Fapn: Feature-aligned pyramid network for dense image prediction, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2021, pp. 864–873

  15. [15]

    G. Yang, J. Lei, H. Tian, Z. Feng, R. Liang, Asymptotic feature pyramid network for label- ing pixels and regions, IEEE Transactions on Circuits and Systems for Video Technology 34 (9) (2024) 7820–7829

  16. [16]

    L. Chen, Y . Fu, L. Gu, C. Yan, T. Harada, G. Huang, Frequency-aware feature fusion for dense image prediction, IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

  17. [17]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Proceedings of the European Conference on Computer Vision, Springer, 2014, pp. 740–755

  18. [18]

    Cordts, M

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223

  19. [19]

    H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890. 29

  20. [20]

    H. Zhao, Y . Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, J. Jia, Psanet: Point-wise spatial atten- tion network for scene parsing, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 267–283

  21. [21]

    T. Xiao, Y . Liu, B. Zhou, Y . Jiang, J. Sun, Unified perceptual parsing for scene understand- ing, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 418–434

  22. [22]

    Kirillov, Y

    A. Kirillov, Y . Wu, K. He, R. Girshick, Pointrend: Image segmentation as rendering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9799–9808

  23. [23]

    Guo, C.-Z

    M.-H. Guo, C.-Z. Lu, Q. Hou, Z. Liu, M.-M. Cheng, S.-M. Hu, Segnext: Rethinking con- volutional attention design for semantic segmentation, in: Advances in Neural Information Processing Systems, V ol. 35, 2022, pp. 1140–1156

  24. [24]

    Zheng, J

    S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y . Wang, Y . Fu, J. Feng, T. Xiang, P. H. Torr, et al., Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2021, pp. 6881–6890

  25. [25]

    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo, Segformer: Simple and efficient design for semantic segmentation with transformers, in: Advances in Neural In- formation Processing Systems, V ol. 34, 2021, pp. 12077–12090

  26. [26]

    S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, in: Advances in Neural Information Processing Systems, 2015, p. 91–99

  27. [27]

    Z. Cai, N. Vasconcelos, Cascade r-cnn: Delving into high quality object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6154–6162

  28. [28]

    A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Han, et al., Yolov10: Real-time end-to- end object detection, in: Advances in Neural Information Processing Systems, 2024, pp. 107984–108011. 30

  29. [29]

    Carion, F

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in: Proceedings of the European Conference on Computer Vision, Springer, 2020, pp. 213–229

  30. [30]

    Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, J. Chen, Detrs beat yolos on real- time object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16965–16974

  31. [31]

    Cheng, I

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, Masked-attention mask trans- former for universal image segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1280–1289

  32. [32]

    J. Jain, J. Li, M. T. Chiu, A. Hassani, N. Orlov, H. Shi, Oneformer: One transformer to rule universal image segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2989–2998

  33. [33]

    C. Wang, W. He, Y . Nie, J. Guo, C. Liu, Y . Wang, K. Han, Gold-yolo: Efficient object de- tector via gather-and-distribute mechanism, in: Advances in Neural Information Processing Systems, V ol. 36, 2023, pp. 51094–51112

  34. [34]

    Ghiasi, T.-Y

    G. Ghiasi, T.-Y . Lin, Q. V . Le, Nas-fpn: Learning scalable feature pyramid architecture for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7036–7045

  35. [35]

    W. Weng, M. Wei, J. Ren, F. Shen, Enhancing aerial object detection with selective fre- quency interaction network, IEEE Transactions on Artificial Intelligence 5 (12) (2024) 6109–6120

  36. [36]

    H. Li, R. Zhang, Y . Pan, J. Ren, F. Shen, Lr-fpn: Enhancing remote sensing object detection with location refined feature pyramid network, in: 2024 International Joint Conference on Neural Networks (IJCNN), 2024, pp. 1–8

  37. [37]

    G. Zhao, W. Ge, Y . Yu, Graphfpn: Graph feature pyramid network for object detection, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2021, pp. 2763–2772. 31

  38. [38]

    M. Hu, Y . Li, L. Fang, S. Wang, A2-fpn: Attention aggregation based feature pyramid network for instance segmentation, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2021, pp. 15343–15352

  39. [39]

    D. Liu, J. Liang, T. Geng, A. Loui, T. Zhou, Tripartite feature enhanced pyramid network for dense prediction, IEEE Transactions on Image Processing 32 (2023) 2678–2692

  40. [40]

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: Deformable transformers for end-to-end object detection, in: International Conference on Learning Representations, 2021

  41. [41]

    J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y . Zhao, D. Liu, Y . Mu, M. Tan, X. Wang, W. Liu, B. Xiao, Deep high-resolution representation learning for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (10) (2021) 3349–3364

  42. [42]

    Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (4) (1989) 303–314

    G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (4) (1989) 303–314

  43. [43]

    Z. Lu, H. Pu, F. Wang, Z. Hu, L. Wang, The expressive power of neural networks: A view from the width, in: Advances in neural information processing systems, 2017, p. 6232–6240

  44. [44]

    Xiong, Z

    Y . Xiong, Z. Li, Y . Chen, F. Wang, X. Zhu, J. Luo, W. Wang, T. Lu, H. Li, Y . Qiao, et al., Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5652–5661

  45. [45]

    Y . Wu, K. He, Group normalization, in: Proceedings of the European Conference on Com- puter Vision, 2018, pp. 3–19

  46. [46]

    X. Wang, S. Zhang, Z. Yu, L. Feng, W. Zhang, Scale-equalizing pyramid convolution for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13359–13368

  47. [47]

    C. Guo, B. Fan, Q. Zhang, S. Xiang, C. Pan, Augfpn: Improving multi-scale feature learn- ing for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12595–12604. 32

  48. [48]

    Zhang, H

    D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, Q. Sun, Feature pyramid transformer, in: Proceedings of the European Conference on Computer Vision, Springer, 2020, pp. 323– 339

  49. [49]

    Z. Zong, Q. Cao, B. Leng, Rcnet: Reverse feature pyramid and cross-scale shift network for object detection, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5637–5645

  50. [50]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International Journal of Computer Vision 88 (2010) 303– 338

  51. [51]

    D. Du, P. Zhu, L. Wen, X. Bian, et al., Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results, in: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, pp. 213–226

  52. [52]

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2980–2988

  53. [53]

    Z. Du, Z. Hu, G. Zhao, Y . Jin, H. Ma, Cross-layer feature pyramid transformer for small object detection in aerial images, IEEE Transactions on Geoscience and Remote Sensing 63 (2025) 1–14

  54. [54]

    K. Chen, J. Wang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, et al., Mmdetection: Open mmlab detection toolbox and benchmark, arXiv preprint arXiv:1906.07155 (2019)

  55. [55]

    Contributors, MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark,https://github.com/open-mmlab/mmsegmentation(2020)

    M. Contributors, MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark,https://github.com/open-mmlab/mmsegmentation(2020)

  56. [56]

    H. G. Ramaswamy, et al., Ablation-cam: Visual explanations for deep convolutional net- work via gradient-free localization, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 983–991

  57. [57]

    Z. Tian, C. Shen, H. Chen, T. He, Fcos: Fully convolutional one-stage object detection, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 9627–9636. 33

  58. [58]

    X. Li, W. Wang, X. Hu, J. Li, J. Tang, J. Yang, Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11632–11641

  59. [59]

    Y . Peng, H. Li, P. Wu, Y . Zhang, X. Sun, F. Wu, D-FINE: Redefine regression task of DETRs as fine-grained distribution refinement, in: The Thirteenth International Conference on Learning Representations, 2025

  60. [60]

    K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

  61. [61]

    Strudel, R

    R. Strudel, R. Garcia, I. Laptev, C. Schmid, Segmenter: Transformer for semantic segmen- tation, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2021, pp. 7262–7272

  62. [62]

    W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, Z. Wang, Real-time single image and video super-resolution using an efficient sub-pixel convolu- tional neural network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883

  63. [63]

    Y . Dai, H. Lu, C. Shen, Learning affinity-aware upsampling for deep image matting, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6841–6850

  64. [64]

    H. Lu, W. Liu, Z. Ye, H. Fu, Y . Liu, Z. Cao, Sapa: Similarity-aware point affiliation for feature upsampling, in: Advances in Neural Information Processing Systems, 2022, pp. 20889–20901

  65. [65]

    H. Lu, W. Liu, H. Fu, Z. Cao, Fade: Fusing the assets of decoder and encoder for task- agnostic upsampling, in: Proceedings of the European Conference on Computer Vision, Springer, 2022, pp. 231–247

  66. [66]

    #$%& !"'$() coordinate offset (∆#, ∆$)attention weight ∆% coordinate offset (∆#, ∆$)attention weight ∆% resampled feature map resampled feature map !! !

    J. Wang, K. Chen, R. Xu, Z. Liu, C. C. Loy, D. Lin, Carafe: Content-aware reassembly of features, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 3007–3016. 34 Appendix A. Algorithm Procedures ofA 3-FPN Algorithm 1:Bottom-up Asymptotic Content-Aware Pyramid Attention Network Input:nhierarchical features{X 1,X 2, . . . ,Xn}from the backbone, which...