arxiv: 2604.10210 · v1 · submitted 2026-04-11 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

Meng'en Qin , Yu Song , Quanling Zhao , Xiaodong Yang , Yingtao Che , Xiaohui Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords feature pyramid networkdense predictionattention mechanismmulti-scale featuresasymptotic disentanglementobject detectionsemantic segmentationcontent-aware resampling

0 comments

The pith

A3-FPN augments feature pyramids with an asymptotically disentangled column network and content-aware attention to capture more discriminative multi-scale features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing feature pyramid networks struggle to extract truly useful details at every scale and often miss small objects because their mixing of information stays too local and uniform. The paper introduces A3-FPN to fix this through a horizontally spread column network that gradually connects features across the entire pyramid while keeping each scale distinct, plus attention modules that use actual image content to decide how to resample and reweight features during fusion and reassembly. A reader would care because the design slots directly into current CNN and transformer models without major changes. If the gains hold, standard backbones would produce stronger results on object detection and scene segmentation without needing entirely new architectures.

Core claim

We propose A3-FPN to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on the信息,

What carries the argument

The horizontally-spread column network for asymptotic global interaction and level disentanglement, together with content-aware attention modules that generate offsets, weights, and reweights during feature fusion and reassembly.

If this is right

A3-FPN integrates directly into existing CNN and transformer architectures for dense prediction tasks.
It delivers measurable gains on MS COCO, VisDrone2019-DET, and Cityscapes.
Paired with OneFormer and Swin-L it reaches 49.6 mask AP on MS COCO.
The same pairing reaches 85.6 mIoU on Cityscapes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reassembly step that prunes features by information content could reduce memory use when scaling the same idea to higher-resolution inputs.
The horizontal column structure may transfer to video or 3D dense prediction where scale varies over time or depth.
If the disentanglement step proves essential, future work could isolate its contribution by ablating only that component across more backbones.

Load-bearing premise

The asymptotic disentanglement and content-aware modules truly improve feature discriminability across scales rather than merely adding parameters that fit the training distributions of the tested benchmarks.

What would settle it

Inserting A3-FPN into a standard backbone on a fresh dense-prediction dataset and measuring no accuracy gain or a drop relative to the original feature pyramid network would show the claimed benefits do not generalize.

Figures

Figures reproduced from arXiv: 2604.10210 by Meng'en Qin, Quanling Zhao, Xiaodong Yang, Xiaohui Yang, Yingtao Che, Yu Song.

**Figure 1.** Figure 1: Illustration of FPN [9] and PAFPN [10]. (a) Top-down multi-scale feature fusion path in FPN. (b) Extra bottomup path aggregation in PAFPN. Both methods have some defects: (1) information loss, (2) context-agnostic sampling, (3) pattern inconsistency. branch (PAFPN). But both the interpolation and strided convolution are context-agnostic and rely on fixed, static sampling patterns based solely on relative … view at source ↗

**Figure 2.** Figure 2: (a) Average precision, parameters, and FLOPs of various feature pyramid networks evaluated on COCO val2017 [17]. Bubble area scales with model GFLOPs; (b) Inference latency vs. performance on COCO val2017 for feature pyramid models. All models are trained for 12 epochs using Faster R-CNN with ResNet-50 as the baseline. Inference latency is measured on a single NVIDIA RTX 4090 GPU. mation loss and gradually… view at source ↗

**Figure 3.** Figure 3: Overall architecture of A 3 -FPN. (a) The bottom-up asymptotically disentangled fusion framework consisting of m columns; (b) Multi-scale Context-aware Attention module for feature fusion; (c) Intra-scale Content-aware Attention module for feature reassembly. levels and achieves better fusion through the Gather-and-Distribute mechanism. A 2 -FPN [38] proposes MGC, GACARAFE, and GACAP to address inaccurate … view at source ↗

**Figure 4.** Figure 4: Offset generator and context weight generator in multi-scale context-aware attention module. (a) Offset generator gathers context information to produce position-wise coordinate offset maps and sampling weight maps for the subsequent Resampler; (b) Context weight generator learns the relationship among different representation patterns and assigns the corresponding context weight to different-level featur… view at source ↗

**Figure 5.** Figure 5: Comparison of different multi-scale designs, including (a) FPN [9] (layer-wise framework), (b) Gold-YOLO [33] (global convolutional framework), (c) Deformable Attention [40] (global attention framework), (d) A 3 -FPN (asymptotically disentangled framework). Suppose a framework induces a propagation path from Xi to Xj (i.e. Xi → Xi+1 → · · · → Xj). If each step Xt → Xt+1 satisfies a strong data processing … view at source ↗

**Figure 6.** Figure 6: In first row, H/8 × W/8 feature maps are upsampled to H/4 × W/4 by bilinesar interpolation, DySample [12] and Cotext-aware Resampler (ours); In second row, we downsample H/4 × W/4 feature maps to H/8 × W/8 through strided convolution, CARAFE++ [13] and Cotext-aware Resampler (ours). features are calculated by: X rs = [ G g=1 K X2 n=1 wg · X s g [(x, y) + pn + (∆xgn, ∆ygn)] · ∆mgn, (9) where S is the concat… view at source ↗

**Figure 7.** Figure 7: Visualization of detection results, the corresponding feature maps and heatmaps. Resampler refines coarsely sampled features and diminishes object displacement. The context weight generator learns the significance relationship of different feature patterns, decreasing the misclassifications and missed detections. ICAtten further enhances the discriminative features and alleviates complex background interf… view at source ↗

**Figure 8.** Figure 8: Qualitative evaluation of various feature pyramid networks for object detection on MS COCO validation set, including FPN [9], PAFPN [10], NAS-FPN [34], AFPN [15] and our A 3 -FPN. Odd rows are the detection results, and the others are the corresponding AblationCAM [56] visualization. The object category in the images is sheep. and ICAtten, A 3 -FPN learns more discriminative and intra-category similar feat… view at source ↗

**Figure 9.** Figure 9: Qualitative evaluation. (a) Instance segmentation results of Mask RCNN [5] with different feature fusion approaches, including FPN [9], FPT [48], DySample [12], CARAFE [13] and our A 3 -FPN; (b) and (d) are comparison between some unified transformer-based models (Mask2Former [31] and Mask DINO [8]) with and without integrating A 3 -FPN on the instance and semantic segmentation task respectively; (c) Sem… view at source ↗

read the original abstract

Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A3-FPN adds a horizontally-spread column network for asymptotic interaction plus content-aware resampling and reweighting modules to feature pyramids, with reported gains on COCO and Cityscapes, but the gains are not clearly tied to those modules.

read the letter

The main things to know are that A3-FPN introduces a horizontally-spread column network for asymptotic feature interaction and disentanglement across scales, along with content-aware modules that use adjacent levels to create position-wise offsets, weights, and reweights for fusion and reassembly. The paper does a good job describing how these pieces fit into existing backbones and reports performance on three datasets, including strong numbers when combined with OneFormer and Swin-L. Releasing the code helps others check the implementation. The soft spot is that the gains are not isolated well. Without ablations that match parameter count and compute, or error bars, it's difficult to confirm that the new asymptotic and content-aware elements are responsible rather than just adding more parameters or different training. The concern from the stress test seems accurate here. This work is for people working on dense visual prediction tasks who are looking for updated pyramid networks to try out. It shows clear thinking on the multi-scale problem and engages with prior FPN designs. I recommend sending it to peer review. The empirical results and public code make it worth a closer look by referees, who can push for the missing controls.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes the Asymptotic Content-Aware Pyramid Attention Network (A3-FPN) to improve multi-scale feature representation in dense visual prediction tasks. It introduces a horizontally-spread column network for asymptotically global feature interaction and disentanglement, along with content-aware attention modules for position-wise resampling using offsets and weights, and deep context reweights for intra-category similarity in fusion, plus reassembly based on information content. The paper claims easy integration into CNN and Transformer architectures, with significant performance improvements on MS COCO, VisDrone2019-DET, and Cityscapes, including 49.6 mask AP and 85.6 mIoU when combined with OneFormer and Swin-L.

Significance. Should the empirical gains prove robust and attributable to the proposed mechanisms, A3-FPN would represent a meaningful advance in feature pyramid designs for handling scale variations and small objects in detection and segmentation. The public availability of the code at the provided GitHub link is a notable strength that facilitates verification and extension by the community.

major comments (3)

The reported benchmark results, such as the 49.6 mask AP on MS COCO, do not include ablation studies that hold parameter count, FLOPs, and training schedule fixed while isolating the contributions of the asymptotically disentangled column network and the content-aware resampling/reweighting modules. This is load-bearing for the central claim that these components yield genuine improvements in discriminative feature capture.
The description of the content-aware attention modules in feature fusion and reassembly lacks analysis of the computational overhead introduced by generating position-wise offsets, weights, and intra-category reweights, which is necessary to evaluate whether the 'remarkable performance gains' come at an acceptable cost compared to standard FPNs.
No error bars, results from multiple random seeds, or statistical tests are provided for the performance numbers on the three datasets, reducing confidence that the gains over baselines are statistically significant rather than due to variance or post-hoc tuning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of empirical validation that will strengthen the paper. We provide point-by-point responses below and commit to revisions that address the concerns while preserving the core contributions of A3-FPN.

read point-by-point responses

Referee: The reported benchmark results, such as the 49.6 mask AP on MS COCO, do not include ablation studies that hold parameter count, FLOPs, and training schedule fixed while isolating the contributions of the asymptotically disentangled column network and the content-aware resampling/reweighting modules. This is load-bearing for the central claim that these components yield genuine improvements in discriminative feature capture.

Authors: We agree that controlled ablations with matched parameter counts, FLOPs, and training schedules are essential to isolate the contributions of the horizontally-spread column network and the content-aware modules. Our existing ablations demonstrate component-wise gains but do not enforce strict budget matching. In the revised manuscript we will add new ablation tables that scale baseline FPN variants (e.g., by adjusting channel widths) to match the exact parameter and FLOP counts of each A3-FPN configuration while keeping the training schedule identical. These results will be reported alongside the original numbers to directly support the claim of genuine improvements. revision: yes
Referee: The description of the content-aware attention modules in feature fusion and reassembly lacks analysis of the computational overhead introduced by generating position-wise offsets, weights, and intra-category reweights, which is necessary to evaluate whether the 'remarkable performance gains' come at an acceptable cost compared to standard FPNs.

Authors: We acknowledge the need for explicit overhead analysis. The current manuscript reports overall model FLOPs but does not break down the incremental cost of the offset/weight generation and reweighting operations. In the revision we will add a dedicated table and accompanying text that quantifies the additional parameters and FLOPs attributable to each content-aware module (resampling, reweighting, and reassembly) relative to a standard FPN. We will also compare these costs against recent pyramid variants (e.g., BiFPN, CARAFE) to demonstrate that the observed gains remain favorable on a performance-per-FLOP basis. revision: yes
Referee: No error bars, results from multiple random seeds, or statistical tests are provided for the performance numbers on the three datasets, reducing confidence that the gains over baselines are statistically significant rather than due to variance or post-hoc tuning.

Authors: We recognize that reporting variability across random seeds strengthens confidence in the results. Our experiments used a fixed seed for reproducibility, consistent with common practice in the field, but did not include multi-seed statistics. In the revised manuscript we will rerun the primary experiments (COCO detection/segmentation, Cityscapes, VisDrone) with at least three independent random seeds, reporting mean and standard deviation for all key metrics. We will also note that the consistent gains across three distinct datasets and multiple backbone architectures already provide supporting evidence, but the added statistics will allow formal assessment of significance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with benchmark validation

full rationale

The paper proposes an empirical neural architecture (A3-FPN) consisting of a horizontally-spread column network and content-aware attention modules for multi-scale feature fusion and reassembly. It describes the design choices in prose and reports performance on MS COCO, VisDrone2019-DET, and Cityscapes when integrated with existing backbones. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior work by the same authors are invoked to justify the core mechanisms. The central claims rest on empirical gains rather than any closed-loop mathematical reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard deep-learning assumptions about multi-scale feature utility plus newly introduced architectural components whose benefit is shown only through empirical results on three datasets.

free parameters (2)

position-wise offsets and weights in content-aware resampling
Learned parameters introduced by the proposed modules; their values are fitted during training on the target datasets.
deep context reweights
Additional learned reweighting factors for intra-category similarity.

axioms (1)

domain assumption Multi-scale representations are the common strategy to tackle object scale variation in dense prediction tasks
Stated explicitly in the opening sentence of the abstract as background.

invented entities (2)

Asymptotically disentangled framework with horizontally-spread column network no independent evidence
purpose: To enable global feature interaction while disentangling each level from hierarchical representations
Newly proposed architectural component without independent evidence outside this work.
Content-aware attention modules for resampling and reassembly no independent evidence
purpose: To generate position-wise offsets/weights and strengthen intra-scale discriminative features
Newly proposed modules whose effectiveness is claimed via benchmark gains.

pith-pipeline@v0.9.0 · 5567 in / 1571 out tokens · 39043 ms · 2026-05-10T16:55:41.755210+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network
cs.CV 2026-04 unverdicted novelty 5.0

StomaD2 integrates diffusion-based image restoration with a specialized rotated detection network to achieve high-accuracy stomatal phenotyping across more than 130 plant species.

Reference graph

Works this paper leans on

66 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

M. Chen, L. Zhang, R. Feng, X. Xue, J. Feng, Rethinking local and global feature repre- sentation for dense prediction, Pattern Recognition 135 (2023) 109168

2023
[2]

Zhang, Z

G. Zhang, Z. Li, C. Tang, J. Li, X. Hu, Cednet: A cascade encoder–decoder network for dense prediction, Pattern Recognition 158 (2025) 111072

2025
[3]

Y . Chen, Z. Zhang, Y . Cao, L. Wang, S. Lin, H. Hu, Reppoints v2: Verification meets regression for object detection, in: Advances in Neural Information Processing Systems, V ol. 33, 2020, pp. 5621–5631

2020
[4]

X. Ding, R. Zhang, Q. Liu, Y . Yang, Real-time small object detection using adaptive weighted fusion of efficient positional features, Pattern Recognition 167 (2025) 111717

2025
[5]

K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2961–2969

2017
[6]

R. Li, C. He, S. Li, Y . Zhang, L. Zhang, Dynamask: dynamic mask selection for instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2023, pp. 11279–11288

2023
[7]

J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440

2015
[8]

F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, H.-Y . Shum, Mask dino: Towards a unified transformer-based framework for object detection and segmentation, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3041–3050

2023
[9]

T.-Y . Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid net- works for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125. 28

2017
[10]

S. Liu, L. Qi, H. Qin, J. Shi, J. Jia, Path aggregation network for instance segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768

2018
[11]

M. Tan, R. Pang, Q. V . Le, Efficientdet: Scalable and efficient object detection, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10781–10790

2020
[12]

W. Liu, H. Lu, H. Fu, Z. Cao, Learning to upsample by learning to sample, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2023, pp. 6027–6037

2023
[13]

J. Wang, K. Chen, R. Xu, Z. Liu, C. C. Loy, D. Lin, Carafe++: Unified content-aware reassembly of features, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (9) (2022) 4674–4687

2022
[14]

Huang, Z

S. Huang, Z. Lu, R. Cheng, C. He, Fapn: Feature-aligned pyramid network for dense image prediction, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2021, pp. 864–873

2021
[15]

G. Yang, J. Lei, H. Tian, Z. Feng, R. Liang, Asymptotic feature pyramid network for label- ing pixels and regions, IEEE Transactions on Circuits and Systems for Video Technology 34 (9) (2024) 7820–7829

2024
[16]

L. Chen, Y . Fu, L. Gu, C. Yan, T. Harada, G. Huang, Frequency-aware feature fusion for dense image prediction, IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

2024
[17]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Proceedings of the European Conference on Computer Vision, Springer, 2014, pp. 740–755

2014
[18]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223

2016
[19]

H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890. 29

2017
[20]

H. Zhao, Y . Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, J. Jia, Psanet: Point-wise spatial atten- tion network for scene parsing, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 267–283

2018
[21]

T. Xiao, Y . Liu, B. Zhou, Y . Jiang, J. Sun, Unified perceptual parsing for scene understand- ing, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 418–434

2018
[22]

Kirillov, Y

A. Kirillov, Y . Wu, K. He, R. Girshick, Pointrend: Image segmentation as rendering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9799–9808

2020
[23]

Guo, C.-Z

M.-H. Guo, C.-Z. Lu, Q. Hou, Z. Liu, M.-M. Cheng, S.-M. Hu, Segnext: Rethinking con- volutional attention design for semantic segmentation, in: Advances in Neural Information Processing Systems, V ol. 35, 2022, pp. 1140–1156

2022
[24]

Zheng, J

S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y . Wang, Y . Fu, J. Feng, T. Xiang, P. H. Torr, et al., Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2021, pp. 6881–6890

2021
[25]

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo, Segformer: Simple and efficient design for semantic segmentation with transformers, in: Advances in Neural In- formation Processing Systems, V ol. 34, 2021, pp. 12077–12090

2021
[26]

S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, in: Advances in Neural Information Processing Systems, 2015, p. 91–99

2015
[27]

Z. Cai, N. Vasconcelos, Cascade r-cnn: Delving into high quality object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6154–6162

2018
[28]

A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Han, et al., Yolov10: Real-time end-to- end object detection, in: Advances in Neural Information Processing Systems, 2024, pp. 107984–108011. 30

2024
[29]

Carion, F

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in: Proceedings of the European Conference on Computer Vision, Springer, 2020, pp. 213–229

2020
[30]

Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, J. Chen, Detrs beat yolos on real- time object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16965–16974

2024
[31]

Cheng, I

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, Masked-attention mask trans- former for universal image segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1280–1289

2022
[32]

J. Jain, J. Li, M. T. Chiu, A. Hassani, N. Orlov, H. Shi, Oneformer: One transformer to rule universal image segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2989–2998

2023
[33]

C. Wang, W. He, Y . Nie, J. Guo, C. Liu, Y . Wang, K. Han, Gold-yolo: Efficient object de- tector via gather-and-distribute mechanism, in: Advances in Neural Information Processing Systems, V ol. 36, 2023, pp. 51094–51112

2023
[34]

Ghiasi, T.-Y

G. Ghiasi, T.-Y . Lin, Q. V . Le, Nas-fpn: Learning scalable feature pyramid architecture for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7036–7045

2019
[35]

W. Weng, M. Wei, J. Ren, F. Shen, Enhancing aerial object detection with selective fre- quency interaction network, IEEE Transactions on Artificial Intelligence 5 (12) (2024) 6109–6120

2024
[36]

H. Li, R. Zhang, Y . Pan, J. Ren, F. Shen, Lr-fpn: Enhancing remote sensing object detection with location refined feature pyramid network, in: 2024 International Joint Conference on Neural Networks (IJCNN), 2024, pp. 1–8

2024
[37]

G. Zhao, W. Ge, Y . Yu, Graphfpn: Graph feature pyramid network for object detection, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2021, pp. 2763–2772. 31

2021
[38]

M. Hu, Y . Li, L. Fang, S. Wang, A2-fpn: Attention aggregation based feature pyramid network for instance segmentation, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2021, pp. 15343–15352

2021
[39]

D. Liu, J. Liang, T. Geng, A. Loui, T. Zhou, Tripartite feature enhanced pyramid network for dense prediction, IEEE Transactions on Image Processing 32 (2023) 2678–2692

2023
[40]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: Deformable transformers for end-to-end object detection, in: International Conference on Learning Representations, 2021

2021
[41]

J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y . Zhao, D. Liu, Y . Mu, M. Tan, X. Wang, W. Liu, B. Xiao, Deep high-resolution representation learning for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (10) (2021) 3349–3364

2021
[42]

Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (4) (1989) 303–314

G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (4) (1989) 303–314

1989
[43]

Z. Lu, H. Pu, F. Wang, Z. Hu, L. Wang, The expressive power of neural networks: A view from the width, in: Advances in neural information processing systems, 2017, p. 6232–6240

2017
[44]

Xiong, Z

Y . Xiong, Z. Li, Y . Chen, F. Wang, X. Zhu, J. Luo, W. Wang, T. Lu, H. Li, Y . Qiao, et al., Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5652–5661

2024
[45]

Y . Wu, K. He, Group normalization, in: Proceedings of the European Conference on Com- puter Vision, 2018, pp. 3–19

2018
[46]

X. Wang, S. Zhang, Z. Yu, L. Feng, W. Zhang, Scale-equalizing pyramid convolution for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13359–13368

2020
[47]

C. Guo, B. Fan, Q. Zhang, S. Xiang, C. Pan, Augfpn: Improving multi-scale feature learn- ing for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12595–12604. 32

2020
[48]

Zhang, H

D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, Q. Sun, Feature pyramid transformer, in: Proceedings of the European Conference on Computer Vision, Springer, 2020, pp. 323– 339

2020
[49]

Z. Zong, Q. Cao, B. Leng, Rcnet: Reverse feature pyramid and cross-scale shift network for object detection, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5637–5645

2021
[50]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International Journal of Computer Vision 88 (2010) 303– 338

2010
[51]

D. Du, P. Zhu, L. Wen, X. Bian, et al., Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results, in: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, pp. 213–226

2019
[52]

T.-Y . Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2980–2988

2017
[53]

Z. Du, Z. Hu, G. Zhao, Y . Jin, H. Ma, Cross-layer feature pyramid transformer for small object detection in aerial images, IEEE Transactions on Geoscience and Remote Sensing 63 (2025) 1–14

2025
[54]

K. Chen, J. Wang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, et al., Mmdetection: Open mmlab detection toolbox and benchmark, arXiv preprint arXiv:1906.07155 (2019)

work page Pith review arXiv 1906
[55]

Contributors, MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark,https://github.com/open-mmlab/mmsegmentation(2020)

M. Contributors, MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark,https://github.com/open-mmlab/mmsegmentation(2020)

2020
[56]

H. G. Ramaswamy, et al., Ablation-cam: Visual explanations for deep convolutional net- work via gradient-free localization, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 983–991

2020
[57]

Z. Tian, C. Shen, H. Chen, T. He, Fcos: Fully convolutional one-stage object detection, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 9627–9636. 33

2019
[58]

X. Li, W. Wang, X. Hu, J. Li, J. Tang, J. Yang, Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11632–11641

2021
[59]

Y . Peng, H. Li, P. Wu, Y . Zhang, X. Sun, F. Wu, D-FINE: Redefine regression task of DETRs as fine-grained distribution refinement, in: The Thirteenth International Conference on Learning Representations, 2025

2025
[60]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

2016
[61]

Strudel, R

R. Strudel, R. Garcia, I. Laptev, C. Schmid, Segmenter: Transformer for semantic segmen- tation, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2021, pp. 7262–7272

2021
[62]

W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, Z. Wang, Real-time single image and video super-resolution using an efficient sub-pixel convolu- tional neural network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883

2016
[63]

Y . Dai, H. Lu, C. Shen, Learning affinity-aware upsampling for deep image matting, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6841–6850

2021
[64]

H. Lu, W. Liu, Z. Ye, H. Fu, Y . Liu, Z. Cao, Sapa: Similarity-aware point affiliation for feature upsampling, in: Advances in Neural Information Processing Systems, 2022, pp. 20889–20901

2022
[65]

H. Lu, W. Liu, H. Fu, Z. Cao, Fade: Fusing the assets of decoder and encoder for task- agnostic upsampling, in: Proceedings of the European Conference on Computer Vision, Springer, 2022, pp. 231–247

2022
[66]

#$%& !"'$() coordinate offset (∆#, ∆$)attention weight ∆% coordinate offset (∆#, ∆$)attention weight ∆% resampled feature map resampled feature map !! !

J. Wang, K. Chen, R. Xu, Z. Liu, C. C. Loy, D. Lin, Carafe: Content-aware reassembly of features, in: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 3007–3016. 34 Appendix A. Algorithm Procedures ofA 3-FPN Algorithm 1:Bottom-up Asymptotic Content-Aware Pyramid Attention Network Input:nhierarchical features{X 1,X 2, . . . ,Xn}from the backbone, which...

2019