arxiv: 2604.19233 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery

Francesco Moretti , Yi Jin , Guiqin Mario

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords adaptive slicingsmall object detectionhigh-resolution imageryaerial imageryobject detectioninference optimizationVisDronexView

0 comments

The pith

By choosing the number of overlapping patches adaptively from image resolution, ASAHI improves small object detection in high-resolution aerial imagery while cutting inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that fixed slicing wastes computation on high-resolution images and that switching to a resolution-based choice of 6 or 12 patches removes most of that waste without losing detection quality. A reader should care because small-object detectors are used in drone surveillance, disaster mapping, and traffic monitoring, where both accuracy and speed matter. The work adds a fine-tuning step that mixes full and sliced images during training and replaces standard duplicate removal with a combined clustering and distance-aware step. Experiments on two large aerial datasets show the new pipeline reaches the highest reported scores while running 20 to 25 percent faster than the fixed-slice baseline.

Core claim

ASAHI determines the optimal number of slices according to image resolution, thereby generating either 6 or 12 overlapping patches via a learned threshold; it augments training with both full-resolution and sliced patches and applies Cluster-DIoU-NMS to merge detections in crowded scenes, yielding 56.8 percent mAP on VisDrone2019-DET-val and 22.7 percent on xView-test at reduced inference cost.

What carries the argument

The adaptive resolution-aware slicing algorithm that uses a learned threshold to decide between 6 and 12 overlapping patches, replacing fixed slice sizes to balance overlap and redundant computation.

If this is right

Detection accuracy reaches state-of-the-art levels on the VisDrone and xView benchmarks.
Inference time drops 20-25 percent compared with fixed-size slicing baselines.
Training on the mixture of full and sliced patches improves the detector's handling of small targets.
Cluster-DIoU-NMS reduces duplicate boxes more reliably in dense object clusters than either component alone.
Beneficial patch overlap is retained while overall redundant computation is lowered.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same resolution-driven patch count rule could be applied to video streams where object scale changes frame to frame.
The learned threshold might be re-optimized per camera type or altitude to match typical object densities.
Similar adaptive slicing could speed up high-resolution segmentation or instance counting tasks outside aerial domains.
Lower compute cost opens the possibility of running these detectors on embedded hardware without accuracy loss.

Load-bearing premise

Dynamically selecting 6 or 12 patches from a learned resolution threshold will always keep enough overlap to find small objects while trimming unnecessary work across every scene type and density level.

What would settle it

A test set of varied-resolution aerial images where the adaptive method either misses more small objects than fixed slicing or shows no reduction in measured inference time would falsify the central efficiency claim.

Figures

Figures reproduced from arXiv: 2604.19233 by Francesco Moretti, Guiqin Mario, Yi Jin.

**Figure 1.** Figure 1: Overview of the proposed ASAHI detection framework. The input image is simultaneously processed through two complementary pathways: Full Inference (FI) for global context and large object detection, and ASAHI adaptive slicing for enhanced small object detection. The Cluster-DIoU-NMS (CDN) module merges and refines predictions from both pathways. During training, the SAF strategy constructs the fine-tuning … view at source ↗

read the original abstract

Deep learning-based object detectors have achieved remarkable success across numerous computer vision applications, yet they continue to struggle with small object detection in high-resolution aerial and satellite imagery, where dense object distributions, variable shooting angles, diminutive target sizes, and substantial inter-class variability pose formidable challenges. Existing slicing strategies that partition high-resolution images into manageable patches have demonstrated promising results for enlarging the effective receptive field of small targets; however, their reliance on fixed slice dimensions introduces significant redundant computation, inflating inference cost and undermining detection speed. In this paper, we propose \textbf{Adaptive Slicing-Assisted Hyper Inference (ASAHI)}, a novel slicing framework that shifts the paradigm from prescribing a fixed slice size to adaptively determining the optimal number of slices according to image resolution, thereby substantially mitigating redundant computation while preserving beneficial overlap between adjacent patches. ASAHI integrates three synergistic components: (1)an adaptive resolution-aware slicing algorithm that dynamically generates 6 or 12 overlapping patches based on a learned threshold, (2)a slicing-assisted fine-tuning (SAF) strategy that constructs augmented training data comprising both full-resolution and sliced image patches, and (3)a Cluster-DIoU-NMS (CDN) post-processing module that combines the geometric merging efficiency of Cluster-NMS with the center-distance-aware suppression of DIoU-NMS to achieve robust duplicate elimination in crowded scenes. Extensive experiments on VisDrone2019 and xView, demonstrate that ASAHI achieves state-of-the-art performance with 56.8% on VisDrone2019-DET-val and 22.7% on xView-test, while reducing inference time by 20-25% compared to the baseline SAHI method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASAHI adds adaptive slice count selection to SAHI plus fine-tuning and custom NMS, delivering reported speed and accuracy gains on aerial benchmarks but with thin support for why the adaptation succeeds.

read the letter

ASAHI's main contribution is turning the fixed slicing in SAHI into an adaptive process that picks 6 or 12 overlapping patches based on image resolution, along with a slicing-assisted fine-tuning and a Cluster-DIoU-NMS variant. It claims better accuracy and speed on VisDrone and xView. The new parts are the resolution-aware decision rule and the way they integrate the fine-tuning and post-processing. This moves past the static slice size that causes extra compute in the original SAHI. The experiments show solid numbers: 56.8% mAP on VisDrone val and 22.7% on xView test, with 20-25% less inference time. It does well at addressing a real pain point in aerial imagery where small objects get lost in high-res images. The approach is pragmatic and builds directly on prior work without overclaiming novelty. The soft spots are in the lack of supporting details. There's no ablation on the threshold choice or how it handles varying densities, which could mean the speed gains come at the cost of missing objects in some scenes. The abstract mentions a learned threshold but doesn't explain the training or validation for it. If the full paper has those, it would help; otherwise the central claim rests on the benchmark wins alone. This paper is for people building detectors for surveillance or mapping from drones and satellites. A reader interested in practical optimizations for small object detection would get value from the specific implementation choices and the reported trade-offs. I think it deserves peer review. The results are worth checking in detail, and the ideas are clear enough to evaluate.

Referee Report

3 major / 1 minor

Summary. The paper introduces Adaptive Slicing-Assisted Hyper Inference (ASAHI) for small-object detection in high-resolution aerial imagery. It replaces fixed-size slicing with an adaptive resolution-aware algorithm that selects either 6 or 12 overlapping patches according to a learned threshold, augments training via slicing-assisted fine-tuning (SAF), and applies Cluster-DIoU-NMS (CDN) for post-processing. The central empirical claim is that ASAHI reaches 56.8% mAP on VisDrone2019-DET-val and 22.7% mAP on xView-test while cutting inference time 20-25% relative to the SAHI baseline.

Significance. If the reported gains are reproducible and the adaptive threshold generalizes, the work would offer a practical improvement over fixed slicing methods by reducing redundant computation while maintaining overlap for small targets. The combination of SAF and CDN also targets training and crowded-scene issues that are common in aerial detection benchmarks.

major comments (3)

[Abstract] Abstract: the headline SOTA numbers (56.8% VisDrone, 22.7% xView) and the 20-25% speed-up are stated without any reference to the base detector, full baseline list, or ablation results that isolate the contribution of the adaptive slicing component versus SAF and CDN.
[Abstract] Abstract (adaptive resolution-aware slicing paragraph): the method relies on a learned threshold that selects between exactly two discrete patch counts (6 or 12); no value of the threshold, training procedure for it, or density-conditioned ablation is supplied, leaving the central assumption that resolution alone suffices to preserve overlap and avoid missed small objects untested.
[Abstract] Abstract: the manuscript reports no error analysis or failure-case study on scenes with varying object density, which directly bears on whether the claimed accuracy and speed improvements hold when the fixed 6/12 choices under-slice dense regions or over-slice sparse ones.

minor comments (1)

[Abstract] Abstract: formatting inconsistency in the component list ('(1)an' lacks space after the parenthesis).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that additional context will improve clarity and will revise the abstract accordingly while preserving its conciseness. We address each point below.

read point-by-point responses

Referee: [Abstract] Abstract: the headline SOTA numbers (56.8% VisDrone, 22.7% xView) and the 20-25% speed-up are stated without any reference to the base detector, full baseline list, or ablation results that isolate the contribution of the adaptive slicing component versus SAF and CDN.

Authors: The abstract summarizes key outcomes with reference to the SAHI baseline. The base detector, complete baseline comparisons, and component-wise ablations (isolating adaptive slicing, SAF, and CDN) are provided in Sections 4.1–4.3 and Table 3. We will revise the abstract to explicitly name the base detector and note that ablations confirm the individual contributions. revision: yes
Referee: [Abstract] Abstract (adaptive resolution-aware slicing paragraph): the method relies on a learned threshold that selects between exactly two discrete patch counts (6 or 12); no value of the threshold, training procedure for it, or density-conditioned ablation is supplied, leaving the central assumption that resolution alone suffices to preserve overlap and avoid missed small objects untested.

Authors: The learned threshold and its training procedure are detailed in Section 3.1. The abstract omits the numerical value and procedure for brevity. We will add a concise clause specifying the threshold and training approach. Our experiments already evaluate performance across datasets with varying densities; we will incorporate an explicit density-conditioned ablation summary into the revised abstract and experiments section. revision: yes
Referee: [Abstract] Abstract: the manuscript reports no error analysis or failure-case study on scenes with varying object density, which directly bears on whether the claimed accuracy and speed improvements hold when the fixed 6/12 choices under-slice dense regions or over-slice sparse ones.

Authors: The abstract is space-constrained and therefore omits detailed error analysis. The manuscript provides qualitative results and density-stratified quantitative evaluations in Section 4.4 and the supplementary material. We will revise the abstract to include a brief summary of robustness across densities and expand the failure-case discussion in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical claims rest on external benchmarks

full rationale

The paper proposes ASAHI with three components—an adaptive resolution-aware slicing algorithm using a learned threshold to select 6 or 12 patches, a slicing-assisted fine-tuning strategy, and a Cluster-DIoU-NMS post-processor—and reports empirical SOTA results (56.8% on VisDrone2019-DET-val, 22.7% on xView-test) plus 20-25% faster inference than SAHI. These performance numbers are obtained via standard evaluation on held-out external datasets rather than any derivation, equation, or fitted parameter that reduces to the method's own inputs by construction. No mathematical first-principles chain, self-definitional quantities, or load-bearing self-citations appear in the provided text; the learned threshold is an internal design choice whose effect is measured externally, not presupposed.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard deep-learning training assumptions plus one learned decision threshold for choosing slice count; no new physical entities are postulated.

free parameters (1)

learned threshold for slice count
Decides between generating 6 or 12 patches according to image resolution; value is learned during training.

axioms (1)

domain assumption Training on a mixture of full-resolution and sliced patches improves small-object detection performance
Invoked by the slicing-assisted fine-tuning component.

pith-pipeline@v0.9.0 · 5608 in / 1225 out tokens · 42643 ms · 2026-05-10T02:11:38.790249+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Slicing aided hyper inference and fine-tuning for small object detection

Fatih Cagatay Akyon, Sinan Onur Altinuc, and Alptekin Temizel. Slicing aided hyper inference and fine-tuning for small object detection. InProceedings of the IEEE Interna- tional Conference on Image Processing (ICIP), pages 966–
[2]

SOD-MTGAN: Small object detection via multi- task generative adversarial network

Yancheng Bai, Yongqiang Zhang, Mingli Ding, and Bernard Ghanem. SOD-MTGAN: Small object detection via multi- task generative adversarial network. InProceedings of the 8 European Conference on Computer Vision (ECCV), pages 210–226. Springer, 2018

2018
[3]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. YOLOv4: Optimal speed and accuracy of object detection.arXiv preprint arXiv:2004.10934, 2020

work page internal anchor Pith review arXiv 2004
[4]

Soft-NMS–improving object detection with one line of code

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS–improving object detection with one line of code. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5562–5570. IEEE, 2017

2017
[5]

VisDrone-DET2021: The vision meets drone object detection challenge results

Yuren Cao, Zhijian He, Longjia Wang, Wengao Wang, Yix- ian Yuan, Daiwei Zhang, Jialiang Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, et al. VisDrone-DET2021: The vision meets drone object detection challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2847–2854. IEEE, 2021

2021
[6]

End- to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. InProceedings of the European Conference on Computer Vision (ECCV), pages 213–229. Springer, 2020

2020
[7]

Diffu- sionDet: Diffusion model for object detection

Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. Diffu- sionDet: Diffusion model for object detection. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 19830–19843. IEEE, 2023

2023
[8]

Yolov6 v3

Chuyi Cheng, Yifeng Song, Jian Li, Biao Wang, Aiguo Tao, Zeyu Chen, Jiayan Yuan, Chu Fan, Zhongyu Rong, et al. YOLOv6 v3.0: A full-scale reloading.arXiv preprint arXiv:2301.05586, 2023

work page arXiv 2023
[9]

A global-local self- adaptive network for drone-view object detection.IEEE Transactions on Image Processing, 30:1556–1569, 2021

Sutao Deng, Shuai Li, Kai Xie, Wenfeng Song, Xiang- wen Liao, Aimin Hao, and Hong Qin. A global-local self- adaptive network for drone-view object detection.IEEE Transactions on Image Processing, 30:1556–1569, 2021

2021
[10]

CSWin Transformer: A general vision transformer backbone with cross-shaped windows

Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. CSWin Transformer: A general vision transformer backbone with cross-shaped windows. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12114–12124. IEEE, 2022

2022
[11]

An image is worth 16×16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16×16 words: Trans- formers for image recognition at scale. InProceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[12]

VisDrone-DET2019: The vi- sion meets drone object detection in image challenge results

Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Lin, Qinghua Hu, Tao Peng, Jiangyu Zheng, Xiangyu Wang, Yifan Zhang, et al. VisDrone-DET2019: The vi- sion meets drone object detection in image challenge results. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 213–226. IEEE, 2019

2019
[13]

Ad- vancing sequential numerical prediction in autoregressive models

Xingjian Fei, Jinghui Lu, Quan Sun, Hao Feng, Yanjie Wang, Wei Shi, Anlu Wang, Jingqun Tang, and Can Huang. Ad- vancing sequential numerical prediction in autoregressive models. InProceedings of the Annual Meeting of the As- sociation for Computational Linguistics (ACL), 2025

2025
[14]

TOOD: Task-aligned one-stage object detection

Chengjian Feng, Yujie Zhong, Yu Gao, Matthew R Scott, and Weilin Huang. TOOD: Task-aligned one-stage object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3490–3499. IEEE, 2021

2021
[15]

DocPedia: Unleashing the power of large multimodal model in the frequency domain for ver- satile document understanding

Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wenqing Zhou, Hezhi Li, and Can Huang. DocPedia: Unleashing the power of large multimodal model in the frequency domain for ver- satile document understanding. 2024

2024
[16]

Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

Hao Feng, Wei Shi, Kai Zhang, Xingjian Fei, Liangtao Liao, Dingkang Yang, Yingying Du, Xuecheng Wu, Jingqun Tang, Yang Liu, et al. Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

work page arXiv 2026
[17]

UniDoc: A universal large multimodal model for simultaneous text detection, recogni- tion, spotting and understanding

Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wenqing Zhou, Hezhi Li, and Can Huang. UniDoc: A universal large multimodal model for simultaneous text detection, recogni- tion, spotting and understanding. 2023

2023
[18]

Dolphin: Document image parsing via heterogeneous anchor prompting

Hao Feng, Shuai Wei, Xingjian Fei, Wei Shi, Yunhao Han, Liangtao Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InFindings of the As- sociation for Computational Linguistics: ACL, pages 21919– 21936, 2025

2025
[19]

Rich feature hierarchies for accurate object detec- tion and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detec- tion and semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587. IEEE, 2014

2014
[20]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE Interna- tional Conference on Computer Vision (ICCV), pages 2980–
[21]

Spatial pyramid pooling in deep convolutional networks for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015

1904
[22]

Coordinate atten- tion for efficient mobile network design

Qibin Hou, Daquan Zhou, and Jiashi Feng. Coordinate atten- tion for efficient mobile network design. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13708–13717. IEEE, 2021

2021
[23]

MinDEV: Multi-modal integrated diffusion framework for video reconstruction from EEG sig- nals

Shuo Huang, Yuxuan Wang, Hanchi Luo, Huanyu Jing, Chao Qin, and Jingqun Tang. MinDEV: Multi-modal integrated diffusion framework for video reconstruction from EEG sig- nals. InProceedings of the ACM International Conference on Multimedia, pages 3350–3359, 2025

2025
[24]

Ultr- alytics YOLOv8

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultr- alytics YOLOv8. 2023.https://github.com/ ultralytics/ultralytics

2023
[25]

Ultralyt- ics/YOLOv5: v5.0–YOLOv5-P6 1280 models, AWS, Super- vise.ly and YouTube integrations, 2021

Glenn Jocher, Alex Stoken, Jiri Borovec, et al. Ultralyt- ics/YOLOv5: v5.0–YOLOv5-P6 1280 models, AWS, Super- vise.ly and YouTube integrations, 2021

2021
[26]

Augmentation for small object detection

M ´at´e Kisantal, Zbigniew Wojna, Jakub Muber, Jakub Jezier- ski, and Jarosław Kowalczyk. Augmentation for small object detection. InProceedings of the International Conference on Advances in Computer Vision, 2019. 9

2019
[27]

Focus-and-detect: A small object detection framework for aerial images.Signal Processing: Image Communication, 104:116675, 2022

Onur Can Koyun, Ramazan Kadir Keser, ˙Ibrahim Batuhan Akkaya, and Bu ˘gra Ufuk T ¨oreyin. Focus-and-detect: A small object detection framework for aerial images.Signal Processing: Image Communication, 104:116675, 2022

2022
[28]

xview: Objects in context in overhead imagery,

Darius Lam, Richard Kuzma, Kevin McGee, Samuel Doo- ley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, and Brendan McCord. xview: Objects in context in overhead imagery.arXiv preprint arXiv:1802.07856, 2018

work page arXiv 2018
[29]

Density map guided object detection in aerial images

Changlin Li, Taojiannan Yang, Sijie Zhu, Chen Chen, and Shanyue Guan. Density map guided object detection in aerial images. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages 737–746. IEEE, 2020

2020
[30]

DN-DETR: Accelerate DETR training by in- troducing query denoising

Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. DN-DETR: Accelerate DETR training by in- troducing query denoising. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13619–13627. IEEE, 2022

2022
[31]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944. IEEE, 2017

2017
[32]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017

2017
[33]

Path aggregation network for instance segmentation

Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8759–8768. IEEE, 2018

2018
[34]

Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InProceedings of the European Conference on Computer Vision (ECCV), pages 38–55. Springer, 2024

2024
[35]

SSD: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. InProceedings of the European Conference on Computer Vision (ECCV), pages 21–37. Springer, 2016

2016
[36]

SPTS v2: Single-point scene text spotting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15047–15063, 2023

Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, et al. SPTS v2: Single-point scene text spotting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15047–15063, 2023

2023
[37]

Swin Transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022. IEEE, 2021

2021
[38]

A bounding box is worth one token— interleaving layout and text in a large language model for document understanding

Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token— interleaving layout and text in a large language model for document understanding. InFindings of the Association for Computational Linguistics: ACL, pages 7252–7273, 2025

2025
[39]

Meta-DermDiagnosis: Few-shot skin disease identification using meta-learning

Kushagra Mahajan, Monika Sharma, and Lovekesh Vig. Meta-DermDiagnosis: Few-shot skin disease identification using meta-learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages 730–731. IEEE, 2020

2020
[40]

Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small ob- ject detection

Junhyug Noh, Wonho Bae, Wonhee Lee, Jinhwan Seo, and Gunhee Kim. Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small ob- ject detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 9725–
[41]

Power of tiling and merging in small object detection

Fatih ¨Ozge Unel, Burak Osman Ozkalayci, and Cevahir Cigla. Power of tiling and merging in small object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 0–0. IEEE, 2019

2019
[42]

De- tectoRS: Detecting objects with recursive feature pyramid and switchable atrous convolution

Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. De- tectoRS: Detecting objects with recursive feature pyramid and switchable atrous convolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10208–10219. IEEE, 2021

2021
[43]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779–
[44]

YOLO9000: Better, faster, stronger

Joseph Redmon and Ali Farhadi. YOLO9000: Better, faster, stronger. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7263–
[45]

Faster R-CNN: Towards real-time object detection with re- gion proposal networks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 39(6):1137–1149, 2017

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 39(6):1137–1149, 2017

2017
[46]

MCTBench: Multimodal cognition towards text-rich visual scenes bench- mark.arXiv preprint arXiv:2410.11538, 2024

Biluo Shan, Xingjian Fei, Wei Shi, Anlu Wang, Guozhi Tang, Liangtao Liao, Jingqun Tang, Xiang Bai, and Can Huang. MCTBench: Multimodal cognition towards text-rich visual scenes benchmark.arXiv preprint arXiv:2410.11538, 2024

work page arXiv 2024
[47]

Weighted boxes fusion: Ensembling boxes from different object detection models.Image and Vision Computing, 107:104117, 2021

Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. Weighted boxes fusion: Ensembling boxes from different object detection models.Image and Vision Computing, 107:104117, 2021

2021
[48]

HIT-UA V: A high-altitude in- frared thermal dataset for unmanned aerial vehicle-based ob- ject detection.Scientific Data, 10(1):227, 2023

Jiashun Suo, Tianyi Wang, Xingzhou Zhang, Haiyang Chen, Wei Zhou, and Weisong Shi. HIT-UA V: A high-altitude in- frared thermal dataset for unmanned aerial vehicle-based ob- ject detection.Scientific Data, 10(1):227, 2023

2023
[49]

EfficientDet: Scalable and efficient object detection

Mingxing Tan, Ruoming Pang, and Quoc V Le. EfficientDet: Scalable and efficient object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10778–10787. IEEE, 2020

2020
[50]

Character recognition competition for street view shop signs.National Science Review, 10(6):nwad141, 2023

Jingqun Tang, Wei Du, Bochao Wang, Wenqing Zhou, Song Mei, Tao Xue, Xiang Xu, and Haoyue Zhang. Character recognition competition for street view shop signs.National Science Review, 10(6):nwad141, 2023. 10

2023
[51]

TextSquare: Scaling up text-centric visual instruction tuning,

Jingqun Tang, Chunhui Lin, Zhen Zhao, Shuai Wei, Binghong Wu, Qi Liu, Yang He, Kang Lu, Hao Feng, Yu Li, et al. TextSquare: Scaling up text-centric visual instruction tuning.arXiv preprint arXiv:2404.12803, 2024

work page arXiv 2024
[52]

MTVQA: Benchmarking multilingual text-centric vi- sual question answering

Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shuai Wei, Anlu Wang, Chunhui Lin, Hao Feng, Zhen Zhao, et al. MTVQA: Benchmarking multilingual text-centric vi- sual question answering. InFindings of the Association for Computational Linguistics: ACL, pages 7748–7763, 2025

2025
[53]

Optimal boxes: Boosting end-to- end scene text recognition by adjusting annotated bounding boxes via reinforcement learning

Jingqun Tang, Wenqiang Qian, Lei Song, Xiaobin Dong, Lingxi Li, and Xiang Bai. Optimal boxes: Boosting end-to- end scene text recognition by adjusting annotated bounding boxes via reinforcement learning. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 233–
[54]

You can even annotate text with voice: Transcription-only-supervised text spotting

Jingqun Tang, Shaobo Qiao, Benlei Cui, Yuhang Ma, Shengjie Zhang, and Dimitrios Kanoulas. You can even annotate text with voice: Transcription-only-supervised text spotting. InProceedings of the 30th ACM International Con- ference on Multimedia, pages 4154–4163, 2022

2022
[55]

Few could be better than all: Feature sampling and grouping for scene text detec- tion

Jingqun Tang, Wenqing Zhang, Hao Liu, Min-Ke Yang, Bo Jiang, Guangliang Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detec- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4563–
[56]

Recent advances in small object detection based on deep learning: A review

Kang Tong, Yiquan Wu, and Fei Zhou. Recent advances in small object detection based on deep learning: A review. Image and Vision Computing, 97:103910, 2020

2020
[57]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems (NeurIPS), volume 30, 2017

2017
[58]

PARGO: Bridging vision-language with partial and global views

Anlu Wang, Biluo Shan, Wei Shi, Kai-Yun Lin, Xingjian Fei, Guozhi Tang, Liangtao Liao, Jingqun Tang, Can Huang, et al. PARGO: Bridging vision-language with partial and global views. InProceedings of the AAAI Conference on Ar- tificial Intelligence, 2025

2025
[59]

Anlu Wang, Jingqun Tang, Liangtao Liao, Hao Feng, Qi Liu, Xingjian Fei, Jinghui Lu, Han Wang, Hao Liu, Yang Liu, et al. WildDoc: How far are we from achieving comprehen- sive and robust document understanding in the wild? InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

2025
[60]

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7464–7475. IEEE, 2023

2023
[61]

A normal- ized Gaussian Wasserstein distance for tiny object detection

Jinwang Wang, Chang Xu, Wen Yang, and Lei Yu. A normal- ized Gaussian Wasserstein distance for tiny object detection. arXiv preprint arXiv:2110.13389, 2021

work page arXiv 2021
[62]

Tao Wang, Yang Chen, Munan Qiao, and Hichem Snoussi. A fast and robust convolutional neural network-based de- fect detection model in product quality control.The Inter- national Journal of Advanced Manufacturing Technology, 94(9):3465–3471, 2018

2018
[63]

A-Fast-RCNN: Hard positive generation via adversary for object detection

Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-Fast-RCNN: Hard positive generation via adversary for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3039–3048. IEEE, 2017

2017
[64]

Object detec- tion using clustering algorithm adaptive searching regions in aerial images

Yang Wang, Yuliang Yang, and Xin Zhao. Object detec- tion using clustering algorithm adaptive searching regions in aerial images. InProceedings of the European Conference on Computer Vision Workshops (ECCVW), pages 651–664. Springer, 2020

2020
[65]

CBAM: Convolutional block attention module

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 3–19. Springer, 2018

2018
[66]

PP-YOLOE: An evolved version of YOLO

Shangliang Xu, Xinxin Wang, Wenyu Lv, Qinyao Chang, Cheng Cui, Kaipeng Deng, Guanzhong Wang, Qingqing Dang, Shengyu Wei, Yuning Du, et al. PP-YOLOE: An evolved version of YOLO.arXiv preprint arXiv:2203.16250, 2022

work page arXiv 2022
[67]

Query- Det: Cascaded sparse query for accelerating high-resolution small object detection

Chenhongyi Yang, Zehao Huang, and Naiyan Wang. Query- Det: Cascaded sparse query for accelerating high-resolution small object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13658–13667. IEEE, 2022

2022
[68]

Clustered object detection in aerial images

Fan Yang, Heng Fan, Peng Chu, Erik Blasch, and Haibin Ling. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8310–8319. IEEE, 2019

2019
[69]

DINO: DETR with improved denoising anchor boxes for end-to-end object detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. InProceedings of the International Conference on Learning Representations (ICLR), 2023

2023
[70]

FFCA-YOLO for small object detection in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:5611215, 2024

Yuting Zhang, Mang Ye, Jianbing Zhu, Siming Liu, Lei Zhang, and Bo Du. FFCA-YOLO for small object detection in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:5611215, 2024

2024
[71]

TabPedia: Towards comprehensive visual table understanding with concept synergy

Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shuai Wei, Binghong Wu, Liangtao Liao, Yongjie Ye, Hao Liu, Wenqing Zhou, et al. TabPedia: Towards comprehensive visual table understanding with concept synergy. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[72]

Multi-modal in-context learning makes an ego-evolving scene text recognizer

Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Hao Liu, Zhizhong Zhang, Xin Tan, Can Huang, and Yuan Xie. Multi-modal in-context learning makes an ego-evolving scene text recognizer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15230–15241. IEEE, 2024

2024
[73]

Harmonizing visual text comprehension and genera- tion

Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shuai Wei, Hao Liu, Xin Tan, Zhizhong Zhang, Can Huang, et al. Harmonizing visual text comprehension and genera- tion. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[74]

Distance-IoU loss: Faster and better 11 learning for bounding box regression

Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-IoU loss: Faster and better 11 learning for bounding box regression. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12993–13000, 2020

2020
[75]

Enhancing ge- ometric factors in model learning and inference for object detection and instance segmentation.IEEE Transactions on Cybernetics, 52(8):8574–8586, 2022

Zhaohui Zheng, Ping Wang, Dongwei Ren, Wei Liu, Rong- guang Ye, Qinghua Hu, and Wangmeng Zuo. Enhancing ge- ometric factors in model learning and inference for object detection and instance segmentation.IEEE Transactions on Cybernetics, 52(8):8574–8586, 2022

2022
[76]

SSA-CNN: Semantic self-attention CNN for pedestrian detection.arXiv preprint arXiv:1902.09080, 2019

Chengji Zhou, Meiqing Wu, and Siew-Kei Lam. SSA-CNN: Semantic self-attention CNN for pedestrian detection.arXiv preprint arXiv:1902.09080, 2019

work page arXiv 1902
[77]

TPH- YOLOv5: Improved YOLOv5 based on transformer predic- tion head for object detection on drone-captured scenarios

Xingkui Zhu, Shuchang Lyu, Xu Wang, and Qi Zhao. TPH- YOLOv5: Improved YOLOv5 based on transformer predic- tion head for object detection on drone-captured scenarios. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2778–2788. IEEE, 2021

2021
[78]

Deformable DETR: Deformable transform- ers for end-to-end object detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transform- ers for end-to-end object detection. InProceedings of the In- ternational Conference on Learning Representations (ICLR), 2021. 12

2021