pith. machine review for the scientific record. sign in

arxiv: 2604.19233 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords adaptive slicingsmall object detectionhigh-resolution imageryaerial imageryobject detectioninference optimizationVisDronexView
0
0 comments X

The pith

By choosing the number of overlapping patches adaptively from image resolution, ASAHI improves small object detection in high-resolution aerial imagery while cutting inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that fixed slicing wastes computation on high-resolution images and that switching to a resolution-based choice of 6 or 12 patches removes most of that waste without losing detection quality. A reader should care because small-object detectors are used in drone surveillance, disaster mapping, and traffic monitoring, where both accuracy and speed matter. The work adds a fine-tuning step that mixes full and sliced images during training and replaces standard duplicate removal with a combined clustering and distance-aware step. Experiments on two large aerial datasets show the new pipeline reaches the highest reported scores while running 20 to 25 percent faster than the fixed-slice baseline.

Core claim

ASAHI determines the optimal number of slices according to image resolution, thereby generating either 6 or 12 overlapping patches via a learned threshold; it augments training with both full-resolution and sliced patches and applies Cluster-DIoU-NMS to merge detections in crowded scenes, yielding 56.8 percent mAP on VisDrone2019-DET-val and 22.7 percent on xView-test at reduced inference cost.

What carries the argument

The adaptive resolution-aware slicing algorithm that uses a learned threshold to decide between 6 and 12 overlapping patches, replacing fixed slice sizes to balance overlap and redundant computation.

If this is right

  • Detection accuracy reaches state-of-the-art levels on the VisDrone and xView benchmarks.
  • Inference time drops 20-25 percent compared with fixed-size slicing baselines.
  • Training on the mixture of full and sliced patches improves the detector's handling of small targets.
  • Cluster-DIoU-NMS reduces duplicate boxes more reliably in dense object clusters than either component alone.
  • Beneficial patch overlap is retained while overall redundant computation is lowered.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same resolution-driven patch count rule could be applied to video streams where object scale changes frame to frame.
  • The learned threshold might be re-optimized per camera type or altitude to match typical object densities.
  • Similar adaptive slicing could speed up high-resolution segmentation or instance counting tasks outside aerial domains.
  • Lower compute cost opens the possibility of running these detectors on embedded hardware without accuracy loss.

Load-bearing premise

Dynamically selecting 6 or 12 patches from a learned resolution threshold will always keep enough overlap to find small objects while trimming unnecessary work across every scene type and density level.

What would settle it

A test set of varied-resolution aerial images where the adaptive method either misses more small objects than fixed slicing or shows no reduction in measured inference time would falsify the central efficiency claim.

Figures

Figures reproduced from arXiv: 2604.19233 by Francesco Moretti, Guiqin Mario, Yi Jin.

Figure 1
Figure 1. Figure 1: Overview of the proposed ASAHI detection framework. The input image is simultaneously processed through two complementary pathways: Full Inference (FI) for global context and large object detection, and ASAHI adaptive slicing for enhanced small object detection. The Cluster-DIoU-NMS (CDN) module merges and refines predictions from both pathways. During training, the SAF strategy constructs the fine-tuning … view at source ↗
read the original abstract

Deep learning-based object detectors have achieved remarkable success across numerous computer vision applications, yet they continue to struggle with small object detection in high-resolution aerial and satellite imagery, where dense object distributions, variable shooting angles, diminutive target sizes, and substantial inter-class variability pose formidable challenges. Existing slicing strategies that partition high-resolution images into manageable patches have demonstrated promising results for enlarging the effective receptive field of small targets; however, their reliance on fixed slice dimensions introduces significant redundant computation, inflating inference cost and undermining detection speed. In this paper, we propose \textbf{Adaptive Slicing-Assisted Hyper Inference (ASAHI)}, a novel slicing framework that shifts the paradigm from prescribing a fixed slice size to adaptively determining the optimal number of slices according to image resolution, thereby substantially mitigating redundant computation while preserving beneficial overlap between adjacent patches. ASAHI integrates three synergistic components: (1)an adaptive resolution-aware slicing algorithm that dynamically generates 6 or 12 overlapping patches based on a learned threshold, (2)a slicing-assisted fine-tuning (SAF) strategy that constructs augmented training data comprising both full-resolution and sliced image patches, and (3)a Cluster-DIoU-NMS (CDN) post-processing module that combines the geometric merging efficiency of Cluster-NMS with the center-distance-aware suppression of DIoU-NMS to achieve robust duplicate elimination in crowded scenes. Extensive experiments on VisDrone2019 and xView, demonstrate that ASAHI achieves state-of-the-art performance with 56.8% on VisDrone2019-DET-val and 22.7% on xView-test, while reducing inference time by 20-25% compared to the baseline SAHI method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Adaptive Slicing-Assisted Hyper Inference (ASAHI) for small-object detection in high-resolution aerial imagery. It replaces fixed-size slicing with an adaptive resolution-aware algorithm that selects either 6 or 12 overlapping patches according to a learned threshold, augments training via slicing-assisted fine-tuning (SAF), and applies Cluster-DIoU-NMS (CDN) for post-processing. The central empirical claim is that ASAHI reaches 56.8% mAP on VisDrone2019-DET-val and 22.7% mAP on xView-test while cutting inference time 20-25% relative to the SAHI baseline.

Significance. If the reported gains are reproducible and the adaptive threshold generalizes, the work would offer a practical improvement over fixed slicing methods by reducing redundant computation while maintaining overlap for small targets. The combination of SAF and CDN also targets training and crowded-scene issues that are common in aerial detection benchmarks.

major comments (3)
  1. [Abstract] Abstract: the headline SOTA numbers (56.8% VisDrone, 22.7% xView) and the 20-25% speed-up are stated without any reference to the base detector, full baseline list, or ablation results that isolate the contribution of the adaptive slicing component versus SAF and CDN.
  2. [Abstract] Abstract (adaptive resolution-aware slicing paragraph): the method relies on a learned threshold that selects between exactly two discrete patch counts (6 or 12); no value of the threshold, training procedure for it, or density-conditioned ablation is supplied, leaving the central assumption that resolution alone suffices to preserve overlap and avoid missed small objects untested.
  3. [Abstract] Abstract: the manuscript reports no error analysis or failure-case study on scenes with varying object density, which directly bears on whether the claimed accuracy and speed improvements hold when the fixed 6/12 choices under-slice dense regions or over-slice sparse ones.
minor comments (1)
  1. [Abstract] Abstract: formatting inconsistency in the component list ('(1)an' lacks space after the parenthesis).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that additional context will improve clarity and will revise the abstract accordingly while preserving its conciseness. We address each point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline SOTA numbers (56.8% VisDrone, 22.7% xView) and the 20-25% speed-up are stated without any reference to the base detector, full baseline list, or ablation results that isolate the contribution of the adaptive slicing component versus SAF and CDN.

    Authors: The abstract summarizes key outcomes with reference to the SAHI baseline. The base detector, complete baseline comparisons, and component-wise ablations (isolating adaptive slicing, SAF, and CDN) are provided in Sections 4.1–4.3 and Table 3. We will revise the abstract to explicitly name the base detector and note that ablations confirm the individual contributions. revision: yes

  2. Referee: [Abstract] Abstract (adaptive resolution-aware slicing paragraph): the method relies on a learned threshold that selects between exactly two discrete patch counts (6 or 12); no value of the threshold, training procedure for it, or density-conditioned ablation is supplied, leaving the central assumption that resolution alone suffices to preserve overlap and avoid missed small objects untested.

    Authors: The learned threshold and its training procedure are detailed in Section 3.1. The abstract omits the numerical value and procedure for brevity. We will add a concise clause specifying the threshold and training approach. Our experiments already evaluate performance across datasets with varying densities; we will incorporate an explicit density-conditioned ablation summary into the revised abstract and experiments section. revision: yes

  3. Referee: [Abstract] Abstract: the manuscript reports no error analysis or failure-case study on scenes with varying object density, which directly bears on whether the claimed accuracy and speed improvements hold when the fixed 6/12 choices under-slice dense regions or over-slice sparse ones.

    Authors: The abstract is space-constrained and therefore omits detailed error analysis. The manuscript provides qualitative results and density-stratified quantitative evaluations in Section 4.4 and the supplementary material. We will revise the abstract to include a brief summary of robustness across densities and expand the failure-case discussion in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical claims rest on external benchmarks

full rationale

The paper proposes ASAHI with three components—an adaptive resolution-aware slicing algorithm using a learned threshold to select 6 or 12 patches, a slicing-assisted fine-tuning strategy, and a Cluster-DIoU-NMS post-processor—and reports empirical SOTA results (56.8% on VisDrone2019-DET-val, 22.7% on xView-test) plus 20-25% faster inference than SAHI. These performance numbers are obtained via standard evaluation on held-out external datasets rather than any derivation, equation, or fitted parameter that reduces to the method's own inputs by construction. No mathematical first-principles chain, self-definitional quantities, or load-bearing self-citations appear in the provided text; the learned threshold is an internal design choice whose effect is measured externally, not presupposed.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard deep-learning training assumptions plus one learned decision threshold for choosing slice count; no new physical entities are postulated.

free parameters (1)
  • learned threshold for slice count
    Decides between generating 6 or 12 patches according to image resolution; value is learned during training.
axioms (1)
  • domain assumption Training on a mixture of full-resolution and sliced patches improves small-object detection performance
    Invoked by the slicing-assisted fine-tuning component.

pith-pipeline@v0.9.0 · 5608 in / 1225 out tokens · 42643 ms · 2026-05-10T02:11:38.790249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Slicing aided hyper inference and fine-tuning for small object detection

    Fatih Cagatay Akyon, Sinan Onur Altinuc, and Alptekin Temizel. Slicing aided hyper inference and fine-tuning for small object detection. InProceedings of the IEEE Interna- tional Conference on Image Processing (ICIP), pages 966–

  2. [2]

    SOD-MTGAN: Small object detection via multi- task generative adversarial network

    Yancheng Bai, Yongqiang Zhang, Mingli Ding, and Bernard Ghanem. SOD-MTGAN: Small object detection via multi- task generative adversarial network. InProceedings of the 8 European Conference on Computer Vision (ECCV), pages 210–226. Springer, 2018

  3. [3]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. YOLOv4: Optimal speed and accuracy of object detection.arXiv preprint arXiv:2004.10934, 2020

  4. [4]

    Soft-NMS–improving object detection with one line of code

    Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS–improving object detection with one line of code. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5562–5570. IEEE, 2017

  5. [5]

    VisDrone-DET2021: The vision meets drone object detection challenge results

    Yuren Cao, Zhijian He, Longjia Wang, Wengao Wang, Yix- ian Yuan, Daiwei Zhang, Jialiang Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, et al. VisDrone-DET2021: The vision meets drone object detection challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2847–2854. IEEE, 2021

  6. [6]

    End- to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. InProceedings of the European Conference on Computer Vision (ECCV), pages 213–229. Springer, 2020

  7. [7]

    Diffu- sionDet: Diffusion model for object detection

    Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. Diffu- sionDet: Diffusion model for object detection. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 19830–19843. IEEE, 2023

  8. [8]

    Yolov6 v3

    Chuyi Cheng, Yifeng Song, Jian Li, Biao Wang, Aiguo Tao, Zeyu Chen, Jiayan Yuan, Chu Fan, Zhongyu Rong, et al. YOLOv6 v3.0: A full-scale reloading.arXiv preprint arXiv:2301.05586, 2023

  9. [9]

    A global-local self- adaptive network for drone-view object detection.IEEE Transactions on Image Processing, 30:1556–1569, 2021

    Sutao Deng, Shuai Li, Kai Xie, Wenfeng Song, Xiang- wen Liao, Aimin Hao, and Hong Qin. A global-local self- adaptive network for drone-view object detection.IEEE Transactions on Image Processing, 30:1556–1569, 2021

  10. [10]

    CSWin Transformer: A general vision transformer backbone with cross-shaped windows

    Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. CSWin Transformer: A general vision transformer backbone with cross-shaped windows. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12114–12124. IEEE, 2022

  11. [11]

    An image is worth 16×16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16×16 words: Trans- formers for image recognition at scale. InProceedings of the International Conference on Learning Representations (ICLR), 2021

  12. [12]

    VisDrone-DET2019: The vi- sion meets drone object detection in image challenge results

    Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Lin, Qinghua Hu, Tao Peng, Jiangyu Zheng, Xiangyu Wang, Yifan Zhang, et al. VisDrone-DET2019: The vi- sion meets drone object detection in image challenge results. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 213–226. IEEE, 2019

  13. [13]

    Ad- vancing sequential numerical prediction in autoregressive models

    Xingjian Fei, Jinghui Lu, Quan Sun, Hao Feng, Yanjie Wang, Wei Shi, Anlu Wang, Jingqun Tang, and Can Huang. Ad- vancing sequential numerical prediction in autoregressive models. InProceedings of the Annual Meeting of the As- sociation for Computational Linguistics (ACL), 2025

  14. [14]

    TOOD: Task-aligned one-stage object detection

    Chengjian Feng, Yujie Zhong, Yu Gao, Matthew R Scott, and Weilin Huang. TOOD: Task-aligned one-stage object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3490–3499. IEEE, 2021

  15. [15]

    DocPedia: Unleashing the power of large multimodal model in the frequency domain for ver- satile document understanding

    Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wenqing Zhou, Hezhi Li, and Can Huang. DocPedia: Unleashing the power of large multimodal model in the frequency domain for ver- satile document understanding. 2024

  16. [16]

    Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

    Hao Feng, Wei Shi, Kai Zhang, Xingjian Fei, Liangtao Liao, Dingkang Yang, Yingying Du, Xuecheng Wu, Jingqun Tang, Yang Liu, et al. Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

  17. [17]

    UniDoc: A universal large multimodal model for simultaneous text detection, recogni- tion, spotting and understanding

    Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wenqing Zhou, Hezhi Li, and Can Huang. UniDoc: A universal large multimodal model for simultaneous text detection, recogni- tion, spotting and understanding. 2023

  18. [18]

    Dolphin: Document image parsing via heterogeneous anchor prompting

    Hao Feng, Shuai Wei, Xingjian Fei, Wei Shi, Yunhao Han, Liangtao Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InFindings of the As- sociation for Computational Linguistics: ACL, pages 21919– 21936, 2025

  19. [19]

    Rich feature hierarchies for accurate object detec- tion and semantic segmentation

    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detec- tion and semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587. IEEE, 2014

  20. [20]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE Interna- tional Conference on Computer Vision (ICCV), pages 2980–

  21. [21]

    Spatial pyramid pooling in deep convolutional networks for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015

  22. [22]

    Coordinate atten- tion for efficient mobile network design

    Qibin Hou, Daquan Zhou, and Jiashi Feng. Coordinate atten- tion for efficient mobile network design. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13708–13717. IEEE, 2021

  23. [23]

    MinDEV: Multi-modal integrated diffusion framework for video reconstruction from EEG sig- nals

    Shuo Huang, Yuxuan Wang, Hanchi Luo, Huanyu Jing, Chao Qin, and Jingqun Tang. MinDEV: Multi-modal integrated diffusion framework for video reconstruction from EEG sig- nals. InProceedings of the ACM International Conference on Multimedia, pages 3350–3359, 2025

  24. [24]

    Ultr- alytics YOLOv8

    Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultr- alytics YOLOv8. 2023.https://github.com/ ultralytics/ultralytics

  25. [25]

    Ultralyt- ics/YOLOv5: v5.0–YOLOv5-P6 1280 models, AWS, Super- vise.ly and YouTube integrations, 2021

    Glenn Jocher, Alex Stoken, Jiri Borovec, et al. Ultralyt- ics/YOLOv5: v5.0–YOLOv5-P6 1280 models, AWS, Super- vise.ly and YouTube integrations, 2021

  26. [26]

    Augmentation for small object detection

    M ´at´e Kisantal, Zbigniew Wojna, Jakub Muber, Jakub Jezier- ski, and Jarosław Kowalczyk. Augmentation for small object detection. InProceedings of the International Conference on Advances in Computer Vision, 2019. 9

  27. [27]

    Focus-and-detect: A small object detection framework for aerial images.Signal Processing: Image Communication, 104:116675, 2022

    Onur Can Koyun, Ramazan Kadir Keser, ˙Ibrahim Batuhan Akkaya, and Bu ˘gra Ufuk T ¨oreyin. Focus-and-detect: A small object detection framework for aerial images.Signal Processing: Image Communication, 104:116675, 2022

  28. [28]

    xview: Objects in context in overhead imagery,

    Darius Lam, Richard Kuzma, Kevin McGee, Samuel Doo- ley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, and Brendan McCord. xview: Objects in context in overhead imagery.arXiv preprint arXiv:1802.07856, 2018

  29. [29]

    Density map guided object detection in aerial images

    Changlin Li, Taojiannan Yang, Sijie Zhu, Chen Chen, and Shanyue Guan. Density map guided object detection in aerial images. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages 737–746. IEEE, 2020

  30. [30]

    DN-DETR: Accelerate DETR training by in- troducing query denoising

    Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. DN-DETR: Accelerate DETR training by in- troducing query denoising. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13619–13627. IEEE, 2022

  31. [31]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944. IEEE, 2017

  32. [32]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017

  33. [33]

    Path aggregation network for instance segmentation

    Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8759–8768. IEEE, 2018

  34. [34]

    Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InProceedings of the European Conference on Computer Vision (ECCV), pages 38–55. Springer, 2024

  35. [35]

    SSD: Single shot multibox detector

    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. InProceedings of the European Conference on Computer Vision (ECCV), pages 21–37. Springer, 2016

  36. [36]

    SPTS v2: Single-point scene text spotting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15047–15063, 2023

    Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, et al. SPTS v2: Single-point scene text spotting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15047–15063, 2023

  37. [37]

    Swin Transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022. IEEE, 2021

  38. [38]

    A bounding box is worth one token— interleaving layout and text in a large language model for document understanding

    Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token— interleaving layout and text in a large language model for document understanding. InFindings of the Association for Computational Linguistics: ACL, pages 7252–7273, 2025

  39. [39]

    Meta-DermDiagnosis: Few-shot skin disease identification using meta-learning

    Kushagra Mahajan, Monika Sharma, and Lovekesh Vig. Meta-DermDiagnosis: Few-shot skin disease identification using meta-learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages 730–731. IEEE, 2020

  40. [40]

    Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small ob- ject detection

    Junhyug Noh, Wonho Bae, Wonhee Lee, Jinhwan Seo, and Gunhee Kim. Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small ob- ject detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 9725–

  41. [41]

    Power of tiling and merging in small object detection

    Fatih ¨Ozge Unel, Burak Osman Ozkalayci, and Cevahir Cigla. Power of tiling and merging in small object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 0–0. IEEE, 2019

  42. [42]

    De- tectoRS: Detecting objects with recursive feature pyramid and switchable atrous convolution

    Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. De- tectoRS: Detecting objects with recursive feature pyramid and switchable atrous convolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10208–10219. IEEE, 2021

  43. [43]

    You only look once: Unified, real-time object de- tection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779–

  44. [44]

    YOLO9000: Better, faster, stronger

    Joseph Redmon and Ali Farhadi. YOLO9000: Better, faster, stronger. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7263–

  45. [45]

    Faster R-CNN: Towards real-time object detection with re- gion proposal networks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 39(6):1137–1149, 2017

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 39(6):1137–1149, 2017

  46. [46]

    MCTBench: Multimodal cognition towards text-rich visual scenes bench- mark.arXiv preprint arXiv:2410.11538, 2024

    Biluo Shan, Xingjian Fei, Wei Shi, Anlu Wang, Guozhi Tang, Liangtao Liao, Jingqun Tang, Xiang Bai, and Can Huang. MCTBench: Multimodal cognition towards text-rich visual scenes benchmark.arXiv preprint arXiv:2410.11538, 2024

  47. [47]

    Weighted boxes fusion: Ensembling boxes from different object detection models.Image and Vision Computing, 107:104117, 2021

    Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. Weighted boxes fusion: Ensembling boxes from different object detection models.Image and Vision Computing, 107:104117, 2021

  48. [48]

    HIT-UA V: A high-altitude in- frared thermal dataset for unmanned aerial vehicle-based ob- ject detection.Scientific Data, 10(1):227, 2023

    Jiashun Suo, Tianyi Wang, Xingzhou Zhang, Haiyang Chen, Wei Zhou, and Weisong Shi. HIT-UA V: A high-altitude in- frared thermal dataset for unmanned aerial vehicle-based ob- ject detection.Scientific Data, 10(1):227, 2023

  49. [49]

    EfficientDet: Scalable and efficient object detection

    Mingxing Tan, Ruoming Pang, and Quoc V Le. EfficientDet: Scalable and efficient object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10778–10787. IEEE, 2020

  50. [50]

    Character recognition competition for street view shop signs.National Science Review, 10(6):nwad141, 2023

    Jingqun Tang, Wei Du, Bochao Wang, Wenqing Zhou, Song Mei, Tao Xue, Xiang Xu, and Haoyue Zhang. Character recognition competition for street view shop signs.National Science Review, 10(6):nwad141, 2023. 10

  51. [51]

    TextSquare: Scaling up text-centric visual instruction tuning,

    Jingqun Tang, Chunhui Lin, Zhen Zhao, Shuai Wei, Binghong Wu, Qi Liu, Yang He, Kang Lu, Hao Feng, Yu Li, et al. TextSquare: Scaling up text-centric visual instruction tuning.arXiv preprint arXiv:2404.12803, 2024

  52. [52]

    MTVQA: Benchmarking multilingual text-centric vi- sual question answering

    Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shuai Wei, Anlu Wang, Chunhui Lin, Hao Feng, Zhen Zhao, et al. MTVQA: Benchmarking multilingual text-centric vi- sual question answering. InFindings of the Association for Computational Linguistics: ACL, pages 7748–7763, 2025

  53. [53]

    Optimal boxes: Boosting end-to- end scene text recognition by adjusting annotated bounding boxes via reinforcement learning

    Jingqun Tang, Wenqiang Qian, Lei Song, Xiaobin Dong, Lingxi Li, and Xiang Bai. Optimal boxes: Boosting end-to- end scene text recognition by adjusting annotated bounding boxes via reinforcement learning. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 233–

  54. [54]

    You can even annotate text with voice: Transcription-only-supervised text spotting

    Jingqun Tang, Shaobo Qiao, Benlei Cui, Yuhang Ma, Shengjie Zhang, and Dimitrios Kanoulas. You can even annotate text with voice: Transcription-only-supervised text spotting. InProceedings of the 30th ACM International Con- ference on Multimedia, pages 4154–4163, 2022

  55. [55]

    Few could be better than all: Feature sampling and grouping for scene text detec- tion

    Jingqun Tang, Wenqing Zhang, Hao Liu, Min-Ke Yang, Bo Jiang, Guangliang Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detec- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4563–

  56. [56]

    Recent advances in small object detection based on deep learning: A review

    Kang Tong, Yiquan Wu, and Fei Zhou. Recent advances in small object detection based on deep learning: A review. Image and Vision Computing, 97:103910, 2020

  57. [57]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems (NeurIPS), volume 30, 2017

  58. [58]

    PARGO: Bridging vision-language with partial and global views

    Anlu Wang, Biluo Shan, Wei Shi, Kai-Yun Lin, Xingjian Fei, Guozhi Tang, Liangtao Liao, Jingqun Tang, Can Huang, et al. PARGO: Bridging vision-language with partial and global views. InProceedings of the AAAI Conference on Ar- tificial Intelligence, 2025

  59. [59]

    Anlu Wang, Jingqun Tang, Liangtao Liao, Hao Feng, Qi Liu, Xingjian Fei, Jinghui Lu, Han Wang, Hao Liu, Yang Liu, et al. WildDoc: How far are we from achieving comprehen- sive and robust document understanding in the wild? InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  60. [60]

    YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

    Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7464–7475. IEEE, 2023

  61. [61]

    A normal- ized Gaussian Wasserstein distance for tiny object detection

    Jinwang Wang, Chang Xu, Wen Yang, and Lei Yu. A normal- ized Gaussian Wasserstein distance for tiny object detection. arXiv preprint arXiv:2110.13389, 2021

  62. [62]

    Tao Wang, Yang Chen, Munan Qiao, and Hichem Snoussi. A fast and robust convolutional neural network-based de- fect detection model in product quality control.The Inter- national Journal of Advanced Manufacturing Technology, 94(9):3465–3471, 2018

  63. [63]

    A-Fast-RCNN: Hard positive generation via adversary for object detection

    Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-Fast-RCNN: Hard positive generation via adversary for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3039–3048. IEEE, 2017

  64. [64]

    Object detec- tion using clustering algorithm adaptive searching regions in aerial images

    Yang Wang, Yuliang Yang, and Xin Zhao. Object detec- tion using clustering algorithm adaptive searching regions in aerial images. InProceedings of the European Conference on Computer Vision Workshops (ECCVW), pages 651–664. Springer, 2020

  65. [65]

    CBAM: Convolutional block attention module

    Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 3–19. Springer, 2018

  66. [66]

    PP-YOLOE: An evolved version of YOLO

    Shangliang Xu, Xinxin Wang, Wenyu Lv, Qinyao Chang, Cheng Cui, Kaipeng Deng, Guanzhong Wang, Qingqing Dang, Shengyu Wei, Yuning Du, et al. PP-YOLOE: An evolved version of YOLO.arXiv preprint arXiv:2203.16250, 2022

  67. [67]

    Query- Det: Cascaded sparse query for accelerating high-resolution small object detection

    Chenhongyi Yang, Zehao Huang, and Naiyan Wang. Query- Det: Cascaded sparse query for accelerating high-resolution small object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13658–13667. IEEE, 2022

  68. [68]

    Clustered object detection in aerial images

    Fan Yang, Heng Fan, Peng Chu, Erik Blasch, and Haibin Ling. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8310–8319. IEEE, 2019

  69. [69]

    DINO: DETR with improved denoising anchor boxes for end-to-end object detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. InProceedings of the International Conference on Learning Representations (ICLR), 2023

  70. [70]

    FFCA-YOLO for small object detection in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:5611215, 2024

    Yuting Zhang, Mang Ye, Jianbing Zhu, Siming Liu, Lei Zhang, and Bo Du. FFCA-YOLO for small object detection in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:5611215, 2024

  71. [71]

    TabPedia: Towards comprehensive visual table understanding with concept synergy

    Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shuai Wei, Binghong Wu, Liangtao Liao, Yongjie Ye, Hao Liu, Wenqing Zhou, et al. TabPedia: Towards comprehensive visual table understanding with concept synergy. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  72. [72]

    Multi-modal in-context learning makes an ego-evolving scene text recognizer

    Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Hao Liu, Zhizhong Zhang, Xin Tan, Can Huang, and Yuan Xie. Multi-modal in-context learning makes an ego-evolving scene text recognizer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15230–15241. IEEE, 2024

  73. [73]

    Harmonizing visual text comprehension and genera- tion

    Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shuai Wei, Hao Liu, Xin Tan, Zhizhong Zhang, Can Huang, et al. Harmonizing visual text comprehension and genera- tion. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  74. [74]

    Distance-IoU loss: Faster and better 11 learning for bounding box regression

    Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-IoU loss: Faster and better 11 learning for bounding box regression. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12993–13000, 2020

  75. [75]

    Enhancing ge- ometric factors in model learning and inference for object detection and instance segmentation.IEEE Transactions on Cybernetics, 52(8):8574–8586, 2022

    Zhaohui Zheng, Ping Wang, Dongwei Ren, Wei Liu, Rong- guang Ye, Qinghua Hu, and Wangmeng Zuo. Enhancing ge- ometric factors in model learning and inference for object detection and instance segmentation.IEEE Transactions on Cybernetics, 52(8):8574–8586, 2022

  76. [76]

    SSA-CNN: Semantic self-attention CNN for pedestrian detection.arXiv preprint arXiv:1902.09080, 2019

    Chengji Zhou, Meiqing Wu, and Siew-Kei Lam. SSA-CNN: Semantic self-attention CNN for pedestrian detection.arXiv preprint arXiv:1902.09080, 2019

  77. [77]

    TPH- YOLOv5: Improved YOLOv5 based on transformer predic- tion head for object detection on drone-captured scenarios

    Xingkui Zhu, Shuchang Lyu, Xu Wang, and Qi Zhao. TPH- YOLOv5: Improved YOLOv5 based on transformer predic- tion head for object detection on drone-captured scenarios. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2778–2788. IEEE, 2021

  78. [78]

    Deformable DETR: Deformable transform- ers for end-to-end object detection

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transform- ers for end-to-end object detection. InProceedings of the In- ternational Conference on Learning Representations (ICLR), 2021. 12