arxiv: 2604.09996 · v1 · submitted 2026-04-11 · 💻 cs.CV

Recognition: unknown

A Comparative Study of Modern Object Detectors for Robust Apple Detection in Orchard Imagery

Mohammed Asad , Ajai Kumar Gautam , Priyanshu Dhiman , Rishi Raj Prajapati

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords apple detectionobject detectionorchard imageryYOLORT-DETRbenchmarkmAPfruit localization

0 comments

The pith

YOLO11n records the highest strict localization mAP for single-class apple detection on a fixed split of the AppleBBCH81 orchard dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a reproducible benchmark that applies the same training, validation, and test split plus identical evaluation rules to six object detectors on public orchard images. It measures performance with COCO-style mAP at two IoU thresholds and examines how precision, recall, and F1-score shift when the confidence threshold changes. YOLO11n leads on the stricter localization metric while YOLOv10n and RT-DETR-L trade off differently at low thresholds. The comparison matters because orchard tasks such as yield estimation and robotic picking need detectors that handle leaves, clusters, and varying light without frequent retraining. Readers can therefore select a model according to whether their downstream system values tight bounding boxes or maximum fruit recall.

Core claim

On the validation split, YOLO11n attains the best strict localization performance with mAP@0.5:0.95 = 0.6065 and mAP@0.5 = 0.9620, followed closely by RT-DETR-L and YOLOv10n. At a fixed operating point of confidence >= 0.05, YOLOv10n reaches the highest F1-score while RT-DETR-L produces very high recall accompanied by many low-confidence false positives. The study concludes that orchard deployment decisions must weigh localization-aware accuracy against threshold robustness and the needs of the specific application.

What carries the argument

A single deterministic train-validation-test split of the AppleBBCH81 dataset together with a unified COCO-style mAP protocol and precision-recall analysis applied identically to all six detectors.

If this is right

Applications that require precise fruit localization should prefer YOLO11n on the basis of its leading mAP@0.5:0.95 score.
Systems that tolerate more false positives in exchange for catching nearly all apples can consider RT-DETR-L at low confidence thresholds.
No detector dominates every operating point, so final selection must incorporate the precision-recall needs of the downstream yield or harvesting task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real orchards often contain greater variability than a single public dataset can represent, so models may still require site-specific fine-tuning.
Future comparisons could add runtime speed and memory measurements to guide edge-device deployment in the field.
Extending the benchmark to video sequences would reveal whether the top models maintain consistent detections across consecutive frames.

Load-bearing premise

The single fixed split and the AppleBBCH81 images already capture enough real-world orchard variability in lighting, leaf clutter, dense clusters, and partial occlusions.

What would settle it

Re-evaluating the same six detectors on a new orchard dataset collected under different seasons, camera angles, or fruit varieties and obtaining a different performance ranking.

Figures

Figures reproduced from arXiv: 2604.09996 by Ajai Kumar Gautam, Mohammed Asad, Priyanshu Dhiman, Rishi Raj Prajapati.

**Figure 2.** Figure 2: Model taxonomy of the evaluated detectors. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Precision-recall curves for all models on the validation split at IoU [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Object-level confusion summary for YOLO11n on the validation split [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Accurate apple detection in orchard images is important for yield prediction, fruit counting, robotic harvesting, and crop monitoring. However, changing illumination, leaf clutter, dense fruit clusters, and partial occlusion make detection difficult. To provide a fair and reproducible comparison, this study establishes a controlled benchmark for single-class apple detection on the public AppleBBCH81 dataset using one deterministic train, validation, and test split and a unified evaluation protocol across six representative detectors: YOLOv10n, YOLO11n, RT-DETR-L, Faster R-CNN (ResNet50-FPN), FCOS (ResNet50-FPN), and SSDLite320 (MobileNetV3-Large). Performance is evaluated primarily using COCO-style mAP@0.5 and mAP@0.5:0.95, and threshold-dependent behavior is further analyzed using precision-recall curves and fixed-threshold precision, recall, and F1-score at IoU = 0.5. On the validation split, YOLO11n achieves the best strict localization performance with mAP@0.5:0.95 = 0.6065 and mAP@0.5 = 0.9620, followed closely by RT-DETR-L and YOLOv10n. At a fixed operating point with confidence >= 0.05, YOLOv10n attains the highest F1-score, whereas RT-DETR-L achieves very high recall but low precision because of many false positives at low confidence. These findings show that detector selection for orchard deployment should be guided not only by localization-aware accuracy but also by threshold robustness and the requirements of the downstream task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A straightforward benchmark paper that ranks YOLO11n highest on apple detection but rests on one fixed data split without variability checks.

read the letter

This paper runs a clean head-to-head of six off-the-shelf detectors on the public AppleBBCH81 dataset for single-class apple detection. It reports mAP@0.5:0.95, mAP@0.5, precision-recall curves, and thresholded F1 scores under one shared train-val-test split and evaluation protocol. YOLO11n comes out on top for strict localization (0.6065 mAP@0.5:0.95), with RT-DETR-L and YOLOv10n close behind, and the authors note that operating-point choice matters for downstream tasks like harvesting robots. That is the main deliverable: a reproducible reference point for people who need to pick a detector for similar orchard imagery. The work is honest about its scope and sticks to standard COCO metrics without claiming new architectures or theory. The comparison is fair within the stated constraints and the threshold analysis adds some practical value over pure mAP tables. The soft spot is the single deterministic split. No standard deviations, no repeated seeds, no k-fold results, and no sensitivity test to partition choice appear in the reported results. Orchard images vary a lot with lighting, occlusion, and clustering, so the narrow gap between the top three detectors could easily flip under a different split. The authors acknowledge the dataset challenges but do not quantify how much the ranking depends on this one partition. Scope is also narrow—one crop, one public dataset—so the guidance stays specific rather than general. This is the kind of paper that helps practitioners in agricultural robotics or yield estimation who want a starting point for detector selection. It is not for readers looking for new methods or broad theoretical claims. I would bring it to a reading group as a quick case study on benchmark pitfalls, but I would not cite it in my own work. It deserves peer review because the empirical protocol is clear and the results are falsifiable, though any referee should ask for split-robustness checks before acceptance.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a benchmark comparison of six object detectors (YOLOv10n, YOLO11n, RT-DETR-L, Faster R-CNN with ResNet50-FPN, FCOS with ResNet50-FPN, and SSDLite320 with MobileNetV3-Large) for single-class apple detection on the public AppleBBCH81 dataset. It uses one deterministic train-validation-test split and a unified COCO-style evaluation protocol, reporting mAP@0.5 and mAP@0.5:0.95 values along with precision-recall curves and fixed-threshold F1 scores. The primary result is that YOLO11n attains the highest validation mAP@0.5:0.95 of 0.6065 (and mAP@0.5 of 0.9620), with secondary observations on threshold robustness and downstream-task implications for orchard applications.

Significance. If the reported detector ordering proves stable, the work supplies a useful, reproducible reference point for agricultural vision systems, clarifying trade-offs between strict localization accuracy and operating-point behavior under realistic orchard variability. The public dataset and standard metrics support direct follow-on comparisons.

major comments (1)

[Abstract and evaluation protocol] The central claim that YOLO11n achieves the best strict localization performance (mAP@0.5:0.95 = 0.6065 on the validation split) rests on a single deterministic train-validation-test partition with only point estimates. No standard deviations across random seeds, no k-fold results, and no sensitivity analysis to split choice are reported. Given the documented high variability from illumination, occlusions, and clustering, the observed ranking versus RT-DETR-L and YOLOv10n could reverse under a different partition, rendering the comparative conclusions unreliable.

minor comments (1)

The abstract states that a 'unified evaluation protocol' is used, yet the precise training hyperparameters, augmentation strategy, and optimizer settings are not summarized; these should be tabulated in the methods section to enable exact reproduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address the single major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and evaluation protocol] The central claim that YOLO11n achieves the best strict localization performance (mAP@0.5:0.95 = 0.6065 on the validation split) rests on a single deterministic train-validation-test partition with only point estimates. No standard deviations across random seeds, no k-fold results, and no sensitivity analysis to split choice are reported. Given the documented high variability from illumination, occlusions, and clustering, the observed ranking versus RT-DETR-L and YOLOv10n could reverse under a different partition, rendering the comparative conclusions unreliable.

Authors: We agree that reliance on a single deterministic split yields only point estimates and does not quantify variability across partitions or random seeds. This is a genuine limitation, especially given the orchard-specific sources of variability we discuss in the introduction. Our rationale for the fixed split was to guarantee identical training and evaluation conditions for all six detectors, thereby enabling a strictly controlled and fully reproducible benchmark on the public AppleBBCH81 dataset. In the revised manuscript we will (1) add an explicit limitations paragraph in the Discussion section that states the ranking is tied to one partition and may change under different splits, (2) recommend that practitioners perform multi-seed or k-fold validation before deployment, and (3) note that the released code and dataset make such sensitivity analyses straightforward for follow-up work. We will not claim statistical superiority of the observed ordering but will present the results as a reproducible reference point under the stated protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark on public dataset

full rationale

The paper reports performance of standard detectors (YOLO variants, RT-DETR, Faster R-CNN, etc.) trained and evaluated on one fixed train-val-test split of the public AppleBBCH81 dataset. All claims are point estimates of COCO mAP@0.5 and mAP@0.5:0.95 plus PR curves and F1 scores at fixed thresholds. No equations, derivations, fitted parameters renamed as predictions, uniqueness theorems, or self-citations appear in the load-bearing steps. Results are straightforward measurements using off-the-shelf implementations and metrics; they do not reduce to their own inputs by construction. The single-split design is a methodological limitation but does not create circularity in the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical comparative study with no free parameters, axioms, or invented entities in a mathematical sense; performance metrics are standard COCO-style evaluations.

pith-pipeline@v0.9.0 · 5621 in / 1069 out tokens · 72155 ms · 2026-05-10T16:18:07.055214+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Yolov10: Real-time end-to-end object detection.arXiv preprint arXiv:2405.14458, 2024

A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Han, and G. Ding, “Yolov10: Real-time end-to-end object detection,”arXiv preprint arXiv:2405.14458, 2024. [Online]. Available: https://arxiv.org/abs/2405. 14458

work page arXiv 2024
[2]

Ultralytics yolo11 documentation,

Ultralytics, “Ultralytics yolo11 documentation,” Online documentation, 2024, accessed: 2026-03-04. [Online]. Available: https://docs.ultralytics. com/models/yolo11/

2024
[3]

Detrs beat yolos on real-time object detection,

Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [Online]. Available: https: //openaccess.thecvf.com/content/CVPR2024/papers/Zhao DETRs Beat YOLOs on Real-time Object Detection CVPR 2024...

2024
[4]

Girshick, and Jian Sun

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2015. [Online]. Available: https://arxiv.org/abs/1506.01497

work page arXiv 2015
[5]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
[6]

Available: https://openaccess.thecvf.com/content cvpr 2017/papers/Lin Feature Pyramid Networks CVPR 2017 paper.pdf

[Online]. Available: https://openaccess.thecvf.com/content cvpr 2017/papers/Lin Feature Pyramid Networks CVPR 2017 paper.pdf

2017
[7]

Fcos: Fully convolutional one- stage object detection,

Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one- stage object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. [Online]. Available: https://openaccess.thecvf.com/content ICCV 2019/papers/Tian FCOS Fully Convolutional One-Stage Object Detection ICCV 2019 paper. pdf

2019
[8]

Ssd: Single shot multibox detector,

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in Computer Vision – ECCV 2016, 2016, pp. 21–37. [Online]. Available: https://arxiv.org/abs/1512.02325

work page arXiv 2016
[9]

Searching for mobilenetv3,

A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y . Zhu, R. Pang, V . Vasudevan, Q. V . Le, and H. Adam, “Searching for mobilenetv3,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. [Online]. Available: https://openaccess.thecvf.com/content ICCV 2019/ papers/Howard Searching for MobileNetV3 ICCV...

2019
[10]

Torchvision model docs: ss- dlite320 mobilenet v3 large,

PyTorch Team, “Torchvision model docs: ss- dlite320 mobilenet v3 large,” Online documentation, 2024, accessed: 2026-03-04. [Online]. Available: https://docs.pytorch.org/vision/main/models/generated/torchvision. models.detection.ssdlite320 mobilenet v3 large.html

2024
[11]

Deepfruits: A fruit detection system using deep neural networks,

I. Sa, Z. Ge, F. Dayoub, B. Upcroft, T. Perez, and C. McCool, “Deepfruits: A fruit detection system using deep neural networks,” Sensors, vol. 16, no. 8, p. 1222, 2016. [Online]. Available: https://www.mdpi.com/1424-8220/16/8/1222

2016
[12]

Deep fruit detection in orchards,

S. Bargoti and J. Underwood, “Deep fruit detection in orchards,” in2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 3626–3633

2017
[13]

Minneapple: A benchmark dataset for apple detection and segmentation,

N. H ¨ani, P. Roy, and V . Isler, “Minneapple: A benchmark dataset for apple detection and segmentation,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 852–858, 2020

2020
[14]

Autonomous yield estimation system for small commercial orchards using uav and ai,

S. Kodors, I. Zarembo, G. L ¯acis, L. Litavniece, I. Apein¯ans, M. Sondors, and A. Pacejs, “Autonomous yield estimation system for small commercial orchards using uav and ai,”Drones, vol. 8, no. 12, p. 734,
[15]

Available: https://www.mdpi.com/2504-446X/8/12/734

[Online]. Available: https://www.mdpi.com/2504-446X/8/12/734
[16]

Applebbch81 dataset (apple fruits, bbch 81-85, yolo format),

projectlzp201910094, “Applebbch81 dataset (apple fruits, bbch 81-85, yolo format),” Kaggle dataset, 2024, accessed: 2026-03-04. [On- line]. Available: https://www.kaggle.com/datasets/projectlzp201910094/ applebbch81

2024
[17]

Bfp net: Balanced feature pyramid network for small apple detection in complex orchard environment,

M. Sun, L. Xu, X. Chen, Z. Ji, Y . Zheng, and W. Jia, “Bfp net: Balanced feature pyramid network for small apple detection in complex orchard environment,”Plant Phenomics, vol. 2022, p. 9892464, 2022

2022
[18]

Detection and counting of small target apples under complicated environments by using improved yolov7-tiny,

L. Ma, L. Zhao, Z. Wang, J. Zhang, and G. Chen, “Detection and counting of small target apples under complicated environments by using improved yolov7-tiny,”Agronomy, vol. 13, no. 5, p. 1419, 2023. [Online]. Available: https://www.mdpi.com/2073-4395/13/5/1419

2023
[19]

Simultaneous fruit detection and size estimation using multitask deep neural networks,

M. Ferrer-Ferrer, J. Ruiz-Hidalgo, E. Gregorio, V . Vilaplana, J.-R. Mor- ros, and J. Gen´e-Mola, “Simultaneous fruit detection and size estimation using multitask deep neural networks,”Biosystems Engineering, vol. 233, pp. 63–75, 2023

2023
[20]

Importance of mosaic augmentation for agricultural image dataset,

S. Kodors, M. Sondors, I. Apeinans, I. Zarembo, G. Lacis, E. Rubauskis, and K. Karklina, “Importance of mosaic augmentation for agricultural image dataset,”Agronomy Research, vol. 22, no. 1, pp. 168–179, 2024

2024
[21]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [Online]. Available: https://www.cv-foundation.org/openaccess/content cvpr 2016/papers/He Deep Residual Learning CVPR 2016 paper.pdf

2016
[22]

You only look once: Unified, real-time object detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788. [Online]. Available: https://www.cv-foundation.org/openaccess/content cvpr 2016/papers/Redmon You Only Look CVPR 2016 paper.pdf

2016
[23]

Ultralytics yolo repository,

Ultralytics, “Ultralytics yolo repository,” GitHub repository, 2026, accessed: 2026-03-04. [Online]. Available: https://github.com/ultralytics/ ultralytics

2026
[24]

Mobilenetv2: Inverted residuals and linear bottlenecks,

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [Online]. Avail- able: https://openaccess.thecvf.com/content cvpr 2018/papers/Sandler MobileNetV2 Inverted Residuals CVPR 2018 paper.pdf

2018
[25]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017. [Online]. Available: https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Xinlei Chen, Hao Fang, Tsung-yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” arXiv preprint arXiv:2005.12872, 2020. [Online]. Available: https: //arxiv.org/abs/2005.12872

work page arXiv 2005
[27]

A comparative study of fruit detection and counting methods for yield mapping in apple orchards,

N. H ¨ani, P. Roy, and V . Isler, “A comparative study of fruit detection and counting methods for yield mapping in apple orchards,”Journal of Field Robotics, vol. 37, no. 2, pp. 263–282, 2020. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/rob.21902

work page doi:10.1002/rob.21902 2020