Recognition: no theorem link
Development of embedded target detection system based on FPGA and YOLOv3-Tiny
Pith reviewed 2026-05-11 01:03 UTC · model grok-4.3
The pith
An FPGA system deploys an optimized YOLOv3-Tiny for target detection achieving 0.211-second latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through low-bit quantization, batch normalization fusion, and table lookup mapping to reduce model size and computation, together with a pipelined FPGA hardware accelerator featuring modular design and on-chip cache, the system achieves an inference latency of 0.211 seconds, a power efficiency of 10.11 GOPS/W, and up to 51.94% reduction in hardware resource utilization on the ZYNQ-XC7Z035 platform compared to similar designs.
What carries the argument
The pipelined FPGA hardware accelerator with on-chip cache optimization and modular design, integrated with model compression via quantization and fusion techniques.
If this is right
- The optimized system outperforms comparable designs by 75.58% in inference speed.
- It achieves at least 29.45% better power efficiency at 10.11 GOPS/W.
- Hardware resource utilization drops by as much as 51.94%, allowing deployment on smaller or cheaper FPGAs.
- Off-chip data transmission is minimized, improving overall system efficiency for embedded AI.
- These optimizations enable practical use of deep learning models in resource-constrained embedded applications.
Where Pith is reading between the lines
- Applying the same quantization and pipelined accelerator approach to other lightweight CNNs could yield similar efficiency gains in edge computing.
- Improved power efficiency might allow longer battery life in portable detection devices like drones or robots.
- Verification on additional FPGA platforms would test if the gains are architecture-specific.
Load-bearing premise
The low-bit quantization, batch-norm fusion, and table-lookup mapping preserve enough detection accuracy for the target application while the FPGA metrics are measured under comparable conditions to other designs.
What would settle it
Reproducing the system on a ZYNQ-XC7Z035 platform and measuring whether the inference time reaches or exceeds 0.211 seconds and power efficiency hits 10.11 GOPS/W, or checking if accuracy drops below usable levels on a standard object detection benchmark.
read the original abstract
Computational complexity and storage requirements are crucial factors influencing the performance and efficiency of convolutional neural networks (CNNs) in resource-constrained environments. This paper presents a high-performance embedded target detection system based on FPGA and YOLOv3-Tiny, specifically designed for embedded artificial intelligence applications. By integrating lightweight CNN optimization techniques with hardware accelerator design, significant improvements are made in both computational efficiency and resource utilization. Key optimizations, including low-bit quantization, batch normalization fusion, and table lookup mapping, reduce model parameters and computational complexity. Additionally, an FPGA hardware accelerator with a pipelined architecture is developed to enhance the efficiency of convolution operations while minimizing off-chip data transmission through modular design and on-chip cache optimization. On the ZYNQ-XC7Z035 platform, the system achieves an inference latency of 0.211 seconds, outperforming comparable designs by 75.58% in speed. The system achieves an power efficiency of 10.11 GOPS/W, surpassing comparable designs by at least 29.45%. Furthermore, hardware resource utilization is reduced by up to 51.94% compared to similar systems. This study offers innovative design methodologies and practical application examples for the efficient deployment of deep learning models on embedded platforms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes the development of an embedded target detection system using YOLOv3-Tiny on FPGA hardware. It applies optimizations including low-bit quantization, batch normalization fusion, and table-lookup mapping to reduce model complexity, then implements a pipelined hardware accelerator on the ZYNQ-XC7Z035 platform. The reported outcomes are an inference latency of 0.211 s (75.58% faster than comparables), power efficiency of 10.11 GOPS/W (at least 29.45% better), and up to 51.94% lower hardware resource utilization.
Significance. If the optimizations preserve usable detection accuracy, the work would offer concrete, reproducible design patterns for deploying lightweight CNN detectors on resource-limited FPGAs, with quantified gains in latency, energy efficiency, and area that could inform similar embedded-AI projects.
major comments (2)
- [Abstract] Abstract and experimental results: No mAP, AP50, precision, recall, or any other detection accuracy figures are supplied for the final low-bit quantized model, nor any comparison against the floating-point YOLOv3-Tiny baseline on COCO, VOC, or a custom dataset. Without this datum the speed and efficiency claims cannot be interpreted as improvements to a functioning detector.
- [Results / Experimental evaluation] The weakest assumption—that low-bit quantization, batch-norm fusion, and table-lookup mapping preserve sufficient accuracy—is never tested or quantified, leaving the central claim that the system is a “viable target detector” unsupported by evidence.
minor comments (2)
- [Abstract] Abstract contains a grammatical error: “an power efficiency” should read “a power efficiency.”
- [Methods / Implementation] The manuscript would benefit from explicit statements of the bit-widths used in quantization, the datasets employed for accuracy verification (even if only in supplementary material), and direct side-by-side tables comparing resource, latency, and accuracy against the cited prior FPGA designs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing the necessity of accuracy metrics. We agree that the current manuscript lacks explicit quantification of detection performance for the quantized model, which is required to substantiate the viability of the embedded detector. We will revise the paper to include these results.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental results: No mAP, AP50, precision, recall, or any other detection accuracy figures are supplied for the final low-bit quantized model, nor any comparison against the floating-point YOLOv3-Tiny baseline on COCO, VOC, or a custom dataset. Without this datum the speed and efficiency claims cannot be interpreted as improvements to a functioning detector.
Authors: We acknowledge the absence of accuracy metrics in the abstract and experimental sections. The optimizations (low-bit quantization, batch-norm fusion, table-lookup mapping) were designed to maintain functional detection capability, but the manuscript does not report mAP, AP50, precision, recall, or baseline comparisons. In the revised manuscript we will add these metrics for the final quantized model versus the floating-point baseline on the custom target-detection dataset used in the work, enabling direct assessment of any accuracy trade-offs. revision: yes
-
Referee: [Results / Experimental evaluation] The weakest assumption—that low-bit quantization, batch-norm fusion, and table-lookup mapping preserve sufficient accuracy—is never tested or quantified, leaving the central claim that the system is a “viable target detector” unsupported by evidence.
Authors: We agree that the manuscript does not explicitly test or report the impact of the optimizations on detection accuracy, leaving the viability claim without direct supporting data. The paper prioritizes hardware metrics (latency, efficiency, resource utilization) on the ZYNQ-XC7Z035 platform. We will revise the experimental evaluation section to include accuracy quantification and comparisons, thereby addressing this gap. revision: yes
Circularity Check
No circularity: paper reports measured FPGA implementation results
full rationale
The manuscript describes an engineering implementation of YOLOv3-Tiny optimizations (quantization, batch-norm fusion, table-lookup) on ZYNQ FPGA, followed by direct hardware measurements of latency, power efficiency, and resource use. No derivation chain, fitted parameters, predictions, or self-citation load-bearing steps exist; all performance numbers are empirical outputs from the built system. The absence of post-optimization accuracy metrics is a completeness issue, not circularity. The derivation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Guo, H.: Object detection: From traditional methods to deep learning. Emerg. Sci. Technol. 3(2), 128–145 (2024). https://doi.org/10.12405/j.issn.2097-1486.2024.02.002
-
[2]
In: Proceedings of the 2019 IEEE HPCC/SmartCity/DSS, pp
Wang, T., Wang, C., Zhou, X., Chen, H.: An overview of FPGA based deep learning accelerators: challenges and opportunities. In: Proceedings of the 2019 IEEE HPCC/SmartCity/DSS, pp. 1674–
2019
-
[3]
https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00229 15
IEEE, Zhangjiajie (2019). https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00229 15
-
[4]
Zhang, R., Ji, T., Dong, F.: Lightweight face detection network improved based on YOLO target detection algorithm. In: Proceedings of the 2020 2nd International Conference on Big Data and Artificial Intelligence (ISBDAI '20), pp. 415–420. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3436286.3436429
-
[5]
Wang, W., Cheng, Y., Zhou, Y., et al.: Research on lightweight network for rapid detection of remote sensing image targets based on YOLO. Remote Sens. Technol. Appl. 39(3), 547–556 (2024). https://doi.org/10.11873/j.issn.1004-0323.2024.3.0547
-
[6]
Bi, F., Yang, J.: Target detection system design and FPGA implementation based on YOLO v2 algorithm. In: Proceedings of the 2019 3rd International Conference on Imaging, Signal Processing and Communication (ICISPC), pp. 10–14. IEEE, Singapore (2019). https://doi.org/10.1109/ICISPC.2019.8935783
-
[7]
Computer Technology and Development
Zhang, L.H., Cai, J.J.: Target detection system based on lightweight Yolov5 algorithm. Computer Technology and Development. Dev. 32(11), 134–139 (2022). https://doi.org/10.3969/j.issn.1673- 629X.2022.11.020
-
[8]
Artificial Intelligence Security
Ren, P., Xu, X., Huang, A., et al.: Optimizing the objective detection for RISC-V architecture. Artificial Intelligence Security. 3(3), 21–33 (2024). https://doi.org/10.12407/j.issn.2097- 2075.2024.03.021
-
[9]
Quantization and training of neural networks for efficient integer-arithmetic-only inference
Jacob, B., et al.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2704–2713. IEEE, Salt Lake City (2018). https://doi.org/10.1109/CVPR.2018.00286
-
[10]
Master's thesis, Inner Mongolia University (2021)
Dai, Z.Y.: Design and implementation of convolutional neural network acceleration based on ZYNQ. Master's thesis, Inner Mongolia University (2021). https://doi.org/10.27224/d.cnki.gnmdu.2021.000713 [10]Yu, H.Z.: SoC design of convolutional neural network based on RISC-V. Master's thesis, Shenyang University of Technology (2023). https://doi.org/10.27322/d...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.