pith. machine review for the scientific record. sign in

arxiv: 2605.03396 · v2 · submitted 2026-05-05 · 💻 cs.AR

Recognition: no theorem link

Design and Implementation of BNN-Based Object Detection on FPGA

Baochang Zhang, Gaolong Zhang, Haoyu Huang, Mengyuan Zhu, Xiaoyu Xu, Xuyu Zhao, Yanjing Li, Yunpeng Wu

Pith reviewed 2026-05-12 01:35 UTC · model grok-4.3

classification 💻 cs.AR
keywords binary neural networkobject detectionFPGAYOLOv3-tinyVerilog RTLquantizationhardware acceleratorVOC dataset
0
0 comments X

The pith

A YOLOv3-tiny-like object detector using 1-bit weights runs on FPGA in Verilog RTL with 0.999964 correlation to its ONNX software counterpart.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper converts a trained BNN model into a complete hardware design for an FPGA. It extracts weights and parameters from ONNX, packs them into memory, and codes padding, binary convolutions, quantization, pooling, and detection logic entirely in Verilog. Simulation confirms the hardware produces nearly identical raw outputs to the original model while delivering 39.6 percent mAP50 on the VOC dataset at 0.098 GFLOPs and 0.74 million parameters.

Core claim

The central claim is that a hybrid-precision BNN detector, with 1-bit weights and 8-bit activations in most layers plus fixed-point heads, can be implemented on FPGA such that the final detection outputs match the software ONNX node at a correlation of 0.999964 and mean absolute error of 0.020027, while achieving 39.6 percent mAP50 on VOC.

What carries the argument

The binary convolution processing element that fuses Mul_prev channel compensation directly into the accumulation step so per-channel scaling occurs without extra multipliers.

If this is right

  • Object detection becomes feasible on low-cost FPGAs with far lower memory footprint than full-precision networks.
  • The design supports real-time inference for embedded vision at 0.098 GFLOPs.
  • Hybrid 1-bit weight and 8-bit activation layers preserve enough accuracy for practical use on VOC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same RTL structure could be reused for other BNN vision tasks by swapping the final head.
  • Power and latency measurements on physical FPGA boards would reveal whether simulation numbers translate to deployed performance.
  • Direct connection to camera sensors on the same FPGA fabric would eliminate external data movement overhead.

Load-bearing premise

The Verilog code correctly performs the binary convolutions, post-processing quantization, and fused channel compensation without numerical or logical errors beyond those captured by the reported simulation metrics.

What would settle it

Feeding the same test images into actual FPGA hardware and measuring the deviation in raw detection outputs from the ONNX reference; correlation dropping well below 0.999 or error rising above 0.02 would falsify the implementation.

Figures

Figures reproduced from arXiv: 2605.03396 by Baochang Zhang, Gaolong Zhang, Haoyu Huang, Mengyuan Zhu, Xiaoyu Xu, Xuyu Zhao, Yanjing Li, Yunpeng Wu.

Figure 1
Figure 1. Figure 1: RTL data-flow diagram of the W1A8 YOLOv3-tiny-like detector on PYNQ-Z2 5.2 Core Modules view at source ↗
Figure 2
Figure 2. Figure 2: compares Params-mAP50 and FLOPs-mAP50. The proposed model lies in the lower-left region of both planes, indicating very low storage and computation while preserving basic detection capability. This makes it suitable for memory- and compute￾limited FPGA platforms view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of W1A8 quantized detection results based on FPGA RTL view at source ↗
read the original abstract

This paper implements a Binary Neural Network (BNN) based YOLOv3-tiny-like object detector on a low-cost FPGA. The network takes 320*320*3 RGB images as input. Its main convolution layers use 1-bit weights and 8-bit activations, while Conv1 and the final detection head use fixed-point standard convolutions. From the trained ONNX model, weights, biases, and quantization parameters are extracted, converted to fixed point, packed into COE files, and stored in Vivado BRAM ROMs. The hardware is written fully in Verilog RTL and includes padding, line buffering, binary convolution, quantization post-processing, max pooling, and detection-head computation. For layers where Mul_prev is indexed by input channel and Div_current by output channel, Mul_prev is fused in-to the BNN PE so that channel-wise compensation is applied during accumulation. On VOC, the model obtains 39.6% mAP50 with 0.098 GFLOPs and 0.74 M parameters. RTL simulation shows that the final raw detection output reaches a correlation coefficient of 0.999964 and a mean absolute error of 0.020027 against the corresponding ONNX node.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the design and Verilog RTL implementation of a mixed-precision BNN variant of YOLOv3-tiny for object detection on FPGA. Input images are 320×320×3 RGB; main convolution layers use 1-bit weights and 8-bit activations while Conv1 and the detection head use fixed-point standard convolutions. Weights, biases and quantization parameters are extracted from a trained ONNX model, packed into COE files and stored in BRAM ROMs. The hardware includes padding, line buffers, binary-convolution PEs, quantization post-processing, max-pooling and detection-head computation, with Mul_prev channel compensation fused into the accumulation path. On VOC the model reports 39.6 % mAP50 at 0.098 GFLOPs and 0.74 M parameters. RTL simulation of the final raw detection tensor yields a correlation of 0.999964 and MAE of 0.020027 versus the corresponding ONNX node.

Significance. If the reported simulation fidelity holds, the work supplies a concrete, low-resource FPGA realization of a BNN object detector whose numerical output matches an independent reference model to high precision. The explicit verification of binary-convolution PEs, quantization post-processing and Mul_prev fusion against an ONNX oracle is a strength for an engineering implementation paper and could serve as a useful reference for edge-AI accelerator designs.

major comments (2)
  1. [Results / Implementation] The manuscript provides no post-synthesis resource utilization (LUTs, BRAMs, DSPs) or timing (Fmax, slack) figures for the target FPGA. These metrics are load-bearing for any claim of a practical low-cost FPGA implementation and should be reported together with the RTL simulation results.
  2. [Network Architecture / Experimental Setup] Exact layer counts, per-layer bit-width choices, and the training/quantization procedure used to obtain the ONNX model are not stated. Without these details the 39.6 % mAP50 figure cannot be reproduced or compared with other BNN detectors, weakening the experimental section.
minor comments (2)
  1. [Abstract] The abstract and results section should explicitly name the target FPGA device and the Vivado version used for synthesis and simulation.
  2. [Hardware Architecture] Notation for Mul_prev and Div_current should be defined once in a table or equation before being used in the hardware-description paragraphs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [Results / Implementation] The manuscript provides no post-synthesis resource utilization (LUTs, BRAMs, DSPs) or timing (Fmax, slack) figures for the target FPGA. These metrics are load-bearing for any claim of a practical low-cost FPGA implementation and should be reported together with the RTL simulation results.

    Authors: We agree with the referee that post-synthesis metrics are essential to substantiate the practicality of the FPGA implementation. The current work primarily focuses on the Verilog RTL design and its functional verification through simulation against the ONNX model. In the revised manuscript, we will add the post-synthesis resource utilization figures (including LUTs, BRAMs, and DSPs) and timing reports (Fmax and slack) for the target FPGA. revision: yes

  2. Referee: [Network Architecture / Experimental Setup] Exact layer counts, per-layer bit-width choices, and the training/quantization procedure used to obtain the ONNX model are not stated. Without these details the 39.6 % mAP50 figure cannot be reproduced or compared with other BNN detectors, weakening the experimental section.

    Authors: We thank the referee for this observation. Although the abstract outlines the bit-width choices at a high level, we acknowledge the need for more precise details. In the revised version, we will include exact layer counts, a per-layer specification of bit-widths for weights and activations, and a description of the training and quantization procedure that led to the ONNX model. This will improve reproducibility and allow better comparison with related BNN object detectors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; implementation validated externally

full rationale

This is an engineering implementation paper describing Verilog RTL for a mixed-precision BNN YOLOv3-tiny variant on FPGA. The load-bearing claim is that the hardware produces raw detection outputs matching an independent ONNX reference model (correlation 0.999964, MAE 0.020027) and achieves 39.6% mAP50 on the standard VOC dataset. No mathematical derivations, equations, or predictions are presented that reduce to self-fitted parameters, self-citations, or ansatzes. Weights and quantization parameters are extracted from a separately trained ONNX model and used as fixed inputs to the RTL; the simulation directly compares against that external model rather than deriving the match by construction. No self-citation chains or uniqueness theorems appear in the provided text. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on standard BNN quantization and FPGA RTL practices extracted from a pre-trained model. No new physical entities or ungrounded postulates are introduced.

free parameters (2)
  • Weight bit-width for main layers
    Chosen as 1 bit to enable efficient XNOR-popcount hardware while accepting accuracy trade-off.
  • Activation bit-width for main layers
    Chosen as 8 bits to balance precision and FPGA resource usage.
axioms (2)
  • standard math Binary convolution can be realized by XNOR and population count operations with subsequent scaling.
    Invoked when describing the BNN processing elements and fusion of Mul_prev.
  • domain assumption RTL simulation accurately predicts post-synthesis behavior for the described neural network operations.
    Underlying the claim that simulation correlation validates the hardware.

pith-pipeline@v0.9.0 · 5535 in / 1503 out tokens · 76659 ms · 2026-05-12T01:35:19.841205+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    In: Advances in Neural Information Processing Systems (2015)

    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: Advances in Neural Information Processing Systems (2015)

  2. [2]

    In: Euro- pean Conference on Computer Vision, pp

    Liu, W., Anguelov, D., Erhan, D., et al.: SSD: Single Shot MultiBox Detector. In: Euro- pean Conference on Computer Vision, pp. 21-37 (2016)

  3. [3]

    In: CVPR, pp

    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You Only Look Once: Unified, Real - Time Object Detection. In: CVPR, pp. 779-788 (2016)

  4. [4]

    In: CVPR, pp

    Redmon, J., Farhadi, A.: YOLO9000: Better, Faster, Stronger. In: CVPR, pp. 7263 -7271 (2017)

  5. [5]

    YOLOv3: An Incremental Improvement

    Redmon, J., Farhadi, A.: YOLOv3: An Incremental Improvement. arXiv:1804.02767 (2018)

  6. [6]

    In: NeurIPS (2015)

    Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations. In: NeurIPS (2015)

  7. [7]

    and Bengio, Y

    Hubara, I., Courbariaux, M., Soudry, D., et al.: Binarized Neural Networks. arXiv:1602.02830 (2016)

  8. [8]

    In: ECCV, pp

    Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. In: ECCV, pp. 525-542 (2016)

  9. [9]

    Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,

    Zhou, S., Wu, Y., Ni, Z., et al.: DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv:1606.06160 (2016)

  10. [10]

    arXiv:2003.03488 (2020)

    Liu, Z., Shen, Z., Savvides, M., Cheng, K.T.: ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions. arXiv:2003.03488 (2020)

  11. [11]

    In: ICLR (2020)

    Esser, S.K., McKinstry, J.L., Bablani, D., et al.: Learned Step Size Quantization. In: ICLR (2020)

  12. [12]

    In: CVPR, pp

    Jacob, B., Kligys, S., Chen, B., et al.: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In: CVPR, pp. 2704-2713 (2018)

  13. [13]

    Proceedings of the IEEE 105(12), 2295-2329 (2017)

    Sze, V., Chen, Y.H., Yang, T.J., Emer, J.: Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceedings of the IEEE 105(12), 2295-2329 (2017)

  14. [14]

    In: FPGA, pp

    Umuroglu, Y., Fraser, N.J., Gambardella, G., et al.: FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In: FPGA, pp. 65-74 (2017)

  15. [15]

    ACM TRETS 11(3), Ar- ticle 16 (2018)

    Blott, M., Preusser, T.B., Fraser, N.J., et al.: FINN -R: An End -to-End Deep -Learning Framework for Fast Exploration of Quantized Neural Networks. ACM TRETS 11(3), Ar- ticle 16 (2018)

  16. [16]

    In: MICRO (2016)

    Sharma, H., Park, J., Mahajan, D., et al.: From High-Level Deep Neural Models to FPGAs. In: MICRO (2016)

  17. [17]

    Journal of Instrumentation 13(07), P07027 (2018)

    Duarte, J., Han, S., Harris, P., et al.: Fast Inference of Deep Neural Networks in FPGAs for Particle Physics. Journal of Instrumentation 13(07), P07027 (2018)

  18. [18]

    In: FPGA, pp

    Zhao, R., Song, W., Zhang, W., et al.: Accelerating Binarized Convolutional Neural Net- works with Software-Programmable FPGAs. In: FPGA, pp. 15-24 (2017)

  19. [19]

    Sensors 23(22), 9254 (2023)

    Su, Y., Seng, K.P., Ang, L.M., Smith, J.: Binary Neural Networks in FPGAs: Architec- tures, Tool Flows and Hardware Comparisons. Sensors 23(22), 9254 (2023)

  20. [20]

    Journal of Circuits, Sys- tems and Computers 33(10), 2450170 (2024)

    Ji, M., Al -Ars, Z., Chang, Y., Zhang, B.: Fully Pipelined FPGA Acceleration of Binary Convolutional Neural Networks with Neural Architecture Search. Journal of Circuits, Sys- tems and Computers 33(10), 2450170 (2024)

  21. [21]

    Parallel Com- puting, 103138 (2025)

    Qian, W., Zhu, Z., Zhu, C., Zhu, Y.: FPGA-Based Accelerator for YOLOv5 Object Detec- tion with Optimized Computation and Data Access for Edge Deployment. Parallel Com- puting, 103138 (2025)

  22. [22]

    Computer Applications and Software 42(9) (2025)

    Wen, C.J., Wang, L.T., Wang, Q., Jiang, S.: Design and Implementation of FPGA Accel- eration for YOLOv3-tiny. Computer Applications and Software 42(9) (2025)

  23. [23]

    In: Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (2018)

    Wang, R.J., Li, X., Ling, C.X.: Pelee: A Real -Time Object Detection System on Mobile Devices. In: Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (2018)

  24. [24]

    arXiv:2304.03198 (2023)

    Zhang, X., Liu, C., Yang, D., Song, T., Ye, Y., Li, K., Song, Y.: RFAConv: Receptive - Field Attention Convolution for Improving Convolutional Neural Networks. arXiv:2304.03198 (2023)

  25. [25]

    Zenodo (2020)

    Jocher, G.: YOLOv5 by Ultralytics. Zenodo (2020). https://doi.org/10.5281/ze- nodo.3908559, last accessed 2026/04/28

  26. [26]

    ISPRS International Journal of Geo-Information 14(9), 364 (2025)

    Su, C., Zhu, L., Dai, W., Zhou, J., Wang, J., Mao, Y., Sun, J.: Nav-YOLO: A Lightweight and Efficient Object Detection Model for Real -Time Indoor Navigation on Mobile Plat- forms. ISPRS International Journal of Geo-Information 14(9), 364 (2025)

  27. [27]

    Information 16(10), 871 (2025)

    El Hamdouni, S., Hdioud, B., El Fkihi, S.: Enhanced Lightweight Object Detection Model in Complex Scenes: An Improved YOLOv8n Approach. Information 16(10), 871 (2025)