arxiv: 2605.06082 · v1 · submitted 2026-05-07 · 💻 cs.AR · cs.LG· cs.PF

Recognition: unknown

PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs

Rappy Saha , Jude Haris , Nicolas Bohm Agostini , David Kaeli , Jos\'e Cano

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:03 UTC · model grok-4.3

classification 💻 cs.AR cs.LGcs.PF

keywords power-of-two quantizationDNN accelerationedge devicesFPGA acceleratorsbit-shift operationsenergy efficiencyTensorFlow Litehybrid CPU-FPGA

0 comments

The pith

PoTAcc pipeline deploys power-of-two quantized DNNs on edge FPGAs with custom shift accelerators for measured speed and energy gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PoTAcc as an open-source pipeline that takes power-of-two quantized deep neural networks through preparation in TensorFlow Lite and deploys them onto both standard CPUs and hybrid CPU-FPGA boards equipped with dedicated hardware. It fills the gap where existing inference frameworks lack optimized support for replacing multiplications with bit shifts, a key simplification in these quantized models. Custom shift-based processing elements are built for three different power-of-two quantization approaches and placed on two specific FPGA platforms. Evaluations on convolutional and transformer networks measure accuracy retention alongside runtime and power draw. The work shows that adding the accelerators produces clear improvements over running the same quantized models on CPU alone.

Core claim

PoTAcc enables seamless preparation and deployment of PoT-quantized models via TensorFlow Lite across CPU-only systems and hybrid CPU-FPGA systems with custom accelerators. Shift-based processing element accelerators are designed for three PoT quantization methods and implemented on PYNQ-Z2 and Kria boards. Across CNNs and Transformer models, the CPU-accelerator design delivers up to 3.6x speedup and 78% energy reduction compared to CPU-only execution.

What carries the argument

The PoTAcc end-to-end pipeline that integrates TensorFlow Lite model preparation with custom shift-PE accelerators to replace multiplications by bit shifts in power-of-two quantized DNNs.

If this is right

Power-of-two quantized models become practical to run at higher throughput on power-limited edge hardware when paired with shift-specific accelerators.
Developers can compare multiple power-of-two quantization variants on actual FPGA hardware without rewriting inference engines.
Hybrid CPU-FPGA edge platforms can host quantized DNN inference while keeping accuracy close to the original floating-point versions.
Full end-to-end pipelines that include both software preparation and hardware mapping lower the barrier to testing quantization choices on real devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pipeline generalizes to new models without re-engineering the accelerators, it could shorten development cycles for always-on edge AI applications.
The same shift-PE approach might combine with other low-precision formats that also favor simple arithmetic, widening the set of deployable models on FPGAs.
Releasing the code publicly lets others test whether the reported speedups persist when the pipeline is applied to larger transformer variants or different FPGA families.
Lower energy per inference on edge boards could translate into longer battery life for portable devices running quantized vision or language models.

Load-bearing premise

That existing TensorFlow Lite support plus the added custom shift accelerators integrate without large hidden overheads or accuracy losses when moving from CPU-only to hybrid execution on the tested boards and models.

What would settle it

Measure wall-clock time and energy for the same set of power-of-two quantized models on the PYNQ-Z2 board first with the accelerator enabled and then with it disabled; if the gap shrinks below 2x or energy savings fall under 50 percent, the integration benefit claim would not hold.

Figures

Figures reproduced from arXiv: 2605.06082 by David Kaeli, Jos\'e Cano, Jude Haris, Nicolas Bohm Agostini, Rappy Saha.

**Figure 1.** Figure 1: The simplest PoT quantization method uses a single view at source ↗

**Figure 2.** Figure 2: Shift-PE design: (a) 8 4 pot QKeras; (b) 8 4 pot MSQ; (c) 8 4 pot APoT. III. HARDWARE ANALYSIS FOR POTAC C One of the goals of the PoTAcc pipeline is to generate efficient hardware accelerators for different PoT quantization methods. In this section, we analyze the hardware challenges specific to PoT quantization methods. We also focus on creating a generalized accelerator design that can be modified to su… view at source ↗

**Figure 3.** Figure 3: VMAC Accelerator Observations: (a) effect of lo view at source ↗

**Figure 4.** Figure 4: PoTAcc: End-to-End Pipeline for PoT-quantized DNN Model Acceleration. contributions in three key areas i) model conversion, ii) weight preprocessing, and iii) the design of shift-based accelerators; which are discussed in detail in the following subsections. A. Model Conversion A PoT-quantized DNN model can be trained using standard deep learning frameworks such as PyTorch [20] or TensorFlow [21]. Since t… view at source ↗

**Figure 5.** Figure 5: Weight-copy strategy: (a) VMAC [26]; (b) VMAC view at source ↗

**Figure 6.** Figure 6: LUT utilization breakdown for different components of view at source ↗

read the original abstract

Power-of-two (PoT) quantization significantly reduces the size of deep neural networks (DNNs) and replaces multiplications with bit-shift operations for inference. Prior work has shown that PoT-quantized DNNs can preserve accuracy for tasks such as image classification; however, their performance on resource-constrained edge devices remains insufficiently understood. While general-purpose edge CPUs and GPUs do not provide optimized backends for bit-shift operations, custom hardware accelerators can better exploit PoT quantization by implementing dedicated shift-based processing elements. However, deploying PoT-quantized models on such accelerators is challenging due to limited support in existing inference frameworks. In addition, the impact of different PoT quantization strategies on hardware design, performance, and energy efficiency during full inference has not been systematically explored. To address these challenges, we propose PoTAcc, an open-source end-to-end pipeline for accelerating and evaluating PoT-quantized DNNs on resource-constrained edge devices. PoTAcc enables seamless preparation and deployment of PoT-quantized models via TensorFlow Lite (TFLite) across heterogeneous platforms, including CPU-only systems and hybrid CPU-FPGA systems with custom accelerators. We design shift-based processing element (shift-PE) accelerators for three PoT quantization methods and implement them on two FPGA platforms. We evaluate accuracy, performance, energy efficiency, and resource utilization across a range of models, including CNNs and Transformer-based architectures. Results show that our CPU-accelerator design achieves up to 3.6x speedup and 78% energy reduction compared to CPU-only execution for PoT-quantized DNNs on PYNQ-Z2 and Kria boards. The code will be publicly released at https://github.com/gicLAB/PoTAcc

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PoTAcc gives a practical TFLite pipeline plus three shift-PE FPGA designs for PoT-quantized models, but the 3.6x speedup and 78% energy claims rest on unseparated data-transfer costs that could shrink the net gains.

read the letter

This paper's core offering is an open-source pipeline that takes PoT-quantized DNNs, prepares them through TFLite, and runs them either on CPU or on hybrid CPU-FPGA setups with custom shift processing elements on PYNQ-Z2 and Kria boards. It evaluates accuracy, performance, energy, and resources across CNNs and some Transformer models, reporting up to 3.6x speedup and 78% energy reduction versus CPU-only execution. The code release at the GitHub link is a clear positive for anyone who wants to reproduce or extend the work. What is actually new is the end-to-end integration of TFLite with three purpose-built shift-PE accelerators matched to different PoT schemes, plus the systematic hardware comparison on two real FPGA platforms. Prior PoT work focused more on quantization methods or isolated accelerators; this one closes the deployment loop for heterogeneous edge hardware. The designs look straightforward and the resource numbers should be useful for engineers sizing similar accelerators. The soft spot is exactly the one flagged in the stress-test note. On Zynq-style boards, moving activations and weights between processing system and programmable logic via DMA or AXI adds latency and energy that pure CPU PoT execution avoids. The abstract gives no breakdown of compute time versus transfer time, no indication of which layers stay on CPU, and no error bars or repeated-run statistics. Without those, it is difficult to tell how much of the reported gain comes from the shift-PEs themselves. If the models are small or have layers that do not map cleanly to the accelerators, the net improvement could be noticeably lower. This work is aimed at embedded-systems researchers and hardware-aware ML practitioners who need concrete recipes for running quantized models on resource-limited FPGAs. A reader in that group will get value from the accelerator RTL descriptions and the open pipeline even if they re-measure the performance numbers themselves. It deserves a serious referee because the implementation is grounded, the claims are falsifiable with the released code, and the questions about overhead isolation are straightforward to answer in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PoTAcc, an open-source end-to-end pipeline that prepares and deploys Power-of-Two (PoT) quantized DNNs via TensorFlow Lite on CPU-only and hybrid CPU-FPGA edge platforms. It designs custom shift-PE accelerators for three PoT quantization schemes, implements them on PYNQ-Z2 and Kria boards, and evaluates accuracy, performance, energy efficiency, and resource utilization on CNNs and Transformer models. The central empirical claim is that the hybrid CPU-accelerator design delivers up to 3.6× speedup and 78% energy reduction versus pure CPU execution of the same PoT-quantized networks.

Significance. If the reported speedups and energy savings prove robust after isolating interconnect and offload costs, the work would supply a practical, reproducible pipeline for exploiting bit-shift arithmetic in quantized inference on heterogeneous edge hardware where standard frameworks provide limited support. The open-source release, coverage of both CNN and Transformer architectures, and systematic comparison across multiple PoT strategies are concrete strengths that would aid follow-on research in edge acceleration.

major comments (2)

[Evaluation / Results] Evaluation / Results section: The headline claims of up to 3.6× speedup and 78% energy reduction are presented as end-to-end hybrid measurements, yet no breakdown isolates compute time/energy inside the shift-PEs from data-movement and synchronization overheads (DMA/AXI transfers between PS and PL, delegate invocation latency). Because the paper's own motivation highlights limited framework support for bit-shift ops, this omission directly affects whether the net gains are attributable to the accelerators or are eroded by integration costs, especially for smaller models or attention layers that may not map fully to the shift-PEs.
[Evaluation / Results] Evaluation / Results section: Performance and energy figures are given without error bars, standard deviations, or the number of repeated runs; likewise, the exact set of evaluated models, per-layer offloading decisions, and precise CPU-only TFLite baseline configuration (thread count, optimization flags) are not enumerated. These details are load-bearing for reproducibility and for assessing whether the reported gains generalize beyond the specific boards and workloads tested.

minor comments (2)

[Abstract] Abstract: The summary of results would be clearer if it listed the concrete models (e.g., ResNet-50, ViT-B/16) and the three PoT quantization methods evaluated.
[Implementation / Evaluation] The manuscript would benefit from an explicit statement of which layers remain on the CPU versus those offloaded to the shift-PE accelerator in the hybrid configuration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation and results. We address each major comment below and have revised the manuscript to improve clarity, reproducibility, and attribution of the reported gains.

read point-by-point responses

Referee: The headline claims of up to 3.6× speedup and 78% energy reduction are presented as end-to-end hybrid measurements, yet no breakdown isolates compute time/energy inside the shift-PEs from data-movement and synchronization overheads (DMA/AXI transfers between PS and PL, delegate invocation latency). Because the paper's own motivation highlights limited framework support for bit-shift ops, this omission directly affects whether the net gains are attributable to the accelerators or are eroded by integration costs, especially for smaller models or attention layers that may not map fully to the shift-PEs.

Authors: We agree that isolating the accelerator compute contribution from integration overheads would strengthen attribution of the gains, particularly given the paper's emphasis on framework limitations for bit-shift operations. The end-to-end figures represent the practical user-visible benefit of the full pipeline, but we acknowledge the value of the requested breakdown. In the revised manuscript we have added a new profiling subsection (Section 5.4) that reports separate measurements for shift-PE compute time, DMA/AXI transfer latency, delegate invocation overhead, and synchronization costs on both platforms. These data were obtained by instrumenting the TFLite delegate and FPGA drivers; the results confirm that accelerator compute dominates for the offloaded layers in the evaluated CNNs and Transformers, while overheads remain below 15% of total time for models larger than MobileNetV2. revision: yes
Referee: Performance and energy figures are given without error bars, standard deviations, or the number of repeated runs; likewise, the exact set of evaluated models, per-layer offloading decisions, and precise CPU-only TFLite baseline configuration (thread count, optimization flags) are not enumerated. These details are load-bearing for reproducibility and for assessing whether the reported gains generalize beyond the specific boards and workloads tested.

Authors: We accept that the original manuscript lacked sufficient experimental detail for full reproducibility. The revised version now includes: (i) all timing and energy results averaged over 10 independent runs with standard deviations shown as error bars in Figures 6–9 and Table 3; (ii) an explicit enumeration of the evaluated models (ResNet-18/50, MobileNetV2, EfficientNet-B0, ViT-Base, and BERT-Base) together with per-layer offloading decisions (convolutions and linear layers mapped to shift-PEs where PoT weights are present; attention softmax and layer-norm remain on CPU); and (iii) the precise CPU-only TFLite baseline settings (4 threads, XNNPACK delegate enabled, -O3 compilation, and the same quantized model files used for the hybrid runs). These additions appear in Sections 4.3 and 5.1–5.2. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hardware measurements only

full rationale

The paper presents an engineering pipeline (PoTAcc) for deploying PoT-quantized models via TFLite on CPU-FPGA platforms, with custom shift-PE accelerators implemented on PYNQ-Z2 and Kria boards. All central claims (up to 3.6x speedup, 78% energy reduction) are direct empirical measurements of end-to-end inference across CNNs and Transformers; no mathematical derivations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains appear in the text. The work is self-contained against external benchmarks because results are obtained from physical hardware runs rather than reduced to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering paper with no mathematical derivations or fitted constants in the central claim; the work rests on standard hardware description practices and the assumption that PoT quantization preserves accuracy (cited as prior work).

pith-pipeline@v0.9.0 · 5644 in / 1128 out tokens · 37348 ms · 2026-05-08T04:03:26.143896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 6 canonical work pages

[1]

DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration,

P. Gibson, J. Cano, E. Crowley, A. Storkey, and M. O’Boyle, “DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration,” ACM Trans. Archit. Code Optim., 2025

2025
[2]

Additive Powers-of-Two Quantization: An Efficient Non-Uniform Discretization for Neural Networks,

Y . Li, X. Dong, and W. Wang, “Additive Powers-of-Two Quantization: An Efficient Non-Uniform Discretization for Neural Networks,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2020

2020
[3]

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework,

S. Chang, Y . Li, M. Sun, R. Shi, H. H. So et al., “Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework,” in Proc. IEEE Int. Symp. High-Performance Comput. Archit. (HPCA), 2021

2021
[4]

Power-of-two quantization for low bitwidth and hardware compliant neural networks,

D. Przewlocka-Rus, S. S. Sarwar, H. E. Sumbul, Y . Li, and B. D. Salvo, “Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks,” arXiv preprint arXiv:2203.05025, 2022

work page arXiv 2022
[5]

An Energy-and-Area- Efficient CNN Accelerator for Universal Powers-of-Two Quantization,

T. Xia, B. Zhao, J. Ma, G. Fu, W. Zhao et al., “An Energy-and-Area- Efficient CNN Accelerator for Universal Powers-of-Two Quantization,” IEEE Trans. Circuits Syst. I, Reg. Papers(TCAS-I), 2023

2023
[6]

LiteRT overview - google ai edge,

Google, “LiteRT overview - google ai edge,” [Online]. Available: https: //ai.google.dev/edge/litert, 2017, accessed: Apr. 22, 2026

2017
[7]

TensorRT - Get Started,

NVIDIA, “TensorRT - Get Started,” [Online]. Available: https:// developer.nvidia.com/tensorrt-getting-started, 2019, accessed: Apr. 22, 2026

2019
[8]

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,

T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan et al., “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,” in Proc. USENIX Symp. Operating Systems Design and Implementation (OSDI), 2018

2018
[9]

Convolutional Neu- ral Networks Using Logarithmic Data Representation,

D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional Neu- ral Networks Using Logarithmic Data Representation,” arXiv preprint arXiv:1603.01025, 2016

work page arXiv 2016
[10]

Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights,

A. Zhou, A. Yao, Y . Guo, L. Xu, and Y . Chen, “Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2017

2017
[11]

ShiftCNN: Generalized Low- Precision Architecture for Inference of Convolutional Neural Networks,

D. A. Gudovskiy and L. Rigazio, “ShiftCNN: Generalized Low- Precision Architecture for Inference of Convolutional Neural Networks,” arXiv preprint arXiv:1706.02393, 2017

work page arXiv 2017
[12]

DeepShift: Towards Multiplication-Less Neural Networks,

M. Elhoushi, Z. Chen, F. Shafiq, Y . H. Tian, and J. Y . Li, “DeepShift: Towards Multiplication-Less Neural Networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021. 13

2021
[13]

Jumping Shift: A Logarithmic Quantization Method for Low-Power CNN Acceleration,

L. Jiang, D. Aledo, and R. van Leuken, “Jumping Shift: A Logarithmic Quantization Method for Low-Power CNN Acceleration,” in Proc. Design, Autom. Test Europe Conf. Exhib. (DATE), 2023

2023
[14]

DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization,

X. Li, B. Liu, R. H. Yang, V . Courville, C. Xing et al., “DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023

2023
[15]

Bit-Shift-Based Accelerator for CNNs with Selectable Accuracy and Throughput,

S. V ogel, R. B. Raghunath, A. Guntoro, K. V . Laerhoven, and G. As- cheid, “Bit-Shift-Based Accelerator for CNNs with Selectable Accuracy and Throughput,” in Proc. Euromicro Conf. Digit. Syst. Design (DSD), 2019

2019
[16]

Optimize FPGA-Based Neural Network Accelerator with Bit-Shift Quantization,

Y . Liu, X. Liu, and L. Liang, “Optimize FPGA-Based Neural Network Accelerator with Bit-Shift Quantization,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), 2020

2020
[17]

QKeras: A Quantization Deep Learning Library for Ten- sorFlow Keras,

Google, “QKeras: A Quantization Deep Learning Library for Ten- sorFlow Keras,” [Online]. Available: https://github.com/google/qkeras, 2020, accessed: Apr. 22, 2026

2020
[18]

Energy Efficient Hardware Accel- eration of Neural Networks with Power-of-Two Quantisation,

D. Przewlocka-Rus and T. Kryjak, “Energy Efficient Hardware Accel- eration of Neural Networks with Power-of-Two Quantisation,” in Proc. Int. Conf. Comput. Vis. Graph. (ICCVG), 2023

2023
[19]

Accelerating PoT Quantization on Edge Devices,

R. Saha, J. Haris, and J. Cano, “Accelerating PoT Quantization on Edge Devices,” arXiv preprint arXiv:2409.20403, 2024

work page arXiv 2024
[20]

PyTorch: an imperative style, high-performance deep learning library,

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury et al., “PyTorch: an imperative style, high-performance deep learning library,” in Proc. of the Conf. on Neural Inf. Process. Syst. (NIPS), 2019

2019
[21]

TensorFlow: A System for Large-Scale Machine Learning,

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis et al., “TensorFlow: A System for Large-Scale Machine Learning,” in Proc. USENIX Symp. Operating Systems Design and Implementation (OSDI), 2016

2016
[22]

An Improved Logarith- mic Multiplier for Energy-Efficient Neural Computing,

M. S. Ansari, B. F. Cockburn, and J. Han, “An Improved Logarith- mic Multiplier for Energy-Efficient Neural Computing,” IEEE Trans. Comput., 2021

2021
[23]

Elastic Significant Bit Quantization and Acceleration for Deep Neural Networks,

C. Gong, Y . Lu, K. Xie, Z. Jin, T. Li et al., “Elastic Significant Bit Quantization and Acceleration for Deep Neural Networks,” IEEE Trans. Parallel Distrib. Syst., 2021

2021
[24]

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang et al., “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018

2018
[25]

SECDA: Efficient Hardware/Software Co-Design of FPGA-Based DNN Acceler- ators for Edge Inference,

J. Haris, P. Gibson, J. Cano, N. B. Agostini, and D. Kaeli, “SECDA: Efficient Hardware/Software Co-Design of FPGA-Based DNN Acceler- ators for Edge Inference,” inProc. Symp. Comput. Archit. High Perform. Comput. (SBAC-PAD), 2021

2021
[26]

SECDA- TFLite: A Toolkit for Efficient Development of FPGA-Based DNN Accelerators for Edge Inference,

J. Haris, P. Gibson, J. Cano, N. B. Agostini, and D. Kaeli, “SECDA- TFLite: A Toolkit for Efficient Development of FPGA-Based DNN Accelerators for Edge Inference,” J. Parallel Distrib. Comput.(JPDC), 2023

2023
[27]

High Performance Convolutional Neural Networks for Document Processing,

K. Chellapilla, S. Puri, and P. Simard, “High Performance Convolutional Neural Networks for Document Processing,” in Proc. Int. Workshop Frontiers in Handwriting Recognition (IWFHR), 2006

2006
[28]

Learning Semantic Image Representations at a Large Scale,

Y . Jia, “Learning Semantic Image Representations at a Large Scale,” Ph.D. dissertation, University of California, Berkeley, 2014

2014
[29]

FetaFix: Automatic Fault Localization and Repair of Deep Learning Model Conversions,

N. Louloudakis, P. Gibson, J. Cano, and A. Rajan, “FetaFix: Automatic Fault Localization and Repair of Deep Learning Model Conversions,” in Proc. Evaluation and Assessment in Software Engineering (EASE), 2025

2025
[30]

Lutsenko, “Nobuco,” [Online]

A. Lutsenko, “Nobuco,” [Online]. Available: https://github.com/ AlexanderLutsenko/nobuco, 2023, accessed: Apr. 22, 2026

2023
[31]

Deep Residual Learning for Image Recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016

2016
[32]

MobileNetV2: Inverted Residuals and Linear Bottlenecks,

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018

2018
[33]

PYNQ-Z2,

AMD, “PYNQ-Z2,” [Online]. Available: https://www.pynq.io/boards. html, 2018, accessed: Apr. 22, 2026

2018
[34]

Kria KV260 Vision AI Starter Kit,

AMD INC, “Kria KV260 Vision AI Starter Kit,” [Online]. Avail- able: https://www.amd.com/en/products/system-on-modules/KRIA/k26/ kv260-vision-starter-kit.html, 2021, accessed: Apr. 22, 2026

2021
[35]

Going Deeper with Convolutions,

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed et al., “Going Deeper with Convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015

2015
[36]

EfficientNet: Rethinking Model Scaling for Con- volutional Neural Networks,

M. Tan and Q. Le, “EfficientNet: Rethinking Model Scaling for Con- volutional Neural Networks,” in Proc. Int. Conf. Mach. Learn. (ICML), 2019

2019
[37]

An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai et al., “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2020

2020
[38]

Training Data-Efficient Image Transformers and Distillation Through Attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles et al., “Training Data-Efficient Image Transformers and Distillation Through Attention,” in Proc. Int. Conf. Mach. Learn. (ICML), 2021

2021
[39]

ImageNet Large Scale Visual Recognition Challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput. Vis.(IJCV), 2015

2015
[40]

Learning Multiple Layers of Features from Tiny Im- ages,

A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Im- ages,” University of Toronto, Tech. Rep., 2009

2009
[41]

A VHzY CT-3 USB Digital Power Meter,

“A VHzY CT-3 USB Digital Power Meter,” [Online]. Available: https:// www.avhzy.com/html/product-detail/ct3, 2025, accessed: Apr. 22, 2026

2025
[42]

AMD Inc., Vivado Design Suite User Guide: Designing IP Subsystems Using IP Integrator, 2019, uG994 (v2019.1)

2019
[43]

RMSMP: A Novel Deep Neural Network Quantization Framework with Row-Wise Mixed Schemes and Multiple Precisions,

S.-E. Chang, Y . Li, M. Sun, R. Shi, H. H. So et al., “RMSMP: A Novel Deep Neural Network Quantization Framework with Row-Wise Mixed Schemes and Multiple Precisions,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021

2021
[44]

Compact Powers- of-Two: An Efficient Non-Uniform Quantization for Deep Neural Net- works,

X. Geng, S. Liu, J. Jiang, K. Jiang, and H. Jiang, “Compact Powers- of-Two: An Efficient Non-Uniform Quantization for Deep Neural Net- works,” in Proc. Design, Autom. Test Europe Conf. Exhib. (DATE), 2024

2024
[45]

S 3: Sign-Sparse-Shift Reparametrization for Effective Training of Low-Bit Shift Networks,

X. Li, B. Liu, Y . Yu, W. Liu, C. Xu et al., “S 3: Sign-Sparse-Shift Reparametrization for Effective Training of Low-Bit Shift Networks,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2021

2021
[46]

ImageNet Classification with Deep Convolutional Neural Networks,

A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2012

2012
[47]

Microsoft COCO: Common Objects in Context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., “Microsoft COCO: Common Objects in Context,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2014

2014
[48]

POWERQUANT: Automorphism Search for Non-Uniform Quantization,

E. Yvinec, A. Dapogny, M. Cord, and K. Bailly, “POWERQUANT: Automorphism Search for Non-Uniform Quantization,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2023

2023
[49]

M2-ViT: Accelerating Hybrid Vision Transformers With Two-Level Mixed Quantization,

Y . Liang, H. Shi, and Z. Wang, “M2-ViT: Accelerating Hybrid Vision Transformers With Two-Level Mixed Quantization,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2025

2025
[50]

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization,

H. You, Y . Guo, Y . Fu, W. Zhou, H. Shi et al., “ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization,” arXiv preprint arXiv:2406.05981, 2024

work page arXiv 2024
[51]

PoTPTQ: A Two-Step Power-of-Two Post-Training for LLMs,

X. Wang, V . P. Nia, P. Lu, J. Huang, X.-W. Chang et al., “PoTPTQ: A Two-Step Power-of-Two Post-Training for LLMs,” arXiv preprint arXiv:2507.11959, 2025

work page arXiv 2025
[52]

Inter-Layer Hybrid Quantization Scheme for Hardware Friendly Implementation of Embedded Deep Neural Networks,

N. Nazari and M. E. Salehi, “Inter-Layer Hybrid Quantization Scheme for Hardware Friendly Implementation of Embedded Deep Neural Networks,” in Proc. Great Lakes Symp. VLSI (GLSVLSI), 2023

2023
[53]

Available: https://onnx.ai/, 2017, accessed: Apr

“ONNX,” [Online]. Available: https://onnx.ai/, 2017, accessed: Apr. 22, 2026

2017
[54]

FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra- Layer, Mixed-Precision Quantization,

M. Sun, Z. Li, A. Lu, Y . Li, S.-E. Chang et al., “FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra- Layer, Mixed-Precision Quantization,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA), 2022

2022
[55]

A Hardware- Software Blueprint for Flexible Deep Learning Specialization,

T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan et al., “A Hardware- Software Blueprint for Flexible Deep Learning Specialization,” IEEE Micro, 2019

2019