arxiv: 2605.07321 · v1 · submitted 2026-05-08 · 💻 cs.AR · cs.DC· cs.NA· eess.IV· math.NA

Recognition: no theorem link

TREA: Low-precision Time-Multiplexed, Resource-Efficient Edge Accelerator for Object Detection and Classification

Mukul Lokhande, Omkar Kokane, Ratko Pilipovic, Santosh Kumar Vishvakarma, Vijay Pratap Sharma

Pith reviewed 2026-05-11 00:56 UTC · model grok-4.3

classification 💻 cs.AR cs.DCcs.NAeess.IVmath.NA

keywords edge acceleratorobject detectionlow-precisionstructured pruningtime-multiplexingCORDICSIMD MAC

0 comments

The pith

TREA uses dual-precision time-multiplexed MACs and structured pruning to cut kernel latency by up to 9 times for edge object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TREA as a hardware accelerator optimized for running object detection and classification on resource-constrained edge devices. It achieves efficiency through a dual 4/8-bit SIMD multiply-accumulate unit that reuses hardware via time-multiplexing, combined with a pruning technique that maintains full utilization at high sparsity levels. A reconfigurable activation unit further supports multiple functions with minimal extra hardware. If these techniques work as described, they would enable faster and more energy-efficient real-time vision processing on small chips compared to standard fixed-precision designs.

Core claim

TREA combines a DQ-MAC unit using most-significant-digit-first shift-and-add for dual-precision operations with SHARP pruning for near 50 percent structured sparsity and an RQ-NAF CORDIC core, allowing a 3x3 convolution in 1 cycle at 4-bit precision versus 9 cycles at 8-bit in a reconfigurable 1D array architecture.

What carries the argument

The DQ-MAC SIMD unit with MSDF computation and run-time bit truncation that performs 4x 4-bit or 1x 8-bit operations per cycle while reducing multiplier and accumulator overhead.

If this is right

A 3x3 convolution kernel completes in 1 cycle in 4-bit mode instead of 9 cycles in 8-bit mode.
A 5x5 kernel takes 3 cycles versus 25, for up to 9x kernel-level latency reduction.
The design shows better latency, hardware utilization, and energy efficiency than fixed-precision non-reconfigurable accelerators.
Layer-wise hardware reuse in a 1D array of 100 units reduces overall area and control complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may generalize to other neural network layers or vision models if the pruning strategy adapts well.
Time-multiplexing could allow smaller chip areas for the same performance in multi-task edge devices.
Combining low precision with CORDIC activations might reduce power further in always-on vision sensors.

Load-bearing premise

The structured hardware-aware reductive pruning maintains acceptable accuracy at near 50 percent sparsity while preserving full MAC utilization in all layers.

What would settle it

Measuring the accuracy of object detection on standard benchmarks like COCO when using the SHARP-pruned 4-bit model on the TREA accelerator versus the unpruned 8-bit baseline.

Figures

Figures reproduced from arXiv: 2605.07321 by Mukul Lokhande, Omkar Kokane, Ratko Pilipovic, Santosh Kumar Vishvakarma, Vijay Pratap Sharma.

**Figure 3.** Figure 3: The error evaluation distribution across various iterations for Pareto [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Enhanced SIMD DQ-MAC with 4/8-bit quantization reduces precision [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Block level of architecture showing data flow for runtime-configurable [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Micro-architecture of Time-Multiplexed Resource-Efficient Edge-AI [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Detailed methodology/workflow illustrating Algorithm–Hardware co-design followed for the empirical evaluation, incorporating Dual-precision [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 10.** Figure 10: Implications of RQ-NAF on AI Workloads RQ-NAF core supports runtime-selectable activation functions (ReLU, Sigmoid, and Tanh), assessed under three quantisation schemes: 4-bit uniform, 8-bit uniform, and the proposed dualprecision configuration [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

**Figure 9.** Figure 9: Software Performance comparison for different generations of object [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 12.** Figure 12: ASIC resource-comparison with prior SIMD MAC units [4], [47]– [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

**Figure 11.** Figure 11: ASIC resource-comparison with prior approximate MAC units, data [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 13.** Figure 13: Performance comparison of FPGA metrics for TREA accelerator [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗

read the original abstract

This work presents TREA, a low-precision time-multiplexed and resource-efficient edge-AI accelerator for object detection and classification, targeting stringent area-power-latency constraints of edge vision platforms. The proposed architecture integrates a dual-precision (4/8-bit) SIMD multiply-accumulate (DQ-MAC) unit based on most-significant-digit-first (MSDF) shift-and-add computation with run-time bit truncation, eliminating conventional multiplier overhead and reducing accumulator bit-width. The DQ-MAC supports 4x FxP4 or 1x FxP8 operations per cycle, achieving up to 4x throughput improvement without hardware duplication. A structured hardware-aware reductive pruning (SHARP) strategy is co-designed with the SIMD datapath, enabling near 50% structured sparsity while maintaining full MAC utilization. This allows a 3x3 convolution kernel to be computed in 1 cycle in FxP4 mode compared to 9 cycles in FxP8, and a 5x5 kernel in 3 cycles versus 25 cycles, yielding up to 9x latency reduction at the kernel level. The accelerator further incorporates a reconfigurable CORDIC-based nonlinear activation function (RQ-NAF) core with a 9-stage pipeline, supporting Sigmoid, Tanh, and ReLU at one output per cycle after pipeline fill, while enabling (N-1) hardware reuse through time-multiplexing. The complete TREA architecture employs a 1D array of 100 SIMD DQ-MAC units with layer-wise hardware reuse, significantly reducing area and control complexity. Experimental results demonstrate substantial improvements in latency, hardware utilization, and energy efficiency compared to conventional fixed-precision and non-reconfigurable accelerators, validating TREA as an effective solution for real-time edge vision workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TREA wires MSDF arithmetic, structured pruning, and CORDIC activations into a reusable edge accelerator, but the accuracy numbers that would validate the pruning claims are absent from the abstract.

read the letter

TREA wires MSDF arithmetic, structured pruning, and CORDIC activations into a reusable edge accelerator, but the accuracy numbers that would validate the pruning claims are absent from the abstract. The DQ-MAC unit supports 4-bit or 8-bit operations in the same hardware by shifting to most-significant-digit-first computation and truncating bits on the fly. SHARP pruning is shaped to keep the SIMD array busy at roughly 50% structured sparsity, and RQ-NAF reuses a 9-stage CORDIC pipeline for the common activation functions. The 1D array of 100 units with layer-wise reuse is the part that actually reduces area and control overhead for edge constraints. Those details are concrete and could be copied into another design without much translation. The architecture description itself is clear on how a 3x3 kernel drops from nine cycles to one at 4-bit and how the pruning avoids the usual utilization drop. That co-design is the useful engineering contribution. The soft spot is the missing evidence on accuracy. The abstract asserts that SHARP keeps detection performance acceptable while delivering the latency and energy gains, yet it supplies no mAP values, no dataset, no comparison to dense or unstructured baselines, and no error bars. If the pruning costs more than a couple of points on a standard detection task, the claim that TREA is an effective solution for real-time edge vision does not follow from the hardware metrics alone. The stress-test note correctly flags this as the load-bearing assumption. This paper is for hardware designers who already work on custom edge accelerators and want specific datapath reuse ideas. It is incremental rather than foundational, but the implementation choices are spelled out enough to be worth a referee's time. I would send it to peer review and ask the authors to add the accuracy tables and direct baseline comparisons in the next round.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes TREA, a low-precision time-multiplexed edge-AI accelerator for object detection and classification. It integrates a dual-precision (4/8-bit) SIMD MAC unit (DQ-MAC) based on MSDF shift-and-add with run-time truncation, a structured hardware-aware reductive pruning (SHARP) strategy enabling ~50% structured sparsity while preserving full MAC utilization in the SIMD array, and a reconfigurable CORDIC-based nonlinear activation (RQ-NAF) core. The architecture uses a 1D array of 100 DQ-MAC units with layer-wise reuse and claims up to 9x kernel-level latency reduction (e.g., 3x3 conv in 1 cycle at FxP4 vs. 9 at FxP8) plus substantial gains in latency, utilization, and energy efficiency over fixed-precision and non-reconfigurable baselines, validated by experimental results.

Significance. If the co-design of SHARP with the DQ-MAC datapath successfully maintains detection accuracy at the reported sparsity levels while delivering the hardware gains, the work would offer a concrete example of hardware-aware structured pruning enabling efficient low-precision SIMD operation for edge vision. The explicit mapping of pruning to preserve full MAC utilization across kernel sizes and the time-multiplexed RQ-NAF reuse are strengths that could inform similar co-design efforts.

major comments (2)

[Abstract and Experimental Results] The central claim that TREA is 'an effective solution for real-time edge vision workloads' rests on experimental results showing latency/utilization/energy gains while remaining effective for detection. However, no quantitative accuracy metrics (mAP, top-1 accuracy, or deltas vs. dense/unstructured baselines) are reported for SHARP at near-50% structured sparsity on any detection dataset. This is load-bearing because the performance claims are explicitly tied to SHARP enabling full MAC utilization without unacceptable accuracy loss (Abstract; Experimental Results section).
[Abstract and Experimental Results] The reported kernel-level latency reductions (3x3 in 1 cycle at FxP4, 5x5 in 3 cycles) and overall 'substantial improvements' lack any baseline accelerator details, dataset sizes, error bars, or layer-wise utilization numbers. Without these, the magnitude and reproducibility of the gains cannot be assessed (Abstract; Experimental Results section).

minor comments (1)

[Abstract] The abstract introduces multiple acronyms (DQ-MAC, SHARP, RQ-NAF) in a single paragraph; a brief parenthetical expansion or table of acronyms would improve readability for readers unfamiliar with the co-design terms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments and the recommendation for major revision. We address each major comment below, agreeing that additional details are needed to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Experimental Results] The central claim that TREA is 'an effective solution for real-time edge vision workloads' rests on experimental results showing latency/utilization/energy gains while remaining effective for detection. However, no quantitative accuracy metrics (mAP, top-1 accuracy, or deltas vs. dense/unstructured baselines) are reported for SHARP at near-50% structured sparsity on any detection dataset. This is load-bearing because the performance claims are explicitly tied to SHARP enabling full MAC utilization without unacceptable accuracy loss (Abstract; Experimental Results section).

Authors: We agree with the referee that the absence of quantitative accuracy metrics for SHARP pruning is a significant omission, as the claims rely on maintaining detection accuracy at high structured sparsity levels. The manuscript currently emphasizes hardware efficiency gains but does not report specific mAP values or comparisons to dense and unstructured pruning baselines on detection datasets. To address this, we will revise the Experimental Results section to include accuracy results on a detection dataset (e.g., PASCAL VOC or COCO), showing mAP for the baseline model, after SHARP pruning at approximately 50% sparsity, and any deltas. This will validate that the structured pruning preserves accuracy sufficiently for the hardware benefits to be meaningful. We will also discuss the hardware-aware nature of SHARP that enables this balance. revision: yes
Referee: [Abstract and Experimental Results] The reported kernel-level latency reductions (3x3 in 1 cycle at FxP4, 5x5 in 3 cycles) and overall 'substantial improvements' lack any baseline accelerator details, dataset sizes, error bars, or layer-wise utilization numbers. Without these, the magnitude and reproducibility of the gains cannot be assessed (Abstract; Experimental Results section).

Authors: We acknowledge that providing more context on the baselines and evaluation setup is necessary for assessing the reported gains. The kernel-level latency reductions are calculated based on the time-multiplexed operation of the DQ-MAC array with SHARP pruning allowing full utilization (e.g., mapping 9 MACs of a 3x3 kernel to 1 cycle in 4-bit mode via 4x parallelism and pruning). However, the manuscript does not detail the exact baseline accelerators, the datasets used for timing/energy measurements, error bars, or per-layer utilization statistics. In the revised version, we will add these details: descriptions of comparison baselines (such as fixed 8-bit SIMD without reconfiguration), the input sizes and datasets (e.g., ImageNet for classification, COCO for detection), any statistical measures, and layer-wise utilization figures for the 100-unit array. This will improve the reproducibility and clarity of the experimental claims. revision: yes

Circularity Check

0 steps flagged

No circularity in TREA architecture proposal

full rationale

The paper is an engineering hardware design contribution describing DQ-MAC units, SHARP co-designed pruning, RQ-NAF activation, and a 1D SIMD array, with all performance claims (latency, utilization, energy) resting on explicit hardware operation counts and experimental measurements rather than any derivation, equation, or prediction that reduces to fitted parameters or self-referential definitions. No self-citations, ansatzes, or uniqueness theorems are invoked in the provided text to support load-bearing steps; the central validation is external to the design description itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The design rests on standard digital hardware assumptions rather than new physical postulates; several engineering components are introduced but function as implementation choices rather than independent entities requiring external validation.

axioms (1)

standard math Standard assumptions of synchronous digital design, including no timing violations in the 9-stage RQ-NAF pipeline and full MAC utilization under structured sparsity.
Implicit in all hardware accelerator proposals; invoked when claiming cycle counts and utilization.

invented entities (3)

DQ-MAC unit no independent evidence
purpose: Dual-precision SIMD multiply-accumulate engine using MSDF shift-and-add with run-time truncation
Newly proposed datapath component enabling 4x FxP4 or 1x FxP8 per cycle.
SHARP pruning strategy no independent evidence
purpose: Hardware-aware reductive pruning co-designed to maintain full utilization at ~50% sparsity
Co-design element claimed to enable 1-cycle 3x3 and 3-cycle 5x5 kernels in FxP4.
RQ-NAF core no independent evidence
purpose: Reconfigurable CORDIC-based nonlinear activation supporting Sigmoid/Tanh/ReLU with time-multiplexed reuse
Proposed activation unit with 9-stage pipeline and (N-1) hardware reuse.

pith-pipeline@v0.9.0 · 5661 in / 1627 out tokens · 60410 ms · 2026-05-11T00:56:15.491048+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Scaling vision transformers,

X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12104–12113, 2022

2022
[2]

Learning both weights and con- nections for efficient neural network,

S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con- nections for efficient neural network,”Advances in neural information processing systems, vol. 28, 2015

2015
[3]

Towards Energy-Efficiency by Navigating the Trilemma of Energy, Latency, and Accuracy,

B. Tian, Y . Pang, M. Huzaifa, S. Wang, and S. V . Adve, “Towards Energy-Efficiency by Navigating the Trilemma of Energy, Latency, and Accuracy,”IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 913–922, 2024

2024
[4]

Flex-PE: Flexible and SIMD Multiprecision Processing Element for AI Workloads,

M. Lokhande, G. Raut, and S. K. Vishvakarma, “Flex-PE: Flexible and SIMD Multiprecision Processing Element for AI Workloads,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 33, no. 6, pp. 1610– 1623, 2025

2025
[5]

Falcon: A Fused- Layer Accelerator With Layer-Wise Hybrid Inference Flow for Compu- tational Imaging CNNs,

Y .-T. Chen, Y .-T. Chiu, H.-J. Tu, and C.-T. Huang, “Falcon: A Fused- Layer Accelerator With Layer-Wise Hybrid Inference Flow for Compu- tational Imaging CNNs,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 33, no. 3, pp. 720–732, 2025

2025
[6]

XR-NPE: High-Throughput Mixed-precision SIMD Neural Processing Engine for Extended Reality Perception Workloads,

T. Chaudhari, T. Dewangan, M. Lokhande, S. K. Vishvakarma,et al., “XR-NPE: High-Throughput Mixed-precision SIMD Neural Processing Engine for Extended Reality Perception Workloads,”39 th International Conference On VLSI Design and 25 th International Conference On Embedded Systems, 2026

2026
[7]

An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs,

C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, and H. Shen, “An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, pp. 1953–1965, Sept. 2020

1953
[8]

Haq: Hardware-aware automated quantization with mixed precision,

K. Wang, Z. Liu, Y . Lin, J. Lin, and S. Han, “Haq: Hardware-aware automated quantization with mixed precision,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8612–8620, 2019

2019
[9]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,

Y . Lin, H. Tang, S. Yang, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,”Proceedings of Machine Learning and Systems, vol. 7, 2025

2025
[10]

Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks,

D. Przewlocka-Rus, S. S. Sarwar, H. E. Sumbul, Y . Li, and B. De Salvo, “Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks,”tinyML Research Symposium, Mar. 2022

2022
[11]

Smoothquant: Accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational conference on machine learning, pp. 38087–38099, PMLR, 2023

2023
[12]

SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models,

M. Li, Y . Lin, Z. Zhang, T. Cai, X. Li, J. Guo, E. Xie, C. Meng, J.- Y . Zhu, and S. Han, “SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models,” inThe Thirteenth International Conference on Learning Representations, 2024

2024
[13]

An Efficient Unstructured Sparse Convolutional Neural Network Accelerator for Wearable ECG Classification Device,

J. Lu, D. Liu, X. Cheng, L. Wei, A. Hu, and X. Zou, “An Efficient Unstructured Sparse Convolutional Neural Network Accelerator for Wearable ECG Classification Device,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, pp. 4572–4582, Nov. 2022

2022
[14]

Precision-aware On- device Learning and Adaptive Runtime-cONfigurable AI acceleration,

M. Lokhande, A. Jain, and S. K. Vishvakarma, “Precision-aware On- device Learning and Adaptive Runtime-cONfigurable AI acceleration,” IEEE International Symposium on VLSI Design and Test, Aug. 2025

2025
[15]

An efficient un- structured sparse convolutional neural network accelerator for wearable ECG classification device,

J. Lu, D. Liu, X. Cheng, L. Wei, A. Hu, and X. Zou, “An efficient un- structured sparse convolutional neural network accelerator for wearable ECG classification device,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 11, pp. 4572–4582, 2022

2022
[16]

An Efficient Hardware Accelerator for Block Sparse Convolutional Neural Networks on FPGA,

X. Yin, Z. Wu, D. Li, C. Shen, and Y . Liu, “An Efficient Hardware Accelerator for Block Sparse Convolutional Neural Networks on FPGA,” IEEE Embedded Systems Letters, vol. 16, no. 2, pp. 158–161, 2024

2024
[17]

Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference,

Z.-G. Liu, P. N. Whatmough, and M. Mattina, “Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference,”IEEE Computer Architecture Letters, vol. 19, no. 1, pp. 34– 37, 2020

2020
[18]

A 3-D Multi-Precision Scalable Systolic FMA Architecture,

H. Liu, X. Lu, X. Yu, K. Li, K. Yang,et al., “A 3-D Multi-Precision Scalable Systolic FMA Architecture,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 72, no. 1, pp. 265–276, 2025

2025
[19]

QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit,

N. Ashar, G. Raut, V . Trivedi, S. K. Vishvakarma, and A. Ku- mar, “QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit,”IEEE Access, vol. 12, pp. 43600–43614, 2024

2024
[20]

ART-MAC: Approximate Rounding and Truncation based MAC Unit for Fault-Tolerant Applications,

V . Mishra, D. Pandey, S. Singh, S. Satapathy, K. Goswami, B. Jajodia, and D. S. Banerjee, “ART-MAC: Approximate Rounding and Truncation based MAC Unit for Fault-Tolerant Applications,” inIEEE International Symposium on Circuits and Systems (ISCAS), pp. 1640–1644, 2022

2022
[21]

QForce- RL: Quantized FPGA-Optimized RL Compute Engine,

A. Jha, T. Dewangan, M. Lokhande, and S. K. Vishvakarma, “QForce- RL: Quantized FPGA-Optimized RL Compute Engine,”IEEE Interna- tional Symposium on VLSI Design and Test (VDAT), Aug. 2025

2025
[22]

An Improved Logarithmic Multiplier for Energy-Efficient Neural Computing,

M. S. Ansari, B. F. Cockburn, and J. Han, “An Improved Logarithmic Multiplier for Energy-Efficient Neural Computing,”IEEE Transactions on Computers, vol. 70, no. 4, pp. 614–625, 2021

2021
[23]

Approximate Hybrid High Radix Encoding for Energy-Efficient Inexact Multipliers,

V . Leon, G. Zervakis, D. Soudris, and K. Pekmestzi, “Approximate Hybrid High Radix Encoding for Energy-Efficient Inexact Multipliers,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 3, pp. 421–430, 2018

2018
[24]

Design of a Hardware-Efficient Approximate 4-2 Compressor for Multiplications in Image Processing,

S. Hwang, K.-W. Kwon, and Y . Kim, “Design of a Hardware-Efficient Approximate 4-2 Compressor for Multiplications in Image Processing,” IEEE Embedded Systems Letters, vol. 17, no. 4, pp. 226–229, 2025

2025
[25]

Concept, Design, and Implementation of Reconfigurable CORDIC,

S. Aggarwal, P. K. Meher, and K. Khare, “Concept, Design, and Implementation of Reconfigurable CORDIC,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 4, pp. 1588–1592, 2016

2016
[26]

RECON: Resource- Efficient CORDIC-Based Neuron Architecture,

G. Raut, S. Rai, S. K. Vishvakarma, and A. Kumar, “RECON: Resource- Efficient CORDIC-Based Neuron Architecture,”IEEE Open Journal of Circuits and Systems, vol. 2, pp. 170–181, 2021

2021
[27]

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,

N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai,et al., “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” inProceedings of the 50th Annual International Symposium on Computer Architecture, ISCA ’23, Association for Computing Machinery, 2023

2023
[28]

ReAFM: A Reconfigurable Nonlinear Activation Function Module for Neural Networks,

X. Wu, S. Liang, M. Wang, and Z. Wang, “ReAFM: A Reconfigurable Nonlinear Activation Function Module for Neural Networks,”IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 70, pp. 2660–2664, July 2023

2023
[29]

Design Space Exploration of Neural Network Activation Function IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. XX, NO. X, XXX 202X 11 Circuits,

T. Yang, Y . Wei, Z. Tu, H. Zeng, M. A. Kinsy, N. Zheng, and P. Ren, “Design Space Exploration of Neural Network Activation Function IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. XX, NO. X, XXX 202X 11 Circuits,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 38, pp. 1974–1978, Oct. 2019

1974
[30]

Efficient CORDIC-Based Acti- vation Functions for RNN Acceleration on FPGAs,

W. Shen, J. Jiang, M. Li, and S. Liu, “Efficient CORDIC-Based Acti- vation Functions for RNN Acceleration on FPGAs,”IEEE Transactions on Artificial Intelligence, pp. 1–11, 2024

2024
[31]

A Unified Parallel CORDIC- Based Hardware Architecture for LSTM Network Acceleration,

N. A. Mohamed and J. R. Cavallaro, “A Unified Parallel CORDIC- Based Hardware Architecture for LSTM Network Acceleration,”IEEE Transactions on Computers, vol. 72, pp. 2752–2766, Oct. 2023

2023
[32]

Exploring Hardware Ac- tivation Function Design: CORDIC Architecture in Diverse Floating Formats,

M. Basavaraju, V . Rayapati, and M. Rao, “Exploring Hardware Ac- tivation Function Design: CORDIC Architecture in Diverse Floating Formats,” in25th International Symposium on Quality Electronic Design (ISQED), pp. 1–8, 2024

2024
[33]

A novel reconfigurable hardware architecture of neural network,

K. Khalil, O. Eldash, B. Dey, A. Kumar, and M. Bayoumi, “A novel reconfigurable hardware architecture of neural network,” inIEEE In- ternational Midwest Symposium on Circuits and Systems (MWSCAS), no. 62, pp. 618–621, IEEE, 2023

2023
[34]

Accurate Modulation Classification with Reusable-Feature Convolutional Neural Network,

T. Huynh-The, C.-H. Hua, V .-S. Doan, and D.-S. Kim, “Accurate Modulation Classification with Reusable-Feature Convolutional Neural Network,” in2020 IEEE Eighth International Conference on Commu- nications and Electronics (ICCE), pp. 12–17, 2021

2021
[35]

Data multiplexed and hardware reused architecture for deep neural network accelerator,

G. Raut, A. Biasizzo, N. Dhakad, N. Gupta,et al., “Data multiplexed and hardware reused architecture for deep neural network accelerator,” Neurocomputing, vol. 486, pp. 147–159, 2022

2022
[36]

HYDRA: Hybrid Data Multiplexing and Run-time Layer Configurable DNN Accelerator,

S. Kumar, K. Gupta, I. S. Dasanayake, M. Lokhande, and S. K. Vishvakarma, “HYDRA: Hybrid Data Multiplexing and Run-time Layer Configurable DNN Accelerator,”Proc. 19th Int. Conf. on Industrial and Information Systems (ICIIS), Sri Lanka, Jan. 2026

2026
[37]

LPRE: Logarithmic Posit-enabled Reconfigurable edge-AI Engine,

O. Kokane, M. Lokhande, G. Raut, A. Teman, and S. K. Vishvakarma, “LPRE: Logarithmic Posit-enabled Reconfigurable edge-AI Engine,” pp. 1–5, Dec. 2024

2024
[38]

AHCO-YOLO: An Algo- rithm–Hardware Co-Optimization Framework for Energy-Efficient and Real-Time Object Detection on Edge Devices,

J. Kim, J.-K. Kang, and Y . Kim, “AHCO-YOLO: An Algo- rithm–Hardware Co-Optimization Framework for Energy-Efficient and Real-Time Object Detection on Edge Devices,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., pp. 1–14, 2025

2025
[39]

The pascal visual object classes (voc) challenge,

M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,”International Journal of Computer Vision, vol. 88, pp. 303–338, 2009

2009
[40]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inComputer Vision - ECCV 2014, (Cham), pp. 740–755, Springer International Publishing, 2014

2014
[41]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”CoRR, vol. abs/1704.04861, 2017

work page internal anchor Pith review arXiv 2017
[42]

Efficientnet: Rethinking model scaling for con- volutional neural networks,

M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con- volutional neural networks,” inInternational conference on machine learning, pp. 6105–6114, PMLR, 2019

2019
[43]

Going deeper with convolutions,

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015

2015
[44]

ImageNet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009

2009
[45]

Xilinx LogiCORE IP MAC v12.0,

A. M. D. Inc., “Xilinx LogiCORE IP MAC v12.0,” Feb. 2024

2024
[46]

A Two-Stage Operand Trimming Approximate Logarithmic Multiplier,

R. Pilipovi ´c, P. Buli´c, and U. Lotri ˇc, “A Two-Stage Operand Trimming Approximate Logarithmic Multiplier,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 68, no. 6, pp. 2535–2545, 2021

2021
[47]

Unified Posit/IEEE-754 Vector MAC Unit for Transprecision Computing,

L. Crespo, P. Tom ´as, and N. Roma, “Unified Posit/IEEE-754 Vector MAC Unit for Transprecision Computing,”IEEE Trans. on Circuits and Syst. II, vol. 69, pp. 2478–2482, May 2022

2022
[48]

Maestro: A 302 GFLOPS/W and 19.8 GFLOPS RISC-V Vector-Tensor Architecture for Wearable Ultrasound Edge Computing,

M. Sinigagliaet al., “Maestro: A 302 GFLOPS/W and 19.8 GFLOPS RISC-V Vector-Tensor Architecture for Wearable Ultrasound Edge Computing,”IEEE Trans. on Circuits and Syst.- I, pp. 1–15, 2025

2025
[49]

LPRE: Logarithmic Posit-enabled Reconfigurable edge-AI Engine,

O. Kokane, M. Lokhande, G. Raut, A. Teman, and S. K. Vishvakarma, “LPRE: Logarithmic Posit-enabled Reconfigurable edge-AI Engine,” in 2025 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5, May 2025

2025
[50]

A Low-Cost FP Dot-Product-Dual-Accumulate Architecture for HPC-Enabled AI,

H. Tan, L. Huang,et al., “A Low-Cost FP Dot-Product-Dual-Accumulate Architecture for HPC-Enabled AI,”IEEE Trans. Comp.-Aided Des. Integ. Cir. Syst., vol. 43, pp. 681–693, Feb. 2024

2024
[51]

A Reconfig- urable Processing Element for Multiple-Precision Floating/Fixed-Point HPC,

B. Li, K. Li, J. Zhou, Y . Ren, W. Mao, H. Yu, and N. Wong, “A Reconfig- urable Processing Element for Multiple-Precision Floating/Fixed-Point HPC,”IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 71, no. 3, pp. 1401–1405, 2024

2024
[52]

Multiple-Mode- Supporting Floating-Point FMA Unit for Deep Learning Processors,

H. Tan, G. Tong, L. Huang, L. Xiao, and N. Xiao, “Multiple-Mode- Supporting Floating-Point FMA Unit for Deep Learning Processors,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 31, no. 2, pp. 253–266, 2023

2023
[53]

A Configurable Floating-Point Multiple-Precision Processing Element for HPC and AI Converged Computing,

W. Mao, K. Li, Q. Cheng, L. Dai, B. Li, X. Xie, H. Li, L. Lin, and H. Yu, “A Configurable Floating-Point Multiple-Precision Processing Element for HPC and AI Converged Computing,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 30, no. 2, pp. 213–226, 2022

2022
[54]

High-Performance Accurate and Approximate Multipliers for FPGA-Based Hardware Ac- celerators,

S. Ullah, S. Rehman, M. Shafique, and A. Kumar, “High-Performance Accurate and Approximate Multipliers for FPGA-Based Hardware Ac- celerators,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 2, pp. 211–224, 2022

2022
[55]

A Precision-Aware Neuron Engine for DNN Accelerators,

S. Vishwakarma, G. Raut, S. Jaiswal, S. K. Vishvakarma, and D. Ghai, “A Precision-Aware Neuron Engine for DNN Accelerators,”SN Com- puter Science, vol. 5, no. 5, p. 494, 2024

2024
[56]

Energy-Efficient Low- Latency Signed Multiplier for FPGA-Based Hardware Accelerators,

S. Ullah, T. D. A. Nguyen, and A. Kumar, “Energy-Efficient Low- Latency Signed Multiplier for FPGA-Based Hardware Accelerators,” IEEE Embedded Systems Letters, vol. 13, no. 2, pp. 41–44, 2021

2021
[57]

RAMAN: A Reconfigurable and Sparse tinyML Accelerator for Inference on Edge,

A. Krishna, S. Rohit Nudurupati, D. G. Chandana, P. Dwivedi, A. van Schaik, M. Mehendale, and C. S. Thakur, “RAMAN: A Reconfigurable and Sparse tinyML Accelerator for Inference on Edge,”IEEE Internet of Things Journal, vol. 11, pp. 24831–24845, July 2024

2024
[58]

Edge-Side Fine-Grained Sparse CNN Accelerator With Efficient Dynamic Pruning Scheme,

B. Wu, T. Yu, K. Chen, and W. Liu, “Edge-Side Fine-Grained Sparse CNN Accelerator With Efficient Dynamic Pruning Scheme,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 71, pp. 1285–1298, Mar. 2024

2024
[59]

A High-Throughput Full-Dataflow MobileNetv2 Accelerator on Edge FPGA,

W. Jiang, H. Yu, and Y . Ha, “A High-Throughput Full-Dataflow MobileNetv2 Accelerator on Edge FPGA,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 5, pp. 1532–1545, 2023

2023
[60]

ShortcutFusion: From Tensorflow to FPGA-Based Accelerator With a Reuse-Aware Memory Allocation for Shortcut Data,

D. T. Nguyen, H. Je, T. N. Nguyen, S. Ryu, K. Lee, and H.-J. Lee, “ShortcutFusion: From Tensorflow to FPGA-Based Accelerator With a Reuse-Aware Memory Allocation for Shortcut Data,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 6, pp. 2477– 2489, 2022

2022
[61]

A Real-Time Object Detection Processor With xnor-Based Variable-Precision Computing Unit,

W. Lee, K. Kim, W. Ahn, J. Kim, and D. Jeon, “A Real-Time Object Detection Processor With xnor-Based Variable-Precision Computing Unit,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 31, no. 6, pp. 749–761, 2023

2023
[62]

Dedicated FPGA Implementation of the Gaussian TinyYOLOv3 Accelerator,

S. Ki, J. Park, and H. Kim, “Dedicated FPGA Implementation of the Gaussian TinyYOLOv3 Accelerator,”IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 70, no. 10, pp. 3882–3886, 2023

2023