arxiv: 2604.19293 · v1 · submitted 2026-04-21 · 💻 cs.AR

Recognition: unknown

Energy Efficient LSTM Accelerators for Embedded FPGAs through Parameterised Architecture Design

Chao Qian , Tianheng Ling , Gregor Schiele

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:48 UTC · model grok-4.3

classification 💻 cs.AR

keywords LSTM acceleratorFPGAenergy efficiencyembedded systemsparameterized designdeep learning hardwarereal-time inference

0 comments

The pith

Parameterized LSTM accelerator design reaches 11.89 GOP/s/W energy efficiency on embedded FPGAs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a hardware accelerator for Long Short-term Memory networks tailored to resource-scarce embedded FPGAs. It introduces configurable parameters for aspects such as DSP allocation and activation function implementation to raise execution speed and cut energy use. This would matter because it supports real-time processing of local sensor data streams on devices without external servers or high power draw. Evaluation results show the design sustaining 32873 samples per second at 11.89 GOP/s/W during inference.

Core claim

The authors claim that a parameterized architecture for LSTM accelerators on FPGAs, with tunable choices for DSP usage and activation functions, improves both speed and energy consumption over related designs while allowing adaptation to different hardware constraints and workloads, as shown by the measured efficiency of 11.89 GOP/s/W at 32873 samples per second in real-time operation.

What carries the argument

The parameterized architecture that lets designers select DSP usage and activation function realizations to match FPGA resources and performance goals.

If this is right

Real-time LSTM inference for time series sensor data becomes practical on power-limited embedded devices.
The same accelerator template can be retuned for different FPGA sizes or LSTM layer counts without a full redesign.
Energy use for on-device deep learning drops, allowing longer battery life or smaller power supplies in embedded systems.
Designers gain a single flexible block that covers multiple embedded FPGA targets instead of separate fixed accelerators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same parameterization idea could be applied to other recurrent or convolutional networks to gain similar efficiency on FPGAs.
Automatic tools that pick the best parameter settings for a given FPGA and model size would reduce manual tuning effort.
Wider adoption might shift more sensor analytics from cloud servers to local hardware in IoT deployments.

Load-bearing premise

The efficiency and adaptability improvements hold when the design is placed on actual target FPGAs with no extra overheads introduced by the parameterization itself.

What would settle it

Running the implemented accelerator on a concrete embedded FPGA board, measuring its power draw and throughput directly, and comparing those numbers to prior accelerators under the same workload and hardware conditions.

Figures

Figures reproduced from arXiv: 2604.19293 by Chao Qian, Gregor Schiele, Tianheng Ling.

**Figure 1.** Figure 1: Unfolding the LSTM Model Architecture in the Time Dimension To better describe the iteration process, we unfold it in the time dimension (see [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Pipelined Loop with Five Stages and Eight Iterations As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: LSTM Accelerator Architecture Overview Note that due to the limited number of DSPs available on the FPGA, the system prioritises allocating DSPs to ALUs on the critical path to increase the system clock frequency. This strategy is employed to make the most out of the available DSP resources. Furthermore, when selecting the weight resource type parameter, if weights such as Wf are assigned to BRAM-type reso… view at source ↗

**Figure 4.** Figure 4: Utilisation without DSPs 25 50 75 100 125 150 175 200 Hidden Size 0% 20% 40% 60% 80% 100% Utilisation (%) BRAM (in total 36kbits) DSP Slices (in total 20) LUT Slices (in total 8000) [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Long Short-term Memory Networks (LSTMs) are a vital Deep Learning technique suitable for performing on-device time series analysis on local sensor data streams of embedded devices. In this paper, we propose a new hardware accelerator design for LSTMs specially optimised for resource-scarce embedded Field Programmable Gate Arrays (FPGAs). Our design improves the execution speed and reduces energy consumption compared to related work. Moreover, it can be adapted to different situations using a number of optimisation parameters, such as the usage of DSPs or the implementation of activation functions. We present our key design decisions and evaluate the performance. Our accelerator achieves an energy efficiency of 11.89 GOP/s/W during a real-time inference with 32873 samples/s.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a parameterized hardware accelerator for LSTM networks on resource-scarce embedded FPGAs. The design supports customization via parameters such as DSP usage and activation function implementations, and claims improvements in execution speed and energy efficiency over related work. It reports a specific energy efficiency of 11.89 GOP/s/W at a throughput of 32873 samples/s during real-time inference.

Significance. If the performance claims can be substantiated with detailed hardware measurements and fair comparisons, the parameterized approach could offer a practical contribution to efficient on-device LSTM inference for embedded time-series applications on FPGAs.

major comments (1)

[Abstract] Abstract: The headline energy efficiency of 11.89 GOP/s/W and throughput of 32873 samples/s are presented without any supporting resource-utilization table, power measurement methodology (real hardware dynamic power vs. post-synthesis estimates), power breakdown, or side-by-side normalized comparison data against related accelerators using identical LSTM dimensions, precision, and target FPGA device.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern about the abstract's presentation of results below, noting that abstracts are space-constrained while the manuscript body contains the requested supporting details.

read point-by-point responses

Referee: [Abstract] Abstract: The headline energy efficiency of 11.89 GOP/s/W and throughput of 32873 samples/s are presented without any supporting resource-utilization table, power measurement methodology (real hardware dynamic power vs. post-synthesis estimates), power breakdown, or side-by-side normalized comparison data against related accelerators using identical LSTM dimensions, precision, and target FPGA device.

Authors: We agree that the abstract is concise and omits tables or detailed methodology, which is standard due to length limits. The full manuscript substantiates the claims as follows: resource utilization appears in Table II; Section IV-B details the real-hardware dynamic power measurement methodology on the target embedded FPGA (no post-synthesis estimates); Figure 6 provides the power breakdown; and Table IV presents side-by-side normalized comparisons using identical LSTM dimensions, precision, and the same FPGA device. If desired, we can revise the abstract to add a one-sentence reference to the evaluation setup in Section IV. revision: partial

Circularity Check

0 steps flagged

No circularity: performance claims rest on direct hardware measurements, not self-referential derivations or fitted predictions.

full rationale

The paper presents a parameterized FPGA accelerator architecture for LSTMs, with design choices (DSP usage, activation functions) implemented and evaluated through synthesis, place-and-route, and runtime measurements on target hardware. The reported 11.89 GOP/s/W and 32873 samples/s figures are empirical results from physical implementation and power measurement, not outputs of equations that loop back to the same fitted parameters or self-cited uniqueness theorems. No load-bearing step reduces by construction to its own inputs; the central claims are falsifiable via independent replication on the same FPGA platform.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The design rests on standard assumptions about FPGA resource mapping and synthesis tools; no new entities or heavily fitted parameters are introduced in the abstract.

free parameters (1)

optimisation parameters (DSP usage, activation implementation)
Abstract states these are used to adapt the design but does not specify how many or their exact fitting process.

axioms (1)

domain assumption FPGA synthesis tools correctly map the parameterized RTL to hardware resources without unexpected overhead.
Implicit in any FPGA accelerator claim; invoked when reporting resource usage and performance.

pith-pipeline@v0.9.0 · 5417 in / 1182 out tokens · 40948 ms · 2026-05-10T01:48:27.181419+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 1 canonical work pages

[1]

In: 2020 International Conference on Field- Programmable Technology (ICFPT)

Boutros, A., Nurvitadhi, E., Ma, R., Gribok, S., Zhao, Z., Hoe, J.C., Betz, V., Langhammer, M.: Beyond peak performance: Comparing the real performance of ai-optimized fpgas and gpus. In: 2020 International Conference on Field- Programmable Technology (ICFPT). pp. 10–19. IEEE (2020)

2020
[2]

In: Architecture of Computing Systems– ARCS 2020: 33rd International Conference, Aachen, Germany, May 25–28, 2020, Proceedings

Burger, A., Urban, P., Boubin, J., Schiele, G.: An architecture for solving the eigenvalue problem on embedded fpgas. In: Architecture of Computing Systems– ARCS 2020: 33rd International Conference, Aachen, Germany, May 25–28, 2020, Proceedings. pp. 32–43. Springer (2020) Energy Efficient LTSM Accelerators for Embedded FPGAs 15

2020
[3]

In: Proceed- ings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Cao, S., Zhang, C., Yao, Z., Xiao, W., Nie, L., Zhan, D., Liu, Y., Wu, M., Zhang, L.: Efficient and effective sparse lstm on fpga with bank-balanced sparsity. In: Proceed- ings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. pp. 63–72 (2019)

2019
[4]

In: 2021 31st International Conference on Field-Programmable Logic and Applications (FPL)

Chen, J., Hong, S., He, W., Moon, J., Jun, S.W.: Eciton: Very Low-Power LSTM Neural Network Accelerator for Predictive Maintenance at the Edge. In: 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). pp. 1–8. IEEE (2021)

2021
[5]

Proceedings of the IEEE107(8), 1655–1674 (2019)

Chen, J., Ran, X.: Deep learning with edge computing: A review. Proceedings of the IEEE107(8), 1655–1674 (2019)

2019
[6]

In: 2018 IEEE Custom Integrated Circuits Conference (CICC)

Conti, F., Cavigelli, L., Paulin, G., Susmelj, I., Benini, L.: Chipmunk: A systolically scalable 0.9 mm 2, 3.08 gop/s/mw@ 1.2 mw accelerator for near-sensor recurrent neural network inference. In: 2018 IEEE Custom Integrated Circuits Conference (CICC). pp. 1–4. IEEE (2018)

2018
[7]

In: 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC)

Fu, R., Zhang, Z., Li, L.: Using lstm and gru neural network methods for traf- fic flow prediction. In: 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC). pp. 324–328. IEEE (2016)

2016
[8]

Neural computation 9(8), 1735–1780 (1997)

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)

1997
[9]

5) forecasting in smart cities

Huang, C.J., Kuo, P.H.: A deep cnn-lstm model for particulate matter (pm2. 5) forecasting in smart cities. Sensors18(7), 2220 (2018)

2018
[10]

Quantizing deep convolutional networks for efficient inference: A whitepaper

Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342 (2018)

work page Pith review arXiv 2018
[11]

International Journal of Neural Systems31(03), 2130001 (2021)

Lara-Ben´ ıtez, P., Carranza-Garc´ ıa, M., Riquelme, J.C.: An experimental review on deep learning architectures for time series forecasting. International Journal of Neural Systems31(03), 2130001 (2021)

2021
[12]

Philo- sophical Transactions of the Royal Society A379(2194), 20200209 (2021)

Lim, B., Zohren, S.: Time-series forecasting with deep learning: a survey. Philo- sophical Transactions of the Royal Society A379(2194), 20200209 (2021)

2021
[13]

Procedia CIRP99, 650– 655 (2021)

Lindemann, B., M¨ uller, T., Vietz, H., Jazdi, N., Weyrich, M.: A survey on long short-term memory networks for time series prediction. Procedia CIRP99, 650– 655 (2021)

2021
[14]

In: 2020 21st International Symposium on Quality Electronic Design (ISQED)

Manjunath, N.K., Paneliya, H., Hosseini, M., Hairston, W.D., Mohsenin, T., et al.: A Low-Power LSTM Processor for Multi-Channel Brain EEG Artifact Detection. In: 2020 21st International Symposium on Quality Electronic Design (ISQED). pp. 105–110. IEEE (2020)

2020
[15]

Qian, C., Ling, T., Schiele, G.: Enhancing energy-efficiency by solving the through- put bottleneck of lstm cells for embedded fpgas. In: Machine Learning and Princi- ples and Practice of Knowledge Discovery in Databases: International Workshops of ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part I. pp. 594–605. Springer (2023)

2022
[16]

Expert Systems39(3), e12687 (2022)

Varadharajan, S.K., Nallasamy, V.: P-scada-a novel area and energy efficient fpga architectures for lstm prediction of heart arrthymias in biot applications. Expert Systems39(3), e12687 (2022)

2022
[17]

Neural Networks125, 70–82 (2020)

Yang, Y., Deng, L., Wu, S., Yan, T., Xie, Y., Li, G.: Training high-performance and large-scale deep neural networks with full 8-bit integers. Neural Networks125, 70–82 (2020)

2020
[18]

In: 2017 IEEE International Conference on Cluster Computing (CLUSTER)

Zhang, Y., Wang, C., Gong, L., Lu, Y., Sun, F., Xu, C., Li, X., Zhou, X.: A power- efficient accelerator based on fpgas for lstm network. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). pp. 629–630. IEEE (2017)

2017