arxiv: 2605.09570 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: no theorem link

End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor

Angel Jim\'enez-Fern\'andez, Antonio Rios-Navarro, Kamil Jeziorek, Piotr Wzorek, Tom\'as Mu\~noz, Tomasz Kryjak, Wiktor Matykiewicz

Pith reviewed 2026-05-12 02:45 UTC · model grok-4.3

classification 💻 cs.LG

keywords keyword spottingFPGAgraph neural networksneuromorphic auditory sensorevent-based processingedge computinglow-power inference

0 comments

The pith

A single FPGA chip can run real-time keyword spotting by feeding raw events from a neuromorphic auditory sensor straight into a graph neural network without conventional preprocessing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an end-to-end keyword spotting pipeline can be built on one FPGA by wiring a neuromorphic auditory sensor directly to a graph neural network. The sensor produces sparse event streams that encode audio changes, and the network classifies keywords from those events alone. This removes the usual stages of signal filtering, feature extraction, and separate processors. The resulting system reaches 87.43 percent accuracy after quantization on the Google Speech Commands v2 dataset, with latency under 35 microseconds and average power of 1.12 watts. The approach targets edge devices that need low-power, always-on audio intelligence.

Core claim

The authors present the first end-to-end FPGA implementation of a keyword spotting system that integrates a Neuromorphic Auditory Sensor and a graph neural network on a single device, enabling real-time processing of raw audio data through a compute-near-memory architecture that operates directly on event-based streams.

What carries the argument

The compute-near-memory network architecture that places GNN inference close to memory handling the sparse event data from the neuromorphic auditory sensor, allowing direct classification without intermediate feature steps.

If this is right

The system delivers 87.43 percent accuracy on the Google Speech Commands v2 dataset after quantization.
End-to-end latency stays below 35 microseconds while processing raw sensor events.
Average power consumption is 1.12 watts on the single FPGA device.
No conventional signal preprocessing steps are required between the sensor and the classifier.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-chip design could reduce board complexity for always-on audio systems in small robots or wearables.
Event-based input may lower data movement costs compared with dense audio frames in other edge audio tasks.
The same sensor-plus-GNN pattern could be tested on different neuromorphic sensors to check whether the accuracy holds across modalities.

Load-bearing premise

The sparse event streams produced by the neuromorphic auditory sensor contain enough information for the graph neural network to reach usable keyword-spotting accuracy after quantization and without any conventional feature extraction.

What would settle it

Deploying the full quantized model on the FPGA and running it on the Google Speech Commands v2 dataset processed through the neuromorphic sensor; accuracy falling below roughly 80 percent or latency exceeding real-time bounds would show the approach does not deliver usable performance.

Figures

Figures reproduced from arXiv: 2605.09570 by Angel Jim\'enez-Fern\'andez, Antonio Rios-Navarro, Kamil Jeziorek, Piotr Wzorek, Tom\'as Mu\~noz, Tomasz Kryjak, Wiktor Matykiewicz.

**Figure 2.** Figure 2: The proposed architecture is illustrated with the sensor and filtering mod [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Influence of the low and high time radius on the keyword-spotting metrics. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

With the rapid growth of mobile robotics and embedded intelligence, there is an increasing demand for efficient on-device data processing on edge platforms. A promising research direction is the use of neuromorphic sensors inspired by human sensory systems, which generate sparse, event-based data encoding changes in the environment. In this work, we present the first end-to-end FPGA implementation of a keyword spotting system that integrates a Neuromorphic Auditory Sensor (NAS) and a graph neural network (GNN) on a single FPGA device, enabling real-time processing of raw audio data. The proposed architecture eliminates conventional signal preprocessing and operates directly on event-based audio streams. Leveraging a compute-near-memory network architecture, the system achieves efficient inference with low latency and low power consumption. Experimental results demonstrate an accuracy of 87.43% after quantization on the Google Speech Commands v2 dataset processed through the neuromorphic sensor, with end-to-end latency below 35 us and average power consumption of 1.12 W. The processed datasets, software models, and hardware modules are available at https://github.com/vision-agh/NAS-GNN-KWS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents the first end-to-end FPGA implementation of a keyword spotting system integrating a Neuromorphic Auditory Sensor (NAS) and a graph neural network (GNN) on a single device. It processes raw audio directly via sparse event streams without conventional signal preprocessing, achieving 87.43% accuracy on Google Speech Commands v2 after quantization, with end-to-end latency below 35 μs and average power of 1.12 W. The architecture employs a compute-near-memory design, and the authors release processed datasets, models, and hardware modules on GitHub.

Significance. If the results hold, the work demonstrates a practical integration of neuromorphic sensing and GNNs for low-latency, low-power edge audio processing on FPGAs. The single-device implementation and open release of code and models are strengths that support reproducibility and extension in neuromorphic hardware for ML. The significance would be higher with explicit verification that the event streams preserve discriminative information without implicit feature extraction.

major comments (2)

[Abstract] Abstract: The reported post-quantization accuracy of 87.43% is presented without training details, baseline comparisons to standard KWS pipelines (e.g., MFCC + DNN), or error analysis, which are required to substantiate that the NAS event streams retain sufficient information for usable accuracy.
[Experimental results] Experimental results: No ablation is provided that isolates the NAS event encoding and graph construction from conventional feature extraction on the same dataset and model family. This directly affects the central claim that the system 'eliminates conventional signal preprocessing' while achieving competitive performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions to improve clarity and substantiation of our claims regarding the NAS-GNN integration and performance.

read point-by-point responses

Referee: [Abstract] Abstract: The reported post-quantization accuracy of 87.43% is presented without training details, baseline comparisons to standard KWS pipelines (e.g., MFCC + DNN), or error analysis, which are required to substantiate that the NAS event streams retain sufficient information for usable accuracy.

Authors: We agree that the abstract's brevity omits key supporting details. In the revised manuscript, we will add a concise mention of training procedure (e.g., optimizer, epochs, quantization method) and error analysis summary in the abstract or prominently in Section 4. We will also include a table comparing our accuracy to representative MFCC+DNN and other KWS baselines from the literature on the same Google Speech Commands v2 dataset. This will better demonstrate that the event streams preserve discriminative information. Full re-training and FPGA porting of baselines is beyond the scope of demonstrating our end-to-end neuromorphic pipeline. revision: partial
Referee: [Experimental results] Experimental results: No ablation is provided that isolates the NAS event encoding and graph construction from conventional feature extraction on the same dataset and model family. This directly affects the central claim that the system 'eliminates conventional signal preprocessing' while achieving competitive performance.

Authors: The manuscript's core contribution is the first single-FPGA integration of NAS event streams with a GNN, which by design bypasses conventional preprocessing (e.g., no MFCC or spectrogram computation). An ablation isolating NAS encoding versus conventional features on an identical model family is not directly applicable, as our GNN operates on sparse event graphs rather than dense feature maps; a fair comparison would require redesigning the model and input pipeline. We will revise the experimental section to explicitly discuss this architectural distinction, cite prior event-based audio works showing competitive accuracy, and clarify that the 'eliminates preprocessing' claim refers to the absence of traditional signal processing steps in our deployed system rather than a performance superiority claim. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical hardware implementation with measured accuracy

full rationale

The manuscript reports an FPGA-based keyword spotting system using NAS event streams fed to a GNN, with accuracy, latency, and power measured on Google Speech Commands v2 after quantization. No equations, fitted parameters, or derivations are presented that reduce a claimed result to its own inputs by construction. The central result is an end-to-end hardware measurement rather than a prediction derived from self-referential definitions or self-citations. Self-citations, if present, are not load-bearing for the accuracy claim, which rests on direct experimental evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard assumptions of neuromorphic sensor fidelity and FPGA synthesis tools; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5533 in / 1179 out tokens · 40312 ms · 2026-05-12T02:45:44.435534+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

[1]

Brain research bulletin50(5-6), 303–304 (1999)

Abbott, L.F.: Lapicque’s introduction of the integrate-and-fire model neuron (1907). Brain research bulletin50(5-6), 303–304 (1999)

work page 1907
[2]

PyTorch 2: Faster machine learning through dynamic Python bytecode transformation and graph compilation,

Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., Chauhan, G., Chourdia, A., Constable, W., Des- maison, A., DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., Hirsh, B., Huang, S., Kalambarkar, K., Kirsch, L., Lazos, M., Lezcano, M., Liang, Y., Liang, J., Lu, Y., Luk, C., Maher, B...

work page doi:10.1145/3620665.3640366 2024
[3]

Nature Communications 16(1), 5776 (Jul 2025).https://doi.org/10.1038/s41467-025-60878-z,https: //doi.org/10.1038/s41467-025-60878-z

Baronig, M., Ferrand, R., Sabathiel, S., Legenstein, R.: Advancing spatio-temporal processing through adaptation in spiking neural networks. Nature Communications 16(1), 5776 (Jul 2025).https://doi.org/10.1038/s41467-025-60878-z,https: //doi.org/10.1038/s41467-025-60878-z

work page doi:10.1038/s41467-025-60878-z 2025
[4]

Frontiers in Neuroscience16(2022).https://doi.org/10

Bittar, A., Garner, P.N.: A surrogate gradient spiking baseline for speech com- mand recognition. Frontiers in Neuroscience16(2022).https://doi.org/10. 3389/fnins.2022.865897

work page arXiv 2022
[5]

IEEE Transactions on Emerging Topics in Computing (01), 1–15 (Dec 2024)

Carpegna, A., Savino, A., Carlo, S.D.: Spiker+: a framework for the generation of efficient Spiking Neural Networks FPGA accelerators for inference at the edge. IEEE Transactions on Emerging Topics in Computing (01), 1–15 (Dec 2024). https://doi.org/10.1109/TETC.2024.3511676

work page doi:10.1109/tetc.2024.3511676 2024
[6]

IEEE Transactions on Circuits and Sys- tems I: Regular Papers54(1), 48–59 (2007).https://doi.org/10.1109/TCSI

Chan, V., Liu, S.C., van Schaik, A.: Aer ear: A matched silicon cochlea pair with address event representation interface. IEEE Transactions on Circuits and Sys- tems I: Regular Papers54(1), 48–59 (2007).https://doi.org/10.1109/TCSI. 2006.887979

work page doi:10.1109/tcsi 2007
[7]

IEEE Transactions on Neural Networks and Learning Systems pp

Cramer, B., Stradmann, Y., Schemmel, J., Zenke, F.: The Heidelberg Spiking Data Sets for the Systematic Evaluation of Spiking Neural Networks. IEEE Transactions on Neural Networks and Learning Systems pp. 1–14 (2020).https://doi.org/10. 1109/TNNLS.2020.3044364

work page arXiv 2020
[8]

Cramer, B., Stradmann, Y., Schemmel, J., Zenke, F.: The heidelberg spiking data sets for the systematic evaluation of spiking neural net- works. IEEE Transactions on Neural Networks and Learning Systems33(7), 2744–2757 (2022).https://doi.org/10.1109/TNNLS.2020.3044364,https:// zenkelab.org/resources/spiking-heidelberg-datasets-shd/

work page doi:10.1109/tnnls.2020.3044364 2022
[9]

In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M

Dampfhoffer, M., Mesquida, T., Valentian, A., Anghel, L.: Investigating current- based and gating approaches for accurate and energy-efficient spiking recurrent neural networks. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds.) Artificial Neural Networks and Machine Learning – ICANN 2022. pp. 359–370. Springer Nature Switzerlan...

work page 2022
[10]

In: 2010 17th IEEE In- ternational Conference on Electronics, Circuits and Systems

Gambin, I., Grech, I., Casha, O., Gatt, E., Micallef, J.: Digital cochlea model implementation using xilinx xc3s500e spartan-3e fpga. In: 2010 17th IEEE In- ternational Conference on Electronics, Circuits and Systems. pp. 946–949 (2010). https://doi.org/10.1109/ICECS.2010.5724669 End-to-End Keyword Spotting on FPGA Using GNNs and NAS 17

work page doi:10.1109/icecs.2010.5724669 2010
[11]

Hammouamri, I., Khalfaoui-Hassani, I., Masquelier, T.: Learning delays in spiking neural networks using dilated convolutions with learnable spacings (2023),https: //arxiv.org/abs/2306.17670

work page arXiv 2023
[12]

Electronics11(16), 2571 (2022)

He, K., Chen, D., Su, T.: A configurable accelerator for keyword spotting based on small-footprint temporal efficient neural network. Electronics11(16), 2571 (2022)

work page 2022
[13]

In: Del Bue, A., Canton, C., Pont- Tuset, J., Tommasi, T

Huber, T.E., Lecomte, J., Polovnikov, B., von Arnim, A.: Scaling up resonate- and-fire networks for fast deep learning. In: Del Bue, A., Canton, C., Pont- Tuset, J., Tommasi, T. (eds.) Computer Vision – ECCV 2024 Workshops. pp. 241–258. Springer Nature Switzerland, Cham (2025).https://doi.org/10.1007/ 978-3-031-92460-6_15

work page 2024
[14]

Jeziorek, K., Wzorek, P., Blachut, K., Nakano, H., Dampfhoffer, M., Mesquida, T., Nishi, H., Dalgaty, T., Kryjak, T.: Hardware-accelerated graph neural networks: an alternative approach for neuromorphic event-based audio classification and key- word spotting on soc fpga (2026),https://arxiv.org/abs/2602.16442

work page arXiv 2026
[15]

In: The 2010 International Joint Conference on Neural Networks (IJCNN)

Jimenez-Fernandez, A., Linares-Barranco, A., Paz-Vicente, R., Jiménez, G., Civit, A.: Building blocks for spikes signals processing. In: The 2010 International Joint Conference on Neural Networks (IJCNN). pp. 1–8 (2010).https://doi.org/10. 1109/IJCNN.2010.5596845

work page arXiv 2010
[16]

IEEE Trans

Jiménez-Fernández, A., Cerezuela-Escudero, E., Miró-Amarante, L., Domínguez- Morales, M.J., Gomez-Rodriguez, F., Linares-Barranco, A., Jiménez-Moreno, G.: A binaural neuromorphic auditory sensor for fpga: A spike signal processing ap- proach. IEEE Trans. Neural Netw. Learning Syst.28(4), 804–818 (2017)

work page 2017
[17]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

2013.2281834

Liu, S.C., van Schaik, A., Minch, B.A., Delbruck, T.: Asynchronous binaural spa- tialauditionsensorwith2×64×4channeloutput.IEEETransactionsonBiomedi- cal Circuits and Systems8(4), 453–464 (2014).https://doi.org/10.1109/TBCAS. 2013.2281834

work page doi:10.1109/tbcas 2014
[19]

SGDR: Stochastic Gradient Descent with Warm Restarts

Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

IEEE Transactions on Acous- tics, Speech, and Signal Processing36(7), 1119–1134 (1988).https://doi.org/ 10.1109/29.1639

Lyon, R., Mead, C.: An analog electronic cochlea. IEEE Transactions on Acous- tics, Speech, and Signal Processing36(7), 1119–1134 (1988).https://doi.org/ 10.1109/29.1639

work page doi:10.1109/29.1639 1988
[21]

Malettira, P.G., Negi, S., Ponghiran, W., Roy, K.: TSkips: Efficiency Through Explicit Temporal Delay Connections in Spiking Neural Networks (2024),https: //arxiv.org/abs/2411.16711

work page arXiv 2024
[22]

Matinizadeh, S., Pacik-Nelson, N., Polykretis, I., Tishbi, K., Kumar, S., Varshika, M.L., Mohammadhassani, A., Mishra, A., Kandasamy, N., Shackleford, J., Gallo, E., Das, A.: A fully-configurable open-source software-defined digital quantized spiking neural core architecture (2024),https://arxiv.org/abs/2404.02248

work page arXiv 2024
[23]

In: International Symposium on Applied Reconfigurable Computing

Nakano, H., Blachut, K., Jeziorek, K., Wzorek, P., Dampfhoffer, M., Mesquida, T., Nishi, H., Kryjak, T., Dalgaty, T.: Hardware-accelerated event-graph neural networks for low-latency time-series classification on soc fpga. In: International Symposium on Applied Reconfigurable Computing. pp. 51–68. Springer (2025)

work page 2025
[24]

Nature Communications12(1), 5791 (Oct 2021).https://doi.org/10.1038/s41467-021-26022-3,https://doi.org/ 10.1038/s41467-021-26022-3

Perez-Nieves, N., Leung, V.C.H., Dragotti, P.L., Goodman, D.F.M.: Neural heterogeneity promotes robust learning. Nature Communications12(1), 5791 (Oct 2021).https://doi.org/10.1038/s41467-021-26022-3,https://doi.org/ 10.1038/s41467-021-26022-3

work page doi:10.1038/s41467-021-26022-3 2021
[25]

Point- net++: Deep hierarchical feature learning on point sets in a metric space.arXiv preprint arXiv:1706.02413, 2017

Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space (2017),https://arxiv.org/abs/1706.02413 18 W. Matykiewicz et al

work page arXiv 2017
[26]

In: 2025 IEEE International Symposium on Circuits and Systems (ISCAS)

Rafeldt, L., Mesquida, T., Nakano, H., Dampfhoffer, M., Moro, F., Vivet, P., Pay- vand, M., Dalgaty, T.: Event-based audio prediction with spectro-temporal event- graphs. In: 2025 IEEE International Symposium on Circuits and Systems (ISCAS). pp. 1–5 (2025).https://doi.org/10.1109/ISCAS56072.2025.11043865

work page doi:10.1109/iscas56072.2025.11043865 2025
[27]

Neuromorphic Computing and Engineering2(4), 044016 (Dec 2022).https://doi.org/10.1088/2634-4386/ac97bb

Rossbroich, J., Gygax, J., Zenke, F.: Fluctuation-driven initialization for spiking neural network training. Neuromorphic Computing and Engineering2(4), 044016 (Dec 2022).https://doi.org/10.1088/2634-4386/ac97bb

work page doi:10.1088/2634-4386/ac97bb 2022
[28]

In: 2023 33rd International Conference Ra- dioelektronika (RADIOELEKTRONIKA)

Sadovsky, E., Jakubec, M., Jarina, R.: Speech command recognition based on con- volutional spiking neural networks. In: 2023 33rd International Conference Ra- dioelektronika (RADIOELEKTRONIKA). pp. 1–5 (2023).https://doi.org/10. 1109/RADIOELEKTRONIKA57919.2023.10109082

work page arXiv 2023
[29]

In: 2024 International Conference on Neuromorphic Systems (ICONS)

Schöne, M., Sushma, N.M., Zhuge, J., Mayr, C., Subramoney, A., Kappel, D.: Scalable event-by-event processing of neuromorphic sensory signals with deep state-space models. In: 2024 International Conference on Neuromorphic Systems (ICONS). pp. 124–131 (2024).https://doi.org/10.1109/ICONS62911. 2024.00026

work page doi:10.1109/icons62911 2024
[30]

Neural Networks185, 107154 (2025)

Sun, P., Wu, J., Devos, P., Botteldooren, D.: Towards parameter-free at- tentional spiking neural networks. Neural Networks185, 107154 (2025). https://doi.org/https://doi.org/10.1016/j.neunet.2025.107154,https:// www.sciencedirect.com/science/article/pii/S0893608025000334

work page doi:10.1016/j.neunet.2025.107154 2025
[31]

In: 2014 IEEE International Symposium on Circuits and Systems (ISCAS)

Thakur, C.S., Hamilton, T.J., Tapson, J., van Schaik, A., Lyon, R.F.: Fpga implementation of the car model of the cochlea. In: 2014 IEEE International Symposium on Circuits and Systems (ISCAS). pp. 1853–1856 (2014).https: //doi.org/10.1109/ISCAS.2014.6865519

work page doi:10.1109/iscas.2014.6865519 2014
[32]

In: 2015 International Joint Conference on Neural Networks (IJCNN)

Wang, S., Koickal, T.J., Enemali, G., Gouveia, L., Wang, L., Hamilton, A.: Design of a silicon cochlea system with biologically faithful response. In: 2015 International Joint Conference on Neural Networks (IJCNN). pp. 1–7 (2015).https://doi.org/ 10.1109/IJCNN.2015.7280828

work page doi:10.1109/ijcnn.2015.7280828 2015
[34]

Warden,P.:Speechcommands:Adatasetforlimited-vocabularyspeechrecognition (2018),https://arxiv.org/abs/1804.03209

work page arXiv 2018
[35]

Nature Machine Intelligence 3(10), 905–913 (Oct 2021).https://doi.org/10.1038/s42256-021-00397-w, https://doi.org/10.1038/s42256-021-00397-w

Yin, B., Corradi, F., Bohté, S.M.: Accurate and efficient time-domain classifica- tion with adaptive spiking recurrent neural networks. Nature Machine Intelligence 3(10), 905–913 (Oct 2021).https://doi.org/10.1038/s42256-021-00397-w, https://doi.org/10.1038/s42256-021-00397-w

work page doi:10.1038/s42256-021-00397-w 2021
[36]

Entropy27(11), 1143 (2025)

Zhang, A., Shi, J., Qian, H., Wang, J.: High precision speech keyword spotting based on binary deep neural network in fpga. Entropy27(11), 1143 (2025)

work page 2025