Recognition: no theorem link
End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor
Pith reviewed 2026-05-12 02:45 UTC · model grok-4.3
The pith
A single FPGA chip can run real-time keyword spotting by feeding raw events from a neuromorphic auditory sensor straight into a graph neural network without conventional preprocessing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present the first end-to-end FPGA implementation of a keyword spotting system that integrates a Neuromorphic Auditory Sensor and a graph neural network on a single device, enabling real-time processing of raw audio data through a compute-near-memory architecture that operates directly on event-based streams.
What carries the argument
The compute-near-memory network architecture that places GNN inference close to memory handling the sparse event data from the neuromorphic auditory sensor, allowing direct classification without intermediate feature steps.
If this is right
- The system delivers 87.43 percent accuracy on the Google Speech Commands v2 dataset after quantization.
- End-to-end latency stays below 35 microseconds while processing raw sensor events.
- Average power consumption is 1.12 watts on the single FPGA device.
- No conventional signal preprocessing steps are required between the sensor and the classifier.
Where Pith is reading between the lines
- The single-chip design could reduce board complexity for always-on audio systems in small robots or wearables.
- Event-based input may lower data movement costs compared with dense audio frames in other edge audio tasks.
- The same sensor-plus-GNN pattern could be tested on different neuromorphic sensors to check whether the accuracy holds across modalities.
Load-bearing premise
The sparse event streams produced by the neuromorphic auditory sensor contain enough information for the graph neural network to reach usable keyword-spotting accuracy after quantization and without any conventional feature extraction.
What would settle it
Deploying the full quantized model on the FPGA and running it on the Google Speech Commands v2 dataset processed through the neuromorphic sensor; accuracy falling below roughly 80 percent or latency exceeding real-time bounds would show the approach does not deliver usable performance.
Figures
read the original abstract
With the rapid growth of mobile robotics and embedded intelligence, there is an increasing demand for efficient on-device data processing on edge platforms. A promising research direction is the use of neuromorphic sensors inspired by human sensory systems, which generate sparse, event-based data encoding changes in the environment. In this work, we present the first end-to-end FPGA implementation of a keyword spotting system that integrates a Neuromorphic Auditory Sensor (NAS) and a graph neural network (GNN) on a single FPGA device, enabling real-time processing of raw audio data. The proposed architecture eliminates conventional signal preprocessing and operates directly on event-based audio streams. Leveraging a compute-near-memory network architecture, the system achieves efficient inference with low latency and low power consumption. Experimental results demonstrate an accuracy of 87.43% after quantization on the Google Speech Commands v2 dataset processed through the neuromorphic sensor, with end-to-end latency below 35 us and average power consumption of 1.12 W. The processed datasets, software models, and hardware modules are available at https://github.com/vision-agh/NAS-GNN-KWS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first end-to-end FPGA implementation of a keyword spotting system integrating a Neuromorphic Auditory Sensor (NAS) and a graph neural network (GNN) on a single device. It processes raw audio directly via sparse event streams without conventional signal preprocessing, achieving 87.43% accuracy on Google Speech Commands v2 after quantization, with end-to-end latency below 35 μs and average power of 1.12 W. The architecture employs a compute-near-memory design, and the authors release processed datasets, models, and hardware modules on GitHub.
Significance. If the results hold, the work demonstrates a practical integration of neuromorphic sensing and GNNs for low-latency, low-power edge audio processing on FPGAs. The single-device implementation and open release of code and models are strengths that support reproducibility and extension in neuromorphic hardware for ML. The significance would be higher with explicit verification that the event streams preserve discriminative information without implicit feature extraction.
major comments (2)
- [Abstract] Abstract: The reported post-quantization accuracy of 87.43% is presented without training details, baseline comparisons to standard KWS pipelines (e.g., MFCC + DNN), or error analysis, which are required to substantiate that the NAS event streams retain sufficient information for usable accuracy.
- [Experimental results] Experimental results: No ablation is provided that isolates the NAS event encoding and graph construction from conventional feature extraction on the same dataset and model family. This directly affects the central claim that the system 'eliminates conventional signal preprocessing' while achieving competitive performance.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions to improve clarity and substantiation of our claims regarding the NAS-GNN integration and performance.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported post-quantization accuracy of 87.43% is presented without training details, baseline comparisons to standard KWS pipelines (e.g., MFCC + DNN), or error analysis, which are required to substantiate that the NAS event streams retain sufficient information for usable accuracy.
Authors: We agree that the abstract's brevity omits key supporting details. In the revised manuscript, we will add a concise mention of training procedure (e.g., optimizer, epochs, quantization method) and error analysis summary in the abstract or prominently in Section 4. We will also include a table comparing our accuracy to representative MFCC+DNN and other KWS baselines from the literature on the same Google Speech Commands v2 dataset. This will better demonstrate that the event streams preserve discriminative information. Full re-training and FPGA porting of baselines is beyond the scope of demonstrating our end-to-end neuromorphic pipeline. revision: partial
-
Referee: [Experimental results] Experimental results: No ablation is provided that isolates the NAS event encoding and graph construction from conventional feature extraction on the same dataset and model family. This directly affects the central claim that the system 'eliminates conventional signal preprocessing' while achieving competitive performance.
Authors: The manuscript's core contribution is the first single-FPGA integration of NAS event streams with a GNN, which by design bypasses conventional preprocessing (e.g., no MFCC or spectrogram computation). An ablation isolating NAS encoding versus conventional features on an identical model family is not directly applicable, as our GNN operates on sparse event graphs rather than dense feature maps; a fair comparison would require redesigning the model and input pipeline. We will revise the experimental section to explicitly discuss this architectural distinction, cite prior event-based audio works showing competitive accuracy, and clarify that the 'eliminates preprocessing' claim refers to the absence of traditional signal processing steps in our deployed system rather than a performance superiority claim. revision: partial
Circularity Check
No circularity: empirical hardware implementation with measured accuracy
full rationale
The manuscript reports an FPGA-based keyword spotting system using NAS event streams fed to a GNN, with accuracy, latency, and power measured on Google Speech Commands v2 after quantization. No equations, fitted parameters, or derivations are presented that reduce a claimed result to its own inputs by construction. The central result is an end-to-end hardware measurement rather than a prediction derived from self-referential definitions or self-citations. Self-citations, if present, are not load-bearing for the accuracy claim, which rests on direct experimental evaluation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Brain research bulletin50(5-6), 303–304 (1999)
Abbott, L.F.: Lapicque’s introduction of the integrate-and-fire model neuron (1907). Brain research bulletin50(5-6), 303–304 (1999)
work page 1907
-
[2]
Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., Chauhan, G., Chourdia, A., Constable, W., Des- maison, A., DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., Hirsh, B., Huang, S., Kalambarkar, K., Kirsch, L., Lazos, M., Lezcano, M., Liang, Y., Liang, J., Lu, Y., Luk, C., Maher, B...
-
[3]
Baronig, M., Ferrand, R., Sabathiel, S., Legenstein, R.: Advancing spatio-temporal processing through adaptation in spiking neural networks. Nature Communications 16(1), 5776 (Jul 2025).https://doi.org/10.1038/s41467-025-60878-z,https: //doi.org/10.1038/s41467-025-60878-z
-
[4]
Frontiers in Neuroscience16(2022).https://doi.org/10
Bittar, A., Garner, P.N.: A surrogate gradient spiking baseline for speech com- mand recognition. Frontiers in Neuroscience16(2022).https://doi.org/10. 3389/fnins.2022.865897
-
[5]
IEEE Transactions on Emerging Topics in Computing (01), 1–15 (Dec 2024)
Carpegna, A., Savino, A., Carlo, S.D.: Spiker+: a framework for the generation of efficient Spiking Neural Networks FPGA accelerators for inference at the edge. IEEE Transactions on Emerging Topics in Computing (01), 1–15 (Dec 2024). https://doi.org/10.1109/TETC.2024.3511676
-
[6]
Chan, V., Liu, S.C., van Schaik, A.: Aer ear: A matched silicon cochlea pair with address event representation interface. IEEE Transactions on Circuits and Sys- tems I: Regular Papers54(1), 48–59 (2007).https://doi.org/10.1109/TCSI. 2006.887979
-
[7]
IEEE Transactions on Neural Networks and Learning Systems pp
Cramer, B., Stradmann, Y., Schemmel, J., Zenke, F.: The Heidelberg Spiking Data Sets for the Systematic Evaluation of Spiking Neural Networks. IEEE Transactions on Neural Networks and Learning Systems pp. 1–14 (2020).https://doi.org/10. 1109/TNNLS.2020.3044364
-
[8]
Cramer, B., Stradmann, Y., Schemmel, J., Zenke, F.: The heidelberg spiking data sets for the systematic evaluation of spiking neural net- works. IEEE Transactions on Neural Networks and Learning Systems33(7), 2744–2757 (2022).https://doi.org/10.1109/TNNLS.2020.3044364,https:// zenkelab.org/resources/spiking-heidelberg-datasets-shd/
-
[9]
In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M
Dampfhoffer, M., Mesquida, T., Valentian, A., Anghel, L.: Investigating current- based and gating approaches for accurate and energy-efficient spiking recurrent neural networks. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds.) Artificial Neural Networks and Machine Learning – ICANN 2022. pp. 359–370. Springer Nature Switzerlan...
work page 2022
-
[10]
In: 2010 17th IEEE In- ternational Conference on Electronics, Circuits and Systems
Gambin, I., Grech, I., Casha, O., Gatt, E., Micallef, J.: Digital cochlea model implementation using xilinx xc3s500e spartan-3e fpga. In: 2010 17th IEEE In- ternational Conference on Electronics, Circuits and Systems. pp. 946–949 (2010). https://doi.org/10.1109/ICECS.2010.5724669 End-to-End Keyword Spotting on FPGA Using GNNs and NAS 17
- [11]
-
[12]
Electronics11(16), 2571 (2022)
He, K., Chen, D., Su, T.: A configurable accelerator for keyword spotting based on small-footprint temporal efficient neural network. Electronics11(16), 2571 (2022)
work page 2022
-
[13]
In: Del Bue, A., Canton, C., Pont- Tuset, J., Tommasi, T
Huber, T.E., Lecomte, J., Polovnikov, B., von Arnim, A.: Scaling up resonate- and-fire networks for fast deep learning. In: Del Bue, A., Canton, C., Pont- Tuset, J., Tommasi, T. (eds.) Computer Vision – ECCV 2024 Workshops. pp. 241–258. Springer Nature Switzerland, Cham (2025).https://doi.org/10.1007/ 978-3-031-92460-6_15
work page 2024
-
[14]
Jeziorek, K., Wzorek, P., Blachut, K., Nakano, H., Dampfhoffer, M., Mesquida, T., Nishi, H., Dalgaty, T., Kryjak, T.: Hardware-accelerated graph neural networks: an alternative approach for neuromorphic event-based audio classification and key- word spotting on soc fpga (2026),https://arxiv.org/abs/2602.16442
-
[15]
In: The 2010 International Joint Conference on Neural Networks (IJCNN)
Jimenez-Fernandez, A., Linares-Barranco, A., Paz-Vicente, R., Jiménez, G., Civit, A.: Building blocks for spikes signals processing. In: The 2010 International Joint Conference on Neural Networks (IJCNN). pp. 1–8 (2010).https://doi.org/10. 1109/IJCNN.2010.5596845
-
[16]
Jiménez-Fernández, A., Cerezuela-Escudero, E., Miró-Amarante, L., Domínguez- Morales, M.J., Gomez-Rodriguez, F., Linares-Barranco, A., Jiménez-Moreno, G.: A binaural neuromorphic auditory sensor for fpga: A spike signal processing ap- proach. IEEE Trans. Neural Netw. Learning Syst.28(4), 804–818 (2017)
work page 2017
-
[17]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[18]
Liu, S.C., van Schaik, A., Minch, B.A., Delbruck, T.: Asynchronous binaural spa- tialauditionsensorwith2×64×4channeloutput.IEEETransactionsonBiomedi- cal Circuits and Systems8(4), 453–464 (2014).https://doi.org/10.1109/TBCAS. 2013.2281834
-
[19]
SGDR: Stochastic Gradient Descent with Warm Restarts
Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
Lyon, R., Mead, C.: An analog electronic cochlea. IEEE Transactions on Acous- tics, Speech, and Signal Processing36(7), 1119–1134 (1988).https://doi.org/ 10.1109/29.1639
- [21]
-
[22]
Matinizadeh, S., Pacik-Nelson, N., Polykretis, I., Tishbi, K., Kumar, S., Varshika, M.L., Mohammadhassani, A., Mishra, A., Kandasamy, N., Shackleford, J., Gallo, E., Das, A.: A fully-configurable open-source software-defined digital quantized spiking neural core architecture (2024),https://arxiv.org/abs/2404.02248
-
[23]
In: International Symposium on Applied Reconfigurable Computing
Nakano, H., Blachut, K., Jeziorek, K., Wzorek, P., Dampfhoffer, M., Mesquida, T., Nishi, H., Kryjak, T., Dalgaty, T.: Hardware-accelerated event-graph neural networks for low-latency time-series classification on soc fpga. In: International Symposium on Applied Reconfigurable Computing. pp. 51–68. Springer (2025)
work page 2025
-
[24]
Perez-Nieves, N., Leung, V.C.H., Dragotti, P.L., Goodman, D.F.M.: Neural heterogeneity promotes robust learning. Nature Communications12(1), 5791 (Oct 2021).https://doi.org/10.1038/s41467-021-26022-3,https://doi.org/ 10.1038/s41467-021-26022-3
-
[25]
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space (2017),https://arxiv.org/abs/1706.02413 18 W. Matykiewicz et al
-
[26]
In: 2025 IEEE International Symposium on Circuits and Systems (ISCAS)
Rafeldt, L., Mesquida, T., Nakano, H., Dampfhoffer, M., Moro, F., Vivet, P., Pay- vand, M., Dalgaty, T.: Event-based audio prediction with spectro-temporal event- graphs. In: 2025 IEEE International Symposium on Circuits and Systems (ISCAS). pp. 1–5 (2025).https://doi.org/10.1109/ISCAS56072.2025.11043865
-
[27]
Rossbroich, J., Gygax, J., Zenke, F.: Fluctuation-driven initialization for spiking neural network training. Neuromorphic Computing and Engineering2(4), 044016 (Dec 2022).https://doi.org/10.1088/2634-4386/ac97bb
-
[28]
In: 2023 33rd International Conference Ra- dioelektronika (RADIOELEKTRONIKA)
Sadovsky, E., Jakubec, M., Jarina, R.: Speech command recognition based on con- volutional spiking neural networks. In: 2023 33rd International Conference Ra- dioelektronika (RADIOELEKTRONIKA). pp. 1–5 (2023).https://doi.org/10. 1109/RADIOELEKTRONIKA57919.2023.10109082
-
[29]
In: 2024 International Conference on Neuromorphic Systems (ICONS)
Schöne, M., Sushma, N.M., Zhuge, J., Mayr, C., Subramoney, A., Kappel, D.: Scalable event-by-event processing of neuromorphic sensory signals with deep state-space models. In: 2024 International Conference on Neuromorphic Systems (ICONS). pp. 124–131 (2024).https://doi.org/10.1109/ICONS62911. 2024.00026
-
[30]
Neural Networks185, 107154 (2025)
Sun, P., Wu, J., Devos, P., Botteldooren, D.: Towards parameter-free at- tentional spiking neural networks. Neural Networks185, 107154 (2025). https://doi.org/https://doi.org/10.1016/j.neunet.2025.107154,https:// www.sciencedirect.com/science/article/pii/S0893608025000334
-
[31]
In: 2014 IEEE International Symposium on Circuits and Systems (ISCAS)
Thakur, C.S., Hamilton, T.J., Tapson, J., van Schaik, A., Lyon, R.F.: Fpga implementation of the car model of the cochlea. In: 2014 IEEE International Symposium on Circuits and Systems (ISCAS). pp. 1853–1856 (2014).https: //doi.org/10.1109/ISCAS.2014.6865519
-
[32]
In: 2015 International Joint Conference on Neural Networks (IJCNN)
Wang, S., Koickal, T.J., Enemali, G., Gouveia, L., Wang, L., Hamilton, A.: Design of a silicon cochlea system with biologically faithful response. In: 2015 International Joint Conference on Neural Networks (IJCNN). pp. 1–7 (2015).https://doi.org/ 10.1109/IJCNN.2015.7280828
- [34]
-
[35]
Yin, B., Corradi, F., Bohté, S.M.: Accurate and efficient time-domain classifica- tion with adaptive spiking recurrent neural networks. Nature Machine Intelligence 3(10), 905–913 (Oct 2021).https://doi.org/10.1038/s42256-021-00397-w, https://doi.org/10.1038/s42256-021-00397-w
-
[36]
Zhang, A., Shi, J., Qian, H., Wang, J.: High precision speech keyword spotting based on binary deep neural network in fpga. Entropy27(11), 1143 (2025)
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.