Recognition: unknown
When Spike Sparsity Does Not Translate to Deployed Cost: VS-WNO on Jetson Orin Nano
Pith reviewed 2026-05-10 06:50 UTC · model grok-4.3
The pith
Spike sparsity in VS-WNO does not reduce deployed latency or energy on Jetson Orin Nano
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the Jetson Orin Nano 8 GB, five pretrained VS-WNO checkpoints exhibit algorithmic spike rates that decline from 54.26 percent in the first spiking layer to 18.15 percent in the fourth, yet they incur 59.6 ms latency and 228.0 mJ dynamic energy per inference while the matched dense WNO checkpoints reach 53.2 ms and 180.7 mJ with marginally lower reference-path error. Nsight Systems traces show that the deployment request path stays launch-dominated, with cudaLaunchKernel consuming 81.6 percent of CUDA API time and dense convolution kernels consuming 53.8 percent of GPU kernel time for both sparse and dense variants.
What carries the argument
The deployment-style request path on the Jetson CUDA stack, in which dense convolution kernels and repeated kernel launches continue to execute irrespective of measured spike activity.
If this is right
- VS-WNO reaches higher latency and higher dynamic energy than dense WNO despite lower spike rates.
- The request path remains dominated by cudaLaunchKernel calls and dense convolution kernels.
- Algorithmic sparsity is present but the runtime does not suppress dense work as spike activity falls.
- Dense WNO achieves slightly lower reference-path error than VS-WNO on the Darcy benchmark.
Where Pith is reading between the lines
- Designers may need sparsity-aware kernels or different hardware stacks before spiking models can outperform dense ones on edge GPUs.
- The result may be specific to the current CUDA launch model and could change if future runtimes prune work based on spike masks.
- Because error rates remain comparable, the choice between dense and spiking versions on this platform is driven by runtime cost rather than accuracy.
Load-bearing premise
The observed runtime behavior on the Jetson Orin Nano with the tested CUDA stack is representative of typical deployment paths for spiking neural operators.
What would settle it
A trace on the same hardware in which decreasing spike rates cause a proportional drop in dense convolution kernel time or in the number of cudaLaunchKernel calls would falsify the claim that sparsity fails to reduce deployed cost.
Figures
read the original abstract
Spiking neural operators are appealing for neuromorphic edge computing because event-driven substrates can, in principle, translate sparse activity into lower latency and energy. Whether that advantage survives deployment on commodity edge-GPU software stacks, however, remains unclear. We study this question on a Jetson Orin Nano 8 GB using five pretrained variable-spiking wavelet neural operator (VS-WNO) checkpoints and five matched dense wavelet neural operator (WNO) checkpoints on the Darcy rectangular benchmark. On a reference-aligned path, VS-WNO exhibits substantial algorithmic sparsity, with mean spike rates decreasing from 54.26% at the first spiking layer to 18.15% at the fourth. On a deployment-style request path, however, this sparsity does not reduce deployed cost: VS-WNO reaches 59.6 ms latency and 228.0 mJ dynamic energy per inference, whereas dense WNO reaches 53.2 ms and 180.7 mJ, while also achieving slightly lower reference-path error (1.77% versus 1.81%). Nsight Systems indicates that the request path remains launch-dominated and dense rather than sparsity-aware: for VS-WNO, cudaLaunchKernel accounts for 81.6% of CUDA API time within the latency window, and dense convolution kernels account for 53.8% of GPU kernel time; dense WNO shows the same pattern. On this Jetson-class GPU stack, spike sparsity is measurable but does not reduce deployed cost because the runtime does not suppress dense work as spike activity decreases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical evaluation of variable-spiking wavelet neural operators (VS-WNO) versus dense wavelet neural operators (WNO) deployed on the Jetson Orin Nano. It demonstrates that although VS-WNO achieves algorithmic sparsity with spike rates ranging from 54.26% in the first layer to 18.15% in the fourth, this does not translate to lower latency or energy consumption (59.6 ms and 228.0 mJ for VS-WNO versus 53.2 ms and 180.7 mJ for WNO), due to the CUDA runtime being dominated by kernel launches (81.6%) and executing dense convolution kernels (53.8% of GPU time).
Significance. If the findings hold, they highlight a significant practical limitation in translating theoretical sparsity advantages of spiking models to real-world edge GPU deployments on commodity stacks. The use of direct hardware measurements and profiling tools like Nsight Systems provides strong, falsifiable evidence for the claim, which could inform the development of sparsity-exploiting runtimes or hardware.
minor comments (2)
- [Abstract] The abstract states 'slightly lower reference-path error (1.77% versus 1.81%)' for dense WNO; clarifying whether this difference is statistically significant would help assess if the models are truly matched.
- [Profiling Analysis] The percentages for CUDA API time and GPU kernel time are provided, but a full breakdown table or figure would allow readers to better understand the contribution of each component to the overall latency.
Simulated Author's Rebuttal
We thank the referee for their accurate summary of the manuscript and for recognizing the significance of demonstrating that algorithmic spike sparsity does not reduce deployed latency or energy on the Jetson Orin Nano under commodity CUDA stacks. The recommendation for minor revision is noted.
Circularity Check
No significant circularity
full rationale
The paper consists entirely of direct empirical measurements (spike rates from 54.26% to 18.15%, latencies of 59.6 ms vs 53.2 ms, energies of 228.0 mJ vs 180.7 mJ, and Nsight breakdowns showing 81.6% cudaLaunchKernel and 53.8% dense kernels) on fixed hardware and pretrained checkpoints. No derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations exist; all claims are scoped observations of measured behavior on the Jetson Orin Nano CUDA stack.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Darcy rectangular benchmark is a standard test case for neural operator performance.
Reference graph
Works this paper leans on
-
[1]
2019.Uses of complex wavelets in deep convolutional neural networks
Fergal Cotter. 2019.Uses of complex wavelets in deep convolutional neural networks. Ph. D. Dissertation. University of Cambridge. doi:10.17863/CAM.53748
-
[2]
Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, Yuyun Liao, Chit-Kwan Lin, Andrew Lines, Ruokun Liu, Deepak Mathaikutty, Steven McCoy, Arnab Paul, Jonathan Tse, Guruguhanathan Venkataramanan, Yi-Hsin Weng, Andreas Wild, Yoonseok Yang, and Hong Wang. 201...
-
[3]
Fonseca Guerra, Prasad Joshi, Philipp Plank, and Sumedh R
Mike Davies, Andreas Wild, Garrick Orchard, Yulia Sandamirskaya, Gabriel A. Fon- seca Guerra, Prasad Joshi, Philipp Plank, and Sumedh R. Risbud. 2021. Advancing neuromorphic computing with Loihi: A survey of results and outlook.Proc. IEEE 109, 5 (2021), 911–934. doi:10.1109/JPROC.2021.3067593
-
[4]
Jason K. Eshraghian, Max Ward, Emre O. Neftci, Xinxin Wang, Gregor Lenz, Girish Dwivedi, Mohammed Bennamoun, Doo Seok Jeong, and Wei D. Lu. 2023. Training spiking neural networks using lessons from deep learning.Proc. IEEE111, 9 (2023), 1016–1054. doi:10.1109/JPROC.2023.3308088
-
[5]
Shailesh Garg and Souvik Chakraborty. 2024. Neuroscience-inspired neural op- erator for partial differential equations.J. Comput. Phys.515 (2024), 113266. doi:10.1016/j.jcp.2024.113266
-
[6]
William Howes, Jason Yoo, Kazuma Kobayashi, Subhankar Sarkar, Farid Ahmed, Souvik Chakraborty, and Syed Bahauddin Alam. 2026. Graph Neural Operator Towards Edge Deployability and Portability for Sparse-to-Dense, Real-Time Vir- tual Sensing on Irregular Grids. arXiv:2604.01802 [cs.LG] https://arxiv.org/abs/ 2604.01802
- [7]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.