arxiv: 2604.17040 · v1 · submitted 2026-04-18 · 💻 cs.LG · cs.AR· cs.NE

Recognition: unknown

When Spike Sparsity Does Not Translate to Deployed Cost: VS-WNO on Jetson Orin Nano

Jason Yoo, Shailesh Garg, Souvik Chakraborty, Syed Bahauddin Alam

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:50 UTC · model grok-4.3

classification 💻 cs.LG cs.ARcs.NE

keywords spiking neural operatorsvariable-spiking wavelet neural operatorJetson Orin Nanospike sparsitydeployed costCUDA runtimeDarcy benchmarkedge GPU deployment

0 comments

The pith

Spike sparsity in VS-WNO does not reduce deployed latency or energy on Jetson Orin Nano

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the sparse firing patterns in variable-spiking wavelet neural operators deliver lower latency and energy when run on a standard edge GPU. On the Jetson Orin Nano, VS-WNO models produce measurable sparsity that falls from 54 percent to 18 percent across layers, yet they take longer and use more power than otherwise identical dense models on the same Darcy benchmark. Profiling shows that the CUDA runtime continues to launch dense convolution kernels and spends most of its time in kernel-launch overhead regardless of how sparse the spikes become. A reader would care because neuromorphic edge computing rests on the assumption that algorithmic sparsity will translate into real savings on commodity hardware.

Core claim

On the Jetson Orin Nano 8 GB, five pretrained VS-WNO checkpoints exhibit algorithmic spike rates that decline from 54.26 percent in the first spiking layer to 18.15 percent in the fourth, yet they incur 59.6 ms latency and 228.0 mJ dynamic energy per inference while the matched dense WNO checkpoints reach 53.2 ms and 180.7 mJ with marginally lower reference-path error. Nsight Systems traces show that the deployment request path stays launch-dominated, with cudaLaunchKernel consuming 81.6 percent of CUDA API time and dense convolution kernels consuming 53.8 percent of GPU kernel time for both sparse and dense variants.

What carries the argument

The deployment-style request path on the Jetson CUDA stack, in which dense convolution kernels and repeated kernel launches continue to execute irrespective of measured spike activity.

If this is right

VS-WNO reaches higher latency and higher dynamic energy than dense WNO despite lower spike rates.
The request path remains dominated by cudaLaunchKernel calls and dense convolution kernels.
Algorithmic sparsity is present but the runtime does not suppress dense work as spike activity falls.
Dense WNO achieves slightly lower reference-path error than VS-WNO on the Darcy benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers may need sparsity-aware kernels or different hardware stacks before spiking models can outperform dense ones on edge GPUs.
The result may be specific to the current CUDA launch model and could change if future runtimes prune work based on spike masks.
Because error rates remain comparable, the choice between dense and spiking versions on this platform is driven by runtime cost rather than accuracy.

Load-bearing premise

The observed runtime behavior on the Jetson Orin Nano with the tested CUDA stack is representative of typical deployment paths for spiking neural operators.

What would settle it

A trace on the same hardware in which decreasing spike rates cause a proportional drop in dense convolution kernel time or in the number of cudaLaunchKernel calls would falsify the claim that sparsity fails to reduce deployed cost.

Figures

Figures reproduced from arXiv: 2604.17040 by Jason Yoo, Shailesh Garg, Souvik Chakraborty, Syed Bahauddin Alam.

**Figure 1.** Figure 1: Per-seed layer-wise spike rates, test error, and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Spiking neural operators are appealing for neuromorphic edge computing because event-driven substrates can, in principle, translate sparse activity into lower latency and energy. Whether that advantage survives deployment on commodity edge-GPU software stacks, however, remains unclear. We study this question on a Jetson Orin Nano 8 GB using five pretrained variable-spiking wavelet neural operator (VS-WNO) checkpoints and five matched dense wavelet neural operator (WNO) checkpoints on the Darcy rectangular benchmark. On a reference-aligned path, VS-WNO exhibits substantial algorithmic sparsity, with mean spike rates decreasing from 54.26% at the first spiking layer to 18.15% at the fourth. On a deployment-style request path, however, this sparsity does not reduce deployed cost: VS-WNO reaches 59.6 ms latency and 228.0 mJ dynamic energy per inference, whereas dense WNO reaches 53.2 ms and 180.7 mJ, while also achieving slightly lower reference-path error (1.77% versus 1.81%). Nsight Systems indicates that the request path remains launch-dominated and dense rather than sparsity-aware: for VS-WNO, cudaLaunchKernel accounts for 81.6% of CUDA API time within the latency window, and dense convolution kernels account for 53.8% of GPU kernel time; dense WNO shows the same pattern. On this Jetson-class GPU stack, spike sparsity is measurable but does not reduce deployed cost because the runtime does not suppress dense work as spike activity decreases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

On the Jetson Orin Nano, VS-WNO spike sparsity does not cut latency or energy because the CUDA path stays launch-heavy and runs dense kernels anyway.

read the letter

This paper's main observation is that VS-WNO shows real algorithmic sparsity on the Darcy benchmark yet ends up slower and more expensive than dense WNO when run on the Jetson Orin Nano. The numbers are 59.6 ms and 228 mJ versus 53.2 ms and 180.7 mJ, with spike rates falling from 54% to 18% across layers. Nsight profiling pins the gap on 81.6% of API time spent in cudaLaunchKernel and 53.8% of kernel time in dense convolutions for the spiking version, and the same pattern appears in the dense baseline.

Referee Report

0 major / 2 minor

Summary. The manuscript reports an empirical evaluation of variable-spiking wavelet neural operators (VS-WNO) versus dense wavelet neural operators (WNO) deployed on the Jetson Orin Nano. It demonstrates that although VS-WNO achieves algorithmic sparsity with spike rates ranging from 54.26% in the first layer to 18.15% in the fourth, this does not translate to lower latency or energy consumption (59.6 ms and 228.0 mJ for VS-WNO versus 53.2 ms and 180.7 mJ for WNO), due to the CUDA runtime being dominated by kernel launches (81.6%) and executing dense convolution kernels (53.8% of GPU time).

Significance. If the findings hold, they highlight a significant practical limitation in translating theoretical sparsity advantages of spiking models to real-world edge GPU deployments on commodity stacks. The use of direct hardware measurements and profiling tools like Nsight Systems provides strong, falsifiable evidence for the claim, which could inform the development of sparsity-exploiting runtimes or hardware.

minor comments (2)

[Abstract] The abstract states 'slightly lower reference-path error (1.77% versus 1.81%)' for dense WNO; clarifying whether this difference is statistically significant would help assess if the models are truly matched.
[Profiling Analysis] The percentages for CUDA API time and GPU kernel time are provided, but a full breakdown table or figure would allow readers to better understand the contribution of each component to the overall latency.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of the manuscript and for recognizing the significance of demonstrating that algorithmic spike sparsity does not reduce deployed latency or energy on the Jetson Orin Nano under commodity CUDA stacks. The recommendation for minor revision is noted.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper consists entirely of direct empirical measurements (spike rates from 54.26% to 18.15%, latencies of 59.6 ms vs 53.2 ms, energies of 228.0 mJ vs 180.7 mJ, and Nsight breakdowns showing 81.6% cudaLaunchKernel and 53.8% dense kernels) on fixed hardware and pretrained checkpoints. No derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations exist; all claims are scoped observations of measured behavior on the Jetson Orin Nano CUDA stack.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical measurement study relying on existing pretrained models and standard hardware profiling tools without introducing new free parameters, axioms beyond domain standards, or invented entities.

axioms (1)

domain assumption The Darcy rectangular benchmark is a standard test case for neural operator performance.
Used to evaluate and compare the VS-WNO and WNO models.

pith-pipeline@v0.9.0 · 5600 in / 1156 out tokens · 41925 ms · 2026-05-10T06:50:08.882779+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

2019.Uses of complex wavelets in deep convolutional neural networks

Fergal Cotter. 2019.Uses of complex wavelets in deep convolutional neural networks. Ph. D. Dissertation. University of Cambridge. doi:10.17863/CAM.53748

work page doi:10.17863/cam.53748 2019
[2]

Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, Yuyun Liao, Chit-Kwan Lin, Andrew Lines, Ruokun Liu, Deepak Mathaikutty, Steven McCoy, Arnab Paul, Jonathan Tse, Guruguhanathan Venkataramanan, Yi-Hsin Weng, Andreas Wild, Yoonseok Yang, and Hong Wang. 201...

work page doi:10.1109/mm.2018.112130359 2018
[3]

Fonseca Guerra, Prasad Joshi, Philipp Plank, and Sumedh R

Mike Davies, Andreas Wild, Garrick Orchard, Yulia Sandamirskaya, Gabriel A. Fon- seca Guerra, Prasad Joshi, Philipp Plank, and Sumedh R. Risbud. 2021. Advancing neuromorphic computing with Loihi: A survey of results and outlook.Proc. IEEE 109, 5 (2021), 911–934. doi:10.1109/JPROC.2021.3067593

work page doi:10.1109/jproc.2021.3067593 2021
[4]

Eshraghian, Max Ward, Emre O

Jason K. Eshraghian, Max Ward, Emre O. Neftci, Xinxin Wang, Gregor Lenz, Girish Dwivedi, Mohammed Bennamoun, Doo Seok Jeong, and Wei D. Lu. 2023. Training spiking neural networks using lessons from deep learning.Proc. IEEE111, 9 (2023), 1016–1054. doi:10.1109/JPROC.2023.3308088

work page doi:10.1109/jproc.2023.3308088 2023
[5]

Shailesh Garg and Souvik Chakraborty. 2024. Neuroscience-inspired neural op- erator for partial differential equations.J. Comput. Phys.515 (2024), 113266. doi:10.1016/j.jcp.2024.113266

work page doi:10.1016/j.jcp.2024.113266 2024
[6]

William Howes, Jason Yoo, Kazuma Kobayashi, Subhankar Sarkar, Farid Ahmed, Souvik Chakraborty, and Syed Bahauddin Alam. 2026. Graph Neural Operator Towards Edge Deployability and Portability for Sparse-to-Dense, Real-Time Vir- tual Sensing on Irregular Grids. arXiv:2604.01802 [cs.LG] https://arxiv.org/abs/ 2604.01802

work page arXiv 2026
[7]

Tapas Tripura and Souvik Chakraborty. 2023. Wavelet neural operator for solving parametric partial differential equations in computational mechanics problems. Computer Methods in Applied Mechanics and Engineering404 (2023), 115783. doi:10. 1016/j.cma.2022.115783

work page arXiv 2023