nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies

Abhinaba Roy; Dorien Herremans; Junyi Liang

arxiv: 2606.05394 · v2 · pith:WVUVBC6Dnew · submitted 2026-06-03 · 💻 cs.SD · eess.AS

nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies

Abhinaba Roy , Junyi Liang , Dorien Herremans This is my paper

Pith reviewed 2026-06-28 04:25 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords nnAudioSTFTiSTFTTorchScriptaudio feature extractioninverse transformPyTorchCQT

0 comments

The pith

nnAudio 2 removes dynamic state changes from STFT and iSTFT to enable TorchScript compilation and restricts reliable inverse transforms to uniform frequency bins.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper updates nnAudio to work reliably in current PyTorch environments by addressing compilation failures and unclear inverse behavior. It achieves this through removal of dynamic state mutation and module construction from the scripted paths of the short-time Fourier transform functions, along with tighter argument handling. The authors also limit reliable inversion to the case where frequency scaling is set to uniform bins and add explicit errors for other settings to avoid degraded output. Additional updates restore compatibility with modern library versions and ensure expected behavior in related transforms. These steps support consistent use of audio feature extraction in scripted deep learning models.

Core claim

By removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers, nnAudio 2 resolves TorchScript compilation failures in STFT and iSTFT. Reliable inversion is restricted to the uniform-bin setting with freq_scale set to no, and explicit runtime errors are raised for unsupported frequency scales to prevent silently degraded reconstructions. The updates also restore CFP compatibility with modern SciPy and ensure VQT reduces to CQT when gamma equals zero, with regression tests confirming the behaviors.

What carries the argument

Removal of dynamic state mutation and module construction from STFT and iSTFT scripted paths, plus explicit runtime checks that restrict inverse-STFT to uniform-bin frequency scaling.

If this is right

STFT and iSTFT modules compile successfully under TorchScript without dynamic code barriers.
Inverse-STFT produces reliable results only for freq_scale set to no and raises explicit errors otherwise.
CFP maintains compatibility with current SciPy versions.
VQT reduces correctly to CQT at gamma equals zero.
The full test suite passes in a modern Python environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The changes allow nnAudio to be used inside TorchScript-based production pipelines for audio models.
Similar dynamic code patterns may cause compilation issues in other PyTorch audio libraries.
The explicit error approach for unsupported inverse cases could apply to other transform implementations.
Regression coverage for these behaviors reduces the chance of undetected edge-case failures in downstream audio research.

Load-bearing premise

Removing dynamic state mutation and module construction from the scripted paths will fix the compilation failures while keeping the original transform behavior intact.

What would settle it

Running torch.jit.script on the updated STFT or iSTFT module and checking whether compilation succeeds without errors, or testing iSTFT reconstruction quality on a non-uniform frequency scale to verify that an error is raised rather than silent degradation.

read the original abstract

nnAudio is an open-source audio feature extraction toolbox for deep learning, but its use in current environments is hindered by TorchScript incompatibilities, inverse-transform edge cases, and dependency drift. We present a targeted modernization for modern PyTorch and scientific Python. We resolve TorchScript compilation failures in STFT and iSTFT by removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers. We clarify inverse-STFT behavior by restricting reliable inversion to the uniform-bin setting (freq_scale=`no') and raising explicit runtime errors for unsupported frequency scales, preventing silently degraded reconstructions. We restore CFP compatibility with modern SciPy and ensure VQT reduces to CQT when gamma = 0. Regression tests cover the new STFT/iSTFT behaviors, and the updated codebase passes the full repository test suite in a modern Python environment. These improvements provide a more robust foundation for differentiable audio analysis in research and deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a maintenance changelog for nnAudio describing targeted code fixes for TorchScript and iSTFT, with no new methods or results.

read the letter

The main thing here is that nnAudio 2 patches some real usability problems in the existing library. It removes dynamic state changes and module creation from the TorchScript paths for STFT and iSTFT, tightens argument handling, and adds explicit errors when iSTFT is asked to invert non-uniform frequency scales. It also updates CFP for current SciPy and makes VQT fall back to CQT at gamma=0. The test suite passes afterward.

What the paper does well is naming concrete pain points that affect people trying to script or deploy the transforms, then stating the exact changes made. That level of specificity is useful for users who hit the same errors.

The soft spots are the lack of any before-and-after code, timing numbers, or reconstruction error checks. The abstract just says the fixes work and tests pass; there is no evidence shown that the original functionality is preserved in all cases or that new edge cases were not created. The scope stays inside one toolbox, so nothing broader is demonstrated.

This is for people already using nnAudio who need to know the library now behaves differently on scripted paths and inverse transforms. A reader wanting new audio analysis techniques or generalizable insights will not get much. It does not rise to the level that needs referee time; the work is straightforward engineering updates rather than a research claim that requires external validation.

Referee Report

1 major / 2 minor

Summary. The manuscript describes nnAudio 2, an update to the nnAudio audio feature extraction toolbox. It resolves TorchScript compilation failures in STFT and iSTFT by removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers. It clarifies inverse-STFT behavior by restricting reliable inversion to the uniform-bin setting (freq_scale='no') and raising explicit runtime errors for unsupported frequency scales. It restores CFP compatibility with modern SciPy and ensures VQT reduces to CQT when gamma=0. Regression tests cover the new STFT/iSTFT behaviors, and the updated codebase passes the full repository test suite in a modern Python environment.

Significance. If the described engineering changes achieve the stated outcomes, the work strengthens a practical library for differentiable audio analysis in deep learning by improving compatibility with current PyTorch and SciPy. The explicit error raising for non-uniform iSTFT cases is a useful safeguard against silent reconstruction degradation. The explicit statement that the full test suite passes after modifications provides direct evidence of maintained functionality, which is a strength for an engineering-focused update.

major comments (1)

[Abstract] Abstract: The central claim that the listed code changes resolve TorchScript compilation failures and that tests pass is asserted without any implementation details, before-after code comparisons, or quantitative validation data confirming no side effects on valid use cases.

minor comments (2)

The manuscript would benefit from a short table or list explicitly contrasting the old and new behaviors for iSTFT under different freq_scale values.
Consider including a brief migration note or changelog section for existing nnAudio users.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address the comment on the abstract below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the listed code changes resolve TorchScript compilation failures and that tests pass is asserted without any implementation details, before-after code comparisons, or quantitative validation data confirming no side effects on valid use cases.

Authors: The abstract is a concise summary of contributions. Implementation details of the TorchScript fixes (removal of dynamic state mutation and module construction from scripted paths, plus tightened argument handling) appear in the main text sections describing the STFT and iSTFT modifications. Before-and-after comparisons are documented via the repository commit history. Quantitative validation is supplied by the explicit statement that regression tests cover the new behaviors and the full test suite passes in a modern Python environment; this directly confirms maintained functionality on valid use cases with no side effects observed. To address the concern, we will revise the abstract to add a brief clause referencing the regression tests and test-suite passage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; engineering changes with test validation

full rationale

The manuscript describes targeted code modifications (removal of dynamic mutation and module construction from TorchScript paths, tightened argument handling, explicit errors for non-uniform freq_scale in iSTFT, SciPy compatibility restoration, and VQT-to-CQT reduction when gamma=0) plus confirmation that the updated test suite passes. No equations, predictions, fitted parameters, or derivation chains are present. The central claims are direct assertions about the effects of the edits and regression-test outcomes; they do not reduce to self-definition, fitted-input renaming, or self-citation load-bearing. This is a standard engineering modernization paper whose validity rests on observable code behavior and test passage rather than any circular logical step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are involved; this is a software maintenance paper describing code changes without mathematical modeling or new theoretical constructs.

pith-pipeline@v0.9.1-grok · 5697 in / 1154 out tokens · 41466 ms · 2026-06-28T04:25:27.746972+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references

[1]

IEEE Access , volume=

nnaudio: An on-the-fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks , author=. IEEE Access , volume=. 2020 , publisher=

2020
[2]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume =

Su, Li and Yang, Yi-Hsuan , title =. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume =. 2015 , doi =

2015
[3]

Machine Learning for Music Discovery Workshop at the 34th International Conference on Machine Learning (

Choi, Keunwoo and Joo, Deokjin and Kim, Juho , title =. Machine Learning for Music Discovery Workshop at the 34th International Conference on Machine Learning (
[4]

American journal of mathematics , volume=

An iteration formula for Fredholm integral equations of the first kind , author=. American journal of mathematics , volume=. 1951 , publisher=

1951
[5]

Yang, Yao-Yuan and Hira, Moto and Ni, Zhaoheng and Astafurov, Artyom and Chen, Caroline and Puhrsch, Christian and Pollack, David and Genzel, Dmitriy and Greenberg, Donny and Yang, Edward Z. and Lian, Jason and Mahadeokar, Jay and Hwang, Jeff and Chen, Ji and Goldsborough, Peter and Roy, Prabhat and Narenthiran, Sean and Watanabe, Shinji and Chintala, Sou...
[6]

McFee, Brian and Raffel, Colin and Liang, Dawen and Ellis, Daniel P. W. and McVicar, Matt and Battenberg, Eric and Nieto, Oriol , title =. Proceedings of the 14th Python in Science Conference (
[7]

and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and

Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , volume =. 2020 , doi =

2020
[8]

2024 , note =

Legacy discrete. 2024 , note =

2024
[9]

, title =

Brown, Judith C. , title =. Journal of the Acoustical Society of America , volume =. 1991 , doi =

1991
[10]

Constant-

Sch. Constant-. 7th Sound and Music Computing Conference (
[11]

Sch. A. Audio Engineering Society 53rd International Conference on Semantic Audio , year =
[12]

IEEE Transactions on Acoustics, Speech, and Signal Processing , volume =

Griffin, Daniel and Lim, Jae , title =. IEEE Transactions on Acoustics, Speech, and Signal Processing , volume =. 1984 , doi =

1984
[13]

Proceedings of the National Academy of Sciences , volume =

Stodden, Victoria and Seiler, Jennifer and Ma, Zhaokun , title =. Proceedings of the National Academy of Sciences , volume =. 2018 , doi =

2018

[1] [1]

IEEE Access , volume=

nnaudio: An on-the-fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks , author=. IEEE Access , volume=. 2020 , publisher=

2020

[2] [2]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume =

Su, Li and Yang, Yi-Hsuan , title =. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume =. 2015 , doi =

2015

[3] [3]

Machine Learning for Music Discovery Workshop at the 34th International Conference on Machine Learning (

Choi, Keunwoo and Joo, Deokjin and Kim, Juho , title =. Machine Learning for Music Discovery Workshop at the 34th International Conference on Machine Learning (

[4] [4]

American journal of mathematics , volume=

An iteration formula for Fredholm integral equations of the first kind , author=. American journal of mathematics , volume=. 1951 , publisher=

1951

[5] [5]

Yang, Yao-Yuan and Hira, Moto and Ni, Zhaoheng and Astafurov, Artyom and Chen, Caroline and Puhrsch, Christian and Pollack, David and Genzel, Dmitriy and Greenberg, Donny and Yang, Edward Z. and Lian, Jason and Mahadeokar, Jay and Hwang, Jeff and Chen, Ji and Goldsborough, Peter and Roy, Prabhat and Narenthiran, Sean and Watanabe, Shinji and Chintala, Sou...

[6] [6]

McFee, Brian and Raffel, Colin and Liang, Dawen and Ellis, Daniel P. W. and McVicar, Matt and Battenberg, Eric and Nieto, Oriol , title =. Proceedings of the 14th Python in Science Conference (

[7] [7]

and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and

Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , volume =. 2020 , doi =

2020

[8] [8]

2024 , note =

Legacy discrete. 2024 , note =

2024

[9] [9]

, title =

Brown, Judith C. , title =. Journal of the Acoustical Society of America , volume =. 1991 , doi =

1991

[10] [10]

Constant-

Sch. Constant-. 7th Sound and Music Computing Conference (

[11] [11]

Sch. A. Audio Engineering Society 53rd International Conference on Semantic Audio , year =

[12] [12]

IEEE Transactions on Acoustics, Speech, and Signal Processing , volume =

Griffin, Daniel and Lim, Jae , title =. IEEE Transactions on Acoustics, Speech, and Signal Processing , volume =. 1984 , doi =

1984

[13] [13]

Proceedings of the National Academy of Sciences , volume =

Stodden, Victoria and Seiler, Jennifer and Ma, Zhaokun , title =. Proceedings of the National Academy of Sciences , volume =. 2018 , doi =

2018