pith. sign in

arxiv: 2606.02341 · v2 · pith:CDEFYFASnew · submitted 2026-06-01 · 💻 cs.SD · cs.LG

Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification

Pith reviewed 2026-06-28 12:34 UTC · model grok-4.3

classification 💻 cs.SD cs.LG
keywords underwater acousticsacoustic classificationdual-encoderChoquet integralparameter-efficient fine-tuningwaveform spectrogram fusionfuzzy measures
0
0 comments X

The pith

A dual-encoder architecture fused by differentiable Choquet integral improves underwater acoustic classification accuracy with fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a dual-encoder neural architecture that processes acoustic waveforms and spectrograms in parallel using pre-trained backbones adapted with parameter-efficient fine-tuning modules. A novel differentiable fuzzy aggregation based on the Choquet integral combines the branches to balance temporal and spectral representations while providing interpretability through learned fuzzy measures. The method yields higher classification accuracy on the DeepShip and ShipsEar datasets compared to single-encoder baselines. It restricts the trainable parameter space to mitigate overfitting risks on limited acoustic datasets and reduces computational costs of full model fine-tuning.

Core claim

The paper claims that simultaneously processing waveform and spectrogram representations in a dual-encoder setup, adapted via parameter-efficient fine-tuning and fused with a differentiable Choquet integral, produces classification improvements over independent single-encoder baselines on underwater acoustic datasets while limiting the trainable parameter count and revealing class-specific representation reliance through the learned fuzzy measures.

What carries the argument

Differentiable Choquet integral fusion mechanism that aggregates outputs from two parameter-efficient fine-tuned encoders processing waveform and spectrogram inputs.

Load-bearing premise

That pre-trained models from other domains adapt effectively to underwater acoustics through the parameter-efficient modules and that the fuzzy measures learned by the Choquet integral reliably indicate class-specific representation preferences.

What would settle it

Running the single-encoder baselines and the proposed dual-encoder on the same datasets with identical pre-trained backbones and showing no accuracy improvement, or that the fuzzy measures do not change consistently with introduced channel distortions.

Figures

Figures reproduced from arXiv: 2606.02341 by Alexandra Van Dine, Amirmohammad Mohammadi, Joshua Peeples.

Figure 1
Figure 1. Figure 1: The proposed dual-encoder architecture. The waveform and its corresponding spectrogram are processed by respective encoders. Parameter-efficient [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Saliency map demonstrating dynamic representation routing for a [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Macro-averaged ROC curves and corresponding low-FPR context [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Macro-averaged ROC curves and corresponding low-FPR context [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise Linear CKA similarity for the DeepShip dataset. The plot [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise Linear CKA similarity for the ShipsEar dataset. The figure [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this domain. Spectrograms model harmonic dependencies, but these reduced representations can filter out acoustic features relevant for discrimination. While phase information from the waveform allows full characterization of the signal, the original waveform can be noisy and complex, rendering this representation difficult for models to process directly. This paper proposes a dual-encoder neural architecture to simultaneously process acoustic waveforms and spectrograms, leveraging pre-trained backbones and parameter-efficient fine-tuning modules, enabling a domain adaptation. To combine these adapted branches, a novel differentiable fuzzy aggregation mechanism based on the Choquet integral is introduced to balance the temporal and spectral representations. This fusion strategy not only yields higher classification accuracy but also provides interpretability. Specifically, by analyzing the learned fuzzy measures, insights are revealed about class-specific shifts in the network's representation reliance. By dynamically shifting attention to the representation least corrupted by potential asymmetric channel distortions, the proposed gating mechanism mitigates the non-stationary challenges of the underwater environment. Evaluations on the DeepShip and ShipsEar datasets demonstrate that the proposed architecture achieves classification improvements over independent single-encoder baselines, while simultaneously restricting the trainable parameter space. This mitigates the risk of overfitting on limited acoustic datasets while alleviating the computational costs associated with fully fine-tuning foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a dual-encoder architecture for underwater acoustic classification that processes raw waveforms and spectrograms in parallel using pre-trained backbones adapted via parameter-efficient fine-tuning modules. These branches are fused with a novel differentiable Choquet integral that learns class-specific fuzzy measures, yielding both higher accuracy than single-encoder baselines and interpretability about representation reliance. Experiments on DeepShip and ShipsEar are reported to show accuracy gains while restricting the number of trainable parameters.

Significance. If the empirical claims hold after the requested controls, the work would demonstrate a practical route for domain adaptation of foundation models to data-scarce acoustic tasks and would supply an interpretable, non-linear fusion operator whose learned measures can be inspected per class. The parameter-efficiency aspect directly addresses overfitting risk on limited underwater datasets.

major comments (3)
  1. [§4] §4 (Experiments): No ablation is presented that replaces the pre-trained encoders with randomly initialized counterparts of identical architecture. Without this control, the reported gains over single-encoder baselines cannot be attributed to successful domain adaptation rather than simply to the added capacity or regularization of the dual-branch design.
  2. [§3.2 and §4.3] §3.2 (Choquet fusion) and §4.3 (Interpretability analysis): The claim that learned fuzzy measures reveal class-specific shifts in representation reliance is not externally validated against known acoustic channel effects (e.g., frequency-dependent absorption or multipath). The manuscript therefore provides no evidence that the per-class measure differences are trustworthy rather than artifacts of the optimization.
  3. [Table 2] Table 2 (main results): The comparison against single-encoder baselines does not include a domain-specific pre-training baseline or a fully fine-tuned (non-PEFT) dual-encoder control. Consequently it remains unclear whether the observed accuracy improvements and parameter savings are jointly attributable to the proposed fusion mechanism.
minor comments (2)
  1. [Abstract] Abstract: quantitative metrics, error bars, and the exact number of trainable parameters are omitted, making it impossible to assess the magnitude of the claimed improvements from the abstract alone.
  2. [§3.1] Notation in §3.1: the distinction between the two PEFT modules applied to waveform versus spectrogram branches is not made explicit in the equations, complicating reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the contributions and limitations of our work on the parameter-efficient dual-encoder architecture with differentiable Choquet integral fusion. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): No ablation is presented that replaces the pre-trained encoders with randomly initialized counterparts of identical architecture. Without this control, the reported gains over single-encoder baselines cannot be attributed to successful domain adaptation rather than simply to the added capacity or regularization of the dual-branch design.

    Authors: We agree this control would strengthen attribution of gains specifically to domain adaptation. In the revised manuscript, we will add the requested ablation in §4, replacing pre-trained backbones with randomly initialized counterparts of identical architecture while retaining the dual-encoder structure and Choquet fusion. This will isolate the role of pre-training. revision: yes

  2. Referee: [§3.2 and §4.3] §3.2 (Choquet fusion) and §4.3 (Interpretability analysis): The claim that learned fuzzy measures reveal class-specific shifts in representation reliance is not externally validated against known acoustic channel effects (e.g., frequency-dependent absorption or multipath). The manuscript therefore provides no evidence that the per-class measure differences are trustworthy rather than artifacts of the optimization.

    Authors: The fuzzy measures offer model-internal interpretability of representation reliance per class, supported by accuracy gains and cross-dataset consistency. Direct external validation against specific acoustic phenomena would require new controlled experiments integrating physical channel models, which exceeds the current scope. We will revise §4.3 to explicitly note this as a limitation and direction for future work. revision: partial

  3. Referee: [Table 2] Table 2 (main results): The comparison against single-encoder baselines does not include a domain-specific pre-training baseline or a fully fine-tuned (non-PEFT) dual-encoder control. Consequently it remains unclear whether the observed accuracy improvements and parameter savings are jointly attributable to the proposed fusion mechanism.

    Authors: We will add a fully fine-tuned (non-PEFT) dual-encoder control to Table 2 to better demonstrate parameter-efficiency benefits of the PEFT approach. Domain-specific pre-training is not included as our focus is efficient adaptation of general pre-trained models; we will add a clarifying note on the absence of suitable large-scale unlabeled underwater datasets for such baselines. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The manuscript presents an empirical architecture combining dual encoders, PEFT modules, and a differentiable Choquet integral fusion layer. All performance claims rest on dataset evaluations rather than any derivation that reduces a result to its own fitted parameters or self-citations by construction. No equations are shown that define a quantity in terms of itself or rename a fitted input as a prediction; the fusion mechanism is introduced as an independent component whose interpretability is asserted from learned measures without tautological reduction to the single-encoder baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes suitability of external pre-trained backbones and differentiability of the Choquet integral without detailing supporting lemmas.

pith-pipeline@v0.9.1-grok · 5797 in / 1050 out tokens · 22329 ms · 2026-06-28T12:34:44.207576+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 36 canonical work pages

  1. [1]

    Detecting submerged objects using active acoustics and deep neural networks: A test case for pelagic fish,

    A. Testolin, D. Kipnis, and R. Diamant, “Detecting submerged objects using active acoustics and deep neural networks: A test case for pelagic fish,”IEEE Transactions on Mobile Computing, vol. 21, no. 8, pp. 2776– 2788, 2022, doi: 10.1109/TMC.2020.3044397

  2. [2]

    Multilabel classification of heterogeneous under- water soundscapes with bayesian deep learning,

    B. Beckleret al., “Multilabel classification of heterogeneous under- water soundscapes with bayesian deep learning,”IEEE Journal of Oceanic Engineering, vol. 47, no. 4, pp. 1143–1154, 2022, doi: 10.1109/JOE.2022.3177850

  3. [3]

    R. J. Urick,Principles of Underwater Sound, 3rd ed. McGraw-Hill, 1983

  4. [4]

    A survey of underwater acoustic target recognition methods based on machine learning,

    X. Luo, L. Chen, H. Zhou, and H. Cao, “A survey of underwater acoustic target recognition methods based on machine learning,”Journal of Marine Science and Engineering, vol. 11, no. 2, p. 384, 2023, doi: 10.3390/jmse11020384

  5. [5]

    DEMON feature extraction of acoustic vector signal based on 3/2-d spectrum,

    L. Sichun and Y . Desen, “DEMON feature extraction of acoustic vector signal based on 3/2-d spectrum,” in2007 2nd IEEE Conference on Industrial Electronics and Applications. IEEE, 2007, pp. 2239–2243, doi: 10.1109/ICIEA.2007.4318809

  6. [6]

    Novel DEMON spectra analysis tech- niques and empirical knowledge based reference criterion for acoustic signal classification,

    M. A. R. Hashmi and R. H. Raza, “Novel DEMON spectra analysis tech- niques and empirical knowledge based reference criterion for acoustic signal classification,”Journal of Electrical Engineering & Technology, vol. 18, no. 1, pp. 561–578, 2023, doi: 10.1007/s42835-022-01167-3

  7. [7]

    F. B. Jensen, W. A. Kuperman, M. B. Porter, and H. Schmidt,Compu- tational Ocean Acoustics, 2nd ed. New York, NY: Springer, 2011, doi: 10.1007/978-1-4419-8678-8

  8. [8]

    Deep Learning - whole book,

    Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, 2015, doi: 10.1038/nature14539

  9. [9]

    Machine learning in acoustics,

    M. J. Bianco, P. Gerstoft, J. Traer, E. Ozanich, M. A. Roch, S. Gannot, and C.-A. Deledalle, “Machine learning in acoustics,”The Journal of the Acoustical Society of America, vol. 146, no. 5, pp. 3590–3631, 2019

  10. [10]

    Deep learning in underwater acoustics: A review,

    H. Niu, E. Reeves, and P. Gerstoft, “Deep learning in underwater acoustics: A review,”The Journal of the Acoustical Society of America, vol. 152, no. 1, pp. 751–793, 2022

  11. [11]

    Adversarial multi-task underwater acoustic target recognition: towards robustness against various influential factors,

    Y . Xie, J. Xu, J. Ren, and J. Li, “Adversarial multi-task underwater acoustic target recognition: towards robustness against various influential factors,”The Journal of the Acoustical Society of America, vol. 156, no. 1, pp. 299–312, 2024, doi: 10.1121/10.0026598

  12. [12]

    Deep learning for underwater acoustics: A review,

    M. van Kootenet al., “Deep learning for underwater acoustics: A review,”The Journal of the Acoustical Society of America, vol. 150, no. 3, pp. 1600–1615, 2021, doi: 10.1121/10.0006240

  13. [13]

    Deep convolution stack for waveform in underwater acoustic target recognition,

    Y . Songet al., “Deep convolution stack for waveform in underwater acoustic target recognition,”Scientific Reports, vol. 11, no. 1, p. 9614, 2021, doi: 10.1038/s41598-021-88799-z

  14. [14]

    Un- derwater radiated noise from modern commercial ships,

    M. F. McKenna, D. Ross, S. M. Wiggins, and J. A. Hildebrand, “Un- derwater radiated noise from modern commercial ships,”The Journal of the Acoustical Society of America, vol. 131, no. 1, pp. 92–103, 2012, doi: 10.1121/1.3664100

  15. [15]

    Self-supervised learning for underwater acoustic signal classification with mixup,

    Q. Xuet al., “Self-supervised learning for underwater acoustic signal classification with mixup,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 3458–3470, 2024, doi: 10.1109/JSTARS.2023.3325921

  16. [16]

    WaveNet: A generative model for raw audio,

    A. van den Oordet al., “WaveNet: A generative model for raw audio,”arXiv preprint arXiv:1609.03499, 2016. [Online]. Available: https://arxiv.org/abs/1609.03499

  17. [17]

    Deep learning for audio signal processing,

    H. Purwins, B. Li, T. Virtanen, J. Schl ¨uter, S.-Y . Chang, and T. Sainath, “Deep learning for audio signal processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 206–219, 2019, doi: 10.1109/JSTSP.2019.2908700

  18. [18]

    Underwater acoustic target recog- nition method based on feature fusion and residual cnn,

    Y . Yang, Q. Yao, and Y . Wang, “Underwater acoustic target recog- nition method based on feature fusion and residual cnn,”IEEE Sensors Journal, vol. 24, no. 22, pp. 37 342–37 357, 2024, doi: 10.1109/JSEN.2024.3464754

  19. [19]

    SSAST-Adapter: A parameter-efficient incremental learning algorithm for underwater acoustic target recognition,

    Q. Zhu, Q. Xu, B. Zhu, Z. Gao, L. Zeng, and K. Xu, “SSAST-Adapter: A parameter-efficient incremental learning algorithm for underwater acoustic target recognition,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5. [Online]. Available: https://ieeexplore.ieee.org/document/10887650/

  20. [20]

    LoRA: Low-rank adaptation of large language models,

    E. J. Huet al., “LoRA: Low-rank adaptation of large language models,” inProc. International Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://openreview.net/forum?id= nZeVKeeFYf9

  21. [21]

    Histogram-based parameter-efficient tuning for passive and active sonar classification,

    A. Mohammadi, D. Carreiro, A. Van Dine, and J. Peeples, “Histogram-based parameter-efficient tuning for passive and active sonar classification,”arXiv preprint arXiv:2504.15214, 2025. [Online]. Available: https://arxiv.org/abs/2504.15214

  22. [22]

    Segmentation pseudolabel generation using the multiple instance learning choquet integral,

    C. H. McCurley and A. Zare, “Segmentation pseudolabel generation using the multiple instance learning choquet integral,”IEEE Trans- actions on Fuzzy Systems, vol. 32, no. 1, pp. 182–195, 2024, doi: 10.1109/TFUZZ.2023.3338049

  23. [23]

    The application of fuzzy integrals in multicriteria decision making,

    M. Grabisch, “The application of fuzzy integrals in multicriteria decision making,”European Journal of Operational Research, vol. 89, no. 3, pp. 445–456, 1996, doi: 10.1016/0377-2217(95)00176-X

  24. [24]

    Bootstrap Your Own Latent

    Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio spectro- gram transformer,” inProc. Interspeech, 2021, pp. 571–575, doi: 10.21437/Interspeech.2021-698

  25. [25]

    CNN architectures for large-scale audio clas- sification,

    S. Hersheyet al., “CNN architectures for large-scale audio clas- sification,” inProc. IEEE ICASSP, 2017, pp. 131–135, doi: 10.1109/ICASSP.2017.7952132

  26. [26]

    An auditory-based time-dilated convolution neural network for ship-radiated noise,

    G. Huet al., “An auditory-based time-dilated convolution neural network for ship-radiated noise,”The Journal of the Acoustical Society of Amer- ica, vol. 148, no. 3, pp. 1556–1566, 2020, doi: 10.1121/10.0001924

  27. [27]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020. [Online]. Available: https://proceedings.neurips. cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html

  28. [28]

    ShipsEar: An underwater vessel noise database,

    D. Santos-Dom ´ınguezet al., “ShipsEar: An underwater vessel noise database,”Applied Acoustics, vol. 113, pp. 64–69, 2016, doi: 10.1016/j.apacoust.2016.06.008

  29. [29]

    DeepShip: An underwater acoustic benchmark dataset,

    M. Irfanet al., “DeepShip: An underwater acoustic benchmark dataset,” Expert Systems with Applications, vol. 183, p. 115270, 2021, doi: 10.1016/j.eswa.2021.115270

  30. [30]

    Underwater acoustic target recognition based on a joint neural network,

    X. Hanet al., “Underwater acoustic target recognition based on a joint neural network,”PLoS One, vol. 17, no. 2, p. e0264445, 2022, doi: 10.1371/journal.pone.0264445

  31. [31]

    Robust underwater target recognition with uncertainty- aware fusion,

    Y . Sunet al., “Robust underwater target recognition with uncertainty- aware fusion,”Information Fusion, vol. 98, p. 101846, 2023, doi: 10.1016/j.inffus.2023.101846

  32. [32]

    Attentional multidomain fea- ture fusion for underwater acoustic signal classification,

    Y . Miao, Y . Li, and Y . Zakharov, “Attentional multidomain fea- ture fusion for underwater acoustic signal classification,”IEEE Jour- nal of Oceanic Engineering, vol. 51, no. 1, pp. 11–22, 2025, doi: 10.1109/JOE.2025.3619304

  33. [33]

    Multilabel recognition method for ship-radiated noise signals based on multidomain information fusion with deep equilibrium models,

    Y . Duan, X. Shen, H. Wang, and Y . Yan, “Multilabel recognition method for ship-radiated noise signals based on multidomain information fusion with deep equilibrium models,”IEEE Journal of Oceanic Engineering, 2025, doi: 10.1109/JOE.2025.3545239

  34. [34]

    Multiple instance choquet integral classifier fusion and regression for remote sensing applications,

    X. Du and A. Zare, “Multiple instance choquet integral classifier fusion and regression for remote sensing applications,”IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 5, pp. 2741–2753, 2018, doi: 10.1109/tgrs.2018.2876687

  35. [35]

    Fuzzy integral in multicriteria decision making,

    M. Grabisch, “Fuzzy integral in multicriteria decision making,”Fuzzy Sets and Systems, vol. 69, no. 3, pp. 279–298, 1995, doi: 10.1016/0165- 0114(94)00174-R

  36. [36]

    Audio Set: An ontology and human-labeled dataset for audio events

    J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human- labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780, doi: 10.1109/ICASSP.2017.7952261

  37. [37]

    ImageNet: A Large-Scale Hierarchical Image Database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255, doi: 10.1109/CVPR.2009.5206848

  38. [38]

    The Uniqueness Problem of Physical Law Learning

    M. Hagiwara, “Aves: Animal vocalization encoder based on self- supervision,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5, doi: 10.1109/ICASSP49357.2023.10095166

  39. [39]

    FSD50K: an open dataset of human-labeled sound events,

    E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021, doi: 10.1109/TASLP.2021.3133208

  40. [40]

    Specaugment on large scale datasets

    H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “VGGSound: A large- scale audio-visual dataset,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 721–725, doi: 10.1109/ICASSP40776.2020.9053169

  41. [41]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90

  42. [42]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

    S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked autoen- coders,” inProceedings of the IEEE/CVF Conference on Computer 9 Vision and Pattern Recognition (CVPR), 2023, pp. 16 133–16 142, doi: 10.1109/CVPR52729.2023.01548

  43. [43]

    PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,

    Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020, doi: 10.1109/TASLP.2020.3030497

  44. [44]

    SSAST: Self- supervised audio spectrogram transformer,

    Y . Gong, C.-I. J. Lai, Y .-A. Chung, and J. Glass, “SSAST: Self- supervised audio spectrogram transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 10 699– 10 709, doi: 10.1609/aaai.v36i10.21315

  45. [45]

    Towards better understanding of gradient-based attribution methods for deep neural networks,

    M. Ancona, E. Ceolini, C. ¨Oztireli, and M. Gross, “Towards better understanding of gradient-based attribution methods for deep neural networks,” inInternational Conference on Learning Representations (ICLR), 2018. [Online]. Available: https://openreview.net/forum?id= Sy21R9JAW

  46. [46]

    Similarity of neural network representations revisited,

    S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inInternational Conference on Machine Learning. PMLR, 2019, pp. 3519–3529. [Online]. Available: http://proceedings.mlr.press/v97/kornblith19a/kornblith19a.pdf Amirmohammad Mohammadireceived his M.Sc. in Electrical Engineering from Sharif University o...