Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification

Alexandra Van Dine; Amirmohammad Mohammadi; Joshua Peeples

arxiv: 2606.02341 · v2 · pith:CDEFYFASnew · submitted 2026-06-01 · 💻 cs.SD · cs.LG

Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification

Amirmohammad Mohammadi , Joshua Peeples , Alexandra Van Dine This is my paper

Pith reviewed 2026-06-28 12:34 UTC · model grok-4.3

classification 💻 cs.SD cs.LG

keywords underwater acousticsacoustic classificationdual-encoderChoquet integralparameter-efficient fine-tuningwaveform spectrogram fusionfuzzy measures

0 comments

The pith

A dual-encoder architecture fused by differentiable Choquet integral improves underwater acoustic classification accuracy with fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a dual-encoder neural architecture that processes acoustic waveforms and spectrograms in parallel using pre-trained backbones adapted with parameter-efficient fine-tuning modules. A novel differentiable fuzzy aggregation based on the Choquet integral combines the branches to balance temporal and spectral representations while providing interpretability through learned fuzzy measures. The method yields higher classification accuracy on the DeepShip and ShipsEar datasets compared to single-encoder baselines. It restricts the trainable parameter space to mitigate overfitting risks on limited acoustic datasets and reduces computational costs of full model fine-tuning.

Core claim

The paper claims that simultaneously processing waveform and spectrogram representations in a dual-encoder setup, adapted via parameter-efficient fine-tuning and fused with a differentiable Choquet integral, produces classification improvements over independent single-encoder baselines on underwater acoustic datasets while limiting the trainable parameter count and revealing class-specific representation reliance through the learned fuzzy measures.

What carries the argument

Differentiable Choquet integral fusion mechanism that aggregates outputs from two parameter-efficient fine-tuned encoders processing waveform and spectrogram inputs.

Load-bearing premise

That pre-trained models from other domains adapt effectively to underwater acoustics through the parameter-efficient modules and that the fuzzy measures learned by the Choquet integral reliably indicate class-specific representation preferences.

What would settle it

Running the single-encoder baselines and the proposed dual-encoder on the same datasets with identical pre-trained backbones and showing no accuracy improvement, or that the fuzzy measures do not change consistently with introduced channel distortions.

Figures

Figures reproduced from arXiv: 2606.02341 by Alexandra Van Dine, Amirmohammad Mohammadi, Joshua Peeples.

**Figure 1.** Figure 1: The proposed dual-encoder architecture. The waveform and its corresponding spectrogram are processed by respective encoders. Parameter-efficient [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Saliency map demonstrating dynamic representation routing for a [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Macro-averaged ROC curves and corresponding low-FPR context [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Macro-averaged ROC curves and corresponding low-FPR context [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Layer-wise Linear CKA similarity for the DeepShip dataset. The plot [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 6.** Figure 6: Layer-wise Linear CKA similarity for the ShipsEar dataset. The figure [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this domain. Spectrograms model harmonic dependencies, but these reduced representations can filter out acoustic features relevant for discrimination. While phase information from the waveform allows full characterization of the signal, the original waveform can be noisy and complex, rendering this representation difficult for models to process directly. This paper proposes a dual-encoder neural architecture to simultaneously process acoustic waveforms and spectrograms, leveraging pre-trained backbones and parameter-efficient fine-tuning modules, enabling a domain adaptation. To combine these adapted branches, a novel differentiable fuzzy aggregation mechanism based on the Choquet integral is introduced to balance the temporal and spectral representations. This fusion strategy not only yields higher classification accuracy but also provides interpretability. Specifically, by analyzing the learned fuzzy measures, insights are revealed about class-specific shifts in the network's representation reliance. By dynamically shifting attention to the representation least corrupted by potential asymmetric channel distortions, the proposed gating mechanism mitigates the non-stationary challenges of the underwater environment. Evaluations on the DeepShip and ShipsEar datasets demonstrate that the proposed architecture achieves classification improvements over independent single-encoder baselines, while simultaneously restricting the trainable parameter space. This mitigates the risk of overfitting on limited acoustic datasets while alleviating the computational costs associated with fully fine-tuning foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dual-encoder plus differentiable Choquet fusion for underwater acoustics is a reasonable engineering step but the abstract supplies no numbers or controls to judge whether the claimed gains are real.

read the letter

The paper puts forward a dual-encoder setup that runs a waveform branch and a spectrogram branch in parallel, each starting from a pre-trained backbone with parameter-efficient adapters, then fuses them with a differentiable Choquet integral whose learned fuzzy measures are meant to give class-specific insight into which representation the model trusts more. That fusion step and the explicit goal of keeping the trainable parameter count low on small acoustic datasets are the concrete additions relative to standard single-representation pipelines.

The approach makes sense for the domain: underwater signals are non-stationary and one representation can be corrupted while the other is not, so letting the model shift attention via the Choquet weights is a plausible mechanism. Using PEFT on external backbones is also a practical way to avoid overfitting when labeled ship recordings are scarce.

The soft spots are exactly where the stress-test note flags them. The abstract states that the model beats single-encoder baselines on DeepShip and ShipsEar while using fewer parameters, yet it gives no accuracy figures, no standard deviations, no list of baselines, and no ablation that replaces the pre-trained encoders with random ones. Without those controls it is impossible to tell whether the reported edge comes from genuine domain adaptation or simply from extra capacity and regularization. The interpretability claim is likewise unsupported so far; there is no check against known acoustic distortions or expert labels to show that the per-class fuzzy-measure shifts are meaningful rather than artifacts.

This work is aimed at practitioners who already work on underwater acoustic classification and need parameter-efficient models that can run on modest hardware. A reader already familiar with Choquet integrals or fuzzy aggregation will find the technical description straightforward, but the paper will only be useful if the full manuscript supplies the missing quantitative results and ablations.

I would send it to peer review so that referees can check whether the experiments actually support the adaptation and interpretability claims; on the current evidence the central argument is plausible but not yet demonstrated.

Referee Report

3 major / 2 minor

Summary. The paper proposes a dual-encoder architecture for underwater acoustic classification that processes raw waveforms and spectrograms in parallel using pre-trained backbones adapted via parameter-efficient fine-tuning modules. These branches are fused with a novel differentiable Choquet integral that learns class-specific fuzzy measures, yielding both higher accuracy than single-encoder baselines and interpretability about representation reliance. Experiments on DeepShip and ShipsEar are reported to show accuracy gains while restricting the number of trainable parameters.

Significance. If the empirical claims hold after the requested controls, the work would demonstrate a practical route for domain adaptation of foundation models to data-scarce acoustic tasks and would supply an interpretable, non-linear fusion operator whose learned measures can be inspected per class. The parameter-efficiency aspect directly addresses overfitting risk on limited underwater datasets.

major comments (3)

[§4] §4 (Experiments): No ablation is presented that replaces the pre-trained encoders with randomly initialized counterparts of identical architecture. Without this control, the reported gains over single-encoder baselines cannot be attributed to successful domain adaptation rather than simply to the added capacity or regularization of the dual-branch design.
[§3.2 and §4.3] §3.2 (Choquet fusion) and §4.3 (Interpretability analysis): The claim that learned fuzzy measures reveal class-specific shifts in representation reliance is not externally validated against known acoustic channel effects (e.g., frequency-dependent absorption or multipath). The manuscript therefore provides no evidence that the per-class measure differences are trustworthy rather than artifacts of the optimization.
[Table 2] Table 2 (main results): The comparison against single-encoder baselines does not include a domain-specific pre-training baseline or a fully fine-tuned (non-PEFT) dual-encoder control. Consequently it remains unclear whether the observed accuracy improvements and parameter savings are jointly attributable to the proposed fusion mechanism.

minor comments (2)

[Abstract] Abstract: quantitative metrics, error bars, and the exact number of trainable parameters are omitted, making it impossible to assess the magnitude of the claimed improvements from the abstract alone.
[§3.1] Notation in §3.1: the distinction between the two PEFT modules applied to waveform versus spectrogram branches is not made explicit in the equations, complicating reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the contributions and limitations of our work on the parameter-efficient dual-encoder architecture with differentiable Choquet integral fusion. We respond to each major comment below.

read point-by-point responses

Referee: [§4] §4 (Experiments): No ablation is presented that replaces the pre-trained encoders with randomly initialized counterparts of identical architecture. Without this control, the reported gains over single-encoder baselines cannot be attributed to successful domain adaptation rather than simply to the added capacity or regularization of the dual-branch design.

Authors: We agree this control would strengthen attribution of gains specifically to domain adaptation. In the revised manuscript, we will add the requested ablation in §4, replacing pre-trained backbones with randomly initialized counterparts of identical architecture while retaining the dual-encoder structure and Choquet fusion. This will isolate the role of pre-training. revision: yes
Referee: [§3.2 and §4.3] §3.2 (Choquet fusion) and §4.3 (Interpretability analysis): The claim that learned fuzzy measures reveal class-specific shifts in representation reliance is not externally validated against known acoustic channel effects (e.g., frequency-dependent absorption or multipath). The manuscript therefore provides no evidence that the per-class measure differences are trustworthy rather than artifacts of the optimization.

Authors: The fuzzy measures offer model-internal interpretability of representation reliance per class, supported by accuracy gains and cross-dataset consistency. Direct external validation against specific acoustic phenomena would require new controlled experiments integrating physical channel models, which exceeds the current scope. We will revise §4.3 to explicitly note this as a limitation and direction for future work. revision: partial
Referee: [Table 2] Table 2 (main results): The comparison against single-encoder baselines does not include a domain-specific pre-training baseline or a fully fine-tuned (non-PEFT) dual-encoder control. Consequently it remains unclear whether the observed accuracy improvements and parameter savings are jointly attributable to the proposed fusion mechanism.

Authors: We will add a fully fine-tuned (non-PEFT) dual-encoder control to Table 2 to better demonstrate parameter-efficiency benefits of the PEFT approach. Domain-specific pre-training is not included as our focus is efficient adaptation of general pre-trained models; we will add a clarifying note on the absence of suitable large-scale unlabeled underwater datasets for such baselines. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The manuscript presents an empirical architecture combining dual encoders, PEFT modules, and a differentiable Choquet integral fusion layer. All performance claims rest on dataset evaluations rather than any derivation that reduces a result to its own fitted parameters or self-citations by construction. No equations are shown that define a quantity in terms of itself or rename a fitted input as a prediction; the fusion mechanism is introduced as an independent component whose interpretability is asserted from learned measures without tautological reduction to the single-encoder baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes suitability of external pre-trained backbones and differentiability of the Choquet integral without detailing supporting lemmas.

pith-pipeline@v0.9.1-grok · 5797 in / 1050 out tokens · 22329 ms · 2026-06-28T12:34:44.207576+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 36 canonical work pages

[1]

Detecting submerged objects using active acoustics and deep neural networks: A test case for pelagic fish,

A. Testolin, D. Kipnis, and R. Diamant, “Detecting submerged objects using active acoustics and deep neural networks: A test case for pelagic fish,”IEEE Transactions on Mobile Computing, vol. 21, no. 8, pp. 2776– 2788, 2022, doi: 10.1109/TMC.2020.3044397

work page doi:10.1109/tmc.2020.3044397 2022
[2]

Multilabel classification of heterogeneous under- water soundscapes with bayesian deep learning,

B. Beckleret al., “Multilabel classification of heterogeneous under- water soundscapes with bayesian deep learning,”IEEE Journal of Oceanic Engineering, vol. 47, no. 4, pp. 1143–1154, 2022, doi: 10.1109/JOE.2022.3177850

work page doi:10.1109/joe.2022.3177850 2022
[3]

R. J. Urick,Principles of Underwater Sound, 3rd ed. McGraw-Hill, 1983

1983
[4]

A survey of underwater acoustic target recognition methods based on machine learning,

X. Luo, L. Chen, H. Zhou, and H. Cao, “A survey of underwater acoustic target recognition methods based on machine learning,”Journal of Marine Science and Engineering, vol. 11, no. 2, p. 384, 2023, doi: 10.3390/jmse11020384

work page doi:10.3390/jmse11020384 2023
[5]

DEMON feature extraction of acoustic vector signal based on 3/2-d spectrum,

L. Sichun and Y . Desen, “DEMON feature extraction of acoustic vector signal based on 3/2-d spectrum,” in2007 2nd IEEE Conference on Industrial Electronics and Applications. IEEE, 2007, pp. 2239–2243, doi: 10.1109/ICIEA.2007.4318809

work page doi:10.1109/iciea.2007.4318809 2007
[6]

Novel DEMON spectra analysis tech- niques and empirical knowledge based reference criterion for acoustic signal classification,

M. A. R. Hashmi and R. H. Raza, “Novel DEMON spectra analysis tech- niques and empirical knowledge based reference criterion for acoustic signal classification,”Journal of Electrical Engineering & Technology, vol. 18, no. 1, pp. 561–578, 2023, doi: 10.1007/s42835-022-01167-3

work page doi:10.1007/s42835-022-01167-3 2023
[7]

F. B. Jensen, W. A. Kuperman, M. B. Porter, and H. Schmidt,Compu- tational Ocean Acoustics, 2nd ed. New York, NY: Springer, 2011, doi: 10.1007/978-1-4419-8678-8

work page doi:10.1007/978-1-4419-8678-8 2011
[8]

Deep Learning - whole book,

Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, 2015, doi: 10.1038/nature14539

work page doi:10.1038/nature14539 2015
[9]

Machine learning in acoustics,

M. J. Bianco, P. Gerstoft, J. Traer, E. Ozanich, M. A. Roch, S. Gannot, and C.-A. Deledalle, “Machine learning in acoustics,”The Journal of the Acoustical Society of America, vol. 146, no. 5, pp. 3590–3631, 2019

2019
[10]

Deep learning in underwater acoustics: A review,

H. Niu, E. Reeves, and P. Gerstoft, “Deep learning in underwater acoustics: A review,”The Journal of the Acoustical Society of America, vol. 152, no. 1, pp. 751–793, 2022

2022
[11]

Adversarial multi-task underwater acoustic target recognition: towards robustness against various influential factors,

Y . Xie, J. Xu, J. Ren, and J. Li, “Adversarial multi-task underwater acoustic target recognition: towards robustness against various influential factors,”The Journal of the Acoustical Society of America, vol. 156, no. 1, pp. 299–312, 2024, doi: 10.1121/10.0026598

work page doi:10.1121/10.0026598 2024
[12]

Deep learning for underwater acoustics: A review,

M. van Kootenet al., “Deep learning for underwater acoustics: A review,”The Journal of the Acoustical Society of America, vol. 150, no. 3, pp. 1600–1615, 2021, doi: 10.1121/10.0006240

work page doi:10.1121/10.0006240 2021
[13]

Deep convolution stack for waveform in underwater acoustic target recognition,

Y . Songet al., “Deep convolution stack for waveform in underwater acoustic target recognition,”Scientific Reports, vol. 11, no. 1, p. 9614, 2021, doi: 10.1038/s41598-021-88799-z

work page doi:10.1038/s41598-021-88799-z 2021
[14]

Un- derwater radiated noise from modern commercial ships,

M. F. McKenna, D. Ross, S. M. Wiggins, and J. A. Hildebrand, “Un- derwater radiated noise from modern commercial ships,”The Journal of the Acoustical Society of America, vol. 131, no. 1, pp. 92–103, 2012, doi: 10.1121/1.3664100

work page doi:10.1121/1.3664100 2012
[15]

Self-supervised learning for underwater acoustic signal classification with mixup,

Q. Xuet al., “Self-supervised learning for underwater acoustic signal classification with mixup,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 3458–3470, 2024, doi: 10.1109/JSTARS.2023.3325921

work page doi:10.1109/jstars.2023.3325921 2024
[16]

WaveNet: A generative model for raw audio,

A. van den Oordet al., “WaveNet: A generative model for raw audio,”arXiv preprint arXiv:1609.03499, 2016. [Online]. Available: https://arxiv.org/abs/1609.03499

Pith/arXiv arXiv 2016
[17]

Deep learning for audio signal processing,

H. Purwins, B. Li, T. Virtanen, J. Schl ¨uter, S.-Y . Chang, and T. Sainath, “Deep learning for audio signal processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 206–219, 2019, doi: 10.1109/JSTSP.2019.2908700

work page doi:10.1109/jstsp.2019.2908700 2019
[18]

Underwater acoustic target recog- nition method based on feature fusion and residual cnn,

Y . Yang, Q. Yao, and Y . Wang, “Underwater acoustic target recog- nition method based on feature fusion and residual cnn,”IEEE Sensors Journal, vol. 24, no. 22, pp. 37 342–37 357, 2024, doi: 10.1109/JSEN.2024.3464754

work page doi:10.1109/jsen.2024.3464754 2024
[19]

SSAST-Adapter: A parameter-efficient incremental learning algorithm for underwater acoustic target recognition,

Q. Zhu, Q. Xu, B. Zhu, Z. Gao, L. Zeng, and K. Xu, “SSAST-Adapter: A parameter-efficient incremental learning algorithm for underwater acoustic target recognition,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5. [Online]. Available: https://ieeexplore.ieee.org/document/10887650/

arXiv 2025
[20]

LoRA: Low-rank adaptation of large language models,

E. J. Huet al., “LoRA: Low-rank adaptation of large language models,” inProc. International Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://openreview.net/forum?id= nZeVKeeFYf9

2022
[21]

Histogram-based parameter-efficient tuning for passive and active sonar classification,

A. Mohammadi, D. Carreiro, A. Van Dine, and J. Peeples, “Histogram-based parameter-efficient tuning for passive and active sonar classification,”arXiv preprint arXiv:2504.15214, 2025. [Online]. Available: https://arxiv.org/abs/2504.15214

Pith/arXiv arXiv 2025
[22]

Segmentation pseudolabel generation using the multiple instance learning choquet integral,

C. H. McCurley and A. Zare, “Segmentation pseudolabel generation using the multiple instance learning choquet integral,”IEEE Trans- actions on Fuzzy Systems, vol. 32, no. 1, pp. 182–195, 2024, doi: 10.1109/TFUZZ.2023.3338049

work page doi:10.1109/tfuzz.2023.3338049 2024
[23]

The application of fuzzy integrals in multicriteria decision making,

M. Grabisch, “The application of fuzzy integrals in multicriteria decision making,”European Journal of Operational Research, vol. 89, no. 3, pp. 445–456, 1996, doi: 10.1016/0377-2217(95)00176-X

work page doi:10.1016/0377-2217(95)00176-x 1996
[24]

Bootstrap Your Own Latent

Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio spectro- gram transformer,” inProc. Interspeech, 2021, pp. 571–575, doi: 10.21437/Interspeech.2021-698

work page doi:10.21437/interspeech.2021-698 2021
[25]

CNN architectures for large-scale audio clas- sification,

S. Hersheyet al., “CNN architectures for large-scale audio clas- sification,” inProc. IEEE ICASSP, 2017, pp. 131–135, doi: 10.1109/ICASSP.2017.7952132

work page doi:10.1109/icassp.2017.7952132 2017
[26]

An auditory-based time-dilated convolution neural network for ship-radiated noise,

G. Huet al., “An auditory-based time-dilated convolution neural network for ship-radiated noise,”The Journal of the Acoustical Society of Amer- ica, vol. 148, no. 3, pp. 1556–1566, 2020, doi: 10.1121/10.0001924

work page doi:10.1121/10.0001924 2020
[27]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020. [Online]. Available: https://proceedings.neurips. cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html

2020
[28]

ShipsEar: An underwater vessel noise database,

D. Santos-Dom ´ınguezet al., “ShipsEar: An underwater vessel noise database,”Applied Acoustics, vol. 113, pp. 64–69, 2016, doi: 10.1016/j.apacoust.2016.06.008

work page doi:10.1016/j.apacoust.2016.06.008 2016
[29]

DeepShip: An underwater acoustic benchmark dataset,

M. Irfanet al., “DeepShip: An underwater acoustic benchmark dataset,” Expert Systems with Applications, vol. 183, p. 115270, 2021, doi: 10.1016/j.eswa.2021.115270

work page doi:10.1016/j.eswa.2021.115270 2021
[30]

Underwater acoustic target recognition based on a joint neural network,

X. Hanet al., “Underwater acoustic target recognition based on a joint neural network,”PLoS One, vol. 17, no. 2, p. e0264445, 2022, doi: 10.1371/journal.pone.0264445

work page doi:10.1371/journal.pone.0264445 2022
[31]

Robust underwater target recognition with uncertainty- aware fusion,

Y . Sunet al., “Robust underwater target recognition with uncertainty- aware fusion,”Information Fusion, vol. 98, p. 101846, 2023, doi: 10.1016/j.inffus.2023.101846

work page doi:10.1016/j.inffus.2023.101846 2023
[32]

Attentional multidomain fea- ture fusion for underwater acoustic signal classification,

Y . Miao, Y . Li, and Y . Zakharov, “Attentional multidomain fea- ture fusion for underwater acoustic signal classification,”IEEE Jour- nal of Oceanic Engineering, vol. 51, no. 1, pp. 11–22, 2025, doi: 10.1109/JOE.2025.3619304

work page doi:10.1109/joe.2025.3619304 2025
[33]

Multilabel recognition method for ship-radiated noise signals based on multidomain information fusion with deep equilibrium models,

Y . Duan, X. Shen, H. Wang, and Y . Yan, “Multilabel recognition method for ship-radiated noise signals based on multidomain information fusion with deep equilibrium models,”IEEE Journal of Oceanic Engineering, 2025, doi: 10.1109/JOE.2025.3545239

work page doi:10.1109/joe.2025.3545239 2025
[34]

Multiple instance choquet integral classifier fusion and regression for remote sensing applications,

X. Du and A. Zare, “Multiple instance choquet integral classifier fusion and regression for remote sensing applications,”IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 5, pp. 2741–2753, 2018, doi: 10.1109/tgrs.2018.2876687

work page doi:10.1109/tgrs.2018.2876687 2018
[35]

Fuzzy integral in multicriteria decision making,

M. Grabisch, “Fuzzy integral in multicriteria decision making,”Fuzzy Sets and Systems, vol. 69, no. 3, pp. 279–298, 1995, doi: 10.1016/0165- 0114(94)00174-R

work page doi:10.1016/0165- 1995
[36]

Audio Set: An ontology and human-labeled dataset for audio events

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human- labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780, doi: 10.1109/ICASSP.2017.7952261

work page doi:10.1109/icassp.2017.7952261 2017
[37]

ImageNet: A Large-Scale Hierarchical Image Database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255, doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[38]

The Uniqueness Problem of Physical Law Learning

M. Hagiwara, “Aves: Animal vocalization encoder based on self- supervision,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5, doi: 10.1109/ICASSP49357.2023.10095166

work page doi:10.1109/icassp49357.2023.10095166 2023
[39]

FSD50K: an open dataset of human-labeled sound events,

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021, doi: 10.1109/TASLP.2021.3133208

work page doi:10.1109/taslp.2021.3133208 2021
[40]

Specaugment on large scale datasets

H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “VGGSound: A large- scale audio-visual dataset,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 721–725, doi: 10.1109/ICASSP40776.2020.9053169

work page doi:10.1109/icassp40776.2020.9053169 2020
[41]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[42]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked autoen- coders,” inProceedings of the IEEE/CVF Conference on Computer 9 Vision and Pattern Recognition (CVPR), 2023, pp. 16 133–16 142, doi: 10.1109/CVPR52729.2023.01548

work page doi:10.1109/cvpr52729.2023.01548 2023
[43]

PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,

Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020, doi: 10.1109/TASLP.2020.3030497

work page doi:10.1109/taslp.2020.3030497 2020
[44]

SSAST: Self- supervised audio spectrogram transformer,

Y . Gong, C.-I. J. Lai, Y .-A. Chung, and J. Glass, “SSAST: Self- supervised audio spectrogram transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 10 699– 10 709, doi: 10.1609/aaai.v36i10.21315

work page doi:10.1609/aaai.v36i10.21315 2022
[45]

Towards better understanding of gradient-based attribution methods for deep neural networks,

M. Ancona, E. Ceolini, C. ¨Oztireli, and M. Gross, “Towards better understanding of gradient-based attribution methods for deep neural networks,” inInternational Conference on Learning Representations (ICLR), 2018. [Online]. Available: https://openreview.net/forum?id= Sy21R9JAW

2018
[46]

Similarity of neural network representations revisited,

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inInternational Conference on Machine Learning. PMLR, 2019, pp. 3519–3529. [Online]. Available: http://proceedings.mlr.press/v97/kornblith19a/kornblith19a.pdf Amirmohammad Mohammadireceived his M.Sc. in Electrical Engineering from Sharif University o...

2019

[1] [1]

Detecting submerged objects using active acoustics and deep neural networks: A test case for pelagic fish,

A. Testolin, D. Kipnis, and R. Diamant, “Detecting submerged objects using active acoustics and deep neural networks: A test case for pelagic fish,”IEEE Transactions on Mobile Computing, vol. 21, no. 8, pp. 2776– 2788, 2022, doi: 10.1109/TMC.2020.3044397

work page doi:10.1109/tmc.2020.3044397 2022

[2] [2]

Multilabel classification of heterogeneous under- water soundscapes with bayesian deep learning,

B. Beckleret al., “Multilabel classification of heterogeneous under- water soundscapes with bayesian deep learning,”IEEE Journal of Oceanic Engineering, vol. 47, no. 4, pp. 1143–1154, 2022, doi: 10.1109/JOE.2022.3177850

work page doi:10.1109/joe.2022.3177850 2022

[3] [3]

R. J. Urick,Principles of Underwater Sound, 3rd ed. McGraw-Hill, 1983

1983

[4] [4]

A survey of underwater acoustic target recognition methods based on machine learning,

X. Luo, L. Chen, H. Zhou, and H. Cao, “A survey of underwater acoustic target recognition methods based on machine learning,”Journal of Marine Science and Engineering, vol. 11, no. 2, p. 384, 2023, doi: 10.3390/jmse11020384

work page doi:10.3390/jmse11020384 2023

[5] [5]

DEMON feature extraction of acoustic vector signal based on 3/2-d spectrum,

L. Sichun and Y . Desen, “DEMON feature extraction of acoustic vector signal based on 3/2-d spectrum,” in2007 2nd IEEE Conference on Industrial Electronics and Applications. IEEE, 2007, pp. 2239–2243, doi: 10.1109/ICIEA.2007.4318809

work page doi:10.1109/iciea.2007.4318809 2007

[6] [6]

Novel DEMON spectra analysis tech- niques and empirical knowledge based reference criterion for acoustic signal classification,

M. A. R. Hashmi and R. H. Raza, “Novel DEMON spectra analysis tech- niques and empirical knowledge based reference criterion for acoustic signal classification,”Journal of Electrical Engineering & Technology, vol. 18, no. 1, pp. 561–578, 2023, doi: 10.1007/s42835-022-01167-3

work page doi:10.1007/s42835-022-01167-3 2023

[7] [7]

F. B. Jensen, W. A. Kuperman, M. B. Porter, and H. Schmidt,Compu- tational Ocean Acoustics, 2nd ed. New York, NY: Springer, 2011, doi: 10.1007/978-1-4419-8678-8

work page doi:10.1007/978-1-4419-8678-8 2011

[8] [8]

Deep Learning - whole book,

Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, 2015, doi: 10.1038/nature14539

work page doi:10.1038/nature14539 2015

[9] [9]

Machine learning in acoustics,

M. J. Bianco, P. Gerstoft, J. Traer, E. Ozanich, M. A. Roch, S. Gannot, and C.-A. Deledalle, “Machine learning in acoustics,”The Journal of the Acoustical Society of America, vol. 146, no. 5, pp. 3590–3631, 2019

2019

[10] [10]

Deep learning in underwater acoustics: A review,

H. Niu, E. Reeves, and P. Gerstoft, “Deep learning in underwater acoustics: A review,”The Journal of the Acoustical Society of America, vol. 152, no. 1, pp. 751–793, 2022

2022

[11] [11]

Adversarial multi-task underwater acoustic target recognition: towards robustness against various influential factors,

Y . Xie, J. Xu, J. Ren, and J. Li, “Adversarial multi-task underwater acoustic target recognition: towards robustness against various influential factors,”The Journal of the Acoustical Society of America, vol. 156, no. 1, pp. 299–312, 2024, doi: 10.1121/10.0026598

work page doi:10.1121/10.0026598 2024

[12] [12]

Deep learning for underwater acoustics: A review,

M. van Kootenet al., “Deep learning for underwater acoustics: A review,”The Journal of the Acoustical Society of America, vol. 150, no. 3, pp. 1600–1615, 2021, doi: 10.1121/10.0006240

work page doi:10.1121/10.0006240 2021

[13] [13]

Deep convolution stack for waveform in underwater acoustic target recognition,

Y . Songet al., “Deep convolution stack for waveform in underwater acoustic target recognition,”Scientific Reports, vol. 11, no. 1, p. 9614, 2021, doi: 10.1038/s41598-021-88799-z

work page doi:10.1038/s41598-021-88799-z 2021

[14] [14]

Un- derwater radiated noise from modern commercial ships,

M. F. McKenna, D. Ross, S. M. Wiggins, and J. A. Hildebrand, “Un- derwater radiated noise from modern commercial ships,”The Journal of the Acoustical Society of America, vol. 131, no. 1, pp. 92–103, 2012, doi: 10.1121/1.3664100

work page doi:10.1121/1.3664100 2012

[15] [15]

Self-supervised learning for underwater acoustic signal classification with mixup,

Q. Xuet al., “Self-supervised learning for underwater acoustic signal classification with mixup,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 3458–3470, 2024, doi: 10.1109/JSTARS.2023.3325921

work page doi:10.1109/jstars.2023.3325921 2024

[16] [16]

WaveNet: A generative model for raw audio,

A. van den Oordet al., “WaveNet: A generative model for raw audio,”arXiv preprint arXiv:1609.03499, 2016. [Online]. Available: https://arxiv.org/abs/1609.03499

Pith/arXiv arXiv 2016

[17] [17]

Deep learning for audio signal processing,

H. Purwins, B. Li, T. Virtanen, J. Schl ¨uter, S.-Y . Chang, and T. Sainath, “Deep learning for audio signal processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 206–219, 2019, doi: 10.1109/JSTSP.2019.2908700

work page doi:10.1109/jstsp.2019.2908700 2019

[18] [18]

Underwater acoustic target recog- nition method based on feature fusion and residual cnn,

Y . Yang, Q. Yao, and Y . Wang, “Underwater acoustic target recog- nition method based on feature fusion and residual cnn,”IEEE Sensors Journal, vol. 24, no. 22, pp. 37 342–37 357, 2024, doi: 10.1109/JSEN.2024.3464754

work page doi:10.1109/jsen.2024.3464754 2024

[19] [19]

SSAST-Adapter: A parameter-efficient incremental learning algorithm for underwater acoustic target recognition,

Q. Zhu, Q. Xu, B. Zhu, Z. Gao, L. Zeng, and K. Xu, “SSAST-Adapter: A parameter-efficient incremental learning algorithm for underwater acoustic target recognition,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5. [Online]. Available: https://ieeexplore.ieee.org/document/10887650/

arXiv 2025

[20] [20]

LoRA: Low-rank adaptation of large language models,

E. J. Huet al., “LoRA: Low-rank adaptation of large language models,” inProc. International Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://openreview.net/forum?id= nZeVKeeFYf9

2022

[21] [21]

Histogram-based parameter-efficient tuning for passive and active sonar classification,

A. Mohammadi, D. Carreiro, A. Van Dine, and J. Peeples, “Histogram-based parameter-efficient tuning for passive and active sonar classification,”arXiv preprint arXiv:2504.15214, 2025. [Online]. Available: https://arxiv.org/abs/2504.15214

Pith/arXiv arXiv 2025

[22] [22]

Segmentation pseudolabel generation using the multiple instance learning choquet integral,

C. H. McCurley and A. Zare, “Segmentation pseudolabel generation using the multiple instance learning choquet integral,”IEEE Trans- actions on Fuzzy Systems, vol. 32, no. 1, pp. 182–195, 2024, doi: 10.1109/TFUZZ.2023.3338049

work page doi:10.1109/tfuzz.2023.3338049 2024

[23] [23]

The application of fuzzy integrals in multicriteria decision making,

M. Grabisch, “The application of fuzzy integrals in multicriteria decision making,”European Journal of Operational Research, vol. 89, no. 3, pp. 445–456, 1996, doi: 10.1016/0377-2217(95)00176-X

work page doi:10.1016/0377-2217(95)00176-x 1996

[24] [24]

Bootstrap Your Own Latent

Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio spectro- gram transformer,” inProc. Interspeech, 2021, pp. 571–575, doi: 10.21437/Interspeech.2021-698

work page doi:10.21437/interspeech.2021-698 2021

[25] [25]

CNN architectures for large-scale audio clas- sification,

S. Hersheyet al., “CNN architectures for large-scale audio clas- sification,” inProc. IEEE ICASSP, 2017, pp. 131–135, doi: 10.1109/ICASSP.2017.7952132

work page doi:10.1109/icassp.2017.7952132 2017

[26] [26]

An auditory-based time-dilated convolution neural network for ship-radiated noise,

G. Huet al., “An auditory-based time-dilated convolution neural network for ship-radiated noise,”The Journal of the Acoustical Society of Amer- ica, vol. 148, no. 3, pp. 1556–1566, 2020, doi: 10.1121/10.0001924

work page doi:10.1121/10.0001924 2020

[27] [27]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020. [Online]. Available: https://proceedings.neurips. cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html

2020

[28] [28]

ShipsEar: An underwater vessel noise database,

D. Santos-Dom ´ınguezet al., “ShipsEar: An underwater vessel noise database,”Applied Acoustics, vol. 113, pp. 64–69, 2016, doi: 10.1016/j.apacoust.2016.06.008

work page doi:10.1016/j.apacoust.2016.06.008 2016

[29] [29]

DeepShip: An underwater acoustic benchmark dataset,

M. Irfanet al., “DeepShip: An underwater acoustic benchmark dataset,” Expert Systems with Applications, vol. 183, p. 115270, 2021, doi: 10.1016/j.eswa.2021.115270

work page doi:10.1016/j.eswa.2021.115270 2021

[30] [30]

Underwater acoustic target recognition based on a joint neural network,

X. Hanet al., “Underwater acoustic target recognition based on a joint neural network,”PLoS One, vol. 17, no. 2, p. e0264445, 2022, doi: 10.1371/journal.pone.0264445

work page doi:10.1371/journal.pone.0264445 2022

[31] [31]

Robust underwater target recognition with uncertainty- aware fusion,

Y . Sunet al., “Robust underwater target recognition with uncertainty- aware fusion,”Information Fusion, vol. 98, p. 101846, 2023, doi: 10.1016/j.inffus.2023.101846

work page doi:10.1016/j.inffus.2023.101846 2023

[32] [32]

Attentional multidomain fea- ture fusion for underwater acoustic signal classification,

Y . Miao, Y . Li, and Y . Zakharov, “Attentional multidomain fea- ture fusion for underwater acoustic signal classification,”IEEE Jour- nal of Oceanic Engineering, vol. 51, no. 1, pp. 11–22, 2025, doi: 10.1109/JOE.2025.3619304

work page doi:10.1109/joe.2025.3619304 2025

[33] [33]

Multilabel recognition method for ship-radiated noise signals based on multidomain information fusion with deep equilibrium models,

Y . Duan, X. Shen, H. Wang, and Y . Yan, “Multilabel recognition method for ship-radiated noise signals based on multidomain information fusion with deep equilibrium models,”IEEE Journal of Oceanic Engineering, 2025, doi: 10.1109/JOE.2025.3545239

work page doi:10.1109/joe.2025.3545239 2025

[34] [34]

Multiple instance choquet integral classifier fusion and regression for remote sensing applications,

X. Du and A. Zare, “Multiple instance choquet integral classifier fusion and regression for remote sensing applications,”IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 5, pp. 2741–2753, 2018, doi: 10.1109/tgrs.2018.2876687

work page doi:10.1109/tgrs.2018.2876687 2018

[35] [35]

Fuzzy integral in multicriteria decision making,

M. Grabisch, “Fuzzy integral in multicriteria decision making,”Fuzzy Sets and Systems, vol. 69, no. 3, pp. 279–298, 1995, doi: 10.1016/0165- 0114(94)00174-R

work page doi:10.1016/0165- 1995

[36] [36]

Audio Set: An ontology and human-labeled dataset for audio events

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human- labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780, doi: 10.1109/ICASSP.2017.7952261

work page doi:10.1109/icassp.2017.7952261 2017

[37] [37]

ImageNet: A Large-Scale Hierarchical Image Database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255, doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009

[38] [38]

The Uniqueness Problem of Physical Law Learning

M. Hagiwara, “Aves: Animal vocalization encoder based on self- supervision,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5, doi: 10.1109/ICASSP49357.2023.10095166

work page doi:10.1109/icassp49357.2023.10095166 2023

[39] [39]

FSD50K: an open dataset of human-labeled sound events,

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021, doi: 10.1109/TASLP.2021.3133208

work page doi:10.1109/taslp.2021.3133208 2021

[40] [40]

Specaugment on large scale datasets

H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “VGGSound: A large- scale audio-visual dataset,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 721–725, doi: 10.1109/ICASSP40776.2020.9053169

work page doi:10.1109/icassp40776.2020.9053169 2020

[41] [41]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[42] [42]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked autoen- coders,” inProceedings of the IEEE/CVF Conference on Computer 9 Vision and Pattern Recognition (CVPR), 2023, pp. 16 133–16 142, doi: 10.1109/CVPR52729.2023.01548

work page doi:10.1109/cvpr52729.2023.01548 2023

[43] [43]

PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,

Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020, doi: 10.1109/TASLP.2020.3030497

work page doi:10.1109/taslp.2020.3030497 2020

[44] [44]

SSAST: Self- supervised audio spectrogram transformer,

Y . Gong, C.-I. J. Lai, Y .-A. Chung, and J. Glass, “SSAST: Self- supervised audio spectrogram transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 10 699– 10 709, doi: 10.1609/aaai.v36i10.21315

work page doi:10.1609/aaai.v36i10.21315 2022

[45] [45]

Towards better understanding of gradient-based attribution methods for deep neural networks,

M. Ancona, E. Ceolini, C. ¨Oztireli, and M. Gross, “Towards better understanding of gradient-based attribution methods for deep neural networks,” inInternational Conference on Learning Representations (ICLR), 2018. [Online]. Available: https://openreview.net/forum?id= Sy21R9JAW

2018

[46] [46]

Similarity of neural network representations revisited,

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inInternational Conference on Machine Learning. PMLR, 2019, pp. 3519–3529. [Online]. Available: http://proceedings.mlr.press/v97/kornblith19a/kornblith19a.pdf Amirmohammad Mohammadireceived his M.Sc. in Electrical Engineering from Sharif University o...

2019