A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition

Hao Zhang; Junpeng Lu; Shumeng Sun; Wei Huang; Zhengyang Xiu; Zhenpeng Xu

arxiv: 2504.13102 · v1 · submitted 2025-04-17 · 💻 cs.SD · cs.AI· eess.AS

A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition

Wei Huang , Shumeng Sun , Junpeng Lu , Zhenpeng Xu , Zhengyang Xiu , Hao Zhang This is my paper

Pith reviewed 2026-05-22 18:56 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS

keywords underwater acoustic target recognitionfew-shot learningmulti-task learningchannel attentionconvolutional neural networkmarine bioacousticsWatkins Marine Life Datasetsonar signal processing

0 comments

The pith

A multi-task balanced attention CNN reaches 97 percent accuracy on 27-class few-shot underwater sounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that few-shot underwater acoustic target recognition becomes practical when a convolutional network is trained jointly on classification and feature reconstruction while a channel attention layer highlights useful patterns like harmonics and quiets background noise. The central idea is that sharing a feature extractor across these two tasks lets the model learn representations that remain stable even when training examples are scarce and ocean recordings contain heavy interference. Experiments on the Watkins Marine Life Dataset report that this MT-BCA-CNN reaches 97 percent accuracy and 95 percent F1-score across 27 classes, beating both plain CNNs and prior state-of-the-art UATR methods. Ablation results are presented to argue that the attention and multi-task components reinforce each other rather than merely adding independent gains. If the approach holds, it would give marine biologists and sonar operators a concrete way to identify ships or animals from very small sets of labeled recordings.

Core claim

The central claim is that a shared feature extractor inside a CNN, optimized simultaneously for target classification and signal reconstruction, combined with a channel attention mechanism that amplifies discriminative acoustic structures such as harmonics and suppresses noise, produces 97 percent classification accuracy and 95 percent F1-score in 27-class few-shot settings on the Watkins Marine Life Dataset and outperforms standard CNN, ACNN, and existing UATR baselines.

What carries the argument

A shared CNN feature extractor trained under multi-task learning with dynamic task weighting and a channel attention module that reweights feature maps to emphasize harmonic structures.

If this is right

Joint optimization of classification and reconstruction yields synergistic gains confirmed by ablation studies on the same dataset.
Dynamic weighting during training keeps the two tasks balanced so neither dominates the shared extractor.
The resulting model maintains high accuracy even when only a handful of examples per class are available.
Performance exceeds both conventional CNNs and prior published UATR methods under identical few-shot conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-classification-plus-reconstruction pattern could be tested on other noisy few-shot audio tasks such as bird calls or industrial fault detection.
Evaluating the trained model on continuous ocean recordings rather than pre-segmented clips would show whether the reported accuracy survives real-time streaming conditions.
Pairing the architecture with simple spectrogram augmentations might push accuracy still higher in the lowest-data regimes without changing the core design.

Load-bearing premise

The channel attention mechanism can reliably pick out harmonic structures while suppressing noise and that the classification and reconstruction tasks produce mutual benefits rather than conflicting gradients on noisy underwater recordings.

What would settle it

Remove the channel attention module, retrain on the identical 27-class few-shot split of the Watkins dataset, and observe whether accuracy remains above 90 percent or drops to the level of a plain CNN.

Figures

Figures reproduced from arXiv: 2504.13102 by Hao Zhang, Junpeng Lu, Shumeng Sun, Wei Huang, Zhengyang Xiu, Zhenpeng Xu.

**Figure 1.** Figure 1: Target Recognition Task Process Against this backdrop, the emergence of deep learning technologies has brought new possibilities to underwater audio classification, marking a transformative shift in the field. The general deep-learning-based audio recognition process is illustrated in Figure1, which encompasses the following workflow: target signal acquisition, database creation, data preprocessing, featur… view at source ↗

**Figure 2.** Figure 2: MT-BCA-CNN Model Architecture 2. Methodology In this paper, we propose a few-shot UATR method, centered on a convolutional neural network model that integrates multi-task learning with a channel attention mechanism. Below, we elaborate on the architecture and implementation details of the proposed MT-BCA-CNN, including the design of the attention module and the multi-task learning training strategy. 2.1. O… view at source ↗

**Figure 3.** Figure 3: Flowchart of multi-task learning implementation, where 𝜆0 and 𝜆1 denote the weights of the task-specific classifiers, 𝐿1 and 𝐿2 represent the task losses, and 𝐿𝑡𝑜𝑡𝑎𝑙 is the joint loss function applying a sigmoid function to the sum of the outputs from both branches. These weights recalibrate the importance of each channel in the feature map, enhancing the model’s focus on critical channels (e.g., harmonic-… view at source ↗

**Figure 4.** Figure 4: Examples of raw signal waveforms and their corresponding Mel-spectrogram. (a) Clymene Dolphin-wave. (b) Clymene Dolphin-Mel. (c) Common Dolphin-wave. (d) Common Dolphin-Mel. (e) Beluga White Whale-wave.(f) Beluga White Whale-Mel. Huang et al.: Preprint submitted to Elsevier Page 12 of 18 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: (a) CAM++(Acc:0.62). (b) ERes2Net(Acc:0.63). (c) ResNetSE(Acc:0.78). (d) MT-BCA-CNN(Acc:0.97). Huang et al.: Preprint submitted to Elsevier Page 13 of 18 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of Parameter Counts and Performance Among Three Classical Models, Baseline CNN, and Our Proposed MT-BCA-CNN on the Dataset. 3.4. Ablation Studies To validate the effectiveness of our proposed modules (Channel Attention, CA, and Multi-Task Learning, MTL), we conducted a series of ablation experiments to evaluate their impact on classification accuracy. Using our custom dataset, we trained two var… view at source ↗

**Figure 7.** Figure 7: Ablation study results. (a)Only Classify Acc(0.89). (b) CNN Acc(0.91). (c) MT-CNN Acc(0.95). (d) MT-BCACNN Acc(0.97). The results in [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Underwater acoustic target recognition (UATR) is of great significance for the protection of marine diversity and national defense security. The development of deep learning provides new opportunities for UATR, but faces challenges brought by the scarcity of reference samples and complex environmental interference. To address these issues, we proposes a multi-task balanced channel attention convolutional neural network (MT-BCA-CNN). The method integrates a channel attention mechanism with a multi-task learning strategy, constructing a shared feature extractor and multi-task classifiers to jointly optimize target classification and feature reconstruction tasks. The channel attention mechanism dynamically enhances discriminative acoustic features such as harmonic structures while suppressing noise. Experiments on the Watkins Marine Life Dataset demonstrate that MT-BCA-CNN achieves 97\% classification accuracy and 95\% $F1$-score in 27-class few-shot scenarios, significantly outperforming traditional CNN and ACNN models, as well as popular state-of-the-art UATR methods. Ablation studies confirm the synergistic benefits of multi-task learning and attention mechanisms, while a dynamic weighting adjustment strategy effectively balances task contributions. This work provides an efficient solution for few-shot underwater acoustic recognition, advancing research in marine bioacoustics and sonar signal processing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper combines channel attention and multi-task learning in a CNN for few-shot underwater acoustic recognition and reports 97% accuracy on the Watkins dataset, but the results hinge on unverified data splits that could allow leakage.

read the letter

The main thing to know is that MT-BCA-CNN takes a standard CNN backbone, adds a channel attention block to emphasize harmonics and suppress noise, and layers on multi-task learning with one head for classification and another for feature reconstruction, using dynamic weighting to balance the losses. On the Watkins Marine Life Dataset it reaches 97% accuracy and 95% F1 in a 27-class few-shot setting and beats the listed baselines plus some prior UATR methods, with ablations showing each piece contributes something.

Referee Report

2 major / 2 minor

Summary. The paper proposes MT-BCA-CNN, a multi-task balanced channel attention CNN for few-shot underwater acoustic target recognition (UATR). It integrates channel attention for enhancing discriminative features (e.g., harmonics) while suppressing noise, with multi-task learning for joint classification and feature reconstruction using dynamic task weighting. On the Watkins Marine Life Dataset, it reports 97% classification accuracy and 95% F1-score in 27-class few-shot scenarios, outperforming traditional CNN, ACNN, and other SOTA UATR methods, with ablation studies supporting the contributions of attention and multi-task components.

Significance. If the central performance claims hold under proper generalization conditions, the work offers a practical approach to few-shot UATR by showing potential synergies between channel attention and multi-task learning on noisy marine data. The ablation studies and dynamic weighting strategy provide concrete evidence for the design choices, which could inform future bioacoustics and sonar applications if reproducibility is ensured.

major comments (2)

[Experimental section] Experimental section (likely §4 or §5): The description of the data splitting protocol on the Watkins Marine Life Dataset is insufficient. It does not specify whether splits are performed at the recording level (to ensure independence) or at the clip level. Given that the dataset consists of multiple short clips extracted from longer continuous recordings, clip-level random splits risk data leakage via shared background noise, hydrophone artifacts, or call patterns. This directly undermines the validity of the reported 97% accuracy and 95% F1-score in the 27-class few-shot setting and the outperformance claims relative to baselines.
[Results section] Results section (performance tables): No error bars, standard deviations, or statistical significance tests (e.g., paired t-tests or McNemar tests) are reported for the 97% accuracy and 95% F1-score across runs or folds. Without these, the numerical superiority over baselines cannot be assessed as robust rather than due to random variation, weakening support for the central claim.

minor comments (2)

[Abstract] Abstract: The sentence 'we proposes a multi-task...' contains a subject-verb agreement error and should be corrected for clarity.
[Method section] Notation: The dynamic weighting parameters in the multi-task loss are introduced but lack explicit equations or initialization details, making the 'balanced' aspect harder to reproduce.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor that we will address to strengthen the paper. We provide point-by-point responses below and commit to revisions where appropriate.

read point-by-point responses

Referee: [Experimental section] Experimental section (likely §4 or §5): The description of the data splitting protocol on the Watkins Marine Life Dataset is insufficient. It does not specify whether splits are performed at the recording level (to ensure independence) or at the clip level. Given that the dataset consists of multiple short clips extracted from longer continuous recordings, clip-level random splits risk data leakage via shared background noise, hydrophone artifacts, or call patterns. This directly undermines the validity of the reported 97% accuracy and 95% F1-score in the 27-class few-shot setting and the outperformance claims relative to baselines.

Authors: We agree that the current description of the data splitting protocol is insufficient and could raise concerns about potential data leakage. In the revised manuscript, we will expand the experimental section to explicitly detail that splits were performed at the recording level: all clips derived from the same original continuous recording are assigned to the same train, validation, or test partition. We will also add a brief justification for this choice and, space permitting, include pseudocode or a flowchart illustrating the procedure to ensure reproducibility and independence of samples. revision: yes
Referee: [Results section] Results section (performance tables): No error bars, standard deviations, or statistical significance tests (e.g., paired t-tests or McNemar tests) are reported for the 97% accuracy and 95% F1-score across runs or folds. Without these, the numerical superiority over baselines cannot be assessed as robust rather than due to random variation, weakening support for the central claim.

Authors: We acknowledge that the absence of variability measures and statistical tests limits the strength of the performance claims. In the revision, we will re-run the experiments with multiple random seeds (or k-fold cross-validation) and report mean accuracy and F1-score along with standard deviations in the tables. We will also add paired t-tests or McNemar tests comparing MT-BCA-CNN against the baselines, with p-values, to demonstrate that the improvements are statistically significant rather than attributable to random variation. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation; empirical results on external dataset

full rationale

The paper presents an empirical ML architecture (MT-BCA-CNN) with channel attention and multi-task learning, evaluated via accuracy/F1 on the public Watkins Marine Life Dataset against external baselines. No equations, predictions, or first-principles claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central performance numbers are measured outcomes, not algebraically forced, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions plus one domain-specific modeling choice about feature enhancement; no new physical entities are introduced.

free parameters (1)

dynamic task weighting parameters
Adjusts relative contribution of classification and reconstruction losses during joint training.

axioms (1)

domain assumption Channel attention dynamically enhances discriminative acoustic features such as harmonic structures while suppressing noise.
Stated directly in the abstract as the intended behavior of the attention module on underwater signals.

pith-pipeline@v0.9.0 · 5759 in / 1419 out tokens · 62426 ms · 2026-05-22T18:56:41.033124+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MT-BCA-CNN integrates a channel attention mechanism with a multi-task learning strategy, constructing a shared feature extractor and multi-task classifiers to jointly optimize target classification and feature reconstruction tasks.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on the Watkins Marine Life Dataset demonstrate that MT-BCA-CNN achieves 97% classification accuracy...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

[2]

Mel frequency cepstral coefficient and its applications: A review

Abdul, Z.K., Al-Talabani, A.K., 2022b. Mel frequency cepstral coefficient and its applications: A review. IEEE Access 10, 122136–122158. doi:10.1109/ACCESS.2022.3223444

work page doi:10.1109/access.2022.3223444 2022
[3]

Time–frequency signal processing: Today and future

Akan, A., Karabiber Cura, O., 2021. Time–frequency signal processing: Today and future. Digital Signal Processing 119, 103216. URL: https://www.sciencedirect.com/science/article/pii/S1051200421002554, doi:https://doi.org/10.1016/j.dsp.2021. 103216

work page doi:10.1016/j.dsp.2021 2021
[4]

Bat detective—deep learning tools for bat acoustic signal detection

Aodha,O.,Gibb,R.,Barlow,K.,Browning,E.,Firman,M.,Freeman,R.,Harder,B.,Kinsey,L.,Mead,G.,Newson,S.,Pandourski,I.,Parsons, S., Russ, J., Szodoray-Parádi, A., Szodoray-Parádi, F., Tilova, E., Girolami, M., Brostow, G., Jones, K., 2018. Bat detective—deep learning tools for bat acoustic signal detection. PLOS Computational Biology 14. doi:10.1371/journal.pcbi...

work page doi:10.1371/journal.pcbi.1005995 2018
[5]

Analysis of recent advancements in support vector machine

Bist, U.S., Singh, N., 2022. Analysis of recent advancements in support vector machine. Concurrency and Computation: Practice and Experience34,e7270. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.7270,doi: https://doi.org/10.1002/ cpe.7270, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.7270

work page doi:10.1002/cpe.7270 2022
[6]

An enhanced res2net with local and global feature fusion for speaker verification

Chen, Y., Zheng, S., Wang, H., Cheng, L., Chen, Q., Qi, J., 2023. An enhanced res2net with local and global feature fusion for speaker verification. URL:https://arxiv.org/abs/2305.12838, arXiv:2305.12838

work page arXiv 2023
[7]

Vip:Virtualpoolingforacceleratingcnn-basedimageclassificationandobjectdetection

Chen,Z.,Zhang,J.,Ding,R.,Marculescu,D.,2020. Vip:Virtualpoolingforacceleratingcnn-basedimageclassificationandobjectdetection. URL: https://arxiv.org/abs/1906.07912, arXiv:1906.07912

work page arXiv 2020
[9]

Demystifyingbatchnormalizationinrelunetworks:Equivalent convex optimization models and implicit regularization

Ergen,T.,Sahiner,A.,Ozturkler,B.,Pauly,J.,Mardani,M.,Pilanci,M.,2022b. Demystifyingbatchnormalizationinrelunetworks:Equivalent convex optimization models and implicit regularization. URL:https://arxiv.org/abs/2103.01499, arXiv:2103.01499

work page arXiv
[10]

Atransformer-baseddeeplearningnetworkforunderwateracoustictargetrecognition

Feng,S.,Zhu,X.,2022. Atransformer-baseddeeplearningnetworkforunderwateracoustictargetrecognition. IEEEGeoscienceandRemote Sensing Letters 19, 1–5. doi:10.1109/LGRS.2022.3201396

work page doi:10.1109/lgrs.2022.3201396 2022
[11]

Deep learning application in plant stress imaging: a review

Gao, Z., Luo, Z., Zhang, W., Lv, Z., Xu, Y., 2020. Deep learning application in plant stress imaging: a review. AgriEngineering 2, 430–446. doi:10.3390/agriengineering2030029

work page doi:10.3390/agriengineering2030029 2020
[12]

ORCA-SPYenableskillerwhale sound source simulation, detection, classification and localization using an integrated deep learning-based segmentation

Hauer,C.,Nöth,E.,Barnhill,A.,Maier,A.,Guthunz,J.,Hofer,H.,Cheng,R.X.,Barth,V.,Bergler,C.,2023. ORCA-SPYenableskillerwhale sound source simulation, detection, classification and localization using an integrated deep learning-based segmentation. Scientific Reports

work page 2023
[13]

cRIS-Team Scopus Importer:2023-07-21

doi:10.1038/s41598-023-38132-7. cRIS-Team Scopus Importer:2023-07-21

work page doi:10.1038/s41598-023-38132-7 2023
[14]

Squeeze-and-excitation networks, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. doi:10.1109/CVPR.2018.00745

work page doi:10.1109/cvpr.2018.00745 2018
[15]

A multi-task learning framework for sound event detection using high-level acoustic characteristics of sounds

Khandelwal, T., Das, R.K., 2023. A multi-task learning framework for sound event detection using high-level acoustic characteristics of sounds. URL: https://arxiv.org/abs/2305.10729, arXiv:2305.10729

work page arXiv 2023
[16]

3639–3648

Khattar,A.,Hegde,S.,Hebbalaguppe,R.,2021.Cross-domainmulti-tasklearningforobjectdetectionandsaliencyestimation,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3639–3648

work page 2021
[17]

Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems

Kong,Q.,Cao,Y.,Iqbal,T.,Xu,Y.,Wang,W.,Plumbley,M.D.,2019. Cross-tasklearningforaudiotagging,soundeventdetectionandspatial localization: Dcase 2019 baseline systems. URL:https://arxiv.org/abs/1904.03476, arXiv:1904.03476

work page internal anchor Pith review Pith/arXiv arXiv 2019
[18]

Noise robust voice conversion with the fusion of mel-spectrum enhancement and feature disentanglement

Lele, C., Xiongwei, Z., Meng, S., Xingyu, Z., 2023. Noise robust voice conversion with the fusion of mel-spectrum enhancement and feature disentanglement. ACTA ACUSTICA 48, 1070–1080. URL: https://www.jac.ac.cn/en/article/doi/10.12395/0371-0025. 2022093, doi:10.12395/0371-0025.2022093

work page doi:10.12395/0371-0025 2023
[19]

Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanism

Leng, Y., Zhuang, J., Pan, J., Sun, C., 2023. Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanism. Knowledge-Based Systems 268, 110460. URL: https://www.sciencedirect.com/science/article/pii/ S0950705123002101, doi:https://doi.org/10.1016/j.knosys.2023.110460

work page doi:10.1016/j.knosys.2023.110460 2023
[20]

Underwatertargetrecognitionusingconvolutionalrecurrentneuralnetworkswith3-dmel- spectrogram and data augmentation

Liu,F.,Shen,T.,Luo,Z.,Zhao,D.,Guo,S.,2021. Underwatertargetrecognitionusingconvolutionalrecurrentneuralnetworkswith3-dmel- spectrogram and data augmentation. Applied Acoustics 178, 107989. URL:https://www.sciencedirect.com/science/article/ pii/S0003682X21000827, doi:https://doi.org/10.1016/j.apacoust.2021.107989

work page doi:10.1016/j.apacoust.2021.107989 2021
[21]

A survey of underwater acoustic target recognition methods based on machine learning

Luo, X., Chen, L., Zhou, H., Cao, H., 2023. A survey of underwater acoustic target recognition methods based on machine learning. Journal of Marine Science and Engineering 11. URL:https://www.mdpi.com/2077-1312/11/2/384, doi:10.3390/jmse11020384

work page doi:10.3390/jmse11020384 2023
[22]

Underwateracousticsignalclassificationbasedonsparsetime-frequencyrepresentation and deep learning

Miao,Y.,Zakharov,Y.,Sun,H.,Li,J.,Wang,J.,2020. Underwateracousticsignalclassificationbasedonsparsetime-frequencyrepresentation and deep learning. IEEE Journal of Oceanic Engineering , 1–14URL:https://eprints.whiterose.ac.uk/id/eprint/167766/. in Press

work page 2020
[23]

URL:https://onlinelibrary.wiley.com/doi/abs/10.1155/2018/6593037, doi:https://doi.org/ 10.1155/2018/6593037, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1155/2018/6593037

Mohammed,S.K.,Hariharan,S.M.,Kamal,S.,2018.Agtcc-basedunderwaterhmmtargetclassifierwithfadingchannelcompensation.Journal of Sensors 2018, 6593037. URL:https://onlinelibrary.wiley.com/doi/abs/10.1155/2018/6593037, doi:https://doi.org/ 10.1155/2018/6593037, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1155/2018/6593037

work page doi:10.1155/2018/6593037 2018
[24]

Areviewontheattentionmechanismofdeeplearning

Niu,Z.,Zhong,G.,Yu,H.,2021. Areviewontheattentionmechanismofdeeplearning. Neurocomputing452,48–62. URL: https://www. sciencedirect.com/science/article/pii/S092523122100477X, doi:https://doi.org/10.1016/j.neucom.2021.03.091

work page doi:10.1016/j.neucom.2021.03.091 2021
[26]

Comprehensiveunderwaterobjecttrackingbenchmarkdatasetandunderwaterimage enhancement with gan

Panetta,K.,Kezebou,L.,Oludare,V.,Agaian,S.,2022b. Comprehensiveunderwaterobjecttrackingbenchmarkdatasetandunderwaterimage enhancement with gan. IEEE Journal of Oceanic Engineering 47, 59–75. doi:10.1109/JOE.2021.3086907

work page doi:10.1109/joe.2021.3086907 2021
[27]

Time–Frequency Processing: Methods and Tools

Pulkki, V., Delikaris-Manias, S., Politis, A., 2018. Time–Frequency Processing: Methods and Tools. pp. 1–24. doi: 10.1002/ 9781119252634.ch1

work page 2018
[28]

The watkins marine mammal sound database: An online, freely accessible resource

Sayigh, L., Daher, M.A., Allen, J., Gordon, H., Joyce, K., Stuhlmann, C., Tyack, P., 2017. The watkins marine mammal sound database: An online, freely accessible resource. Proceedings of Meet- ings on Acoustics 27, 040013. URL: https://doi.org/10.1121/2.0000358, doi: 10.1121/2.0000358, arXiv:https://pubs.aip.org/asa/poma/article-pdf/doi/10.1121/2.0000358/...

work page doi:10.1121/2.0000358 2017
[29]

Differential treatment for time and frequency dimensions in mel-spectrograms: An efficient 3d spectrogram network for underwater acoustic target classification

Tang, N., Zhou, F., Wang, Y., Zhang, H., Lyu, T., Wang, Z., Chang, L., 2023. Differential treatment for time and frequency dimensions in mel-spectrograms: An efficient 3d spectrogram network for underwater acoustic target classification. Ocean Engineering 287, 115863. URL: https://www.sciencedirect.com/science/article/pii/S0029801823022473, doi:https://do...

work page doi:10.1016/j.oceaneng 2023
[30]

Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

Thanda, A., Venkatesan, S.M., 2017. Multi-task learning of deep neural networks for audio visual automatic speech recognition. URL: https://arxiv.org/abs/1701.02477, arXiv:1701.02477

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Anunderwateracoustictargetrecognitionmethodbasedonamnet

Wang,B.,Zhang,W.,Zhu,Y.,Wu,C.,Zhang,S.,2023a. Anunderwateracoustictargetrecognitionmethodbasedonamnet. IEEEGeoscience and Remote Sensing Letters 20, 1–5. doi:10.1109/LGRS.2023.3235659. Huang et al.: Preprint submitted to Elsevier Page 17 of 18 A MT-BCA-CNN Model for Few-shot UATR

work page doi:10.1109/lgrs.2023.3235659 2023
[32]

CAM++: A fast and efficient network for speaker verification using context- aware masking

Wang, H., andYafeng Chen, S.Z., Cheng, L., Chen, Q., 2023b. CAM++: A fast and efficient network for speaker verification using context- aware masking. CoRR abs/2303.00332. URL:https://doi.org/10.48550/arXiv.2303.00332, doi:10.48550/ARXIV.2303.00332, arXiv:2303.00332

work page doi:10.48550/arxiv.2303.00332
[33]

Underwater acoustic target recognition using attention-based deep neural network

Xiao, X., Wang, W., Ren, Q., Gerstoft, P., Ma, L., 2021. Underwater acoustic target recognition using attention-based deep neural network. JASA Express Letters 1, 106001. URL: https://doi.org/10.1121/10.0006299, doi:10.1121/10.0006299, arXiv:https://pubs.aip.org/asa/jel/article-pdf/doi/10.1121/10.0006299/14785347/106001_1_online.pdf

work page doi:10.1121/10.0006299 2021
[34]

A novel deep-learning method with channel attention mechanism for underwater target recognition

Xue, L., Zeng, X., Jin, A., 2022. A novel deep-learning method with channel attention mechanism for underwater target recognition. Sensors 22, 5492. doi:10.3390/s22155492

work page doi:10.3390/s22155492 2022
[35]

An adaptive algorithm for target recognition using gaussian mixture models

Xue, W., Jiang, T., 2018. An adaptive algorithm for target recognition using gaussian mixture models. Measurement 124, 233–

work page 2018
[36]

Masset, R

URL: https://www.sciencedirect.com/science/article/pii/S0263224118302951, doi:https://doi.org/10.1016/j. measurement.2018.04.019

work page doi:10.1016/j 2018
[37]

Cross-view scene image localization with triplet network integrating netvlad and fully connected layers

XUE, Z., ZHOU, Y., QIANG, Y., LIU, Y., LIN, H., 2021. Cross-view scene image localization with triplet network integrating netvlad and fully connected layers. National Remote Sensing Bulletin 25, 1095–1107. doi:10.11834/jrs.20210188

work page doi:10.11834/jrs.20210188 2021
[38]

A deep convolutional neural network inspired by auditory perception for underwater acoustic target recognition

Yang, H., Li, J., Shen, S., Xu, G., 2019. A deep convolutional neural network inspired by auditory perception for underwater acoustic target recognition. Sensors 19. URL:https://www.mdpi.com/1424-8220/19/5/1104, doi:10.3390/s19051104

work page doi:10.3390/s19051104 2019
[39]

Anoverviewonunderwateracousticpassivetargetrecognitionbasedondeep learning

ZHANG,Q.,DA,L.,WANG,C.,ZHANG,Y.,ZHUO,J.,2023. Anoverviewonunderwateracousticpassivetargetrecognitionbasedondeep learning. Journal of Electronics & Information Technology 45, 4190. doi:10.11999/JEIT221301". Huang et al.: Preprint submitted to Elsevier Page 18 of 18

work page doi:10.11999/jeit221301 2023

[1] [2]

Mel frequency cepstral coefficient and its applications: A review

Abdul, Z.K., Al-Talabani, A.K., 2022b. Mel frequency cepstral coefficient and its applications: A review. IEEE Access 10, 122136–122158. doi:10.1109/ACCESS.2022.3223444

work page doi:10.1109/access.2022.3223444 2022

[2] [3]

Time–frequency signal processing: Today and future

Akan, A., Karabiber Cura, O., 2021. Time–frequency signal processing: Today and future. Digital Signal Processing 119, 103216. URL: https://www.sciencedirect.com/science/article/pii/S1051200421002554, doi:https://doi.org/10.1016/j.dsp.2021. 103216

work page doi:10.1016/j.dsp.2021 2021

[3] [4]

Bat detective—deep learning tools for bat acoustic signal detection

Aodha,O.,Gibb,R.,Barlow,K.,Browning,E.,Firman,M.,Freeman,R.,Harder,B.,Kinsey,L.,Mead,G.,Newson,S.,Pandourski,I.,Parsons, S., Russ, J., Szodoray-Parádi, A., Szodoray-Parádi, F., Tilova, E., Girolami, M., Brostow, G., Jones, K., 2018. Bat detective—deep learning tools for bat acoustic signal detection. PLOS Computational Biology 14. doi:10.1371/journal.pcbi...

work page doi:10.1371/journal.pcbi.1005995 2018

[4] [5]

Analysis of recent advancements in support vector machine

Bist, U.S., Singh, N., 2022. Analysis of recent advancements in support vector machine. Concurrency and Computation: Practice and Experience34,e7270. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.7270,doi: https://doi.org/10.1002/ cpe.7270, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.7270

work page doi:10.1002/cpe.7270 2022

[5] [6]

An enhanced res2net with local and global feature fusion for speaker verification

Chen, Y., Zheng, S., Wang, H., Cheng, L., Chen, Q., Qi, J., 2023. An enhanced res2net with local and global feature fusion for speaker verification. URL:https://arxiv.org/abs/2305.12838, arXiv:2305.12838

work page arXiv 2023

[6] [7]

Vip:Virtualpoolingforacceleratingcnn-basedimageclassificationandobjectdetection

Chen,Z.,Zhang,J.,Ding,R.,Marculescu,D.,2020. Vip:Virtualpoolingforacceleratingcnn-basedimageclassificationandobjectdetection. URL: https://arxiv.org/abs/1906.07912, arXiv:1906.07912

work page arXiv 2020

[7] [9]

Demystifyingbatchnormalizationinrelunetworks:Equivalent convex optimization models and implicit regularization

Ergen,T.,Sahiner,A.,Ozturkler,B.,Pauly,J.,Mardani,M.,Pilanci,M.,2022b. Demystifyingbatchnormalizationinrelunetworks:Equivalent convex optimization models and implicit regularization. URL:https://arxiv.org/abs/2103.01499, arXiv:2103.01499

work page arXiv

[8] [10]

Atransformer-baseddeeplearningnetworkforunderwateracoustictargetrecognition

Feng,S.,Zhu,X.,2022. Atransformer-baseddeeplearningnetworkforunderwateracoustictargetrecognition. IEEEGeoscienceandRemote Sensing Letters 19, 1–5. doi:10.1109/LGRS.2022.3201396

work page doi:10.1109/lgrs.2022.3201396 2022

[9] [11]

Deep learning application in plant stress imaging: a review

Gao, Z., Luo, Z., Zhang, W., Lv, Z., Xu, Y., 2020. Deep learning application in plant stress imaging: a review. AgriEngineering 2, 430–446. doi:10.3390/agriengineering2030029

work page doi:10.3390/agriengineering2030029 2020

[10] [12]

ORCA-SPYenableskillerwhale sound source simulation, detection, classification and localization using an integrated deep learning-based segmentation

Hauer,C.,Nöth,E.,Barnhill,A.,Maier,A.,Guthunz,J.,Hofer,H.,Cheng,R.X.,Barth,V.,Bergler,C.,2023. ORCA-SPYenableskillerwhale sound source simulation, detection, classification and localization using an integrated deep learning-based segmentation. Scientific Reports

work page 2023

[11] [13]

cRIS-Team Scopus Importer:2023-07-21

doi:10.1038/s41598-023-38132-7. cRIS-Team Scopus Importer:2023-07-21

work page doi:10.1038/s41598-023-38132-7 2023

[12] [14]

Squeeze-and-excitation networks, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. doi:10.1109/CVPR.2018.00745

work page doi:10.1109/cvpr.2018.00745 2018

[13] [15]

A multi-task learning framework for sound event detection using high-level acoustic characteristics of sounds

Khandelwal, T., Das, R.K., 2023. A multi-task learning framework for sound event detection using high-level acoustic characteristics of sounds. URL: https://arxiv.org/abs/2305.10729, arXiv:2305.10729

work page arXiv 2023

[14] [16]

3639–3648

Khattar,A.,Hegde,S.,Hebbalaguppe,R.,2021.Cross-domainmulti-tasklearningforobjectdetectionandsaliencyestimation,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3639–3648

work page 2021

[15] [17]

Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems

Kong,Q.,Cao,Y.,Iqbal,T.,Xu,Y.,Wang,W.,Plumbley,M.D.,2019. Cross-tasklearningforaudiotagging,soundeventdetectionandspatial localization: Dcase 2019 baseline systems. URL:https://arxiv.org/abs/1904.03476, arXiv:1904.03476

work page internal anchor Pith review Pith/arXiv arXiv 2019

[16] [18]

Noise robust voice conversion with the fusion of mel-spectrum enhancement and feature disentanglement

Lele, C., Xiongwei, Z., Meng, S., Xingyu, Z., 2023. Noise robust voice conversion with the fusion of mel-spectrum enhancement and feature disentanglement. ACTA ACUSTICA 48, 1070–1080. URL: https://www.jac.ac.cn/en/article/doi/10.12395/0371-0025. 2022093, doi:10.12395/0371-0025.2022093

work page doi:10.12395/0371-0025 2023

[17] [19]

Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanism

Leng, Y., Zhuang, J., Pan, J., Sun, C., 2023. Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanism. Knowledge-Based Systems 268, 110460. URL: https://www.sciencedirect.com/science/article/pii/ S0950705123002101, doi:https://doi.org/10.1016/j.knosys.2023.110460

work page doi:10.1016/j.knosys.2023.110460 2023

[18] [20]

Underwatertargetrecognitionusingconvolutionalrecurrentneuralnetworkswith3-dmel- spectrogram and data augmentation

Liu,F.,Shen,T.,Luo,Z.,Zhao,D.,Guo,S.,2021. Underwatertargetrecognitionusingconvolutionalrecurrentneuralnetworkswith3-dmel- spectrogram and data augmentation. Applied Acoustics 178, 107989. URL:https://www.sciencedirect.com/science/article/ pii/S0003682X21000827, doi:https://doi.org/10.1016/j.apacoust.2021.107989

work page doi:10.1016/j.apacoust.2021.107989 2021

[19] [21]

A survey of underwater acoustic target recognition methods based on machine learning

Luo, X., Chen, L., Zhou, H., Cao, H., 2023. A survey of underwater acoustic target recognition methods based on machine learning. Journal of Marine Science and Engineering 11. URL:https://www.mdpi.com/2077-1312/11/2/384, doi:10.3390/jmse11020384

work page doi:10.3390/jmse11020384 2023

[20] [22]

Underwateracousticsignalclassificationbasedonsparsetime-frequencyrepresentation and deep learning

Miao,Y.,Zakharov,Y.,Sun,H.,Li,J.,Wang,J.,2020. Underwateracousticsignalclassificationbasedonsparsetime-frequencyrepresentation and deep learning. IEEE Journal of Oceanic Engineering , 1–14URL:https://eprints.whiterose.ac.uk/id/eprint/167766/. in Press

work page 2020

[21] [23]

URL:https://onlinelibrary.wiley.com/doi/abs/10.1155/2018/6593037, doi:https://doi.org/ 10.1155/2018/6593037, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1155/2018/6593037

Mohammed,S.K.,Hariharan,S.M.,Kamal,S.,2018.Agtcc-basedunderwaterhmmtargetclassifierwithfadingchannelcompensation.Journal of Sensors 2018, 6593037. URL:https://onlinelibrary.wiley.com/doi/abs/10.1155/2018/6593037, doi:https://doi.org/ 10.1155/2018/6593037, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1155/2018/6593037

work page doi:10.1155/2018/6593037 2018

[22] [24]

Areviewontheattentionmechanismofdeeplearning

Niu,Z.,Zhong,G.,Yu,H.,2021. Areviewontheattentionmechanismofdeeplearning. Neurocomputing452,48–62. URL: https://www. sciencedirect.com/science/article/pii/S092523122100477X, doi:https://doi.org/10.1016/j.neucom.2021.03.091

work page doi:10.1016/j.neucom.2021.03.091 2021

[23] [26]

Comprehensiveunderwaterobjecttrackingbenchmarkdatasetandunderwaterimage enhancement with gan

Panetta,K.,Kezebou,L.,Oludare,V.,Agaian,S.,2022b. Comprehensiveunderwaterobjecttrackingbenchmarkdatasetandunderwaterimage enhancement with gan. IEEE Journal of Oceanic Engineering 47, 59–75. doi:10.1109/JOE.2021.3086907

work page doi:10.1109/joe.2021.3086907 2021

[24] [27]

Time–Frequency Processing: Methods and Tools

Pulkki, V., Delikaris-Manias, S., Politis, A., 2018. Time–Frequency Processing: Methods and Tools. pp. 1–24. doi: 10.1002/ 9781119252634.ch1

work page 2018

[25] [28]

The watkins marine mammal sound database: An online, freely accessible resource

Sayigh, L., Daher, M.A., Allen, J., Gordon, H., Joyce, K., Stuhlmann, C., Tyack, P., 2017. The watkins marine mammal sound database: An online, freely accessible resource. Proceedings of Meet- ings on Acoustics 27, 040013. URL: https://doi.org/10.1121/2.0000358, doi: 10.1121/2.0000358, arXiv:https://pubs.aip.org/asa/poma/article-pdf/doi/10.1121/2.0000358/...

work page doi:10.1121/2.0000358 2017

[26] [29]

Differential treatment for time and frequency dimensions in mel-spectrograms: An efficient 3d spectrogram network for underwater acoustic target classification

Tang, N., Zhou, F., Wang, Y., Zhang, H., Lyu, T., Wang, Z., Chang, L., 2023. Differential treatment for time and frequency dimensions in mel-spectrograms: An efficient 3d spectrogram network for underwater acoustic target classification. Ocean Engineering 287, 115863. URL: https://www.sciencedirect.com/science/article/pii/S0029801823022473, doi:https://do...

work page doi:10.1016/j.oceaneng 2023

[27] [30]

Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

Thanda, A., Venkatesan, S.M., 2017. Multi-task learning of deep neural networks for audio visual automatic speech recognition. URL: https://arxiv.org/abs/1701.02477, arXiv:1701.02477

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [31]

Anunderwateracoustictargetrecognitionmethodbasedonamnet

Wang,B.,Zhang,W.,Zhu,Y.,Wu,C.,Zhang,S.,2023a. Anunderwateracoustictargetrecognitionmethodbasedonamnet. IEEEGeoscience and Remote Sensing Letters 20, 1–5. doi:10.1109/LGRS.2023.3235659. Huang et al.: Preprint submitted to Elsevier Page 17 of 18 A MT-BCA-CNN Model for Few-shot UATR

work page doi:10.1109/lgrs.2023.3235659 2023

[29] [32]

CAM++: A fast and efficient network for speaker verification using context- aware masking

Wang, H., andYafeng Chen, S.Z., Cheng, L., Chen, Q., 2023b. CAM++: A fast and efficient network for speaker verification using context- aware masking. CoRR abs/2303.00332. URL:https://doi.org/10.48550/arXiv.2303.00332, doi:10.48550/ARXIV.2303.00332, arXiv:2303.00332

work page doi:10.48550/arxiv.2303.00332

[30] [33]

Underwater acoustic target recognition using attention-based deep neural network

Xiao, X., Wang, W., Ren, Q., Gerstoft, P., Ma, L., 2021. Underwater acoustic target recognition using attention-based deep neural network. JASA Express Letters 1, 106001. URL: https://doi.org/10.1121/10.0006299, doi:10.1121/10.0006299, arXiv:https://pubs.aip.org/asa/jel/article-pdf/doi/10.1121/10.0006299/14785347/106001_1_online.pdf

work page doi:10.1121/10.0006299 2021

[31] [34]

A novel deep-learning method with channel attention mechanism for underwater target recognition

Xue, L., Zeng, X., Jin, A., 2022. A novel deep-learning method with channel attention mechanism for underwater target recognition. Sensors 22, 5492. doi:10.3390/s22155492

work page doi:10.3390/s22155492 2022

[32] [35]

An adaptive algorithm for target recognition using gaussian mixture models

Xue, W., Jiang, T., 2018. An adaptive algorithm for target recognition using gaussian mixture models. Measurement 124, 233–

work page 2018

[33] [36]

Masset, R

URL: https://www.sciencedirect.com/science/article/pii/S0263224118302951, doi:https://doi.org/10.1016/j. measurement.2018.04.019

work page doi:10.1016/j 2018

[34] [37]

Cross-view scene image localization with triplet network integrating netvlad and fully connected layers

XUE, Z., ZHOU, Y., QIANG, Y., LIU, Y., LIN, H., 2021. Cross-view scene image localization with triplet network integrating netvlad and fully connected layers. National Remote Sensing Bulletin 25, 1095–1107. doi:10.11834/jrs.20210188

work page doi:10.11834/jrs.20210188 2021

[35] [38]

A deep convolutional neural network inspired by auditory perception for underwater acoustic target recognition

Yang, H., Li, J., Shen, S., Xu, G., 2019. A deep convolutional neural network inspired by auditory perception for underwater acoustic target recognition. Sensors 19. URL:https://www.mdpi.com/1424-8220/19/5/1104, doi:10.3390/s19051104

work page doi:10.3390/s19051104 2019

[36] [39]

Anoverviewonunderwateracousticpassivetargetrecognitionbasedondeep learning

ZHANG,Q.,DA,L.,WANG,C.,ZHANG,Y.,ZHUO,J.,2023. Anoverviewonunderwateracousticpassivetargetrecognitionbasedondeep learning. Journal of Electronics & Information Technology 45, 4190. doi:10.11999/JEIT221301". Huang et al.: Preprint submitted to Elsevier Page 18 of 18

work page doi:10.11999/jeit221301 2023