A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition
Pith reviewed 2026-05-22 18:56 UTC · model grok-4.3
The pith
A multi-task balanced attention CNN reaches 97 percent accuracy on 27-class few-shot underwater sounds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a shared feature extractor inside a CNN, optimized simultaneously for target classification and signal reconstruction, combined with a channel attention mechanism that amplifies discriminative acoustic structures such as harmonics and suppresses noise, produces 97 percent classification accuracy and 95 percent F1-score in 27-class few-shot settings on the Watkins Marine Life Dataset and outperforms standard CNN, ACNN, and existing UATR baselines.
What carries the argument
A shared CNN feature extractor trained under multi-task learning with dynamic task weighting and a channel attention module that reweights feature maps to emphasize harmonic structures.
If this is right
- Joint optimization of classification and reconstruction yields synergistic gains confirmed by ablation studies on the same dataset.
- Dynamic weighting during training keeps the two tasks balanced so neither dominates the shared extractor.
- The resulting model maintains high accuracy even when only a handful of examples per class are available.
- Performance exceeds both conventional CNNs and prior published UATR methods under identical few-shot conditions.
Where Pith is reading between the lines
- The same joint-classification-plus-reconstruction pattern could be tested on other noisy few-shot audio tasks such as bird calls or industrial fault detection.
- Evaluating the trained model on continuous ocean recordings rather than pre-segmented clips would show whether the reported accuracy survives real-time streaming conditions.
- Pairing the architecture with simple spectrogram augmentations might push accuracy still higher in the lowest-data regimes without changing the core design.
Load-bearing premise
The channel attention mechanism can reliably pick out harmonic structures while suppressing noise and that the classification and reconstruction tasks produce mutual benefits rather than conflicting gradients on noisy underwater recordings.
What would settle it
Remove the channel attention module, retrain on the identical 27-class few-shot split of the Watkins dataset, and observe whether accuracy remains above 90 percent or drops to the level of a plain CNN.
Figures
read the original abstract
Underwater acoustic target recognition (UATR) is of great significance for the protection of marine diversity and national defense security. The development of deep learning provides new opportunities for UATR, but faces challenges brought by the scarcity of reference samples and complex environmental interference. To address these issues, we proposes a multi-task balanced channel attention convolutional neural network (MT-BCA-CNN). The method integrates a channel attention mechanism with a multi-task learning strategy, constructing a shared feature extractor and multi-task classifiers to jointly optimize target classification and feature reconstruction tasks. The channel attention mechanism dynamically enhances discriminative acoustic features such as harmonic structures while suppressing noise. Experiments on the Watkins Marine Life Dataset demonstrate that MT-BCA-CNN achieves 97\% classification accuracy and 95\% $F1$-score in 27-class few-shot scenarios, significantly outperforming traditional CNN and ACNN models, as well as popular state-of-the-art UATR methods. Ablation studies confirm the synergistic benefits of multi-task learning and attention mechanisms, while a dynamic weighting adjustment strategy effectively balances task contributions. This work provides an efficient solution for few-shot underwater acoustic recognition, advancing research in marine bioacoustics and sonar signal processing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MT-BCA-CNN, a multi-task balanced channel attention CNN for few-shot underwater acoustic target recognition (UATR). It integrates channel attention for enhancing discriminative features (e.g., harmonics) while suppressing noise, with multi-task learning for joint classification and feature reconstruction using dynamic task weighting. On the Watkins Marine Life Dataset, it reports 97% classification accuracy and 95% F1-score in 27-class few-shot scenarios, outperforming traditional CNN, ACNN, and other SOTA UATR methods, with ablation studies supporting the contributions of attention and multi-task components.
Significance. If the central performance claims hold under proper generalization conditions, the work offers a practical approach to few-shot UATR by showing potential synergies between channel attention and multi-task learning on noisy marine data. The ablation studies and dynamic weighting strategy provide concrete evidence for the design choices, which could inform future bioacoustics and sonar applications if reproducibility is ensured.
major comments (2)
- [Experimental section] Experimental section (likely §4 or §5): The description of the data splitting protocol on the Watkins Marine Life Dataset is insufficient. It does not specify whether splits are performed at the recording level (to ensure independence) or at the clip level. Given that the dataset consists of multiple short clips extracted from longer continuous recordings, clip-level random splits risk data leakage via shared background noise, hydrophone artifacts, or call patterns. This directly undermines the validity of the reported 97% accuracy and 95% F1-score in the 27-class few-shot setting and the outperformance claims relative to baselines.
- [Results section] Results section (performance tables): No error bars, standard deviations, or statistical significance tests (e.g., paired t-tests or McNemar tests) are reported for the 97% accuracy and 95% F1-score across runs or folds. Without these, the numerical superiority over baselines cannot be assessed as robust rather than due to random variation, weakening support for the central claim.
minor comments (2)
- [Abstract] Abstract: The sentence 'we proposes a multi-task...' contains a subject-verb agreement error and should be corrected for clarity.
- [Method section] Notation: The dynamic weighting parameters in the multi-task loss are introduced but lack explicit equations or initialization details, making the 'balanced' aspect harder to reproduce.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor that we will address to strengthen the paper. We provide point-by-point responses below and commit to revisions where appropriate.
read point-by-point responses
-
Referee: [Experimental section] Experimental section (likely §4 or §5): The description of the data splitting protocol on the Watkins Marine Life Dataset is insufficient. It does not specify whether splits are performed at the recording level (to ensure independence) or at the clip level. Given that the dataset consists of multiple short clips extracted from longer continuous recordings, clip-level random splits risk data leakage via shared background noise, hydrophone artifacts, or call patterns. This directly undermines the validity of the reported 97% accuracy and 95% F1-score in the 27-class few-shot setting and the outperformance claims relative to baselines.
Authors: We agree that the current description of the data splitting protocol is insufficient and could raise concerns about potential data leakage. In the revised manuscript, we will expand the experimental section to explicitly detail that splits were performed at the recording level: all clips derived from the same original continuous recording are assigned to the same train, validation, or test partition. We will also add a brief justification for this choice and, space permitting, include pseudocode or a flowchart illustrating the procedure to ensure reproducibility and independence of samples. revision: yes
-
Referee: [Results section] Results section (performance tables): No error bars, standard deviations, or statistical significance tests (e.g., paired t-tests or McNemar tests) are reported for the 97% accuracy and 95% F1-score across runs or folds. Without these, the numerical superiority over baselines cannot be assessed as robust rather than due to random variation, weakening support for the central claim.
Authors: We acknowledge that the absence of variability measures and statistical tests limits the strength of the performance claims. In the revision, we will re-run the experiments with multiple random seeds (or k-fold cross-validation) and report mean accuracy and F1-score along with standard deviations in the tables. We will also add paired t-tests or McNemar tests comparing MT-BCA-CNN against the baselines, with p-values, to demonstrate that the improvements are statistically significant rather than attributable to random variation. revision: yes
Circularity Check
No circularity in derivation; empirical results on external dataset
full rationale
The paper presents an empirical ML architecture (MT-BCA-CNN) with channel attention and multi-task learning, evaluated via accuracy/F1 on the public Watkins Marine Life Dataset against external baselines. No equations, predictions, or first-principles claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central performance numbers are measured outcomes, not algebraically forced, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- dynamic task weighting parameters
axioms (1)
- domain assumption Channel attention dynamically enhances discriminative acoustic features such as harmonic structures while suppressing noise.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MT-BCA-CNN integrates a channel attention mechanism with a multi-task learning strategy, constructing a shared feature extractor and multi-task classifiers to jointly optimize target classification and feature reconstruction tasks.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments on the Watkins Marine Life Dataset demonstrate that MT-BCA-CNN achieves 97% classification accuracy...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[2]
Mel frequency cepstral coefficient and its applications: A review
Abdul, Z.K., Al-Talabani, A.K., 2022b. Mel frequency cepstral coefficient and its applications: A review. IEEE Access 10, 122136–122158. doi:10.1109/ACCESS.2022.3223444
-
[3]
Time–frequency signal processing: Today and future
Akan, A., Karabiber Cura, O., 2021. Time–frequency signal processing: Today and future. Digital Signal Processing 119, 103216. URL: https://www.sciencedirect.com/science/article/pii/S1051200421002554, doi:https://doi.org/10.1016/j.dsp.2021. 103216
-
[4]
Bat detective—deep learning tools for bat acoustic signal detection
Aodha,O.,Gibb,R.,Barlow,K.,Browning,E.,Firman,M.,Freeman,R.,Harder,B.,Kinsey,L.,Mead,G.,Newson,S.,Pandourski,I.,Parsons, S., Russ, J., Szodoray-Parádi, A., Szodoray-Parádi, F., Tilova, E., Girolami, M., Brostow, G., Jones, K., 2018. Bat detective—deep learning tools for bat acoustic signal detection. PLOS Computational Biology 14. doi:10.1371/journal.pcbi...
-
[5]
Analysis of recent advancements in support vector machine
Bist, U.S., Singh, N., 2022. Analysis of recent advancements in support vector machine. Concurrency and Computation: Practice and Experience34,e7270. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.7270,doi: https://doi.org/10.1002/ cpe.7270, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.7270
-
[6]
An enhanced res2net with local and global feature fusion for speaker verification
Chen, Y., Zheng, S., Wang, H., Cheng, L., Chen, Q., Qi, J., 2023. An enhanced res2net with local and global feature fusion for speaker verification. URL:https://arxiv.org/abs/2305.12838, arXiv:2305.12838
-
[7]
Vip:Virtualpoolingforacceleratingcnn-basedimageclassificationandobjectdetection
Chen,Z.,Zhang,J.,Ding,R.,Marculescu,D.,2020. Vip:Virtualpoolingforacceleratingcnn-basedimageclassificationandobjectdetection. URL: https://arxiv.org/abs/1906.07912, arXiv:1906.07912
-
[9]
Ergen,T.,Sahiner,A.,Ozturkler,B.,Pauly,J.,Mardani,M.,Pilanci,M.,2022b. Demystifyingbatchnormalizationinrelunetworks:Equivalent convex optimization models and implicit regularization. URL:https://arxiv.org/abs/2103.01499, arXiv:2103.01499
-
[10]
Atransformer-baseddeeplearningnetworkforunderwateracoustictargetrecognition
Feng,S.,Zhu,X.,2022. Atransformer-baseddeeplearningnetworkforunderwateracoustictargetrecognition. IEEEGeoscienceandRemote Sensing Letters 19, 1–5. doi:10.1109/LGRS.2022.3201396
-
[11]
Deep learning application in plant stress imaging: a review
Gao, Z., Luo, Z., Zhang, W., Lv, Z., Xu, Y., 2020. Deep learning application in plant stress imaging: a review. AgriEngineering 2, 430–446. doi:10.3390/agriengineering2030029
-
[12]
Hauer,C.,Nöth,E.,Barnhill,A.,Maier,A.,Guthunz,J.,Hofer,H.,Cheng,R.X.,Barth,V.,Bergler,C.,2023. ORCA-SPYenableskillerwhale sound source simulation, detection, classification and localization using an integrated deep learning-based segmentation. Scientific Reports
work page 2023
-
[13]
cRIS-Team Scopus Importer:2023-07-21
doi:10.1038/s41598-023-38132-7. cRIS-Team Scopus Importer:2023-07-21
-
[14]
Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. doi:10.1109/CVPR.2018.00745
-
[15]
Khandelwal, T., Das, R.K., 2023. A multi-task learning framework for sound event detection using high-level acoustic characteristics of sounds. URL: https://arxiv.org/abs/2305.10729, arXiv:2305.10729
- [16]
-
[17]
Kong,Q.,Cao,Y.,Iqbal,T.,Xu,Y.,Wang,W.,Plumbley,M.D.,2019. Cross-tasklearningforaudiotagging,soundeventdetectionandspatial localization: Dcase 2019 baseline systems. URL:https://arxiv.org/abs/1904.03476, arXiv:1904.03476
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[18]
Lele, C., Xiongwei, Z., Meng, S., Xingyu, Z., 2023. Noise robust voice conversion with the fusion of mel-spectrum enhancement and feature disentanglement. ACTA ACUSTICA 48, 1070–1080. URL: https://www.jac.ac.cn/en/article/doi/10.12395/0371-0025. 2022093, doi:10.12395/0371-0025.2022093
-
[19]
Leng, Y., Zhuang, J., Pan, J., Sun, C., 2023. Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanism. Knowledge-Based Systems 268, 110460. URL: https://www.sciencedirect.com/science/article/pii/ S0950705123002101, doi:https://doi.org/10.1016/j.knosys.2023.110460
-
[20]
Liu,F.,Shen,T.,Luo,Z.,Zhao,D.,Guo,S.,2021. Underwatertargetrecognitionusingconvolutionalrecurrentneuralnetworkswith3-dmel- spectrogram and data augmentation. Applied Acoustics 178, 107989. URL:https://www.sciencedirect.com/science/article/ pii/S0003682X21000827, doi:https://doi.org/10.1016/j.apacoust.2021.107989
-
[21]
A survey of underwater acoustic target recognition methods based on machine learning
Luo, X., Chen, L., Zhou, H., Cao, H., 2023. A survey of underwater acoustic target recognition methods based on machine learning. Journal of Marine Science and Engineering 11. URL:https://www.mdpi.com/2077-1312/11/2/384, doi:10.3390/jmse11020384
-
[22]
Underwateracousticsignalclassificationbasedonsparsetime-frequencyrepresentation and deep learning
Miao,Y.,Zakharov,Y.,Sun,H.,Li,J.,Wang,J.,2020. Underwateracousticsignalclassificationbasedonsparsetime-frequencyrepresentation and deep learning. IEEE Journal of Oceanic Engineering , 1–14URL:https://eprints.whiterose.ac.uk/id/eprint/167766/. in Press
work page 2020
-
[23]
Mohammed,S.K.,Hariharan,S.M.,Kamal,S.,2018.Agtcc-basedunderwaterhmmtargetclassifierwithfadingchannelcompensation.Journal of Sensors 2018, 6593037. URL:https://onlinelibrary.wiley.com/doi/abs/10.1155/2018/6593037, doi:https://doi.org/ 10.1155/2018/6593037, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1155/2018/6593037
-
[24]
Areviewontheattentionmechanismofdeeplearning
Niu,Z.,Zhong,G.,Yu,H.,2021. Areviewontheattentionmechanismofdeeplearning. Neurocomputing452,48–62. URL: https://www. sciencedirect.com/science/article/pii/S092523122100477X, doi:https://doi.org/10.1016/j.neucom.2021.03.091
-
[26]
Comprehensiveunderwaterobjecttrackingbenchmarkdatasetandunderwaterimage enhancement with gan
Panetta,K.,Kezebou,L.,Oludare,V.,Agaian,S.,2022b. Comprehensiveunderwaterobjecttrackingbenchmarkdatasetandunderwaterimage enhancement with gan. IEEE Journal of Oceanic Engineering 47, 59–75. doi:10.1109/JOE.2021.3086907
-
[27]
Time–Frequency Processing: Methods and Tools
Pulkki, V., Delikaris-Manias, S., Politis, A., 2018. Time–Frequency Processing: Methods and Tools. pp. 1–24. doi: 10.1002/ 9781119252634.ch1
work page 2018
-
[28]
The watkins marine mammal sound database: An online, freely accessible resource
Sayigh, L., Daher, M.A., Allen, J., Gordon, H., Joyce, K., Stuhlmann, C., Tyack, P., 2017. The watkins marine mammal sound database: An online, freely accessible resource. Proceedings of Meet- ings on Acoustics 27, 040013. URL: https://doi.org/10.1121/2.0000358, doi: 10.1121/2.0000358, arXiv:https://pubs.aip.org/asa/poma/article-pdf/doi/10.1121/2.0000358/...
-
[29]
Tang, N., Zhou, F., Wang, Y., Zhang, H., Lyu, T., Wang, Z., Chang, L., 2023. Differential treatment for time and frequency dimensions in mel-spectrograms: An efficient 3d spectrogram network for underwater acoustic target classification. Ocean Engineering 287, 115863. URL: https://www.sciencedirect.com/science/article/pii/S0029801823022473, doi:https://do...
-
[30]
Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition
Thanda, A., Venkatesan, S.M., 2017. Multi-task learning of deep neural networks for audio visual automatic speech recognition. URL: https://arxiv.org/abs/1701.02477, arXiv:1701.02477
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
Anunderwateracoustictargetrecognitionmethodbasedonamnet
Wang,B.,Zhang,W.,Zhu,Y.,Wu,C.,Zhang,S.,2023a. Anunderwateracoustictargetrecognitionmethodbasedonamnet. IEEEGeoscience and Remote Sensing Letters 20, 1–5. doi:10.1109/LGRS.2023.3235659. Huang et al.: Preprint submitted to Elsevier Page 17 of 18 A MT-BCA-CNN Model for Few-shot UATR
-
[32]
CAM++: A fast and efficient network for speaker verification using context- aware masking
Wang, H., andYafeng Chen, S.Z., Cheng, L., Chen, Q., 2023b. CAM++: A fast and efficient network for speaker verification using context- aware masking. CoRR abs/2303.00332. URL:https://doi.org/10.48550/arXiv.2303.00332, doi:10.48550/ARXIV.2303.00332, arXiv:2303.00332
-
[33]
Underwater acoustic target recognition using attention-based deep neural network
Xiao, X., Wang, W., Ren, Q., Gerstoft, P., Ma, L., 2021. Underwater acoustic target recognition using attention-based deep neural network. JASA Express Letters 1, 106001. URL: https://doi.org/10.1121/10.0006299, doi:10.1121/10.0006299, arXiv:https://pubs.aip.org/asa/jel/article-pdf/doi/10.1121/10.0006299/14785347/106001_1_online.pdf
-
[34]
A novel deep-learning method with channel attention mechanism for underwater target recognition
Xue, L., Zeng, X., Jin, A., 2022. A novel deep-learning method with channel attention mechanism for underwater target recognition. Sensors 22, 5492. doi:10.3390/s22155492
-
[35]
An adaptive algorithm for target recognition using gaussian mixture models
Xue, W., Jiang, T., 2018. An adaptive algorithm for target recognition using gaussian mixture models. Measurement 124, 233–
work page 2018
-
[36]
URL: https://www.sciencedirect.com/science/article/pii/S0263224118302951, doi:https://doi.org/10.1016/j. measurement.2018.04.019
work page doi:10.1016/j 2018
-
[37]
XUE, Z., ZHOU, Y., QIANG, Y., LIU, Y., LIN, H., 2021. Cross-view scene image localization with triplet network integrating netvlad and fully connected layers. National Remote Sensing Bulletin 25, 1095–1107. doi:10.11834/jrs.20210188
-
[38]
Yang, H., Li, J., Shen, S., Xu, G., 2019. A deep convolutional neural network inspired by auditory perception for underwater acoustic target recognition. Sensors 19. URL:https://www.mdpi.com/1424-8220/19/5/1104, doi:10.3390/s19051104
-
[39]
Anoverviewonunderwateracousticpassivetargetrecognitionbasedondeep learning
ZHANG,Q.,DA,L.,WANG,C.,ZHANG,Y.,ZHUO,J.,2023. Anoverviewonunderwateracousticpassivetargetrecognitionbasedondeep learning. Journal of Electronics & Information Technology 45, 4190. doi:10.11999/JEIT221301". Huang et al.: Preprint submitted to Elsevier Page 18 of 18
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.