Recognition: unknown
From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings
Pith reviewed 2026-05-09 19:39 UTC · model grok-4.3
The pith
Pretrained acoustic embeddings classify elephant vocalizations nearly as well as fully supervised networks without any fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fixed pretrained embedding networks drawn from bioacoustic, speech, and general audio domains classify elephant calls effectively when paired with lightweight downstream classifiers. Perch 2.0 yields AUCs of 0.849 for African bush elephants and 0.936 for Asian elephants, coming within 2.2% of an end-to-end supervised baseline. Analysis of transformer encoders reveals that the second layer of wav2vec2.0 and HuBERT suffices for good performance while using only about 10% of the parameters.
What carries the argument
Out-of-species pretrained acoustic embeddings used as fixed feature extractors paired with lightweight classifiers for elephant call classification.
If this is right
- Classification performance stays high for both African and Asian elephant species even though the embeddings contain no elephant data.
- Truncating transformer networks at intermediate layers preserves accuracy while reducing parameters to roughly 10 percent of the full model.
- A broad range of embedding sources, including those with no bioacoustic data at all, support effective downstream classification.
- Only small classifiers need training on elephant data, lowering the amount of labeled examples required.
Where Pith is reading between the lines
- The same fixed-embedding strategy could be applied to vocalizations of other species where annotated recordings are even harder to obtain.
- The compact early-layer embeddings open the door to on-device monitoring systems that run in the field with limited power and compute.
- Selecting specific layers might improve transfer in other acoustic domains beyond elephants, such as marine mammal sounds.
Load-bearing premise
The acoustic features learned from birds, speech, or general audio share enough structure with elephant calls to transfer useful distinctions without meaningful domain shift on the available datasets.
What would settle it
A larger and more diverse collection of elephant calls, collected independently, shows AUCs that fall well below those of fully supervised models when the same fixed embeddings are used.
read the original abstract
We show that pretrained acoustic embeddings classify elephant vocalisations at a level approaching that of end-to-end supervised neural networks, without any fine-tuning of the embedding model. This result is of practical importance because annotated bioacoustic data are scarce and costly to obtain, leaving conventional supervised approaches prone to overfitting and to poor generalisation under domain shift. A broad range of embedding models drawn from general audio, speech, and bioacoustic domains is evaluated, all of which are either out-of-domain (containing no bioacoustic data) or out-of-species (containing no elephant call data). The embedding networks themselves remain fixed; only the lightweight downstream classifiers, which include a linear model and several small neural networks, are trained. Among the models considered, Perch 2.0 achieves the best cross-validated classification performance, attaining AUCs of 0.849 on African bush elephant (Loxodonta africana) calls and 0.936 on Asian elephant (Elephas maximus) calls, with Perch 1.0 close behind. The best-performing system is within 2.2 % of an end-to-end supervised elephant call classification system. A layerwise analysis of pretrained transformer encoders, considered as embedding models, shows that intermediate representations outperform final-layer outputs. The second layer of both wav2vec2.0 and HuBERT encodes sufficient information for effective elephant call classification; truncation at this layer therefore preserves classification performance whilst retaining only approximately 10 % of the parameters of the full network. Such compact embedding networks are well suited to on-device processing where computational resources are limited.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates fixed pretrained acoustic embeddings (from birdsong, speech, and general audio models, none containing elephant data) for classifying elephant vocalizations. Lightweight downstream classifiers are trained on held-out elephant data while the embedding networks remain frozen. Perch 2.0 achieves the highest cross-validated AUCs (0.849 African bush elephant, 0.936 Asian elephant), within 2.2% of an end-to-end supervised baseline. Additional results include layer-wise analysis of transformer encoders and the observation that intermediate layers (e.g., layer 2 of wav2vec 2.0 and HuBERT) suffice for good performance while using only ~10% of the parameters.
Significance. If the evaluation protocol is robust, the result would be practically useful for data-scarce bioacoustics by showing that out-of-species embeddings can approach supervised performance without fine-tuning. The systematic comparison across embedding families, the layer-truncation findings, and the explicit supervised baseline provide concrete evidence that could guide deployment of compact models on resource-limited devices.
major comments (2)
- [Evaluation / cross-validation procedure] Evaluation / cross-validation procedure: the manuscript does not state whether k-fold or other splits are group-aware (by individual elephant or by recording session). Because multiple calls from the same animal or microphone session share low-level acoustic signatures, leakage would allow both the embedding-based classifiers and the end-to-end supervised baseline to exploit these confounds, rendering the reported 2.2% gap uninterpretable as evidence of meaningful out-of-species transfer.
- [Dataset description] Dataset description: no numbers are given for the total number of calls, number of individuals, number of recording sessions, or class balance for either the African or Asian elephant datasets. Without these statistics it is impossible to judge the risk of overfitting or the statistical reliability of the AUC figures.
minor comments (2)
- [Abstract] The abstract lists Perch 2.0 as best but does not name the full set of embedding models evaluated; a summary table in §3 or §4 would improve readability.
- [Layer-wise analysis] The layer-wise truncation result is interesting, yet the text does not report whether the same early-layer advantage holds for all transformer models tested or only for wav2vec 2.0 and HuBERT.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight key areas for improving the transparency and robustness of our evaluation. We address each major point below and will revise the manuscript to incorporate clarifications and additional details.
read point-by-point responses
-
Referee: [Evaluation / cross-validation procedure] Evaluation / cross-validation procedure: the manuscript does not state whether k-fold or other splits are group-aware (by individual elephant or by recording session). Because multiple calls from the same animal or microphone session share low-level acoustic signatures, leakage would allow both the embedding-based classifiers and the end-to-end supervised baseline to exploit these confounds, rendering the reported 2.2% gap uninterpretable as evidence of meaningful out-of-species transfer.
Authors: We agree that explicitly describing the cross-validation procedure and ensuring it is group-aware is essential to support claims of out-of-species transfer. The manuscript currently refers only to 'cross-validated' performance without detailing whether splits were performed at the call level (stratified k-fold) or grouped by individual/session. This omission leaves open the possibility of leakage. We will revise the paper to fully specify the original procedure and, where feasible, add results from group-aware splits (e.g., leave-one-individual-out or session-based folds). This will allow direct assessment of whether the small performance gap to the supervised baseline holds under stricter generalization conditions. revision: yes
-
Referee: [Dataset description] Dataset description: no numbers are given for the total number of calls, number of individuals, number of recording sessions, or class balance for either the African or Asian elephant datasets. Without these statistics it is impossible to judge the risk of overfitting or the statistical reliability of the AUC figures.
Authors: We acknowledge that the absence of these statistics limits readers' ability to evaluate the results. The revised manuscript will include a dedicated dataset section or table reporting, for each species: total calls, number of unique individuals, number of recording sessions, class distribution (e.g., rumble vs. other vocalization types), and any preprocessing or balancing steps. These details will be drawn directly from our data collection and will enable assessment of overfitting risk and statistical reliability. revision: yes
Circularity Check
No circularity: fixed embeddings + downstream classifiers on held-out data
full rationale
The paper evaluates fixed pretrained embeddings (Perch, wav2vec2, HuBERT, etc.) by training only lightweight downstream classifiers on elephant call datasets. Reported AUCs (0.849 / 0.936) and the 2.2% gap to end-to-end supervised baselines arise from standard cross-validation on held-out data; no equations, fitted parameters, or self-citations reduce these metrics to the inputs by construction. Layerwise analysis and truncation arguments are empirical observations on the same fixed models. This is a conventional transfer-learning protocol with no load-bearing self-definition or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Acoustic features learned from birds, speech or general audio transfer to elephant calls without fine-tuning
Reference graph
Works this paper leans on
-
[1]
N., Harvey, M., Harrell, L., Jansen, A., Merkens, K
Allen, A. N., Harvey, M., Harrell, L., Jansen, A., Merkens, K. P., Wall, C. C., Cattiau, J., & Oleson, E. M. (2021). A Convolutional Neural Network for Automated Detection of Humpback Whale Song in a Diverse, Long-Term Passive Acoustic Dataset.Frontiers in Marine Science,8.doi: 10.3389/fmars.2021.607321. Allen, A. N., Harvey, M., Harrell, L., Wood, M., Sz...
-
[2]
Wood, Maximilian Eibl, and Holger Klinck
Kahl, S., Wood, C. M., Eibl, M., & Klinck, H. (2021). BirdNET: A Deep Learning Solution for Avian Diversity Monitoring.Ecological Informatics,61, 101236.doi: 10.1016/j.ecoinf.2021.101236. Keen,S.C.,Shiu,Y.,Wrege,P.H.,&Rowland,E.D.(2017).AutomatedDetectionofLow-frequencyRumblesofForest Elephants: A critical tool for their conservation.The Journal of the Ac...
-
[3]
Stoeger, A. S., & Manger, P. (2014). Vocal learning in elephants: Neural bases and adaptive context.Current Opinion in Neurobiology,28, 101–107.doi: 10.1016/j.conb.2014.07.001. Stöger, A. S., Heilmann, G., Zeppelzauer, M., Ganswindt, A., Hensman, S., & Charlton, B. D. (2012). Visualizing sound emission of elephant vocalizations: Evidence for two rumble pr...
-
[4]
160 LR0.6786 0.2073 0.8048 0.2840 MLP 0.6988 0.1891 0.7848 0.2396 AERD (Geldenhuys & Niesler,
2073
-
[5]
768 LR0.7951 0.2768 0.8991 0.3099 MLP0.7879 0.2505 0.9057 0.3202 Elman0.7955 0.2506 0.8760 0.3629 GRU 0.8137 0.2816 0.8991 0.4063 LSTM0.7235 0.1917 0.8590 0.3207 BEATs (time) (Chen, Wu, et al.,
1917
-
[6]
1024 LR0.7267 0.1693 0.8083 0.1984 MLP 0.7309 0.1896 0.7828 0.2013 Elman0.6440 0.1487 0.7383 0.1775 GRU0.7160 0.1511 0.8121 0.2247 LSTM0.6072 0.1083 0.7583 0.1808 XLS-R (Conneau et al.,
1984
-
[7]
1024 LR0.8009 0.2685 0.8920 0.2990 MLP 0.8180 0.2528 0.8826 0.2750 Elman0.7623 0.1870 0.8400 0.2318 GRU0.7995 0.2055 0.8567 0.2465 LSTM0.7655 0.2257 0.8456 0.2481 HuBERT (base) (Hsu et al.,
2055
-
[8]
768 LR0.8037 0.1982 0.8931 0.2793 MLP 0.8296 0.2338 0.8885 0.2661 Elman0.8109 0.2347 0.8787 0.3162 GRU0.8221 0.2642 0.8766 0.3221 LSTM0.7946 0.2467 0.8636 0.3056 HuBERT (large) (Hsu et al.,
1982
-
[9]
1024 LR0.7791 0.1951 0.8560 0.2285 MLP 0.8032 0.1911 0.8505 0.2263 Elman0.7182 0.1738 0.8416 0.2348 GRU0.7123 0.1568 0.8455 0.2474 LSTM0.7020 0.1560 0.8566 0.2511 HuBERT (xlarge) (Hsu et al.,
1951
-
[10]
1280 LR0.7813 0.1952 0.8390 0.2282 MLP 0.8133 0.2006 0.8607 0.2413 Elman0.6514 0.1388 0.8186 0.1883 GRU0.7470 0.1778 0.8268 0.2095 LSTM0.7224 0.1578 0.8399 0.2253 (d) Bioacoustics Humpback (Allen et al.,
1952
-
[11]
2048 LR0.5653 0.1276 0.8418 0.2868 MLP0.5543 0.1177 0.8451 0.2752 Elman0.7709 0.2112 0.8431 0.2440 GRU0.7854 0.2270 0.8621 0.2711 LSTM 0.7874 0.2446 0.8514 0.2468 Multi-species whale (Allen et al.,
2048
- [12]
-
[13]
768 LR0.7995 0.2621 0.8794 0.2690 MLP 0.8210 0.2789 0.8907 0.2767 Elman0.7297 0.1859 0.8648 0.3090 GRU0.7544 0.2407 0.8798 0.3543 LSTM0.7559 0.2051 0.8635 0.3250 AVES-bio (Hagiwara,
2051
-
[14]
768 LR0.8083 0.2271 0.8726 0.2647 MLP 0.8301 0.2475 0.8722 0.2545 Elman0.7757 0.1927 0.8431 0.2884 GRU0.7867 0.2100 0.8696 0.3236 LSTM0.7366 0.1830 0.8477 0.3024 BirdAVES-biox (base) (Hagiwara,
1927
-
[15]
1024 LR0.8118 0.2399 0.8913 0.2853 MLP 0.8437 0.2746 0.8887 0.2852 Elman0.7719 0.2208 0.8517 0.2722 GRU0.7773 0.2291 0.8770 0.3274 LSTM0.7639 0.2054 0.8542 0.3064 Continued on next page 25 From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species EmbeddingsPreprint Table 3 – continued EV LDC Embedding Dim. Class. AUC mAP AUC mAP BirdAVES-bi...
2054
-
[16]
1024 LR0.7791 0.2075 0.8387 0.2048 MLP 0.8039 0.2389 0.8618 0.2381 Elman0.7861 0.2218 0.8564 0.2559 GRU0.7695 0.2364 0.8435 0.2622 LSTM0.7313 0.2115 0.8298 0.2435 26
2075
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.