arxiv: 2605.00225 · v1 · submitted 2026-04-30 · 📡 eess.AS · cs.LG· cs.SD· q-bio.QM

Recognition: unknown

From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings

Christiaan M. Geldenhuys , Thomas R. Niesler

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:39 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDq-bio.QM

keywords elephant call classificationpretrained acoustic embeddingstransfer learningbioacousticsout-of-species embeddingstransformer layersAUC performance

0 comments

The pith

Pretrained acoustic embeddings classify elephant vocalizations nearly as well as fully supervised networks without any fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that embeddings pretrained on birdsong, speech, or general audio can classify calls from African bush elephants and Asian elephants at high accuracy. Only lightweight classifiers on top of the fixed embeddings require training, avoiding the need to fine-tune the large models themselves. This reaches performance within 2.2 percent of end-to-end systems trained directly on elephant data. The approach helps because annotated bioacoustic recordings remain scarce, which often causes supervised models to overfit or generalize poorly. Perch 2.0 performs best overall, and early layers of transformer-based embeddings prove sufficient while using far fewer parameters.

Core claim

Fixed pretrained embedding networks drawn from bioacoustic, speech, and general audio domains classify elephant calls effectively when paired with lightweight downstream classifiers. Perch 2.0 yields AUCs of 0.849 for African bush elephants and 0.936 for Asian elephants, coming within 2.2% of an end-to-end supervised baseline. Analysis of transformer encoders reveals that the second layer of wav2vec2.0 and HuBERT suffices for good performance while using only about 10% of the parameters.

What carries the argument

Out-of-species pretrained acoustic embeddings used as fixed feature extractors paired with lightweight classifiers for elephant call classification.

If this is right

Classification performance stays high for both African and Asian elephant species even though the embeddings contain no elephant data.
Truncating transformer networks at intermediate layers preserves accuracy while reducing parameters to roughly 10 percent of the full model.
A broad range of embedding sources, including those with no bioacoustic data at all, support effective downstream classification.
Only small classifiers need training on elephant data, lowering the amount of labeled examples required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fixed-embedding strategy could be applied to vocalizations of other species where annotated recordings are even harder to obtain.
The compact early-layer embeddings open the door to on-device monitoring systems that run in the field with limited power and compute.
Selecting specific layers might improve transfer in other acoustic domains beyond elephants, such as marine mammal sounds.

Load-bearing premise

The acoustic features learned from birds, speech, or general audio share enough structure with elephant calls to transfer useful distinctions without meaningful domain shift on the available datasets.

What would settle it

A larger and more diverse collection of elephant calls, collected independently, shows AUCs that fall well below those of fully supervised models when the same fixed embeddings are used.

read the original abstract

We show that pretrained acoustic embeddings classify elephant vocalisations at a level approaching that of end-to-end supervised neural networks, without any fine-tuning of the embedding model. This result is of practical importance because annotated bioacoustic data are scarce and costly to obtain, leaving conventional supervised approaches prone to overfitting and to poor generalisation under domain shift. A broad range of embedding models drawn from general audio, speech, and bioacoustic domains is evaluated, all of which are either out-of-domain (containing no bioacoustic data) or out-of-species (containing no elephant call data). The embedding networks themselves remain fixed; only the lightweight downstream classifiers, which include a linear model and several small neural networks, are trained. Among the models considered, Perch 2.0 achieves the best cross-validated classification performance, attaining AUCs of 0.849 on African bush elephant (Loxodonta africana) calls and 0.936 on Asian elephant (Elephas maximus) calls, with Perch 1.0 close behind. The best-performing system is within 2.2 % of an end-to-end supervised elephant call classification system. A layerwise analysis of pretrained transformer encoders, considered as embedding models, shows that intermediate representations outperform final-layer outputs. The second layer of both wav2vec2.0 and HuBERT encodes sufficient information for effective elephant call classification; truncation at this layer therefore preserves classification performance whilst retaining only approximately 10 % of the parameters of the full network. Such compact embedding networks are well suited to on-device processing where computational resources are limited.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pretrained embeddings from birds and speech get close to supervised performance on elephant calls, but the cross-validation needs checking for individual or session leakage.

read the letter

The main takeaway is that fixed embeddings pretrained on non-elephant audio, especially Perch 2.0, classify elephant vocalizations with AUCs of 0.849 for African bush elephants and 0.936 for Asian elephants. These numbers sit within 2.2% of an end-to-end supervised baseline trained directly on the elephant data, and no fine-tuning of the embedding models is required. That is the practical result the paper delivers for low-label bioacoustic settings. The layerwise truncation finding is also concrete: the second layer of wav2vec2.0 and HuBERT already carries enough information, so you can drop to roughly 10% of the parameters while keeping most of the AUC. This is useful for on-device work where compute is tight. The paper evaluates a range of out-of-domain and out-of-species embeddings and shows that lightweight heads on top are enough, which gives a clear picture of what transfers and what does not. The evaluation stays honest by freezing the embeddings and reporting cross-validated numbers against a supervised reference. The soft spot is the cross-validation procedure. If the folds do not separate calls by individual elephant or by recording session, both the embedding classifiers and the supervised baseline can exploit shared background noise, channel effects, or individual timbre. The abstract gives no sign that group-aware splits were used, so the small gap to the baseline could partly reflect dataset artifacts rather than true transfer. Dataset sizes, diversity, and any statistical tests on the 2.2% difference also matter for interpreting how robust the result is. This paper is aimed at bioacoustics researchers and conservation groups who need classifiers when labeled elephant data are scarce. Readers working on transfer learning for audio will get usable numbers and a truncation trick they can test themselves. It deserves a serious referee because the empirical comparisons are straightforward and the practical angle is clear, even though the CV details and dataset description will need tightening in review. I would send it forward after confirming the splits.

Referee Report

2 major / 2 minor

Summary. The paper evaluates fixed pretrained acoustic embeddings (from birdsong, speech, and general audio models, none containing elephant data) for classifying elephant vocalizations. Lightweight downstream classifiers are trained on held-out elephant data while the embedding networks remain frozen. Perch 2.0 achieves the highest cross-validated AUCs (0.849 African bush elephant, 0.936 Asian elephant), within 2.2% of an end-to-end supervised baseline. Additional results include layer-wise analysis of transformer encoders and the observation that intermediate layers (e.g., layer 2 of wav2vec 2.0 and HuBERT) suffice for good performance while using only ~10% of the parameters.

Significance. If the evaluation protocol is robust, the result would be practically useful for data-scarce bioacoustics by showing that out-of-species embeddings can approach supervised performance without fine-tuning. The systematic comparison across embedding families, the layer-truncation findings, and the explicit supervised baseline provide concrete evidence that could guide deployment of compact models on resource-limited devices.

major comments (2)

[Evaluation / cross-validation procedure] Evaluation / cross-validation procedure: the manuscript does not state whether k-fold or other splits are group-aware (by individual elephant or by recording session). Because multiple calls from the same animal or microphone session share low-level acoustic signatures, leakage would allow both the embedding-based classifiers and the end-to-end supervised baseline to exploit these confounds, rendering the reported 2.2% gap uninterpretable as evidence of meaningful out-of-species transfer.
[Dataset description] Dataset description: no numbers are given for the total number of calls, number of individuals, number of recording sessions, or class balance for either the African or Asian elephant datasets. Without these statistics it is impossible to judge the risk of overfitting or the statistical reliability of the AUC figures.

minor comments (2)

[Abstract] The abstract lists Perch 2.0 as best but does not name the full set of embedding models evaluated; a summary table in §3 or §4 would improve readability.
[Layer-wise analysis] The layer-wise truncation result is interesting, yet the text does not report whether the same early-layer advantage holds for all transformer models tested or only for wav2vec 2.0 and HuBERT.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight key areas for improving the transparency and robustness of our evaluation. We address each major point below and will revise the manuscript to incorporate clarifications and additional details.

read point-by-point responses

Referee: [Evaluation / cross-validation procedure] Evaluation / cross-validation procedure: the manuscript does not state whether k-fold or other splits are group-aware (by individual elephant or by recording session). Because multiple calls from the same animal or microphone session share low-level acoustic signatures, leakage would allow both the embedding-based classifiers and the end-to-end supervised baseline to exploit these confounds, rendering the reported 2.2% gap uninterpretable as evidence of meaningful out-of-species transfer.

Authors: We agree that explicitly describing the cross-validation procedure and ensuring it is group-aware is essential to support claims of out-of-species transfer. The manuscript currently refers only to 'cross-validated' performance without detailing whether splits were performed at the call level (stratified k-fold) or grouped by individual/session. This omission leaves open the possibility of leakage. We will revise the paper to fully specify the original procedure and, where feasible, add results from group-aware splits (e.g., leave-one-individual-out or session-based folds). This will allow direct assessment of whether the small performance gap to the supervised baseline holds under stricter generalization conditions. revision: yes
Referee: [Dataset description] Dataset description: no numbers are given for the total number of calls, number of individuals, number of recording sessions, or class balance for either the African or Asian elephant datasets. Without these statistics it is impossible to judge the risk of overfitting or the statistical reliability of the AUC figures.

Authors: We acknowledge that the absence of these statistics limits readers' ability to evaluate the results. The revised manuscript will include a dedicated dataset section or table reporting, for each species: total calls, number of unique individuals, number of recording sessions, class distribution (e.g., rumble vs. other vocalization types), and any preprocessing or balancing steps. These details will be drawn directly from our data collection and will enable assessment of overfitting risk and statistical reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: fixed embeddings + downstream classifiers on held-out data

full rationale

The paper evaluates fixed pretrained embeddings (Perch, wav2vec2, HuBERT, etc.) by training only lightweight downstream classifiers on elephant call datasets. Reported AUCs (0.849 / 0.936) and the 2.2% gap to end-to-end supervised baselines arise from standard cross-validation on held-out data; no equations, fitted parameters, or self-citations reduce these metrics to the inputs by construction. Layerwise analysis and truncation arguments are empirical observations on the same fixed models. This is a conventional transfer-learning protocol with no load-bearing self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that acoustic embeddings trained on non-elephant data contain transferable features for elephant vocalizations; no free parameters or invented entities are introduced beyond standard ML training of the lightweight classifiers.

axioms (1)

domain assumption Acoustic features learned from birds, speech or general audio transfer to elephant calls without fine-tuning
Invoked by the decision to keep embedding networks fixed and train only downstream classifiers on elephant data.

pith-pipeline@v0.9.0 · 5607 in / 1304 out tokens · 69445 ms · 2026-05-09T19:39:35.034883+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 4 canonical work pages

[1]

N., Harvey, M., Harrell, L., Jansen, A., Merkens, K

Allen, A. N., Harvey, M., Harrell, L., Jansen, A., Merkens, K. P., Wall, C. C., Cattiau, J., & Oleson, E. M. (2021). A Convolutional Neural Network for Automated Detection of Humpback Whale Song in a Diverse, Long-Term Passive Acoustic Dataset.Frontiers in Marine Science,8.doi: 10.3389/fmars.2021.607321. Allen, A. N., Harvey, M., Harrell, L., Wood, M., Sz...

work page doi:10.3389/fmars.2021.607321 2021
[2]

Wood, Maximilian Eibl, and Holger Klinck

Kahl, S., Wood, C. M., Eibl, M., & Klinck, H. (2021). BirdNET: A Deep Learning Solution for Avian Diversity Monitoring.Ecological Informatics,61, 101236.doi: 10.1016/j.ecoinf.2021.101236. Keen,S.C.,Shiu,Y.,Wrege,P.H.,&Rowland,E.D.(2017).AutomatedDetectionofLow-frequencyRumblesofForest Elephants: A critical tool for their conservation.The Journal of the Ac...

work page doi:10.1016/j.ecoinf.2021.101236 2021
[3]

S., & Manger, P

Stoeger, A. S., & Manger, P. (2014). Vocal learning in elephants: Neural bases and adaptive context.Current Opinion in Neurobiology,28, 101–107.doi: 10.1016/j.conb.2014.07.001. Stöger, A. S., Heilmann, G., Zeppelzauer, M., Ganswindt, A., Hensman, S., & Charlton, B. D. (2012). Visualizing sound emission of elephant vocalizations: Evidence for two rumble pr...

work page doi:10.1016/j.conb.2014.07.001 2014
[4]

160 LR0.6786 0.2073 0.8048 0.2840 MLP 0.6988 0.1891 0.7848 0.2396 AERD (Geldenhuys & Niesler,

2073
[5]

768 LR0.7951 0.2768 0.8991 0.3099 MLP0.7879 0.2505 0.9057 0.3202 Elman0.7955 0.2506 0.8760 0.3629 GRU 0.8137 0.2816 0.8991 0.4063 LSTM0.7235 0.1917 0.8590 0.3207 BEATs (time) (Chen, Wu, et al.,

1917
[6]

1024 LR0.7267 0.1693 0.8083 0.1984 MLP 0.7309 0.1896 0.7828 0.2013 Elman0.6440 0.1487 0.7383 0.1775 GRU0.7160 0.1511 0.8121 0.2247 LSTM0.6072 0.1083 0.7583 0.1808 XLS-R (Conneau et al.,

1984
[7]

1024 LR0.8009 0.2685 0.8920 0.2990 MLP 0.8180 0.2528 0.8826 0.2750 Elman0.7623 0.1870 0.8400 0.2318 GRU0.7995 0.2055 0.8567 0.2465 LSTM0.7655 0.2257 0.8456 0.2481 HuBERT (base) (Hsu et al.,

2055
[8]

768 LR0.8037 0.1982 0.8931 0.2793 MLP 0.8296 0.2338 0.8885 0.2661 Elman0.8109 0.2347 0.8787 0.3162 GRU0.8221 0.2642 0.8766 0.3221 LSTM0.7946 0.2467 0.8636 0.3056 HuBERT (large) (Hsu et al.,

1982
[9]

1024 LR0.7791 0.1951 0.8560 0.2285 MLP 0.8032 0.1911 0.8505 0.2263 Elman0.7182 0.1738 0.8416 0.2348 GRU0.7123 0.1568 0.8455 0.2474 LSTM0.7020 0.1560 0.8566 0.2511 HuBERT (xlarge) (Hsu et al.,

1951
[10]

1280 LR0.7813 0.1952 0.8390 0.2282 MLP 0.8133 0.2006 0.8607 0.2413 Elman0.6514 0.1388 0.8186 0.1883 GRU0.7470 0.1778 0.8268 0.2095 LSTM0.7224 0.1578 0.8399 0.2253 (d) Bioacoustics Humpback (Allen et al.,

1952
[11]

2048 LR0.5653 0.1276 0.8418 0.2868 MLP0.5543 0.1177 0.8451 0.2752 Elman0.7709 0.2112 0.8431 0.2440 GRU0.7854 0.2270 0.8621 0.2711 LSTM 0.7874 0.2446 0.8514 0.2468 Multi-species whale (Allen et al.,

2048
[12]

1536 LR0.8020 0.3654 0.9319 0.4824 MLP0.7976 0.2929 0.93550.4757 Elman 0.8492 0.42200.9314 0.5077 GRU0.8453 0.3966 0.9312 0.5126 LSTM0.8400 0.3744 0.9276 0.5036 AVES-core (Hagiwara,

work page arXiv
[13]

768 LR0.7995 0.2621 0.8794 0.2690 MLP 0.8210 0.2789 0.8907 0.2767 Elman0.7297 0.1859 0.8648 0.3090 GRU0.7544 0.2407 0.8798 0.3543 LSTM0.7559 0.2051 0.8635 0.3250 AVES-bio (Hagiwara,

2051
[14]

768 LR0.8083 0.2271 0.8726 0.2647 MLP 0.8301 0.2475 0.8722 0.2545 Elman0.7757 0.1927 0.8431 0.2884 GRU0.7867 0.2100 0.8696 0.3236 LSTM0.7366 0.1830 0.8477 0.3024 BirdAVES-biox (base) (Hagiwara,

1927
[15]

1024 LR0.8118 0.2399 0.8913 0.2853 MLP 0.8437 0.2746 0.8887 0.2852 Elman0.7719 0.2208 0.8517 0.2722 GRU0.7773 0.2291 0.8770 0.3274 LSTM0.7639 0.2054 0.8542 0.3064 Continued on next page 25 From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species EmbeddingsPreprint Table 3 – continued EV LDC Embedding Dim. Class. AUC mAP AUC mAP BirdAVES-bi...

2054
[16]

1024 LR0.7791 0.2075 0.8387 0.2048 MLP 0.8039 0.2389 0.8618 0.2381 Elman0.7861 0.2218 0.8564 0.2559 GRU0.7695 0.2364 0.8435 0.2622 LSTM0.7313 0.2115 0.8298 0.2435 26

2075