arxiv: 2605.14031 · v1 · submitted 2026-05-13 · 💻 cs.SD · cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

Wuao Liu , Mustafa Chasmai , Subhransu Maji , Grant Van Horn

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:41 UTC · model grok-4.3

classification 💻 cs.SD cs.CVcs.LG

keywords masked autoencodersbioacousticsspecies classificationself-supervised learningpretraining scaleaudio transferweakly labeled dataiNatSounds

0 comments

The pith

For fine-grained bioacoustic classification with limited labels, pretraining on large general audio datasets beats additional domain-specific masked autoencoder training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether masked autoencoders help species recognition when only weakly labeled recordings are available. It compares models pretrained on massive general audio collections against versions that receive extra masked reconstruction training on bioacoustic data. Results show that the general pretraining already delivers the strongest transfer, while extra domain-specific steps add little or can lower accuracy. This matters for practitioners who must decide whether to invest in custom pretraining or simply use existing large audio models. The work clarifies that data scale outweighs objective tailoring once the training set reaches only moderate size.

Core claim

Models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models. Selective data filtering offers a negligible advantage when the overall data scale is limited. In moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design.

What carries the argument

Masked autoencoder pretraining applied to audio spectrograms, with systematic variation of pretraining data scale and domain for downstream species classification on iNatSounds.

If this is right

Off-the-shelf general-audio models should be the default starting point for bioacoustic tasks with moderate data.
Additional masked autoencoder pretraining on limited domain data is not worth the compute when general pretraining already exists.
Data curation steps such as filtering add little value once overall pretraining volume is constrained.
Similar patterns are likely in other fine-grained audio domains that rely on weakly labeled recordings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finding suggests that for other weakly labeled audio tasks, simply scaling general pretraining may be more reliable than designing new self-supervised objectives.
It raises the question of whether masked reconstruction loses fine species distinctions that supervised signals from broad audio corpora preserve.
Practitioners could test whether the same scale-over-objective pattern appears when the target domain has even fewer total hours of audio.

Load-bearing premise

Observed performance gaps arise mainly from differences in pretraining data volume rather than from variations in model capacity, optimizer settings, or label noise in the weakly annotated recordings.

What would settle it

A controlled experiment in which model size, optimizer, and training schedule are matched exactly yet domain-specific masked autoencoder pretraining on a larger bioacoustic corpus still underperforms general-audio pretraining.

Figures

Figures reproduced from arXiv: 2605.14031 by Grant Van Horn, Mustafa Chasmai, Subhransu Maji, Wuao Liu.

**Figure 1.** Figure 1: Overview. We investigate MAE pretraining for finegrained bioacoustic recognition. An audio encoder is first trained with a masked spectrogram reconstruction objective and then finetuned for species classification. We systematically evaluate its effectiveness under a modest data regime and find that pretraining scale plays a more critical role than continual pretraining. to survey wildlife across vast and… view at source ↗

**Figure 2.** Figure 2: Reconstruction performance. Qualitative examples of masked spectrogram reconstruction for three species in the iNatSounds validation set. We use a ViT-B encoder following AudioMAE’s default configuration. Input spectrograms are masked with a ratio of 0.8. The model directly predicts masked pixel values for visualization. Following AudioMAE, we adopt a patch-wise normalized reconstruction target rather tha… view at source ↗

**Figure 3.** Figure 3: Finetuning MAEs using different fractions of bioacoustic data. We show Top-1 validation accuracy on iNatSounds as a function of the available labeled samples. further emphasizing that data scale and diversity are critical for effective bioacoustic pretraining. 4.5. Partial Finetuning and Sample Efficiency While full finetuning measures final downstream task performance, it does not assess how efficiently… view at source ↗

**Figure 5.** Figure 5: Audio segments filtered by reconstruction loss. Left: Distribution of confidence scores across audio segments. Middle: Fraction of audio segments to keep under different confidence thresholds, along with the corresponding validation accuracy using a MobileNetV3 model trained on the filtered data. Right: Top-1 validation accuracy of ViT-B models fine-tuned on both iNatSounds full dataset and filtered subset… view at source ↗

read the original abstract

Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large general-audio pretraining beats extra domain-specific MAE in limited bioacoustics, but tighter controls would strengthen the scale attribution.

read the letter

In moderate-sized fine-grained bioacoustic settings, pretraining on large general audio datasets transfers better to species classification than running additional masked autoencoder training on the target domain data. The paper's main contribution is showing this pattern on iNatSounds, along with the finding that selective filtering adds little when overall scale stays limited. That result goes against what larger audio benchmarks often suggest and gives practical pointers for ecology-style recordings that come with weak single-label annotations.

Referee Report

2 major / 2 minor

Summary. The paper conducts a systematic empirical study of masked autoencoder (MAE) pretraining for fine-grained species classification on the iNatSounds dataset. It analyzes the effects of pretraining data scale, domain specificity, data curation, and transfer strategies, reporting that off-the-shelf models pretrained on large-scale general audio achieve the best transfer performance. Additional domain-specific MAE pretraining provides limited benefits or can degrade results, and selective filtering yields negligible gains when overall scale is limited. The central claim is that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design.

Significance. If the central attribution to scale holds after addressing controls, the work supplies actionable guidance for self-supervised audio models under weak supervision and limited domain data. It helps delineate when general large-scale pretraining is preferable to further domain-specific MAE efforts, which is useful for bioacoustics practitioners facing weakly labeled repositories like iNaturalist.

major comments (2)

[Abstract and §4] The comparison between off-the-shelf general-audio checkpoints and custom domain-specific MAE runs does not report matched model capacity, optimizer, learning-rate schedule, or total training budget. Without these controls, performance differences on iNatSounds cannot be cleanly attributed to pretraining data scale rather than architectural or optimization mismatches. This directly affects the load-bearing claim in the abstract that scale dominates objective design.
[§4.2] No variance estimates, statistical significance tests, or multiple random seeds are described for the reported transfer accuracies across pretraining regimes. This weakens confidence that the observed ordering (general > domain-specific MAE) is robust rather than an artifact of single-run variability or dataset-specific biases in the single-label iNatSounds recordings.

minor comments (2)

[§3.3] The transfer-strategy ablation in §3.3 would be clearer with an explicit diagram showing the frozen vs. fine-tuned stages and the exact linear-probe protocol.
[Table 2] Table 2 caption should explicitly state the number of parameters and the exact checkpoint sources for each general-audio baseline to facilitate replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate additional controls and statistical reporting where feasible.

read point-by-point responses

Referee: [Abstract and §4] The comparison between off-the-shelf general-audio checkpoints and custom domain-specific MAE runs does not report matched model capacity, optimizer, learning-rate schedule, or total training budget. Without these controls, performance differences on iNatSounds cannot be cleanly attributed to pretraining data scale rather than architectural or optimization mismatches. This directly affects the load-bearing claim in the abstract that scale dominates objective design.

Authors: We appreciate the referee highlighting this issue. The off-the-shelf general-audio models refer to publicly released checkpoints (e.g., AudioMAE variants pretrained on AudioSet) whose architectures, capacities, and training protocols are documented in the original publications. Our domain-specific MAE runs used the identical ViT-based architecture and followed the standard MAE optimization settings (AdamW, cosine schedule, etc.) as closely as possible given the available compute. To strengthen the attribution to scale, we will add a supplementary table in the revised manuscript that explicitly lists model capacity, optimizer, learning-rate schedule, and total training budget for every pretraining regime. This will allow readers to evaluate comparability directly. revision: yes
Referee: [§4.2] No variance estimates, statistical significance tests, or multiple random seeds are described for the reported transfer accuracies across pretraining regimes. This weakens confidence that the observed ordering (general > domain-specific MAE) is robust rather than an artifact of single-run variability or dataset-specific biases in the single-label iNatSounds recordings.

Authors: We agree that variance estimates and statistical tests would increase confidence in the reported ordering. Our experiments were run with single random seeds owing to the substantial compute required for MAE pretraining on large audio corpora. In the revision we will re-evaluate the key transfer results using at least three independent random seeds, report means and standard deviations, and include paired statistical significance tests (e.g., t-tests) between the general-audio and domain-specific regimes. This directly addresses concerns about single-run variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in this empirical study

full rationale

This is a purely empirical paper that reports experimental comparisons of MAE pretraining strategies on the iNatSounds dataset. There are no derivations, equations, fitted parameters, or mathematical claims that reduce to their own inputs by construction. Central claims rest on held-out transfer performance metrics rather than any self-referential definitions, uniqueness theorems, or ansatzes imported via self-citation. Results are grounded in direct experimentation against external benchmarks and off-the-shelf models, making the study self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on standard transfer-learning assumptions that MAE representations learned on general audio remain useful after fine-tuning on weakly labeled bioacoustic data; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Masked autoencoder pretraining produces transferable representations across audio domains when data scale is sufficient.
Invoked when interpreting why general-audio pretraining outperforms domain-specific continuation.

pith-pipeline@v0.9.0 · 5572 in / 1274 out tokens · 42849 ms · 2026-05-15T05:41:23.006797+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The pretraining objective minimizes the reconstruction loss over the masked patches: L_MAE = 1/|M| Σ_{k∈M} ||p_k - ˆp_k||_2^2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 2 internal anchors

[1]

Mae-ast: Masked autoencoding audio spectrogram transformer

Alan Baade, Puyuan Peng, and David Harwath. Mae-ast: Masked autoencoding audio spectrogram transformer. In Proc. Interspeech 2022, pages 2438–2442, 2022. 2

work page 2022
[2]

Entropy-based analysis of influential factors for underwater acoustic target recognition in passive sonar data.Ocean Engineering, 342: 122908, 2025

Junho Bae, Mingu Kang, and Youngmin Choo. Entropy-based analysis of influential factors for underwater acoustic target recognition in passive sonar data.Ocean Engineering, 342: 122908, 2025. 3

work page 2025
[3]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020. 2

work page 2020
[4]

Global biodiversity: indicators of recent declines

Stuart HM Butchart, Matt Walpole, Ben Collen, Arco Van Strien, J¨orn PW Scharlemann, Rosamunde EA Almond, Jonathan EM Baillie, Bastian Bomhard, Claire Brown, John Bruno, et al. Global biodiversity: indicators of recent declines. Science, 328(5982):1164–1168, 2010. 1

work page 2010
[5]

The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524–132544,

Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn. The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524–132544,

work page
[6]

Audio geolocation: A natural sounds benchmark

Mustafa Chasmai, Wuao Liu, Subhransu Maji, and Grant Van Horn. Audio geolocation: A natural sounds benchmark. arXiv preprint arXiv:2505.18726, 2025. 3

work page arXiv 2025
[7]

Beats: audio pre-training with acoustic tokeniz- ers

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. Beats: audio pre-training with acoustic tokeniz- ers. InProceedings of the 40th International Conference on Machine Learning, pages 5178–5193, 2023. 2

work page 2023
[8]

Eat: self-supervised pre-training with efficient audio transformer

Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, and Xie Chen. Eat: self-supervised pre-training with efficient audio transformer. InProceedings of the Thirty-Third Inter- national Joint Conference on Artificial Intelligence, pages 3807–3815, 2024. 2

work page 2024
[9]

Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery.Advances in Neural Information Processing Systems, 35:197–211, 2022

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery.Advances in Neural Information Processing Systems, 35:197–211, 2022. 2

work page 2022
[10]

Pervasive human-driven decline of life on earth points to the need for transformative change.Science, 366(6471):eaax3100, 2019

Sandra D´ıaz, Josef Settele, Eduardo S Brond ´ızio, Hien T Ngo, John Agard, Almut Arneth, Patricia Balvanera, Kate A Brauman, Stuart HM Butchart, Kai MA Chan, et al. Pervasive human-driven decline of life on earth points to the need for transformative change.Science, 366(6471):eaax3100, 2019. 1

work page 2019
[11]

Clap learning audio concepts from natu- ral language supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natu- ral language supervision. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 2

work page 2023
[12]

Insectset459: an open dataset of insect sounds for bioacoustic machine learning.arXiv preprint arXiv:2503.15074, 2025

Marius Faiß, Burooj Ghani, and Dan Stowell. Insectset459: an open dataset of insect sounds for bioacoustic machine learning.arXiv preprint arXiv:2503.15074, 2025. 3

work page arXiv 2025
[13]

Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958,

Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al. Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958,

work page
[14]

Fsd50k: an open dataset of human-labeled sound events.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. Fsd50k: an open dataset of human-labeled sound events.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021. 3 8

work page 2021
[15]

Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4438–4446, 2017. 1

work page 2017
[16]

Gemmeke, Daniel P

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. InProc. IEEE ICASSP 2017, New Orleans, LA, 2017. 2, 3, 4

work page 2017
[17]

PhD thesis, TILBURG UNIVERSITY , 2024

STEFAN V ASILEV GENEV .Classification of Anuran Species Using High Efficiency CNNs. PhD thesis, TILBURG UNIVERSITY , 2024. 3

work page 2024
[18]

Emerging opportunities and challenges for passive acoustics in ecological assessment and monitoring.Methods in Ecology and Evolution, 10(2):169–185, 2019

Rory Gibb, Ella Browning, Paul Glover-Kapfer, and Kate E Jones. Emerging opportunities and challenges for passive acoustics in ecological assessment and monitoring.Methods in Ecology and Evolution, 10(2):169–185, 2019. 1

work page 2019
[19]

Audioclip: Extending clip to image, text and audio

Andrey Guzhov, Federico Raue, J ¨orn Hees, and Andreas Dengel. Audioclip: Extending clip to image, text and audio. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022. 2

work page 2022
[20]

Aves: Animal vocalization encoder based on self-supervision

Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 1, 2

work page 2023
[21]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2, 5

work page 2022
[22]

Searching for mo- bilenetv3

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. InProceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 7

work page 2019
[23]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29: 3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29: 3451–3460, 2021. 2

work page 2021
[24]

Masked autoencoders that listen.Advances in Neural Information Processing Systems, 35:28708–28720, 2022

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feicht- enhofer. Masked autoencoders that listen.Advances in Neural Information Processing Systems, 35:28708–28720, 2022. 2, 4, 5, 6

work page 2022
[25]

inaturalist

iNaturalist. inaturalist. https://www.inaturalist. org, 2026. Accessed: 2026-02-23. 1, 4

work page 2026
[26]

Birdnet: A deep learning solution for avian diversity monitoring.Ecological Informatics, 61:101236, 2021

Stefan Kahl, Connor M Wood, Maximilian Eibl, and Holger Klinck. Birdnet: A deep learning solution for avian diversity monitoring.Ecological Informatics, 61:101236, 2021. 3

work page 2021
[27]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013. 1

work page 2013
[28]

Mavis: A multimodal conversational assistant for avian species

Yevheniia Kryklyvets, Mohammed Irfan Kurpath, Sa- hal Shaji Mullappilly, Jinxing Zhou, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, and Hisham Cholakkal. Mavis: A multimodal conversational assistant for avian species. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28601–28627, 2025. 3

work page 2025
[29]

Bilinear cnn models for fine-grained visual recognition

Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on com- puter vision, pages 1449–1457, 2015. 1

work page 2015
[30]

Zhiwei Lin, Yongtao Wang, Shengxiang Qi, Nan Dong, and Ming-Hsuan Yang. Bev-mae: Bird’s eye view masked au- toencoders for point cloud pre-training in autonomous driving scenarios.Proceedings of the AAAI Conference on Artificial Intelligence, 38(4):3531–3539, 2024. 2

work page 2024
[31]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 1

work page internal anchor Pith review Pith/arXiv arXiv 2013
[32]

AVEX: What Matters for Animal Vocalization Encoding

Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Mad- die Cusimano, Felix Effenberger, Masato Hagiwara, Ben- jamin Hoffman, et al. What matters for bioacoustic encoding. arXiv preprint arXiv:2508.11845, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Mixture of mixups for multi-label classification of rare anuran sounds

Ilyass Moummad, Nicolas Farrugia, Romain Serizel, Jeremy Froidevaux, and Vincent Lostanlen. Mixture of mixups for multi-label classification of rare anuran sounds. In2024 32nd European Signal Processing Conference (EUSIPCO), pages 1282–1286. IEEE, 2024. 3

work page 2024
[34]

Lv-mae: Learning long video representations through masked-embedding autoencoders

Ilan Naiman, Emanuel Ben-Baruch, Oron Anschel, Alon Shoshan, Igor Kviatkovsky, Manoj Aggarwal, and Gerard Medioni. Lv-mae: Learning long video representations through masked-embedding autoencoders. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21398–21407, 2025. 2

work page 2025
[35]

The mel scale.Journal of Music Theory, 9(2): 295–308, 1965

Paul Pedersen. The mel scale.Journal of Music Theory, 9(2): 295–308, 1965. 4

work page 1965
[36]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[37]

Can masked autoen- coders also listen to birds?Transactions on Machine Learning Research Journal, 2025

Lukas Rauch, Ren´e Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, and Christoph Scholz. Can masked autoen- coders also listen to birds?Transactions on Machine Learning Research Journal, 2025. 1, 2, 3

work page 2025
[38]

Birdset: A large-scale dataset for audio classification in avian bioacoustics

Lukas Rauch, Raphael Schwinger, Moritz Wirth, Ren´e Hein- rich, Denis Huseljic, Marek Herde, Jonas Lange, Stefan Kahl, Bernhard Sick, Sven Tomforde, et al. Birdset: A large-scale dataset for audio classification in avian bioacoustics. InIn- ternational Conference on Learning Representations, pages 29482–29520, 2025. 2, 3

work page 2025
[39]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Can- dido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF 9 International Conference on Computer Vision, pages 4088– 4099, 2023. 2

work page 2023
[40]

Transferable models for bioacoustics with human language supervision

David Robinson, Adelaide Robinson, and Lily Akrapong- pisak. Transferable models for bioacoustics with human language supervision. InICASSP 2024-2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1316–1320. IEEE, 2024. 2

work page 2024
[41]

NatureLM-audio: an audio-language foun- dation model for bioacoustics

David Robinson, Marius Miron, Masato Hagiwara, and Olivier Pietquin. NatureLM-audio: an audio-language foun- dation model for bioacoustics. InThe Thirteenth International Conference on Learning Representations, 2025. 3

work page 2025
[42]

Passive acoustic monitoring provides a fresh perspective on fundamental ecological questions.Functional Ecology, 37 (4):959–975, 2023

Samuel RP-J Ross, Darren P O’Connell, Jessica L Deich- mann, Camille Desjonqu `eres, Amandine Gasc, Jennifer N Phillips, Sarab S Sethi, Connor M Wood, and Zuzana Burival- ova. Passive acoustic monitoring provides a fresh perspective on fundamental ecological questions.Functional Ecology, 37 (4):959–975, 2023. 1

work page 2023
[43]

Towards a global terrestrial species monitoring program.Journal for Nature Conservation, 25:51–57, 2015

Dirk S Schmeller, Romain Julliard, Peter J Bellingham, Monika B ¨ohm, Neil Brummitt, Alessandro Chiarucci, De- nis Couvet, Sarah Elmendorf, David M Forsyth, Jaime Garc´ıa Moreno, et al. Towards a global terrestrial species monitoring program.Journal for Nature Conservation, 25:51–57, 2015. 1

work page 2015
[44]

wav2vec: Unsupervised pre-training for speech recognition

Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. InProc. Interspeech 2019, pages 3465–3469,

work page 2019
[45]

Foun- dation models for bioacoustics–a comparative review.Eco- logical Informatics, page 103765, 2026

Raphael Schwinger, Paria Vali Zadeh, Lukas Rauch, Mats Kurz, Tom Hauschild, Sam Lapp, and Sven Tomforde. Foun- dation models for bioacoustics–a comparative review.Eco- logical Informatics, page 103765, 2026. 3

work page 2026
[46]

Sam audio: Segment anything in audio

Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi- Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, et al. Sam audio: Segment anything in audio. arXiv preprint arXiv:2512.18099, 2025. 3

work page arXiv 2025
[47]

Terrestrial passive acoustic monitoring: review and perspectives.BioScience, 69 (1):15–25, 2019

Larissa Sayuri Moreira Sugai, Thiago Sanna Freire Silva, Jos´e Wagner Ribeiro Jr, and Diego Llusia. Terrestrial passive acoustic monitoring: review and perspectives.BioScience, 69 (1):15–25, 2019. 1

work page 2019
[48]

Merlin l48 spectrogram dataset

Aaron Sun, Subhransu Maji, and Grant Van Horn. Merlin l48 spectrogram dataset. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 1

work page 2025
[49]

Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Processing Systems, 36:20054–20066, 2023

Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, and Hairong Qi. Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Processing Systems, 36:20054–20066, 2023. 2

work page 2023
[50]

Video- mae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural infor- mation processing systems, 35:10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Video- mae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural infor- mation processing systems, 35:10078–10093, 2022. 2

work page 2022
[51]

The inaturalist species classification and detection dataset

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778,

work page
[52]

Exploring fine- grained audiovisual categorization with the SSW60 dataset

Grant Van Horn, Rui Qian, Kimberly Wilber, Hartwig Adam, Oisin Mac Aodha, and Serge Belongie. Exploring fine- grained audiovisual categorization with the SSW60 dataset. InEuropean Conference on Computer Vision, pages 271–289. Springer, 2022. 1

work page 2022
[53]

Perch 2.0: The bittern lesson for bioacoustics.arXiv preprint arXiv:2508.04665, 2025

Bart van Merri ¨enboer, Vincent Dumoulin, Jenny Hamer, Lauren Harrell, Andrea Burns, and Tom Denton. Perch 2.0: The bittern lesson for bioacoustics.arXiv preprint arXiv:2508.04665, 2025. 1, 2, 3

work page arXiv 2025
[54]

Revisiting mae pre-training for 3d medical image segmentation

Tassilo Wald, Constantin Ulrich, Stanislav Lukyanenko, An- drei Goncharov, Alberto Paderno, Maximilian Miller, Leander Maerkisch, Paul Jaeger, and Klaus Maier-Hein. Revisiting mae pre-training for 3d medical image segmentation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 5186–5196, 2025. 2

work page 2025
[55]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14549–14560, 2023. 2

work page 2023
[56]

Diffusion models as masked autoencoders

Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, and Christoph Feichtenhofer. Diffusion models as masked autoencoders. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16284–16294, 2023. 2

work page 2023
[57]

Xeno-canto: Sharing bird sounds from around the world

Xeno-Canto Foundation. Xeno-canto: Sharing bird sounds from around the world. https://xeno- canto.org ,

work page
[58]

Accessed: 2026-02-23. 1 10

work page 2026