pith. machine review for the scientific record. sign in

arxiv: 2605.14031 · v1 · submitted 2026-05-13 · 💻 cs.SD · cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:41 UTC · model grok-4.3

classification 💻 cs.SD cs.CVcs.LG
keywords masked autoencodersbioacousticsspecies classificationself-supervised learningpretraining scaleaudio transferweakly labeled dataiNatSounds
0
0 comments X

The pith

For fine-grained bioacoustic classification with limited labels, pretraining on large general audio datasets beats additional domain-specific masked autoencoder training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether masked autoencoders help species recognition when only weakly labeled recordings are available. It compares models pretrained on massive general audio collections against versions that receive extra masked reconstruction training on bioacoustic data. Results show that the general pretraining already delivers the strongest transfer, while extra domain-specific steps add little or can lower accuracy. This matters for practitioners who must decide whether to invest in custom pretraining or simply use existing large audio models. The work clarifies that data scale outweighs objective tailoring once the training set reaches only moderate size.

Core claim

Models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models. Selective data filtering offers a negligible advantage when the overall data scale is limited. In moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design.

What carries the argument

Masked autoencoder pretraining applied to audio spectrograms, with systematic variation of pretraining data scale and domain for downstream species classification on iNatSounds.

If this is right

  • Off-the-shelf general-audio models should be the default starting point for bioacoustic tasks with moderate data.
  • Additional masked autoencoder pretraining on limited domain data is not worth the compute when general pretraining already exists.
  • Data curation steps such as filtering add little value once overall pretraining volume is constrained.
  • Similar patterns are likely in other fine-grained audio domains that rely on weakly labeled recordings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The finding suggests that for other weakly labeled audio tasks, simply scaling general pretraining may be more reliable than designing new self-supervised objectives.
  • It raises the question of whether masked reconstruction loses fine species distinctions that supervised signals from broad audio corpora preserve.
  • Practitioners could test whether the same scale-over-objective pattern appears when the target domain has even fewer total hours of audio.

Load-bearing premise

Observed performance gaps arise mainly from differences in pretraining data volume rather than from variations in model capacity, optimizer settings, or label noise in the weakly annotated recordings.

What would settle it

A controlled experiment in which model size, optimizer, and training schedule are matched exactly yet domain-specific masked autoencoder pretraining on a larger bioacoustic corpus still underperforms general-audio pretraining.

Figures

Figures reproduced from arXiv: 2605.14031 by Grant Van Horn, Mustafa Chasmai, Subhransu Maji, Wuao Liu.

Figure 1
Figure 1. Figure 1: Overview. We investigate MAE pretraining for fine￾grained bioacoustic recognition. An audio encoder is first trained with a masked spectrogram reconstruction objective and then fine￾tuned for species classification. We systematically evaluate its effectiveness under a modest data regime and find that pretraining scale plays a more critical role than continual pretraining. to survey wildlife across vast and… view at source ↗
Figure 2
Figure 2. Figure 2: Reconstruction performance. Qualitative examples of masked spectrogram reconstruction for three species in the iNatSounds validation set. We use a ViT-B encoder following AudioMAE’s default configuration. Input spectrograms are masked with a ratio of 0.8. The model directly predicts masked pixel values for visualization. Following AudioMAE, we adopt a patch-wise normal￾ized reconstruction target rather tha… view at source ↗
Figure 3
Figure 3. Figure 3: Finetuning MAEs using different fractions of bioa￾coustic data. We show Top-1 validation accuracy on iNatSounds as a function of the available labeled samples. further emphasizing that data scale and diversity are critical for effective bioacoustic pretraining. 4.5. Partial Finetuning and Sample Efficiency While full finetuning measures final downstream task perfor￾mance, it does not assess how efficiently… view at source ↗
Figure 5
Figure 5. Figure 5: Audio segments filtered by reconstruction loss. Left: Distribution of confidence scores across audio segments. Middle: Fraction of audio segments to keep under different confidence thresholds, along with the corresponding validation accuracy using a MobileNetV3 model trained on the filtered data. Right: Top-1 validation accuracy of ViT-B models fine-tuned on both iNatSounds full dataset and filtered subset… view at source ↗
read the original abstract

Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a systematic empirical study of masked autoencoder (MAE) pretraining for fine-grained species classification on the iNatSounds dataset. It analyzes the effects of pretraining data scale, domain specificity, data curation, and transfer strategies, reporting that off-the-shelf models pretrained on large-scale general audio achieve the best transfer performance. Additional domain-specific MAE pretraining provides limited benefits or can degrade results, and selective filtering yields negligible gains when overall scale is limited. The central claim is that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design.

Significance. If the central attribution to scale holds after addressing controls, the work supplies actionable guidance for self-supervised audio models under weak supervision and limited domain data. It helps delineate when general large-scale pretraining is preferable to further domain-specific MAE efforts, which is useful for bioacoustics practitioners facing weakly labeled repositories like iNaturalist.

major comments (2)
  1. [Abstract and §4] The comparison between off-the-shelf general-audio checkpoints and custom domain-specific MAE runs does not report matched model capacity, optimizer, learning-rate schedule, or total training budget. Without these controls, performance differences on iNatSounds cannot be cleanly attributed to pretraining data scale rather than architectural or optimization mismatches. This directly affects the load-bearing claim in the abstract that scale dominates objective design.
  2. [§4.2] No variance estimates, statistical significance tests, or multiple random seeds are described for the reported transfer accuracies across pretraining regimes. This weakens confidence that the observed ordering (general > domain-specific MAE) is robust rather than an artifact of single-run variability or dataset-specific biases in the single-label iNatSounds recordings.
minor comments (2)
  1. [§3.3] The transfer-strategy ablation in §3.3 would be clearer with an explicit diagram showing the frozen vs. fine-tuned stages and the exact linear-probe protocol.
  2. [Table 2] Table 2 caption should explicitly state the number of parameters and the exact checkpoint sources for each general-audio baseline to facilitate replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate additional controls and statistical reporting where feasible.

read point-by-point responses
  1. Referee: [Abstract and §4] The comparison between off-the-shelf general-audio checkpoints and custom domain-specific MAE runs does not report matched model capacity, optimizer, learning-rate schedule, or total training budget. Without these controls, performance differences on iNatSounds cannot be cleanly attributed to pretraining data scale rather than architectural or optimization mismatches. This directly affects the load-bearing claim in the abstract that scale dominates objective design.

    Authors: We appreciate the referee highlighting this issue. The off-the-shelf general-audio models refer to publicly released checkpoints (e.g., AudioMAE variants pretrained on AudioSet) whose architectures, capacities, and training protocols are documented in the original publications. Our domain-specific MAE runs used the identical ViT-based architecture and followed the standard MAE optimization settings (AdamW, cosine schedule, etc.) as closely as possible given the available compute. To strengthen the attribution to scale, we will add a supplementary table in the revised manuscript that explicitly lists model capacity, optimizer, learning-rate schedule, and total training budget for every pretraining regime. This will allow readers to evaluate comparability directly. revision: yes

  2. Referee: [§4.2] No variance estimates, statistical significance tests, or multiple random seeds are described for the reported transfer accuracies across pretraining regimes. This weakens confidence that the observed ordering (general > domain-specific MAE) is robust rather than an artifact of single-run variability or dataset-specific biases in the single-label iNatSounds recordings.

    Authors: We agree that variance estimates and statistical tests would increase confidence in the reported ordering. Our experiments were run with single random seeds owing to the substantial compute required for MAE pretraining on large audio corpora. In the revision we will re-evaluate the key transfer results using at least three independent random seeds, report means and standard deviations, and include paired statistical significance tests (e.g., t-tests) between the general-audio and domain-specific regimes. This directly addresses concerns about single-run variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in this empirical study

full rationale

This is a purely empirical paper that reports experimental comparisons of MAE pretraining strategies on the iNatSounds dataset. There are no derivations, equations, fitted parameters, or mathematical claims that reduce to their own inputs by construction. Central claims rest on held-out transfer performance metrics rather than any self-referential definitions, uniqueness theorems, or ansatzes imported via self-citation. Results are grounded in direct experimentation against external benchmarks and off-the-shelf models, making the study self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on standard transfer-learning assumptions that MAE representations learned on general audio remain useful after fine-tuning on weakly labeled bioacoustic data; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Masked autoencoder pretraining produces transferable representations across audio domains when data scale is sufficient.
    Invoked when interpreting why general-audio pretraining outperforms domain-specific continuation.

pith-pipeline@v0.9.0 · 5572 in / 1274 out tokens · 42849 ms · 2026-05-15T05:41:23.006797+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 2 internal anchors

  1. [1]

    Mae-ast: Masked autoencoding audio spectrogram transformer

    Alan Baade, Puyuan Peng, and David Harwath. Mae-ast: Masked autoencoding audio spectrogram transformer. In Proc. Interspeech 2022, pages 2438–2442, 2022. 2

  2. [2]

    Entropy-based analysis of influential factors for underwater acoustic target recognition in passive sonar data.Ocean Engineering, 342: 122908, 2025

    Junho Bae, Mingu Kang, and Youngmin Choo. Entropy-based analysis of influential factors for underwater acoustic target recognition in passive sonar data.Ocean Engineering, 342: 122908, 2025. 3

  3. [3]

    wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020. 2

  4. [4]

    Global biodiversity: indicators of recent declines

    Stuart HM Butchart, Matt Walpole, Ben Collen, Arco Van Strien, J¨orn PW Scharlemann, Rosamunde EA Almond, Jonathan EM Baillie, Bastian Bomhard, Claire Brown, John Bruno, et al. Global biodiversity: indicators of recent declines. Science, 328(5982):1164–1168, 2010. 1

  5. [5]

    The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524–132544,

    Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn. The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524–132544,

  6. [6]

    Audio geolocation: A natural sounds benchmark

    Mustafa Chasmai, Wuao Liu, Subhransu Maji, and Grant Van Horn. Audio geolocation: A natural sounds benchmark. arXiv preprint arXiv:2505.18726, 2025. 3

  7. [7]

    Beats: audio pre-training with acoustic tokeniz- ers

    Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. Beats: audio pre-training with acoustic tokeniz- ers. InProceedings of the 40th International Conference on Machine Learning, pages 5178–5193, 2023. 2

  8. [8]

    Eat: self-supervised pre-training with efficient audio transformer

    Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, and Xie Chen. Eat: self-supervised pre-training with efficient audio transformer. InProceedings of the Thirty-Third Inter- national Joint Conference on Artificial Intelligence, pages 3807–3815, 2024. 2

  9. [9]

    Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery.Advances in Neural Information Processing Systems, 35:197–211, 2022

    Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery.Advances in Neural Information Processing Systems, 35:197–211, 2022. 2

  10. [10]

    Pervasive human-driven decline of life on earth points to the need for transformative change.Science, 366(6471):eaax3100, 2019

    Sandra D´ıaz, Josef Settele, Eduardo S Brond ´ızio, Hien T Ngo, John Agard, Almut Arneth, Patricia Balvanera, Kate A Brauman, Stuart HM Butchart, Kai MA Chan, et al. Pervasive human-driven decline of life on earth points to the need for transformative change.Science, 366(6471):eaax3100, 2019. 1

  11. [11]

    Clap learning audio concepts from natu- ral language supervision

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natu- ral language supervision. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 2

  12. [12]

    Insectset459: an open dataset of insect sounds for bioacoustic machine learning.arXiv preprint arXiv:2503.15074, 2025

    Marius Faiß, Burooj Ghani, and Dan Stowell. Insectset459: an open dataset of insect sounds for bioacoustic machine learning.arXiv preprint arXiv:2503.15074, 2025. 3

  13. [13]

    Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958,

    Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al. Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958,

  14. [14]

    Fsd50k: an open dataset of human-labeled sound events.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. Fsd50k: an open dataset of human-labeled sound events.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021. 3 8

  15. [15]

    Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

    Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4438–4446, 2017. 1

  16. [16]

    Gemmeke, Daniel P

    Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. InProc. IEEE ICASSP 2017, New Orleans, LA, 2017. 2, 3, 4

  17. [17]

    PhD thesis, TILBURG UNIVERSITY , 2024

    STEFAN V ASILEV GENEV .Classification of Anuran Species Using High Efficiency CNNs. PhD thesis, TILBURG UNIVERSITY , 2024. 3

  18. [18]

    Emerging opportunities and challenges for passive acoustics in ecological assessment and monitoring.Methods in Ecology and Evolution, 10(2):169–185, 2019

    Rory Gibb, Ella Browning, Paul Glover-Kapfer, and Kate E Jones. Emerging opportunities and challenges for passive acoustics in ecological assessment and monitoring.Methods in Ecology and Evolution, 10(2):169–185, 2019. 1

  19. [19]

    Audioclip: Extending clip to image, text and audio

    Andrey Guzhov, Federico Raue, J ¨orn Hees, and Andreas Dengel. Audioclip: Extending clip to image, text and audio. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022. 2

  20. [20]

    Aves: Animal vocalization encoder based on self-supervision

    Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 1, 2

  21. [21]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2, 5

  22. [22]

    Searching for mo- bilenetv3

    Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. InProceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 7

  23. [23]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29: 3451–3460, 2021

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29: 3451–3460, 2021. 2

  24. [24]

    Masked autoencoders that listen.Advances in Neural Information Processing Systems, 35:28708–28720, 2022

    Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feicht- enhofer. Masked autoencoders that listen.Advances in Neural Information Processing Systems, 35:28708–28720, 2022. 2, 4, 5, 6

  25. [25]

    inaturalist

    iNaturalist. inaturalist. https://www.inaturalist. org, 2026. Accessed: 2026-02-23. 1, 4

  26. [26]

    Birdnet: A deep learning solution for avian diversity monitoring.Ecological Informatics, 61:101236, 2021

    Stefan Kahl, Connor M Wood, Maximilian Eibl, and Holger Klinck. Birdnet: A deep learning solution for avian diversity monitoring.Ecological Informatics, 61:101236, 2021. 3

  27. [27]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013. 1

  28. [28]

    Mavis: A multimodal conversational assistant for avian species

    Yevheniia Kryklyvets, Mohammed Irfan Kurpath, Sa- hal Shaji Mullappilly, Jinxing Zhou, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, and Hisham Cholakkal. Mavis: A multimodal conversational assistant for avian species. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28601–28627, 2025. 3

  29. [29]

    Bilinear cnn models for fine-grained visual recognition

    Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on com- puter vision, pages 1449–1457, 2015. 1

  30. [30]

    Zhiwei Lin, Yongtao Wang, Shengxiang Qi, Nan Dong, and Ming-Hsuan Yang. Bev-mae: Bird’s eye view masked au- toencoders for point cloud pre-training in autonomous driving scenarios.Proceedings of the AAAI Conference on Artificial Intelligence, 38(4):3531–3539, 2024. 2

  31. [31]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 1

  32. [32]

    AVEX: What Matters for Animal Vocalization Encoding

    Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Mad- die Cusimano, Felix Effenberger, Masato Hagiwara, Ben- jamin Hoffman, et al. What matters for bioacoustic encoding. arXiv preprint arXiv:2508.11845, 2025. 2, 3

  33. [33]

    Mixture of mixups for multi-label classification of rare anuran sounds

    Ilyass Moummad, Nicolas Farrugia, Romain Serizel, Jeremy Froidevaux, and Vincent Lostanlen. Mixture of mixups for multi-label classification of rare anuran sounds. In2024 32nd European Signal Processing Conference (EUSIPCO), pages 1282–1286. IEEE, 2024. 3

  34. [34]

    Lv-mae: Learning long video representations through masked-embedding autoencoders

    Ilan Naiman, Emanuel Ben-Baruch, Oron Anschel, Alon Shoshan, Igor Kviatkovsky, Manoj Aggarwal, and Gerard Medioni. Lv-mae: Learning long video representations through masked-embedding autoencoders. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21398–21407, 2025. 2

  35. [35]

    The mel scale.Journal of Music Theory, 9(2): 295–308, 1965

    Paul Pedersen. The mel scale.Journal of Music Theory, 9(2): 295–308, 1965. 4

  36. [36]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

  37. [37]

    Can masked autoen- coders also listen to birds?Transactions on Machine Learning Research Journal, 2025

    Lukas Rauch, Ren´e Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, and Christoph Scholz. Can masked autoen- coders also listen to birds?Transactions on Machine Learning Research Journal, 2025. 1, 2, 3

  38. [38]

    Birdset: A large-scale dataset for audio classification in avian bioacoustics

    Lukas Rauch, Raphael Schwinger, Moritz Wirth, Ren´e Hein- rich, Denis Huseljic, Marek Herde, Jonas Lange, Stefan Kahl, Bernhard Sick, Sven Tomforde, et al. Birdset: A large-scale dataset for audio classification in avian bioacoustics. InIn- ternational Conference on Learning Representations, pages 29482–29520, 2025. 2, 3

  39. [39]

    Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

    Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Can- dido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF 9 International Conference on Computer Vision, pages 4088– 4099, 2023. 2

  40. [40]

    Transferable models for bioacoustics with human language supervision

    David Robinson, Adelaide Robinson, and Lily Akrapong- pisak. Transferable models for bioacoustics with human language supervision. InICASSP 2024-2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1316–1320. IEEE, 2024. 2

  41. [41]

    NatureLM-audio: an audio-language foun- dation model for bioacoustics

    David Robinson, Marius Miron, Masato Hagiwara, and Olivier Pietquin. NatureLM-audio: an audio-language foun- dation model for bioacoustics. InThe Thirteenth International Conference on Learning Representations, 2025. 3

  42. [42]

    Passive acoustic monitoring provides a fresh perspective on fundamental ecological questions.Functional Ecology, 37 (4):959–975, 2023

    Samuel RP-J Ross, Darren P O’Connell, Jessica L Deich- mann, Camille Desjonqu `eres, Amandine Gasc, Jennifer N Phillips, Sarab S Sethi, Connor M Wood, and Zuzana Burival- ova. Passive acoustic monitoring provides a fresh perspective on fundamental ecological questions.Functional Ecology, 37 (4):959–975, 2023. 1

  43. [43]

    Towards a global terrestrial species monitoring program.Journal for Nature Conservation, 25:51–57, 2015

    Dirk S Schmeller, Romain Julliard, Peter J Bellingham, Monika B ¨ohm, Neil Brummitt, Alessandro Chiarucci, De- nis Couvet, Sarah Elmendorf, David M Forsyth, Jaime Garc´ıa Moreno, et al. Towards a global terrestrial species monitoring program.Journal for Nature Conservation, 25:51–57, 2015. 1

  44. [44]

    wav2vec: Unsupervised pre-training for speech recognition

    Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. InProc. Interspeech 2019, pages 3465–3469,

  45. [45]

    Foun- dation models for bioacoustics–a comparative review.Eco- logical Informatics, page 103765, 2026

    Raphael Schwinger, Paria Vali Zadeh, Lukas Rauch, Mats Kurz, Tom Hauschild, Sam Lapp, and Sven Tomforde. Foun- dation models for bioacoustics–a comparative review.Eco- logical Informatics, page 103765, 2026. 3

  46. [46]

    Sam audio: Segment anything in audio

    Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi- Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, et al. Sam audio: Segment anything in audio. arXiv preprint arXiv:2512.18099, 2025. 3

  47. [47]

    Terrestrial passive acoustic monitoring: review and perspectives.BioScience, 69 (1):15–25, 2019

    Larissa Sayuri Moreira Sugai, Thiago Sanna Freire Silva, Jos´e Wagner Ribeiro Jr, and Diego Llusia. Terrestrial passive acoustic monitoring: review and perspectives.BioScience, 69 (1):15–25, 2019. 1

  48. [48]

    Merlin l48 spectrogram dataset

    Aaron Sun, Subhransu Maji, and Grant Van Horn. Merlin l48 spectrogram dataset. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 1

  49. [49]

    Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Processing Systems, 36:20054–20066, 2023

    Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, and Hairong Qi. Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Processing Systems, 36:20054–20066, 2023. 2

  50. [50]

    Video- mae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural infor- mation processing systems, 35:10078–10093, 2022

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Video- mae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural infor- mation processing systems, 35:10078–10093, 2022. 2

  51. [51]

    The inaturalist species classification and detection dataset

    Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778,

  52. [52]

    Exploring fine- grained audiovisual categorization with the SSW60 dataset

    Grant Van Horn, Rui Qian, Kimberly Wilber, Hartwig Adam, Oisin Mac Aodha, and Serge Belongie. Exploring fine- grained audiovisual categorization with the SSW60 dataset. InEuropean Conference on Computer Vision, pages 271–289. Springer, 2022. 1

  53. [53]

    Perch 2.0: The bittern lesson for bioacoustics.arXiv preprint arXiv:2508.04665, 2025

    Bart van Merri ¨enboer, Vincent Dumoulin, Jenny Hamer, Lauren Harrell, Andrea Burns, and Tom Denton. Perch 2.0: The bittern lesson for bioacoustics.arXiv preprint arXiv:2508.04665, 2025. 1, 2, 3

  54. [54]

    Revisiting mae pre-training for 3d medical image segmentation

    Tassilo Wald, Constantin Ulrich, Stanislav Lukyanenko, An- drei Goncharov, Alberto Paderno, Maximilian Miller, Leander Maerkisch, Paul Jaeger, and Klaus Maier-Hein. Revisiting mae pre-training for 3d medical image segmentation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 5186–5196, 2025. 2

  55. [55]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14549–14560, 2023. 2

  56. [56]

    Diffusion models as masked autoencoders

    Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, and Christoph Feichtenhofer. Diffusion models as masked autoencoders. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16284–16294, 2023. 2

  57. [57]

    Xeno-canto: Sharing bird sounds from around the world

    Xeno-Canto Foundation. Xeno-canto: Sharing bird sounds from around the world. https://xeno- canto.org ,

  58. [58]

    Accessed: 2026-02-23. 1 10