State Space Models for Bioacoustics: A Comparative Evaluation with Transformers

Chengyu Tang; Sanjeev Baskiyar

arxiv: 2512.03563 · v2 · submitted 2025-12-03 · 💻 cs.SD · cs.AI

State Space Models for Bioacoustics: A Comparative Evaluation with Transformers

Chengyu Tang , Sanjeev Baskiyar This is my paper

Pith reviewed 2026-05-17 02:38 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords bioacousticsmambastate space modelstransformersaudio representationself-supervised learningBEANS benchmarkenvironmental monitoring

0 comments

The pith

BioMamba matches Transformer accuracy on wildlife sound tasks while using far less VRAM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BioMamba, a Mamba-based model for representing wildlife sounds, and tests whether this architecture can match the performance of leading Transformer models in bioacoustics. It pre-trains the model with self-supervised learning on a large audio collection and evaluates it on the BEANS benchmark for classification and detection tasks. The central result is that BioMamba delivers accuracy comparable to the Transformer model AVES while consuming substantially less VRAM. This matters because bioacoustic monitoring in the field often occurs on hardware with tight memory limits, where lower consumption allows longer operation or more sensors. The work therefore frames Mamba as a practical alternative for scaling environmental sound analysis.

Core claim

BioMamba is a Mamba-based audio representation model pre-trained via self-supervised learning on a large corpus of wildlife sounds. When tested on the BEANS benchmark, it reaches performance levels comparable to the state-of-the-art Transformer model AVES across diverse classification and detection tasks while significantly lowering VRAM consumption. The results indicate that the Mamba architecture offers a computationally efficient route for real-world environmental monitoring applications.

What carries the argument

BioMamba, a Mamba state-space architecture adapted for self-supervised audio representation of wildlife sounds.

If this is right

Bioacoustic classification and detection tasks become feasible on devices with lower memory capacity than required by Transformers.
Continuous environmental monitoring can run for longer periods or with more sensors under the same hardware constraints.
Self-supervised pre-training on large audio corpora succeeds with Mamba for wildlife sound data.
Mamba provides a direct alternative to Transformer models for scaling bioacoustic analysis in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Mamba's linear scaling with sequence length may particularly suit the long recordings typical in bioacoustics.
Efficiency gains could extend to other audio domains where memory limits currently constrain Transformer use.
The approach invites direct comparisons on additional benchmarks to test whether the memory advantage holds across varied audio tasks.

Load-bearing premise

The reported VRAM reduction and accuracy parity result from the Mamba architecture itself rather than from differences in training details, model scale, or hardware setup.

What would settle it

A side-by-side retraining of both BioMamba and AVES from scratch on identical data splits, hyperparameters, and hardware while measuring exact VRAM usage and task accuracy.

Figures

Figures reproduced from arXiv: 2512.03563 by Chengyu Tang, Sanjeev Baskiyar.

read the original abstract

In this study, we evaluate the efficacy of the Mamba architecture bioacoustics by introducing BioMamba, a Mamba-based audio representation model for wildlife sounds. We pre-train a BioMamba using self-supervised learning on a large audio corpus and evaluate it on the BEANS benchmark across diverse classification and detection tasks. Compared to the state-of-the-art Transformer-based model (AVES), BioMamba achieves comparable performance while significantly reducing VRAM consumption. Our results demonstrate Mamba's potential as a computationally efficient alternative for real-world environmental monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BioMamba applies Mamba to bioacoustics and claims efficiency gains over AVES, but the comparison lacks the controls needed to isolate the architecture.

read the letter

This paper's main point is that their BioMamba model matches the AVES Transformer on BEANS tasks while using significantly less VRAM. They pre-train a Mamba-based audio model with self-supervised learning on a large corpus and evaluate it across classification and detection in wildlife sounds. That is the core result they want readers to take away. What is new is the concrete application of Mamba to this domain. Prior Mamba work existed for audio, so this is an extension rather than a fresh framework, but bringing it to bioacoustics and running it on the BEANS benchmark is a reasonable next step. The paper does well by keeping the focus on practical efficiency, which matters for environmental monitoring setups that often run on limited hardware. The efficiency angle is the part that could actually matter to people deploying these models in the field. The soft spots sit in the comparison. The abstract states comparable performance and reduced VRAM but gives no numbers, error bars, or training details. The stress-test concern is fair: without evidence that AVES was retrained from scratch under identical conditions on optimizer, batch size, preprocessing, and hardware, the VRAM savings and accuracy parity could trace to implementation choices instead of the state-space architecture itself. If the full paper supplies those controls, the claim strengthens; right now the evidence looks incomplete. This work is for researchers in bioacoustics or anyone testing efficient audio models for real-world use. A reader who needs lower-memory options for monitoring would get a useful data point. It deserves a serious referee because the empirical question is well-posed and the domain is applied enough that a referee can push for tighter methods without the paper being fundamentally broken. I would recommend sending it to peer review, with the main request being clearer documentation of the baseline setup and quantitative results.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BioMamba, a Mamba-based state-space model for bioacoustic audio representations. It is pre-trained via self-supervised learning on a large audio corpus and evaluated on the BEANS benchmark for diverse classification and detection tasks involving wildlife sounds. The central empirical claim is that BioMamba achieves performance comparable to the Transformer-based AVES model while consuming significantly less VRAM, positioning Mamba as a computationally efficient alternative for environmental monitoring applications.

Significance. If the comparison is shown to be controlled and the VRAM reduction is reproducible, the work would establish state-space models as viable, lower-memory substitutes for Transformers in bioacoustics, with direct implications for deploying representation models on edge devices for real-time wildlife monitoring. The introduction of a new pre-trained model (BioMamba) and its systematic evaluation on the established BEANS benchmark would add a useful data point to the growing literature on efficient sequence models for audio.

major comments (2)

[Abstract, §4] Abstract and §4 (Results): the headline claim that BioMamba 'achieves comparable performance while significantly reducing VRAM consumption' is stated without any numerical values, tables, error bars, or statistical comparisons to AVES. This absence leaves the central empirical result without visible quantitative support in the provided text.
[§3] §3 (Methods): the manuscript does not state whether the AVES baseline was re-trained from scratch using the identical self-supervised recipe, optimizer schedule, batch size, audio preprocessing pipeline, and hardware stack employed for BioMamba. Without this control, both the accuracy parity and the reported VRAM reduction cannot be unambiguously attributed to the Mamba architecture rather than implementation or training differences.

minor comments (2)

[Abstract] The abstract refers to 'a large audio corpus' without naming the dataset or its size; this detail should be supplied in §2 or §3 for reproducibility.
[§2] Notation for model dimensions, state-space parameters, and audio feature extraction is introduced without a dedicated table or equation block; adding a summary table of hyperparameters would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our empirical results and experimental controls. We address each major comment below and outline the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Results): the headline claim that BioMamba 'achieves comparable performance while significantly reducing VRAM consumption' is stated without any numerical values, tables, error bars, or statistical comparisons to AVES. This absence leaves the central empirical result without visible quantitative support in the provided text.

Authors: We agree that the abstract and results would benefit from explicit quantitative support. In the revised manuscript we will update the abstract to report key metrics (e.g., mean accuracy across BEANS tasks and relative VRAM reduction) and will expand §4 to include the full comparison table with per-task scores, standard deviations, and direct VRAM measurements for both models. Statistical comparisons will be added where sample sizes permit. revision: yes
Referee: [§3] §3 (Methods): the manuscript does not state whether the AVES baseline was re-trained from scratch using the identical self-supervised recipe, optimizer schedule, batch size, audio preprocessing pipeline, and hardware stack employed for BioMamba. Without this control, both the accuracy parity and the reported VRAM reduction cannot be unambiguously attributed to the Mamba architecture rather than implementation or training differences.

Authors: We acknowledge the need for explicit methodological transparency. The AVES performance numbers are taken from the original AVES publication on the same BEANS tasks; re-training the full Transformer from scratch under identical conditions was not feasible within our compute budget. VRAM measurements, however, were obtained by running both models on the identical hardware and preprocessing pipeline. We will revise §3 to state these facts clearly, add a limitations paragraph discussing the implications, and include inference-time VRAM figures obtained under controlled conditions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison without derivation chain

full rationale

The paper performs an empirical evaluation of BioMamba (Mamba-based) against AVES (Transformer) on the BEANS benchmark after self-supervised pre-training. No equations, derivations, or mathematical claims are present that could reduce predictions to fitted inputs or self-citations by construction. The central result (comparable accuracy at lower VRAM) rests on reported experimental measurements rather than any self-definitional or load-bearing self-citation structure. This is a standard self-contained empirical study; external benchmarks and direct comparisons provide independent grounding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard machine-learning assumptions about self-supervised pre-training producing transferable audio representations and on the fairness of the benchmark comparison; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Self-supervised learning on a large unlabeled audio corpus produces representations useful for downstream bioacoustics tasks
Invoked by the pre-training step described in the abstract

invented entities (1)

BioMamba no independent evidence
purpose: Mamba-based audio representation model for wildlife sounds
New model name and architecture adaptation introduced in the abstract

pith-pipeline@v0.9.0 · 5382 in / 1197 out tokens · 97979 ms · 2026-05-17T02:38:46.867198+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 5 internal anchors

[1]

https://kaggle.com/birdsong-recognition

Cornell birdcall identification. https://kaggle.com/birdsong-recognition

work page
[2]

wav2vec 2.0: A frame- work for self-supervised learning of speech representations.Advances in Neural Information Processing Systems (NeurIPS), 33:12449–12460, 2020

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A frame- work for self-supervised learning of speech representations.Advances in Neural Information Processing Systems (NeurIPS), 33:12449–12460, 2020

work page 2020
[3]

Hawaiian islands cetacean and ecosystem assessment survey (hiceas) towed array data

NOAA Pacific Islands Fisheries Science Center. Hawaiian islands cetacean and ecosystem assessment survey (hiceas) towed array data. edited and annotated for the 9th international workshop on detection, classification, localization, and density estimation of marine mammals using passive acoustics (dclde 2022), 2022

work page 2022
[4]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020

work page 2020
[5]

Chronister, Tessa A

Lauren M. Chronister, Tessa A. Rhinehart, Aidan Place, and Justin Kitzes. An annotated set of audio recordings of eastern north american birds containing frequency, time, and species information.Ecology, 102(6):e03329, 2021

work page 2021
[6]

FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

work page 2022
[7]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021
[10]

Hansford, Amanda Hoepfner, Heidi Ma, Jessica V

Emmanuel Dufourq, Ian Durbach, James P. Hansford, Amanda Hoepfner, Heidi Ma, Jessica V . Bryant, Christina S. Stender, Wenyong Li, Zhiwei Liu, Qing Chen, Zhaoli Zhou, and Samuel T. Turvey. Automated detection of hainan gibbon calls for passive acoustic monitoring.Remote Sensing in Ecology and Conservation, 7(3):475–487, 2021

work page 2021
[11]

SSAST: Self-supervised audio spectrogram transformer

Yuan Gong, Cheng-I Lai, Yu-An Chung, and James Glass. SSAST: Self-supervised audio spectrogram transformer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10699–10709, 2022. 6

work page 2022
[12]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Aves: Animal vocalization encoder based on self-supervision

Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision. In2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, June 2023

work page 2023
[14]

Beans: The benchmark of animal sounds

Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian. Beans: The benchmark of animal sounds. InICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, June 2023

work page 2023
[15]

Deep residual learning for im- age recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

work page 2016
[16]

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Chan- ning Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. Cnn architectures for large-scale audio classification. In2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ...

work page 2017
[17]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Trans

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:3451–3460, October 2021

work page 2021
[18]

Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis

Xilin Jiang, Yinghao Aaron Li, Adrian Nicolas Florea, Cong Han, and Nima Mesgarani. Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[19]

Humbugdb: A large-scale acoustic mosquito dataset

Ivan Kiskin, Marianne Sinka, Adam Cobb, Waqas Rafique, Lawrence Wang, Davide Zilli, Benjamin Gutteridge, Rinita Dam, Theodoros Marinos, Yunpeng Li, Dickson Msaky, Emmanuel Kaindoa, Gerard Killeen, Eva Herreros-Moya, Kathy Willis, and Stephen J Roberts. Humbugdb: A large-scale acoustic mosquito dataset. InProceedings of the Neural Information Processing Sy...

work page 2021
[20]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[21]

Velev, Rahul Dodhia, Juan Lav- ista Ferres, and T

Jack LeBien, Ming Zhong, Marconi Campos-Cerqueira, Julian P. Velev, Rahul Dodhia, Juan Lav- ista Ferres, and T. Mitchell Aide. A pipeline for identification of bird and frog species in tropical soundscape recordings using a convolutional neural network.Ecological Informatics, 59:101113, 2020

work page 2020
[22]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Few-shot bioacoustic event detection: A new task at the dcase 2021 challenge

Veronica Morfi, Inês Nolasco, Vincent Lostanlen, Shubhr Singh, Ariana Strandburg-Peshkin, Lisa F Gill, Hanna Pamula, David Benvent, and Dan Stowell. Few-shot bioacoustic event detection: A new task at the dcase 2021 challenge. InDCASE, pages 145–149, 2021

work page 2021
[24]

Seld-mamba: Selective state-space model for sound event localization and detection with source distance estimation.arXiv preprint arXiv:2408.05057, 2024

Da Mu, Zhicheng Zhang, Haobo Yue, Zehao Wang, Jin Tang, and Jianqin Yin. Seld-mamba: Selective state-space model for sound event localization and detection with source distance estimation.arXiv preprint arXiv:2408.05057, 2024

work page arXiv 2024
[25]

An annotated dataset of egyptian fruit bat vocalizations across varying contexts and during vocal ontogeny.Scientific data, 4(1):1–7, 2017

Yosef Prat, Mor Taub, Ester Pratt, and Yossi Yovel. An annotated dataset of egyptian fruit bat vocalizations across varying contexts and during vocal ontogeny.Scientific data, 4(1):1–7, 2017. 7

work page 2017
[26]

The watkins marine mammal sound database: an online, freely accessible resource

Laela Sayigh, Mary Ann Daher, Julie Allen, Helen Gordon, Katherine Joyce, Claire Stuhlmann, and Peter Tyack. The watkins marine mammal sound database: an online, freely accessible resource. InProceedings of Meetings on Acoustics, volume 27, page 040013. Acoustical Society of America, 2016

work page 2016
[27]

SSAMBA: Self- supervised audio representation learning with mamba state space model.arXiv preprint arXiv:2405.11831, 2024

Siavash Shams, Sukru Samet Dindar, Xilin Jiang, and Nima Mesgarani. SSAMBA: Self- supervised audio representation learning with mamba state space model.arXiv preprint arXiv:2405.11831, 2024

work page arXiv 2024
[28]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[29]

Barking in domestic dogs: context specificity and individual identification.Animal behaviour, 68(2):343–355, 2004

Sophia Yin and Brenda McCowan. Barking in domestic dogs: context specificity and individual identification.Animal behaviour, 68(2):343–355, 2004

work page 2004
[30]

Mamba in Speech: Towards an Alternative to Self-Attention,

Xiangyu Zhang, Jianbo Ma, Mostafa Shahin, Beena Ahmed, and Julien Epps. Mamba in speech: Towards an alternative to self-attention.arXiv preprint arXiv:2405.12609, 2024. 8

work page arXiv 2024

[1] [1]

https://kaggle.com/birdsong-recognition

Cornell birdcall identification. https://kaggle.com/birdsong-recognition

work page

[2] [2]

wav2vec 2.0: A frame- work for self-supervised learning of speech representations.Advances in Neural Information Processing Systems (NeurIPS), 33:12449–12460, 2020

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A frame- work for self-supervised learning of speech representations.Advances in Neural Information Processing Systems (NeurIPS), 33:12449–12460, 2020

work page 2020

[3] [3]

Hawaiian islands cetacean and ecosystem assessment survey (hiceas) towed array data

NOAA Pacific Islands Fisheries Science Center. Hawaiian islands cetacean and ecosystem assessment survey (hiceas) towed array data. edited and annotated for the 9th international workshop on detection, classification, localization, and density estimation of marine mammals using passive acoustics (dclde 2022), 2022

work page 2022

[4] [4]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020

work page 2020

[5] [5]

Chronister, Tessa A

Lauren M. Chronister, Tessa A. Rhinehart, Aidan Place, and Justin Kitzes. An annotated set of audio recordings of eastern north american birds containing frequency, time, and species information.Ecology, 102(6):e03329, 2021

work page 2021

[6] [6]

FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

work page 2022

[7] [7]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019

[9] [9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021

[10] [10]

Hansford, Amanda Hoepfner, Heidi Ma, Jessica V

Emmanuel Dufourq, Ian Durbach, James P. Hansford, Amanda Hoepfner, Heidi Ma, Jessica V . Bryant, Christina S. Stender, Wenyong Li, Zhiwei Liu, Qing Chen, Zhaoli Zhou, and Samuel T. Turvey. Automated detection of hainan gibbon calls for passive acoustic monitoring.Remote Sensing in Ecology and Conservation, 7(3):475–487, 2021

work page 2021

[11] [11]

SSAST: Self-supervised audio spectrogram transformer

Yuan Gong, Cheng-I Lai, Yu-An Chung, and James Glass. SSAST: Self-supervised audio spectrogram transformer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10699–10709, 2022. 6

work page 2022

[12] [12]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Aves: Animal vocalization encoder based on self-supervision

Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision. In2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, June 2023

work page 2023

[14] [14]

Beans: The benchmark of animal sounds

Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian. Beans: The benchmark of animal sounds. InICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, June 2023

work page 2023

[15] [15]

Deep residual learning for im- age recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

work page 2016

[16] [16]

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Chan- ning Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. Cnn architectures for large-scale audio classification. In2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ...

work page 2017

[17] [17]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Trans

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:3451–3460, October 2021

work page 2021

[18] [18]

Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis

Xilin Jiang, Yinghao Aaron Li, Adrian Nicolas Florea, Cong Han, and Nima Mesgarani. Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025

[19] [19]

Humbugdb: A large-scale acoustic mosquito dataset

Ivan Kiskin, Marianne Sinka, Adam Cobb, Waqas Rafique, Lawrence Wang, Davide Zilli, Benjamin Gutteridge, Rinita Dam, Theodoros Marinos, Yunpeng Li, Dickson Msaky, Emmanuel Kaindoa, Gerard Killeen, Eva Herreros-Moya, Kathy Willis, and Stephen J Roberts. Humbugdb: A large-scale acoustic mosquito dataset. InProceedings of the Neural Information Processing Sy...

work page 2021

[20] [20]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[21] [21]

Velev, Rahul Dodhia, Juan Lav- ista Ferres, and T

Jack LeBien, Ming Zhong, Marconi Campos-Cerqueira, Julian P. Velev, Rahul Dodhia, Juan Lav- ista Ferres, and T. Mitchell Aide. A pipeline for identification of bird and frog species in tropical soundscape recordings using a convolutional neural network.Ecological Informatics, 59:101113, 2020

work page 2020

[22] [22]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Few-shot bioacoustic event detection: A new task at the dcase 2021 challenge

Veronica Morfi, Inês Nolasco, Vincent Lostanlen, Shubhr Singh, Ariana Strandburg-Peshkin, Lisa F Gill, Hanna Pamula, David Benvent, and Dan Stowell. Few-shot bioacoustic event detection: A new task at the dcase 2021 challenge. InDCASE, pages 145–149, 2021

work page 2021

[24] [24]

Seld-mamba: Selective state-space model for sound event localization and detection with source distance estimation.arXiv preprint arXiv:2408.05057, 2024

Da Mu, Zhicheng Zhang, Haobo Yue, Zehao Wang, Jin Tang, and Jianqin Yin. Seld-mamba: Selective state-space model for sound event localization and detection with source distance estimation.arXiv preprint arXiv:2408.05057, 2024

work page arXiv 2024

[25] [25]

An annotated dataset of egyptian fruit bat vocalizations across varying contexts and during vocal ontogeny.Scientific data, 4(1):1–7, 2017

Yosef Prat, Mor Taub, Ester Pratt, and Yossi Yovel. An annotated dataset of egyptian fruit bat vocalizations across varying contexts and during vocal ontogeny.Scientific data, 4(1):1–7, 2017. 7

work page 2017

[26] [26]

The watkins marine mammal sound database: an online, freely accessible resource

Laela Sayigh, Mary Ann Daher, Julie Allen, Helen Gordon, Katherine Joyce, Claire Stuhlmann, and Peter Tyack. The watkins marine mammal sound database: an online, freely accessible resource. InProceedings of Meetings on Acoustics, volume 27, page 040013. Acoustical Society of America, 2016

work page 2016

[27] [27]

SSAMBA: Self- supervised audio representation learning with mamba state space model.arXiv preprint arXiv:2405.11831, 2024

Siavash Shams, Sukru Samet Dindar, Xilin Jiang, and Nima Mesgarani. SSAMBA: Self- supervised audio representation learning with mamba state space model.arXiv preprint arXiv:2405.11831, 2024

work page arXiv 2024

[28] [28]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[29] [29]

Barking in domestic dogs: context specificity and individual identification.Animal behaviour, 68(2):343–355, 2004

Sophia Yin and Brenda McCowan. Barking in domestic dogs: context specificity and individual identification.Animal behaviour, 68(2):343–355, 2004

work page 2004

[30] [30]

Mamba in Speech: Towards an Alternative to Self-Attention,

Xiangyu Zhang, Jianbo Ma, Mostafa Shahin, Beena Ahmed, and Julien Epps. Mamba in speech: Towards an alternative to self-attention.arXiv preprint arXiv:2405.12609, 2024. 8

work page arXiv 2024