arxiv: 2605.07903 · v1 · submitted 2026-05-08 · 💻 cs.SD · cs.AI

Recognition: 3 theorem links

· Lean Theorem

BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing

Hamze Hammami , Nidhal Abdulaziz

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:23 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords unsupervised learningacoustic tokenshoney bee buzzingvector-quantized autoencoderbioacousticscolony state discoveryqueenless detectionhive monitoring

0 comments

The pith

Unsupervised learning discovers repeatable acoustic states in honey bee buzzing without labels or predefined units.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to process raw hive audio recordings to learn a set of discrete acoustic tokens in a fully unsupervised way. Features are first extracted with a pre-trained audio transformer and then a vector-quantized autoencoder is trained on those features alone, without any labels or extra objectives. The resulting tokens exhibit consistent patterns that align with whether a queen is present in the colony, measured by how much their usage differs. The queenless situation further splits into three stable subgroups that appear again across various training setups. This shows that structured information about collective non-vocal sounds can be pulled out automatically.

Core claim

The central discovery is that training a vector-quantized variational autoencoder on embeddings from a frozen self-supervised spectrogram transformer yields a discrete codebook whose tokens capture repeatable acoustic structure in unlabelled honey bee colony sounds. These tokens separate queenright and queenless conditions according to Jensen-Shannon divergence values of 0.609 to 0.688, decompose the queenless state into three coherent sub-states that are stable to changes in codebook size and random initialization, exhibit non-random sequential transitions, and generalize to new recordings with high token overlap.

What carries the argument

The vector-quantized variational autoencoder applied to frozen spectrogram transformer embeddings to learn a finite set of discrete acoustic tokens directly from unlabeled data.

If this is right

The tokens distinguish between queenright and queenless hives with measurable divergence in their frequency of use.
Queenless conditions consistently divide into three sub-states that remain coherent regardless of codebook size or training seed.
Sequences of tokens display non-random structure across all tested configurations.
Token assignments generalize to previously unseen recordings while maintaining overlap and overall arrangement.
This framework supports the development of label-free acoustic systems for monitoring colony conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying the same tokenization process to audio from other social insects or animal groups could reveal analogous hidden states in their collective behaviors.
Correlating the discovered tokens with additional colony variables such as brood presence or foraging activity could strengthen the case for their biological meaning.
The approach might enable continuous, non-invasive tracking of hive health changes over time using only microphone data.
Testing the tokens on recordings from different hive setups or bee species would check how general the discovered structure is.

Load-bearing premise

The learned tokens reflect actual biological differences in colony state rather than differences in how the audio was recorded or artifacts introduced by the feature extraction model.

What would settle it

Repeating the training on a new set of recordings from hives with independently verified queen status but obtaining no separation in token distributions between queenright and queenless groups would falsify the claim that meaningful states have been discovered.

Figures

Figures reproduced from arXiv: 2605.07903 by Hamze Hammami, Nidhal Abdulaziz.

**Figure 1.** Figure 1: Method summary. 2.1. UrBAN Dataset. Acoustic data equivalent to approximately five hours of hive recordings was sampled from the UrBAN dataset [2]. The UrBAN dataset contains well over 1000 hours of honey-bee hive audio collected under real-world conditions. For the purposes of a controlled and interpretable experiment, the amount of data used is limited while preserving data diversity through annotation… view at source ↗

**Figure 2.** Figure 2: VQ-VAE architecture. (5) Lvq = Lcodebook + βLcommit + γLdiversity where: β = 0.25, γ = 0.1 Each term serves a distinct objective: • Reconstruction Loss measures the mean squared error between the original PaSST embedding and the decoder output (Equation 6): (6) Lrecon = ∥x − xˆ∥ 2 where: x ∈ R 1295 is the original input, xˆ ∈ R 1295 is the reconstructed output • Codebook Loss pulls codebook entries toward … view at source ↗

**Figure 3.** Figure 3: Loss and reconstruction curves for the baseline experiment [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Reconstruction quality on the test recording. The codebook represents the colony-level acoustic state rather than preserving individual frame variation. The remaining error reflects the deviation of each frame from the average pattern of its assigned token, which is the expected cost of discrete compression. 4.2.2. Codebook [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Codebook perplexity and active token count across training epochs [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: similarity matrix, and usage distribution. 4.3. State Validation. To determine whether the concentration of error in the high-activation dimensions reflects meaningful learning, and whether the encoder captures genuine state variation, the learned representations were evaluated against known hive conditions. Token usage patterns were analyzed across different states, and embeddings were projected to 2D us… view at source ↗

**Figure 7.** Figure 7: Baseline queen status token usage heatmap. The silhouette score operates in the 128-dimensional latent space rather than on token distributions, measuring whether individual frame embeddings cluster more tightly with frames of their own condition than with frames of the opposite condition. Positive values indicate that points are closer to their own group than to the other; the moderate scores observed her… view at source ↗

**Figure 8.** Figure 8: 2D latent projections coloured by queen status. 4.3.3. Sub-states. An interesting observation from the state validation projections is that UMAP and t-SNE form distinct clusters that are not fully connected. In t-SNE, two clusters are clearly dominant while a third is either barely formed or sparsely populated. To investigate this, QNL embeddings were isolated in the 128-dimensional latent space and K-mean… view at source ↗

**Figure 9.** Figure 9: PCA projection of queenless embeddings coloured by sub-state [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Token composition of queenless sub-states for the baseline experiment [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: UMAP projection QNL sub-state. Based on the metrics, Sub-state A is the most consistent finding across all experiments. Its size remains fixed at approximately 57% of all queenless frames, regardless of codebook size or random seed, and its purity is consistently above 97%, meaning that in every model trained, [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: shows deviation of each sub-state’s mean PaSST feature profile from the overall queenless mean. Sub-state A consistently activates above the queenless mean, Sub-state C is broadly suppressed below the queenless mean in the same region, suggesting lower overall energy. Sub-state B shows a mixed pattern, consistent with its interpretation as a heterogeneous behavior rather than a single well-defined state … view at source ↗

**Figure 13.** Figure 13: Token transition probability matrix 4.5. Unseen Data. Token Distribution. Of the 19 active training tokens, 18 are also active in the test file, giving a Jaccard overlap of 0.947 and a JSD of 0.2065. The test file uses the same [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: UMAP manifold projections for training, test, and overlay. 5. Discussion and Limitations The central question of this work is whether collective honey bee buzzing contains structured, repeatable acoustic states that can be recovered without supervision. The results across three independent experiments consistently support that it does, with token separation, stable substate structure, and preserved manif… view at source ↗

read the original abstract

Discovering structure in biological signals without supervision is a fundamental problem in computational intelligence, yet existing bioacoustic methods assume vocal production models or predefined semantic units, leaving non-vocal species poorly served. This work introduces BeeVe, an unsupervised framework for acoustic state discovery in collective honey bee buzzing. BeeVe uses the self-supervised Patchout Spectrogram Transformer (PaSST) as a frozen feature extractor, then trains a Vector-Quantized Variational Autoencoder (VQ-VAE) without labels on those embeddings, learning a finite discrete codebook of acoustic tokens directly from unlabelled hive audio. No labels, pretext tasks, or contrastive objectives are used at any stage. Post-hoc evaluation against known queen status reveals that the learned tokens separate queenright and queenless conditions with Jensen-Shannon Divergence values between 0.609 and 0.688, and that the queenless condition further decomposes into three internally coherent sub-states stable across experiments with different codebook sizes and random seeds. Token transition analysis confirms non-random sequential structure (p << 0.001) across all experiments. Generalisation to unseen recordings preserves both token overlap (Jaccard = 0.947) and global manifold topology. These results demonstrate that unsupervised discrete codebook learning can recover repeatable acoustic structure from a non-vocal biological signal without annotation, opening a path toward non-invasive acoustic hive health monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BeeVe shows a label-free PaSST + VQ-VAE pipeline can extract stable discrete tokens from hive audio that separate queenright and queenless conditions with reported JSD 0.6-0.7 and Jaccard 0.947 on unseen recordings, but the post-hoc checks do not yet rule out recording artifacts or extractor biases.

read the letter

The paper's core result is that training a VQ-VAE on frozen PaSST embeddings from unlabeled bee buzzing produces a codebook whose tokens show repeatable structure: queenless hives split into three stable sub-states across codebook sizes and seeds, token sequences are non-random at p << 0.001, and the tokens overlap heavily (Jaccard 0.947) when tested on held-out recordings while preserving manifold topology. No labels or pretext tasks are used at training time, which is the clean part of the setup. The concrete numbers on separation and generalization are what make the claim testable rather than hand-wavy. That is the actual advance over prior bioacoustic work that relies on vocal models or predefined units. The non-vocal collective signal angle is a reasonable extension. The soft spot is the validation strategy. Everything is evaluated after the fact against queen status alone, and the abstract gives no information on whether recording hardware, hive location, time of day, or environmental factors were balanced or controlled. Without those ablations, the JSD separation could still reflect systematic differences in activity level or microphone response that the general-audio PaSST already encodes. The internal consistency checks (transitions, stability, cross-recording overlap) are useful but do not address external confounds. This is the kind of applied unsupervised work that would interest people building acoustic monitoring tools for agriculture or pollinator health. A reader already working on discrete representation learning for animal sounds would find the metrics and the non-vocal case worth looking at. It is coherent enough on its own terms to deserve peer review, though any referee will need to see the data collection details and at least one confound ablation before the biological interpretation can be taken as solid.

Referee Report

3 major / 2 minor

Summary. The paper introduces BeeVe, an unsupervised framework that extracts features from unlabeled honey bee buzzing audio using a frozen PaSST model and then trains a VQ-VAE to learn a discrete codebook of acoustic tokens. Post-hoc evaluation against queen status shows separation between queenright and queenless conditions (JSD 0.609–0.688), decomposition of queenless into three stable sub-states across codebook sizes and seeds, non-random token transitions (p << 0.001), and strong generalization to unseen recordings (Jaccard 0.947). The central claim is that this approach recovers repeatable acoustic structure from a non-vocal biological signal without any labels or supervision.

Significance. If the tokens correspond to biologically meaningful colony states, the work would advance unsupervised bioacoustics for collective non-vocal signals and support practical non-invasive hive health monitoring. The combination of frozen self-supervised audio features with VQ-VAE for discrete state discovery, plus reported internal consistency metrics (stability, transitions, generalization), provides a reproducible template worth testing in other bioacoustic domains.

major comments (3)

[Methods] Methods (data collection and experimental design): No details are provided on the number of hives, recording hardware, microphone placement, time-of-day controls, or environmental covariates. Without these, the post-hoc JSD separation by queen status cannot be distinguished from hive-specific recording artifacts or systematic biases in the frozen PaSST extractor.
[Results] Results (JSD and sub-state analysis): The reported JSD range (0.609–0.688) and three queenless sub-states are presented as evidence of biologically meaningful discovery, but no ablation holds recording conditions fixed while varying only queen status, nor are permutation tests or baseline comparisons against PaSST biases described. This leaves the central claim vulnerable to the alternative that tokens reflect activity or hardware patterns rather than colony state.
[Evaluation] Evaluation (generalization and transitions): While Jaccard 0.947 on unseen recordings and p << 0.001 on transitions demonstrate internal consistency, the unseen recordings appear drawn from the same hives/conditions; this does not test external validity against the confounds raised in the skeptic note.

minor comments (2)

[Abstract] Abstract and results: The exact VQ-VAE training procedure (learning rate, epochs, codebook initialization) and statistical controls for multiple comparisons are not summarized, making it hard to assess reproducibility from the reported metrics alone.
[Methods] Notation: The distinction between 'queenless sub-states' and the global codebook tokens could be clarified with a diagram or explicit definition in the methods.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive review and for highlighting key areas where additional rigor would strengthen the manuscript. We address each major comment below, indicating where revisions will be made and where limitations of the current dataset prevent full resolution.

read point-by-point responses

Referee: [Methods] Methods (data collection and experimental design): No details are provided on the number of hives, recording hardware, microphone placement, time-of-day controls, or environmental covariates. Without these, the post-hoc JSD separation by queen status cannot be distinguished from hive-specific recording artifacts or systematic biases in the frozen PaSST extractor.

Authors: We agree that these methodological details are necessary for proper interpretation and to help readers assess potential confounds. In the revised manuscript we will add a dedicated data collection subsection specifying the number of hives, recording hardware model, microphone placement, time-of-day sampling protocol, and any recorded environmental covariates. This will allow direct evaluation of whether the observed token distributions could arise from recording artifacts. revision: yes
Referee: [Results] Results (JSD and sub-state analysis): The reported JSD range (0.609–0.688) and three queenless sub-states are presented as evidence of biologically meaningful discovery, but no ablation holds recording conditions fixed while varying only queen status, nor are permutation tests or baseline comparisons against PaSST biases described. This leaves the central claim vulnerable to the alternative that tokens reflect activity or hardware patterns rather than colony state.

Authors: We will add permutation tests comparing the observed JSD values against distributions obtained from randomly reassigned tokens, and we will include baseline token statistics computed directly on the frozen PaSST embeddings (without VQ-VAE) to isolate the contribution of the discrete codebook. However, an ablation that holds all recording conditions fixed while independently varying only queen status is not feasible with the existing dataset and would require new controlled recordings. revision: partial
Referee: [Evaluation] Evaluation (generalization and transitions): While Jaccard 0.947 on unseen recordings and p << 0.001 on transitions demonstrate internal consistency, the unseen recordings appear drawn from the same hives/conditions; this does not test external validity against the confounds raised in the skeptic note.

Authors: We will revise the Evaluation section to explicitly state that the held-out recordings come from the same hives and recording conditions, thereby clarifying that the Jaccard and transition metrics demonstrate internal reproducibility rather than cross-hardware or cross-site generalization. We will also add a limitations paragraph discussing the scope of these results and the need for future multi-site validation. revision: yes

standing simulated objections not resolved

An ablation holding recording conditions fixed while varying only queen status cannot be performed without new data collection outside the current study.

Circularity Check

0 steps flagged

No circularity: unsupervised training and post-hoc evaluation are independent

full rationale

The paper trains a VQ-VAE codebook on frozen PaSST embeddings using only unlabelled hive audio with no target variables, labels, or queen-status information at any stage of learning. Token distributions, transitions, and manifold topology are derived directly from the unsupervised objective. Post-hoc JSD separation (0.609–0.688) and stability checks against queen status are computed after training and do not enter the loss, codebook construction, or any reported equation. No self-citation chain, fitted-input renaming, or ansatz smuggling is present in the derivation; the core claim that discrete tokens recover repeatable structure therefore stands on independent content rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that pre-trained general audio embeddings remain informative for hive buzzing and that VQ-VAE discretization yields biologically relevant tokens; no new entities are postulated and only standard training choices appear.

free parameters (1)

VQ-VAE codebook size
Varies across experiments; chosen to produce stable sub-states but not derived from data or theory.

axioms (2)

domain assumption PaSST embeddings capture relevant structure in non-vocal bioacoustic signals
Used as frozen extractor without additional justification or ablation in the abstract.
domain assumption Discrete tokens from VQ-VAE correspond to repeatable acoustic states
Core modeling choice; evaluated only post-hoc.

pith-pipeline@v0.9.0 · 5546 in / 1454 out tokens · 57989 ms · 2026-05-11T03:23:27.777246+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BeeVe uses the self-supervised Patchout Spectrogram Transformer (PaSST) as a frozen feature extractor, then trains a Vector-Quantized Variational Autoencoder (VQ-VAE) without labels on those embeddings, learning a finite discrete codebook of acoustic tokens
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Token transition analysis confirms non-random sequential structure (p ≪ 0.001) ... queenless condition further decomposes into three internally coherent sub-states stable across experiments
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Jensen-Shannon Divergence values between 0.609 and 0.688

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Mahsa Abdollahi, Pierre Giovenazzo, and Tiago H Falk,Automated beehive acoustics monitoring: A com- prehensive review of the literature and recommendations for future work, Applied Sciences12(2022), no. 8, 3920

work page 2022
[2]

Mahsa Abdollahi, Yi Zhu, Heitor R Guimar˜ aes, Nico Coallier, S´ egol` ene Maucourt, Pierre Giovenazzo, and Tiago H Falk,Urban: Urban beehive acoustics and phenotyping dataset, Scientific Data12(2025), no. 1, 536. 20 H. HAMMAMI AND N. ABDULAZIZ

work page 2025
[3]

Cleiton M Carvalho Jr, ´Icaro de Lima Rodrigues, and Danielo G Gomes,Unsupervised acoustic detection of queenless hives in honeybees (apis mellifera ligustica), Brazilian e-Science Workshop (BreSci), SBC, 2025, pp. 49–56

work page 2025
[4]

Christine Erbe and Jeanette A Thomas,Exploring animal behavior through sound: Volume 1: Methods, Springer Nature, 2022

work page 2022
[5]

1, 72–77

Sara Ferrari, Mitchell Silva, Marcella Guarino, and Daniel Berckmans,Monitoring of swarming sounds in bee hives for early detection of the swarming period, Computers and electronics in agriculture64(2008), no. 1, 72–77

work page 2008
[6]

7, 258–267

W Tecumseh Fitch,The evolution of speech: a comparative review, Trends in cognitive sciences4(2000), no. 7, 258–267

work page 2000
[7]

Karl von Frisch,The dance language and orientation of bees, Harvard University Press, 1993

work page 1993
[8]

Masato Hagiwara,Aves: Animal vocalization encoder based on self-supervision, ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5

work page 2023
[9]

Hamze Hammami and Nidhal Abdulaziz,Beebetter: A multi-modal beehive system for honeybee health monitoring and hazard detection, 2024 7th International Conference on Signal Processing and Information Security (ICSPIS), IEEE, 2024, pp. 1–5

work page 2024
[10]

Logan S James, Benjamin Hoffman, Jen-Yu Liu, Marius Miron, Milad Alizadeh, Emmanuel Fernandez, Matthieu Geist, Diane Kim, Aza Raskin, Jon T Sakata, et al.,Zebra finch females flexibly communicate with each other and with ai-driven acoustic interaction models, bioRxiv (2026), 2026–02

work page 2026
[11]

11, 1392

Dimitrios Kanelis, Vasilios Liolios, Fotini Papadopoulou, Maria-Anna Rodopoulou, Dimitrios Kampelopou- los, Kostas Siozios, and Chrysoula Tananaki,Decoding the behavior of a queenless colony using sound signals, Biology12(2023), no. 11, 1392

work page 2023
[12]

3, 297–307

WH Kirchner,Acoustical communication in honeybees, Apidologie24(1993), no. 3, 297–307

work page 1993
[13]

Khaled Koutini, Jan Schl¨ uter, Hamid Eghbal-Zadeh, and Gerhard Widmer,Efficient training of audio transformers with patchout, arXiv preprint arXiv:2110.05069 (2021)

work page arXiv 2021
[14]

3, 207–212

Axel Michelsen, Wolfgang H Kirchner, and Martin Lindauer,Sound and vibrational signals in the dance language of the honeybee, apis mellifera, Behavioral ecology and sociobiology18(1986), no. 3, 207–212

work page 1986
[15]

Orr Paradise, Pranav Muralikrishnan, Liangyuan Chen, Hugo Flores Garc´ ıa, Bryan Pardo, Roee Diamant, David F Gruber, Shane Gero, and Shafi Goldwasser,Wham: Towards a translative model of sperm whale vocalization, arXiv preprint arXiv:2512.02206 (2025)

work page arXiv 2025
[16]

1, 14571

Michael Ramsey, M Bencsik, and MI Newton,Extensive vibrational characterisation and long-term moni- toring of honeybee dorso-ventral abdominal vibration signals, Scientific Reports8(2018), no. 1, 14571

work page 2018
[17]

David Robinson, Marius Miron, Masato Hagiwara, Benno Weck, Sara Keen, Milad Alizadeh, Gagan Narula, Matthieu Geist, and Olivier Pietquin,Naturelm-audio: an audio-language foundation model for bioacous- tics, arXiv preprint arXiv:2411.07186 (2024)

work page arXiv 2024
[18]

Eklavya Sarkar and Mathew Magimai Doss,Comparing self-supervised learning models pre-trained on hu- man speech and animal vocalizations for bioacoustics processing, ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2025, pp. 1–5

work page 2025
[19]

Eklavya Sarkar and Mathew Magimai Doss,Towards leveraging sequential structure in animal vocalizations, arXiv preprint arXiv:2511.10190 (2025)

work page arXiv 2025
[20]

Pratyusha Sharma, Shane Gero, Daniela Rus, Antonio Torralba, and Jacob Andreas,Whalelm: Finding structure and information in sperm whale vocalizations and behavior with machine learning, bioRxiv (2024), 2024–10. School of Engineering and Physical Sciences, Heriot-W att University Dubai Email address:hh2095@hw.ac.uk School of Engineering and Physical Scien...

work page 2024