Recognition: unknown
Ecologically-Constrained Task Arithmetic for Multi-Taxa Bioacoustic Classifiers Without Shared Data
Pith reviewed 2026-05-07 12:36 UTC · model grok-4.3
The pith
Independently fine-tuned BEATs encoders can be arithmetically composed into a 661-species bioacoustic classifier without sharing data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Independently fine-tuned BEATs encoders can be composed into a unified 661-species classifier via task vector arithmetic without sharing data. Bioacoustic task vectors are near-orthogonal (cosine 0.01-0.09) and their separation aligns with spectral distribution distance. Simple averaging proves optimal for composition, while sign-conflict methods lower accuracy by one to six points. The resulting model exhibits linear mode connectivity across taxonomic pairs, supports zero-shot transfer to new regions, and fails under domain negation.
What carries the argument
Task vector arithmetic on fine-tuned BEATs encoders, where weight deltas from the base model are added or averaged across taxa to form a multi-species classifier.
If this is right
- Institutions can share only task vectors to assemble multi-taxa classifiers while keeping raw audio private.
- Averaging task vectors outperforms sign-conflict methods and maintains linear mode connectivity across all taxonomic pairs.
- The composition produces an asymmetric accuracy shift that benefits underrepresented taxa relative to species-rich groups.
- The unified model enables zero-shot transfer to new geographic regions without additional fine-tuning.
Where Pith is reading between the lines
- The approach could extend to other privacy-sensitive domains with scattered training data, such as medical audio or environmental sensors.
- If near-orthogonality scales to thousands of species, global collaborative classifiers assembled from local contributions become practical.
- The observed alignment with spectral distances suggests similar vector geometry may exist in other ecological or niche-based classification tasks.
- Non-linear composition operators could be tested to close the remaining gap to fully joint training.
Load-bearing premise
Bioacoustic task vectors from different taxa stay near-orthogonal and linear mode connectivity holds sufficiently for arithmetic composition to preserve accuracy without joint training on shared data.
What would settle it
The composed 661-species model achieving accuracy more than five percentage points below a jointly trained baseline on a held-out multi-taxa test set, or multiple taxonomic pairs showing cosine similarity above 0.1.
Figures
read the original abstract
Training data for bioacoustics is scattered across taxa, regions, and institutions. Centralizing it all is often infeasible. We show that independently fine-tuned BEATs encoders can be composed into a unified 661-species classifier via task vector arithmetic without sharing data. We find that bioacoustic task vectors are near-orthogonal (cosine 0.01-0.09). Their separation aligns closely with spectral distribution distance, a gradient consistent with the acoustic niche hypothesis. This geometry makes simple averaging optimal while sign-conflict methods reduce accuracy by one to six percentage points. Composition also creates an asymmetric gap: species-rich groups lose accuracy relative to joint training while underrepresented taxa gain, a redistribution useful for equitable biodiversity monitoring. We verify linear mode connectivity across all taxonomic pairs, demonstrate zero-shot transfer to new regions, and identify domain negation as a boundary condition where composition fails. These results enable a collaborative paradigm for bioacoustics where institutions share only task vectors to assemble multi-taxa classifiers, preserving data privacy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that independently fine-tuned BEATs encoders for bioacoustic tasks across taxa can be composed into a single 661-species classifier using task vector arithmetic (primarily simple averaging) without any data sharing. It reports that the resulting task vectors are near-orthogonal (cosine similarities 0.01-0.09), with their geometry aligning to spectral distribution distances consistent with the acoustic niche hypothesis; simple averaging outperforms sign-conflict methods, linear mode connectivity holds for taxonomic pairs, zero-shot transfer to new regions is possible, and composition produces an asymmetric accuracy redistribution (species-rich taxa lose relative to joint training while underrepresented taxa gain). Domain negation is identified as a failure mode.
Significance. If the empirical results on multi-vector composition hold under rigorous controls, the work would enable a practical collaborative paradigm for bioacoustics in which institutions share only task vectors rather than raw recordings, directly addressing data privacy and centralization barriers. The reported alignment between model geometry and ecological principles offers a bridge between machine learning and bioacoustics theory, and the asymmetric accuracy effect could support more equitable monitoring of rare taxa. The paper provides concrete empirical measurements (cosine similarities, pairwise connectivity, accuracy deltas) that are falsifiable and could be reproduced by others sharing task vectors.
major comments (2)
- [§4] §4 (multi-taxa composition experiments): Linear mode connectivity is verified only across taxonomic pairs, yet the central claim requires that simultaneous averaging of dozens of task vectors (for the 661-species unified classifier) preserves performance via connectivity. Higher-order interference or accumulated misalignment is not directly tested; the observed asymmetric accuracy gap is consistent with such effects but does not confirm the aggregate case.
- [§4.1] §4.1 and accuracy tables: The soundness of accuracy comparisons, zero-shot transfer, and optimality of averaging rests on empirical verifications, but the manuscript lacks sufficient detail on experimental controls, baseline selection criteria, dataset partitioning, and whether post-hoc choices were made; this prevents full evaluation of whether the reported deltas (1-6 percentage points) are robust.
minor comments (3)
- [§3] The notation for task vectors and the precise definition of 'task vector arithmetic' in the methods section would benefit from an explicit equation or pseudocode to distinguish it from prior task arithmetic literature.
- [Figures] Figure captions for the cosine similarity and connectivity plots should include error bars or confidence intervals and state the exact number of taxonomic pairs evaluated.
- [§2] A citation to foundational work on the acoustic niche hypothesis should be added when the alignment with spectral distances is discussed.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's significance and for the detailed, constructive comments on experimental rigor. We address each major point below and have revised the manuscript to strengthen the presentation of results and controls.
read point-by-point responses
-
Referee: [§4] §4 (multi-taxa composition experiments): Linear mode connectivity is verified only across taxonomic pairs, yet the central claim requires that simultaneous averaging of dozens of task vectors (for the 661-species unified classifier) preserves performance via connectivity. Higher-order interference or accumulated misalignment is not directly tested; the observed asymmetric accuracy gap is consistent with such effects but does not confirm the aggregate case.
Authors: We agree that pairwise verification alone does not fully establish the multi-vector case. The near-orthogonality (cosine similarities 0.01–0.09) provides a theoretical basis for expecting limited higher-order interference, and the reported 661-species model was itself obtained by simultaneous averaging, with the observed asymmetric accuracy redistribution serving as an empirical outcome of that aggregate composition. To strengthen the claim, we have added a new analysis in the revised §4 that measures accuracy as a function of the number of simultaneously averaged vectors (subsets of 2, 5, 10, and 20 taxa), showing consistent scaling without abrupt degradation attributable to accumulated misalignment. While a complete connectivity proof for arbitrary numbers of vectors remains beyond the current scope, these results support the practical validity of the full composition. revision: partial
-
Referee: [§4.1] §4.1 and accuracy tables: The soundness of accuracy comparisons, zero-shot transfer, and optimality of averaging rests on empirical verifications, but the manuscript lacks sufficient detail on experimental controls, baseline selection criteria, dataset partitioning, and whether post-hoc choices were made; this prevents full evaluation of whether the reported deltas (1-6 percentage points) are robust.
Authors: We acknowledge the need for greater transparency. The revised manuscript expands §4.1 and the experimental appendix with explicit descriptions of: baseline selection (joint training on pooled data where feasible for direct comparison, otherwise independent single-task models); dataset partitioning (stratified random splits by taxon and region to ensure no cross-contamination); hyperparameter consistency and random-seed controls; and confirmation that all tabulated results derive from the primary experimental protocol without post-hoc model selection. We have also added variance estimates and per-species accuracy breakdowns in supplementary material to allow direct assessment of the reported 1–6 percentage point deltas. revision: yes
Circularity Check
No significant circularity; claims rest on direct empirical measurements
full rationale
The paper's derivation chain consists of applying standard task-vector arithmetic (from prior literature) to independently fine-tuned BEATs encoders, followed by explicit experimental verification of vector angles, pairwise linear mode connectivity, and composed-model accuracy on held-out test sets. No quantity is defined in terms of itself, no fitted parameter is relabeled as a 'prediction,' and no central premise reduces to a self-citation whose content is unverified. The reported cosine similarities (0.01-0.09), connectivity checks, and accuracy deltas are measured quantities, not tautological outputs of the paper's own equations. The multi-vector composition claim is supported by direct evaluation rather than by construction from pairwise results alone.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ecologically-Constrained Task Arithmetic for Multi-Taxa Bioacoustic Classifiers Without Shared Data
Introduction Passive bioacoustic monitoring generates large volumes of recordings across thousands of sites, capturing sounds from birds, marine mammals, amphibians, and insects [1]. How- ever, the training data needed to build automated species classi- fiers remain fragmented. Ornithological surveys, cetacean pro- grams, and herpetological fieldwork each...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Problem setting We consider building a multi-taxa classifier fromNindepen- dently trained specialists without access to their training data
Method 2.1. Problem setting We consider building a multi-taxa classifier fromNindepen- dently trained specialists without access to their training data. Given a shared pretrained encoderθ 0 ∈R d andNspecialist checkpoints{θ 1, . . . ,θN }, each fine-tuned on a private dataset Di covering a disjoint species setS i (Si ∩S j =∅fori̸=j), we define thetask vec...
2023
-
[3]
Experiment 1: Linear Mode Connectivity Task vector addition relies on the assumption that all fine-tuned models lie in the same loss basin withθ 0
Experiments and Results 3.1. Experiment 1: Linear Mode Connectivity Task vector addition relies on the assumption that all fine-tuned models lie in the same loss basin withθ 0. If this does not hold, merging could produce a model with high loss. We test this prerequisite for all ten pairwise combinations of group en- coders by interpolating between each p...
-
[4]
Task arithmetic offers an alternative
Discussion Toward collaborative model building in bioacoustics.Cur- rent bioacoustic monitoring relies on either monolithic classi- fiers (BirdNET, Perch) that require centralized retraining for up- dates, or isolated specialists that cannot share knowledge. Task arithmetic offers an alternative. Each research group trains on its own data, contributes a t...
-
[5]
Acoustic niche partitioning extends to weight space: taxa in distinct spectral bands pro- duce modular, composable parameter updates
Conclusion Our results affirmatively answer both questions posed in §1, as summarized in findings F1–F3. Acoustic niche partitioning extends to weight space: taxa in distinct spectral bands pro- duce modular, composable parameter updates. This geometric regime distinguishes bioacoustics from vision and explains why conflict-resolution methods are counterp...
-
[6]
Computational bioacoustics with deep learning: a review and roadmap,
D. Stowell, “Computational bioacoustics with deep learning: a review and roadmap,”PeerJ, vol. 10, p. e13152, 2022
2022
-
[7]
Birdnet: A deep learning solution for avian diversity monitoring,
S. Kahl, C. M. Wood, M. Eibl, and H. Klinck, “Birdnet: A deep learning solution for avian diversity monitoring,”Ecological In- formatics, vol. 61, p. 101236, 2021
2021
-
[8]
The watkins marine mammal sound database: an online, freely accessible resource,
L. Sayigh, M. A. Daher, J. Allen, H. Gordon, K. Joyce, C. Stuhlmann, and P. Tyack, “The watkins marine mammal sound database: an online, freely accessible resource,” inProceedings of Meetings on Acoustics, vol. 27, no. 1. Acoustical Society of America, 2016, p. 040013
2016
-
[9]
A dataset for benchmarking neotropical anuran calls identification in passive acoustic moni- toring,
J. S. Ca ˜nas, M. P. Toro-G´omez, L. S. M. Sugai, H. D. Ben´ıtez Re- strepo, J. Rudas, B. Posso Bautista, L. F. Toledo, S. Dena, A. H. R. Domingos, F. L. de Souzaet al., “A dataset for benchmarking neotropical anuran calls identification in passive acoustic moni- toring,”Scientific Data, vol. 10, no. 1, p. 771, 2023
2023
-
[10]
Perch 2.0: The bittern lesson for bioacoustics.arXiv preprint arXiv:2508.04665, 2025
B. van Merri ¨enboer, V . Dumoulin, J. Hamer, L. Harrell, A. Burns, and T. Denton, “Perch 2.0: The bittern lesson for bioacoustics,” arXiv preprint arXiv:2508.04665, 2025
-
[11]
The search for squawk: Agile modeling in bioacoustics,
V . Dumoulin, O. Stretcu, J. Hamer, L. Harrell, R. Laber, H. Larochelle, B. van Merri ¨enboer, A. Navine, P. Hart, B. Williamset al., “The search for squawk: Agile modeling in bioacoustics,”arXiv preprint arXiv:2505.03071, 2025
-
[12]
Catastrophic forgetting in connectionist net- works,
R. M. French, “Catastrophic forgetting in connectionist net- works,”Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999
1999
-
[13]
Editing models with task arithmetic,
G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” inInternational Conference on Learning Repre- sentations (ICLR), 2023
2023
-
[14]
Ties-merging: Resolving interference when merging models,
P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties-merging: Resolving interference when merging models,” Advances in neural information processing systems, vol. 36, pp. 7093–7115, 2023
2023
-
[15]
Distilling a speech and music encoder with task arithmetic,
F. Ritter-Gutierrez, Y .-C. Lin, J.-C. Wei, J. H. Wong, E. S. Chng, N. F. Chen, and H.-y. Lee, “Distilling a speech and music encoder with task arithmetic,” inProc. Interspeech 2025, 2025, pp. 3858– 3862
2025
-
[16]
The niche hypothesis: a virtual symphony of animal sounds, the origins of musical expression and the health of habitats,
B. L. Krauseet al., “The niche hypothesis: a virtual symphony of animal sounds, the origins of musical expression and the health of habitats,”The Soundscape Newsletter, vol. 6, no. 5, 1993
1993
-
[17]
Model merging improves zero-shot generalization in bioacoustic foundation models,
D. Marincione, D. Crisostomi, R. Dessi, E. Rodol `a, and E. Rossi, “Model merging improves zero-shot generalization in bioacoustic foundation models,” inThe Thirty-Ninth Annual Conference on Neural Information Processing Systems workshop: AI for non- human animal communication, 2025
2025
-
[18]
R. A. Nihal, B. Yen, R. Shi, and K. Nakadai, “Weakly supervised multiple instance learning for whale call detection and temporal localization in long-duration passive acoustic monitoring,”arXiv preprint arXiv:2502.20838, 2025
-
[19]
Language models are super mario: Absorbing abilities from homologous models as a free lunch,
L. Yu, B. Yu, H. Yu, F. Huang, and Y . Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” inForty-first International Conference on Machine Learning, 2024
2024
-
[20]
Della-merging: Reduc- ing interference in model merging through magnitude-based sam- pling,
P. T. Deep, R. Bhardwaj, and S. Poria, “Della-merging: Reduc- ing interference in model merging through magnitude-based sam- pling,”arXiv preprint arXiv:2406.11617, 2024
-
[21]
Beats: audio pre-training with acoustic tok- enizers,
S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “Beats: audio pre-training with acoustic tok- enizers,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023
2023
-
[22]
Linear mode connectivity and the lottery ticket hypothesis,
J. Frankle, G. K. Dziugaite, D. Roy, and M. Carbin, “Linear mode connectivity and the lottery ticket hypothesis,” inInternational conference on machine learning. PMLR, 2020, pp. 3259–3269
2020
-
[23]
Task arithmetic through the lens of one-shot federated learning,
Z. Tao, I. Mason, S. Kulkarni, and X. Boix, “Task arithmetic through the lens of one-shot federated learning,”Transactions on Machine Learning Research, 2025
2025
-
[24]
Scaffold: Stochastic controlled averaging for fed- erated learning,
S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for fed- erated learning,” inInternational conference on machine learning. PMLR, 2020, pp. 5132–5143
2020
-
[25]
Birdset: A large-scale dataset for audio classification in avian bioacous- tics,
L. Rauch, R. Schwinger, M. Wirth, R. Heinrich, D. Huseljic, M. Herde, J. Lange, S. Kahl, B. Sick, S. Tomfordeet al., “Birdset: A large-scale dataset for audio classification in avian bioacous- tics,”arXiv preprint arXiv:2403.10380, 2024. Generative AI Use Disclosure Generative AI tools were used for editing prose and debugging code analysis. All scientifi...
-
[26]
Trim: Zero out the bottomkfraction of parameters by mag- nitude in each task vector, retaining only the largest changes
-
[27]
Elect sign: For each parameter, take a majority vote across task vectors to determine the dominant sign
-
[28]
spectral fingerprint
Disjoint merge: Average only the values that agree with the elected sign; discard conflicting values. B.3. DARE DARE [14] builds on the observation that fine-tuning deltas are highly redundant: dropping most individual parameter changes often leaves the merged model’s behavior unchanged. DARE randomly zeros each parameter with probabilitypand rescales the...
2023
-
[29]
The composition gap is measured using the unified probe
Per-group probes assess how well the merged encoder preserves discriminability within each group, while the unified probe evaluates global multi-taxa classification performance. The composition gap is measured using the unified probe. We also evaluate usingk-nearest neighbors withk= 1. This approach requires no training: each test sample receives the la- ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.