arxiv: 2605.03914 · v1 · submitted 2026-05-05 · 💻 cs.SD · cs.LG

Recognition: unknown

Ecologically-Constrained Task Arithmetic for Multi-Taxa Bioacoustic Classifiers Without Shared Data

Ragib Amin Nihal , Benjamin Yen , Runwu Shi , Takeshi Ashizawa , Kazuhiro Nakadai

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:36 UTC · model grok-4.3

classification 💻 cs.SD cs.LG

keywords task arithmeticbioacousticsmulti-taxa classificationBEATs encodertask vectorsdata privacylinear mode connectivityacoustic niche hypothesis

0 comments

The pith

Independently fine-tuned BEATs encoders can be arithmetically composed into a 661-species bioacoustic classifier without sharing data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that task vectors extracted from separately trained audio encoders on distinct animal taxa can be added or averaged to produce one unified classifier spanning 661 species. This succeeds because the vectors remain nearly orthogonal, with cosine similarities between 0.01 and 0.09, and their geometry tracks spectral differences across groups. A reader would care because bioacoustic datasets are fragmented across institutions and regions, often making joint training impossible due to privacy or logistics. The method redistributes performance toward underrepresented taxa while enabling zero-shot generalization to new areas.

Core claim

Independently fine-tuned BEATs encoders can be composed into a unified 661-species classifier via task vector arithmetic without sharing data. Bioacoustic task vectors are near-orthogonal (cosine 0.01-0.09) and their separation aligns with spectral distribution distance. Simple averaging proves optimal for composition, while sign-conflict methods lower accuracy by one to six points. The resulting model exhibits linear mode connectivity across taxonomic pairs, supports zero-shot transfer to new regions, and fails under domain negation.

What carries the argument

Task vector arithmetic on fine-tuned BEATs encoders, where weight deltas from the base model are added or averaged across taxa to form a multi-species classifier.

If this is right

Institutions can share only task vectors to assemble multi-taxa classifiers while keeping raw audio private.
Averaging task vectors outperforms sign-conflict methods and maintains linear mode connectivity across all taxonomic pairs.
The composition produces an asymmetric accuracy shift that benefits underrepresented taxa relative to species-rich groups.
The unified model enables zero-shot transfer to new geographic regions without additional fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other privacy-sensitive domains with scattered training data, such as medical audio or environmental sensors.
If near-orthogonality scales to thousands of species, global collaborative classifiers assembled from local contributions become practical.
The observed alignment with spectral distances suggests similar vector geometry may exist in other ecological or niche-based classification tasks.
Non-linear composition operators could be tested to close the remaining gap to fully joint training.

Load-bearing premise

Bioacoustic task vectors from different taxa stay near-orthogonal and linear mode connectivity holds sufficiently for arithmetic composition to preserve accuracy without joint training on shared data.

What would settle it

The composed 661-species model achieving accuracy more than five percentage points below a jointly trained baseline on a held-out multi-taxa test set, or multiple taxonomic pairs showing cosine similarity above 0.1.

Figures

Figures reproduced from arXiv: 2605.03914 by Benjamin Yen, Kazuhiro Nakadai, Ragib Amin Nihal, Runwu Shi, Takeshi Ashizawa.

**Figure 1.** Figure 1: Ecologically-constrained task arithmetic for bioacoustic model composition. (a) Acoustic niche partitioning: taxonomic groups concentrate vocal energy in non-overlapping frequency bands (schematic). (b) Top: task vectors are near-orthogonal in weight space, with magnitude proportional to dataset size. Bottom: each group modifies a sparse, largely disjoint subset of encoder parameters. (c) Composition pipel… view at source ↗

**Figure 2.** Figure 2: Linear mode connectivity for some specialist pairs. Every curve is monotonic: no loss barrier exceeds endpoint. G1 G2 G3 G4 G5 G1 G2 G3 G4 G5 ‖τ‖=13.6 0.539 0.535 0.506 0.511 0.092 ‖τ‖=9.5 0.537 0.505 0.515 0.085 0.093 ‖τ‖=6.9 0.510 0.514 0.014 0.013 0.021 ‖τ‖=1.4 0.509 0.029 0.038 0.039 0.022 ‖τ‖=4.8 Birds Cosine Similarity Sign Agreement view at source ↗

**Figure 3.** Figure 3: Pairwise cosine similarity heatmap. related to the tasks. Task vector L2 norms span from 13.58 for G1 to 1.45 for G4, correlating with training set size ( view at source ↗

**Figure 4.** Figure 4: (a) Spectral distribution distance (JSD) vs. task vector cosine similarity. (b) Per-group composition gap view at source ↗

**Figure 5.** Figure 5: Domain negation: accuracy vs. subtraction strength β for focal negation (solid) and random-vector control (dashed). F3: Composition works for taxonomic and geographic tasks with zero-shot transfer, but fails for domain negation because recordings may entangle with species identity. Additional experiments are in the Supplementary Material. 4. Discussion Toward collaborative model building in bioacoustics. C… view at source ↗

read the original abstract

Training data for bioacoustics is scattered across taxa, regions, and institutions. Centralizing it all is often infeasible. We show that independently fine-tuned BEATs encoders can be composed into a unified 661-species classifier via task vector arithmetic without sharing data. We find that bioacoustic task vectors are near-orthogonal (cosine 0.01-0.09). Their separation aligns closely with spectral distribution distance, a gradient consistent with the acoustic niche hypothesis. This geometry makes simple averaging optimal while sign-conflict methods reduce accuracy by one to six percentage points. Composition also creates an asymmetric gap: species-rich groups lose accuracy relative to joint training while underrepresented taxa gain, a redistribution useful for equitable biodiversity monitoring. We verify linear mode connectivity across all taxonomic pairs, demonstrate zero-shot transfer to new regions, and identify domain negation as a boundary condition where composition fails. These results enable a collaborative paradigm for bioacoustics where institutions share only task vectors to assemble multi-taxa classifiers, preserving data privacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Task arithmetic merges independently fine-tuned bioacoustic models across taxa without data sharing because the vectors are near-orthogonal, though the full 661-species average rests on an untested extrapolation from pairwise checks.

read the letter

This paper shows that independently fine-tuned BEATs encoders for different taxa can be combined into a single 661-species classifier by averaging their task vectors, with no need to share raw recordings. The vectors turn out to be nearly orthogonal, with reported cosine similarities between 0.01 and 0.09, and that separation tracks spectral distribution distances in a way that matches the acoustic niche hypothesis. Simple averaging outperforms sign-conflict approaches by one to six points, pairwise linear mode connectivity holds, and they get some zero-shot transfer to new regions. The asymmetric accuracy shift is also worth noting: species-rich groups lose relative to joint training while rarer taxa gain, which could matter for balancing monitoring effort.

Referee Report

2 major / 3 minor

Summary. The paper claims that independently fine-tuned BEATs encoders for bioacoustic tasks across taxa can be composed into a single 661-species classifier using task vector arithmetic (primarily simple averaging) without any data sharing. It reports that the resulting task vectors are near-orthogonal (cosine similarities 0.01-0.09), with their geometry aligning to spectral distribution distances consistent with the acoustic niche hypothesis; simple averaging outperforms sign-conflict methods, linear mode connectivity holds for taxonomic pairs, zero-shot transfer to new regions is possible, and composition produces an asymmetric accuracy redistribution (species-rich taxa lose relative to joint training while underrepresented taxa gain). Domain negation is identified as a failure mode.

Significance. If the empirical results on multi-vector composition hold under rigorous controls, the work would enable a practical collaborative paradigm for bioacoustics in which institutions share only task vectors rather than raw recordings, directly addressing data privacy and centralization barriers. The reported alignment between model geometry and ecological principles offers a bridge between machine learning and bioacoustics theory, and the asymmetric accuracy effect could support more equitable monitoring of rare taxa. The paper provides concrete empirical measurements (cosine similarities, pairwise connectivity, accuracy deltas) that are falsifiable and could be reproduced by others sharing task vectors.

major comments (2)

[§4] §4 (multi-taxa composition experiments): Linear mode connectivity is verified only across taxonomic pairs, yet the central claim requires that simultaneous averaging of dozens of task vectors (for the 661-species unified classifier) preserves performance via connectivity. Higher-order interference or accumulated misalignment is not directly tested; the observed asymmetric accuracy gap is consistent with such effects but does not confirm the aggregate case.
[§4.1] §4.1 and accuracy tables: The soundness of accuracy comparisons, zero-shot transfer, and optimality of averaging rests on empirical verifications, but the manuscript lacks sufficient detail on experimental controls, baseline selection criteria, dataset partitioning, and whether post-hoc choices were made; this prevents full evaluation of whether the reported deltas (1-6 percentage points) are robust.

minor comments (3)

[§3] The notation for task vectors and the precise definition of 'task vector arithmetic' in the methods section would benefit from an explicit equation or pseudocode to distinguish it from prior task arithmetic literature.
[Figures] Figure captions for the cosine similarity and connectivity plots should include error bars or confidence intervals and state the exact number of taxonomic pairs evaluated.
[§2] A citation to foundational work on the acoustic niche hypothesis should be added when the alignment with spectral distances is discussed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the detailed, constructive comments on experimental rigor. We address each major point below and have revised the manuscript to strengthen the presentation of results and controls.

read point-by-point responses

Referee: [§4] §4 (multi-taxa composition experiments): Linear mode connectivity is verified only across taxonomic pairs, yet the central claim requires that simultaneous averaging of dozens of task vectors (for the 661-species unified classifier) preserves performance via connectivity. Higher-order interference or accumulated misalignment is not directly tested; the observed asymmetric accuracy gap is consistent with such effects but does not confirm the aggregate case.

Authors: We agree that pairwise verification alone does not fully establish the multi-vector case. The near-orthogonality (cosine similarities 0.01–0.09) provides a theoretical basis for expecting limited higher-order interference, and the reported 661-species model was itself obtained by simultaneous averaging, with the observed asymmetric accuracy redistribution serving as an empirical outcome of that aggregate composition. To strengthen the claim, we have added a new analysis in the revised §4 that measures accuracy as a function of the number of simultaneously averaged vectors (subsets of 2, 5, 10, and 20 taxa), showing consistent scaling without abrupt degradation attributable to accumulated misalignment. While a complete connectivity proof for arbitrary numbers of vectors remains beyond the current scope, these results support the practical validity of the full composition. revision: partial
Referee: [§4.1] §4.1 and accuracy tables: The soundness of accuracy comparisons, zero-shot transfer, and optimality of averaging rests on empirical verifications, but the manuscript lacks sufficient detail on experimental controls, baseline selection criteria, dataset partitioning, and whether post-hoc choices were made; this prevents full evaluation of whether the reported deltas (1-6 percentage points) are robust.

Authors: We acknowledge the need for greater transparency. The revised manuscript expands §4.1 and the experimental appendix with explicit descriptions of: baseline selection (joint training on pooled data where feasible for direct comparison, otherwise independent single-task models); dataset partitioning (stratified random splits by taxon and region to ensure no cross-contamination); hyperparameter consistency and random-seed controls; and confirmation that all tabulated results derive from the primary experimental protocol without post-hoc model selection. We have also added variance estimates and per-species accuracy breakdowns in supplementary material to allow direct assessment of the reported 1–6 percentage point deltas. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical measurements

full rationale

The paper's derivation chain consists of applying standard task-vector arithmetic (from prior literature) to independently fine-tuned BEATs encoders, followed by explicit experimental verification of vector angles, pairwise linear mode connectivity, and composed-model accuracy on held-out test sets. No quantity is defined in terms of itself, no fitted parameter is relabeled as a 'prediction,' and no central premise reduces to a self-citation whose content is unverified. The reported cosine similarities (0.01-0.09), connectivity checks, and accuracy deltas are measured quantities, not tautological outputs of the paper's own equations. The multi-vector composition claim is supported by direct evaluation rather than by construction from pairwise results alone.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; method builds on standard task arithmetic applied to BEATs encoders with empirical observations of vector geometry.

pith-pipeline@v0.9.0 · 5494 in / 976 out tokens · 41311 ms · 2026-05-07T12:36:50.158665+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Ecologically-Constrained Task Arithmetic for Multi-Taxa Bioacoustic Classifiers Without Shared Data

Introduction Passive bioacoustic monitoring generates large volumes of recordings across thousands of sites, capturing sounds from birds, marine mammals, amphibians, and insects [1]. How- ever, the training data needed to build automated species classi- fiers remain fragmented. Ornithological surveys, cetacean pro- grams, and herpetological fieldwork each...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Problem setting We consider building a multi-taxa classifier fromNindepen- dently trained specialists without access to their training data

Method 2.1. Problem setting We consider building a multi-taxa classifier fromNindepen- dently trained specialists without access to their training data. Given a shared pretrained encoderθ 0 ∈R d andNspecialist checkpoints{θ 1, . . . ,θN }, each fine-tuned on a private dataset Di covering a disjoint species setS i (Si ∩S j =∅fori̸=j), we define thetask vec...

2023
[3]

Experiment 1: Linear Mode Connectivity Task vector addition relies on the assumption that all fine-tuned models lie in the same loss basin withθ 0

Experiments and Results 3.1. Experiment 1: Linear Mode Connectivity Task vector addition relies on the assumption that all fine-tuned models lie in the same loss basin withθ 0. If this does not hold, merging could produce a model with high loss. We test this prerequisite for all ten pairwise combinations of group en- coders by interpolating between each p...
[4]

Task arithmetic offers an alternative

Discussion Toward collaborative model building in bioacoustics.Cur- rent bioacoustic monitoring relies on either monolithic classi- fiers (BirdNET, Perch) that require centralized retraining for up- dates, or isolated specialists that cannot share knowledge. Task arithmetic offers an alternative. Each research group trains on its own data, contributes a t...
[5]

Acoustic niche partitioning extends to weight space: taxa in distinct spectral bands pro- duce modular, composable parameter updates

Conclusion Our results affirmatively answer both questions posed in §1, as summarized in findings F1–F3. Acoustic niche partitioning extends to weight space: taxa in distinct spectral bands pro- duce modular, composable parameter updates. This geometric regime distinguishes bioacoustics from vision and explains why conflict-resolution methods are counterp...
[6]

Computational bioacoustics with deep learning: a review and roadmap,

D. Stowell, “Computational bioacoustics with deep learning: a review and roadmap,”PeerJ, vol. 10, p. e13152, 2022

2022
[7]

Birdnet: A deep learning solution for avian diversity monitoring,

S. Kahl, C. M. Wood, M. Eibl, and H. Klinck, “Birdnet: A deep learning solution for avian diversity monitoring,”Ecological In- formatics, vol. 61, p. 101236, 2021

2021
[8]

The watkins marine mammal sound database: an online, freely accessible resource,

L. Sayigh, M. A. Daher, J. Allen, H. Gordon, K. Joyce, C. Stuhlmann, and P. Tyack, “The watkins marine mammal sound database: an online, freely accessible resource,” inProceedings of Meetings on Acoustics, vol. 27, no. 1. Acoustical Society of America, 2016, p. 040013

2016
[9]

A dataset for benchmarking neotropical anuran calls identification in passive acoustic moni- toring,

J. S. Ca ˜nas, M. P. Toro-G´omez, L. S. M. Sugai, H. D. Ben´ıtez Re- strepo, J. Rudas, B. Posso Bautista, L. F. Toledo, S. Dena, A. H. R. Domingos, F. L. de Souzaet al., “A dataset for benchmarking neotropical anuran calls identification in passive acoustic moni- toring,”Scientific Data, vol. 10, no. 1, p. 771, 2023

2023
[10]

Perch 2.0: The bittern lesson for bioacoustics.arXiv preprint arXiv:2508.04665, 2025

B. van Merri ¨enboer, V . Dumoulin, J. Hamer, L. Harrell, A. Burns, and T. Denton, “Perch 2.0: The bittern lesson for bioacoustics,” arXiv preprint arXiv:2508.04665, 2025

work page arXiv 2025
[11]

The search for squawk: Agile modeling in bioacoustics,

V . Dumoulin, O. Stretcu, J. Hamer, L. Harrell, R. Laber, H. Larochelle, B. van Merri ¨enboer, A. Navine, P. Hart, B. Williamset al., “The search for squawk: Agile modeling in bioacoustics,”arXiv preprint arXiv:2505.03071, 2025

work page arXiv 2025
[12]

Catastrophic forgetting in connectionist net- works,

R. M. French, “Catastrophic forgetting in connectionist net- works,”Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999

1999
[13]

Editing models with task arithmetic,

G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” inInternational Conference on Learning Repre- sentations (ICLR), 2023

2023
[14]

Ties-merging: Resolving interference when merging models,

P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties-merging: Resolving interference when merging models,” Advances in neural information processing systems, vol. 36, pp. 7093–7115, 2023

2023
[15]

Distilling a speech and music encoder with task arithmetic,

F. Ritter-Gutierrez, Y .-C. Lin, J.-C. Wei, J. H. Wong, E. S. Chng, N. F. Chen, and H.-y. Lee, “Distilling a speech and music encoder with task arithmetic,” inProc. Interspeech 2025, 2025, pp. 3858– 3862

2025
[16]

The niche hypothesis: a virtual symphony of animal sounds, the origins of musical expression and the health of habitats,

B. L. Krauseet al., “The niche hypothesis: a virtual symphony of animal sounds, the origins of musical expression and the health of habitats,”The Soundscape Newsletter, vol. 6, no. 5, 1993

1993
[17]

Model merging improves zero-shot generalization in bioacoustic foundation models,

D. Marincione, D. Crisostomi, R. Dessi, E. Rodol `a, and E. Rossi, “Model merging improves zero-shot generalization in bioacoustic foundation models,” inThe Thirty-Ninth Annual Conference on Neural Information Processing Systems workshop: AI for non- human animal communication, 2025

2025
[18]

Weakly supervised multiple instance learning for whale call detection and temporal localization in long-duration passive acoustic monitoring,

R. A. Nihal, B. Yen, R. Shi, and K. Nakadai, “Weakly supervised multiple instance learning for whale call detection and temporal localization in long-duration passive acoustic monitoring,”arXiv preprint arXiv:2502.20838, 2025

work page arXiv 2025
[19]

Language models are super mario: Absorbing abilities from homologous models as a free lunch,

L. Yu, B. Yu, H. Yu, F. Huang, and Y . Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” inForty-first International Conference on Machine Learning, 2024

2024
[20]

Della-merging: Reduc- ing interference in model merging through magnitude-based sam- pling,

P. T. Deep, R. Bhardwaj, and S. Poria, “Della-merging: Reduc- ing interference in model merging through magnitude-based sam- pling,”arXiv preprint arXiv:2406.11617, 2024

work page arXiv 2024
[21]

Beats: audio pre-training with acoustic tok- enizers,

S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “Beats: audio pre-training with acoustic tok- enizers,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023

2023
[22]

Linear mode connectivity and the lottery ticket hypothesis,

J. Frankle, G. K. Dziugaite, D. Roy, and M. Carbin, “Linear mode connectivity and the lottery ticket hypothesis,” inInternational conference on machine learning. PMLR, 2020, pp. 3259–3269

2020
[23]

Task arithmetic through the lens of one-shot federated learning,

Z. Tao, I. Mason, S. Kulkarni, and X. Boix, “Task arithmetic through the lens of one-shot federated learning,”Transactions on Machine Learning Research, 2025

2025
[24]

Scaffold: Stochastic controlled averaging for fed- erated learning,

S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for fed- erated learning,” inInternational conference on machine learning. PMLR, 2020, pp. 5132–5143

2020
[25]

Birdset: A large-scale dataset for audio classification in avian bioacous- tics,

L. Rauch, R. Schwinger, M. Wirth, R. Heinrich, D. Huseljic, M. Herde, J. Lange, S. Kahl, B. Sick, S. Tomfordeet al., “Birdset: A large-scale dataset for audio classification in avian bioacous- tics,”arXiv preprint arXiv:2403.10380, 2024. Generative AI Use Disclosure Generative AI tools were used for editing prose and debugging code analysis. All scientifi...

work page arXiv 2024
[26]

Trim: Zero out the bottomkfraction of parameters by mag- nitude in each task vector, retaining only the largest changes
[27]

Elect sign: For each parameter, take a majority vote across task vectors to determine the dominant sign
[28]

spectral fingerprint

Disjoint merge: Average only the values that agree with the elected sign; discard conflicting values. B.3. DARE DARE [14] builds on the observation that fine-tuning deltas are highly redundant: dropping most individual parameter changes often leaves the merged model’s behavior unchanged. DARE randomly zeros each parameter with probabilitypand rescales the...

2023
[29]

The composition gap is measured using the unified probe

Per-group probes assess how well the merged encoder preserves discriminability within each group, while the unified probe evaluates global multi-taxa classification performance. The composition gap is measured using the unified probe. We also evaluate usingk-nearest neighbors withk= 1. This approach requires no training: each test sample receives the la- ...