Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio

Aaqib Saeed; Hareld Kemps; Linda Moonen; Martijn den Dekker; Ran Piao; Tsai-Ning Wang; Yuan Lu

arxiv: 2606.23243 · v1 · pith:T7ZBH3MDnew · submitted 2026-06-22 · 💻 cs.LG · cs.SD

Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio

Ran Piao , Tsai-Ning Wang , Martijn den Dekker , Linda Moonen , Hareld Kemps , Yuan Lu , Aaqib Saeed This is my paper

Pith reviewed 2026-06-26 09:14 UTC · model grok-4.3

classification 💻 cs.LG cs.SD

keywords in-context learningaudio-language modelsfederated learningmedical audio diagnosispseudo-labelingunsupervised clusteringrespiratory audiocardiac audio

0 comments

The pith

Federated Self-Contextualization lets audio-language models do in-context clinical diagnosis from pseudo-label episodes built by unsupervised clustering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that audio-language models can acquire in-context diagnostic reasoning for clinical audio when real labels are scarce by first aligning embeddings through caption pretraining and then using federated optimization on episodes whose labels come from clustering the audio representations themselves. A sympathetic reader would care because many medical audio tasks lack large annotated corpora yet still require models that generalize from only a handful of examples at test time. The approach operates across decentralized hospital clients and reaches 71.6 percent accuracy in 2-way 2-shot evaluation on held-out respiratory and cardiac conditions. It does so by treating the clustered pseudo-labels as the support set for multimodal reasoning over a query audio sample.

Core claim

FSC constructs pseudo-label episodes via unsupervised clustering of audio representations, bypassing scarce real diagnostic labels, and enables contextual reasoning from support-query pairs. Our progressive three-stage pipeline first aligns audio embeddings with the language model via caption-based pretraining, then adapts it for episodic in-context inference through federated optimization. At test time, given a small labeled support set, the model diagnoses an unseen query through multimodal reasoning. On held-out respiratory and cardiac conditions, FSC achieves 71.6% accuracy in 2-way 2-shot evaluation, outperforming audio-language baselines by over 9%.

What carries the argument

Federated Self-Contextualization (FSC), the three-stage pipeline that creates pseudo-label episodes from unsupervised clustering of audio representations and then performs federated adaptation for multimodal in-context inference.

Load-bearing premise

That unsupervised clustering of audio representations produces pseudo-label episodes sufficiently reliable for the model to learn effective multimodal in-context reasoning without real diagnostic labels.

What would settle it

A test in which the unsupervised clusters show no better alignment with actual diagnostic categories than random grouping, resulting in accuracy no higher than the audio-language baselines.

Figures

Figures reproduced from arXiv: 2606.23243 by Aaqib Saeed, Hareld Kemps, Linda Moonen, Martijn den Dekker, Ran Piao, Tsai-Ning Wang, Yuan Lu.

**Figure 1.** Figure 1: Overview of Federated Self-Contextualization (FSC). (A) Federated setting with client-side clustering for local pseudo-label generation. (B) Progressive Local ICL Training Pipeline (C) In-context clinical inference using real-world few-shot support examples. alize to real diagnostic descriptions at inference; decoupling reasoning skill acquisition from annotation availability. • We instantiate this paradig… view at source ↗

read the original abstract

Clinical audio diagnosis in low-resource settings requires models that identify conditions from minimal examples without large annotated corpora. We propose Federated Self-Contextualization (FSC), a multimodal language model framework for in-context clinical audio diagnosis across federated hospital clients. FSC constructs pseudo-label episodes via unsupervised clustering of audio representations, bypassing scarce real diagnostic labels, and enables contextual reasoning from support-query pairs. Our progressive three-stage pipeline first aligns audio embeddings with the language model via caption-based pretraining, then adapts it for episodic in-context inference through federated optimization. At test time, given a small labeled support set, the model diagnoses an unseen query through multimodal reasoning. On held-out respiratory and cardiac conditions, FSC achieves 71.6% accuracy in 2-way 2-shot evaluation, outperforming audio-language baselines by over 9%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FSC claims 71.6% 2-shot accuracy on medical audio via federated pseudo-label clustering, but the abstract supplies no validation that the clusters track clinical categories.

read the letter

The main takeaway is that this paper outlines a three-stage pipeline called Federated Self-Contextualization for in-context learning on decentralized medical audio. It uses caption pretraining to align audio with language, then federated optimization over episodes built from unsupervised clustering of audio embeddings, and finally in-context inference on small support sets. The headline number is 71.6% accuracy in 2-way 2-shot on held-out respiratory and cardiac conditions, beating baselines by more than 9%.

What is new is the specific combination of federated training with unsupervised pseudo-label episodes for an audio-language model aimed at clinical diagnosis. The setup directly tackles label scarcity and data privacy across hospital clients, which is a practical constraint in this domain.

The framing of the problem and the staged pipeline are clear enough on paper. The approach of generating episodes without real diagnostic labels is a reasonable attempt to scale in-context methods where annotations are expensive.

The weakest part is the unsupervised clustering step that supplies the pseudo-labels. The abstract gives no cluster purity numbers, no normalized mutual information against any available labels, and no qualitative checks on what the clusters actually contain. Without that, it is impossible to tell whether the reported gains come from the episodic mechanism or from the caption pretraining alone. Dataset sizes, error bars, and statistical tests are also absent, so the central performance claim cannot be evaluated.

This is aimed at researchers working on federated multimodal models or in-context learning for audio in healthcare. A reader focused on privacy-preserving medical AI could get some value from the framework description, but only if the full paper adds the missing checks on the clustering quality.

I would send it to peer review. The idea is coherent on its own terms and the problem is real, but the current evidence is too thin to judge whether the pipeline actually works as described.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Federated Self-Contextualization (FSC), a multimodal language model framework for in-context clinical audio diagnosis in federated hospital settings. It constructs pseudo-label episodes via unsupervised clustering of audio representations to bypass scarce diagnostic labels, then uses a three-stage pipeline of caption-based pretraining followed by federated optimization over support-query episodes. At inference, the model performs multimodal reasoning on a small labeled support set to diagnose an unseen query. The central empirical claim is that FSC achieves 71.6% accuracy in 2-way 2-shot evaluation on held-out respiratory and cardiac conditions, outperforming audio-language baselines by over 9%.

Significance. If the result holds after proper validation, the work would be significant for enabling in-context learning in medical audio analysis under privacy and annotation constraints typical of decentralized clinical environments. The combination of unsupervised pseudo-labeling with federated episodic training addresses a practical gap in low-resource multimodal medical AI. No machine-checked proofs or parameter-free derivations are present, but the empirical framing on held-out conditions offers a falsifiable prediction that could be tested with additional controls.

major comments (1)

[Abstract] Abstract: The central performance claim (71.6% 2-way 2-shot accuracy and >9% improvement) rests on the unsupervised clustering step producing pseudo-label episodes that align with clinically relevant categories. No quantitative validation of this step (cluster purity, NMI against any labels, or alignment metrics) is supplied, so it is impossible to determine whether the reported gains arise from the episodic pseudo-label mechanism rather than caption pretraining alone or evaluation artifacts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the need to validate the unsupervised clustering component. We address this concern directly below and will incorporate the requested analysis in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (71.6% 2-way 2-shot accuracy and >9% improvement) rests on the unsupervised clustering step producing pseudo-label episodes that align with clinically relevant categories. No quantitative validation of this step (cluster purity, NMI against any labels, or alignment metrics) is supplied, so it is impossible to determine whether the reported gains arise from the episodic pseudo-label mechanism rather than caption pretraining alone or evaluation artifacts.

Authors: We agree that the manuscript would benefit from explicit quantitative validation of the clustering step. In the revised version we will add a dedicated analysis section reporting silhouette scores on the learned audio representations together with normalized mutual information (NMI) computed against a small held-out subset of diagnostic labels reserved solely for this validation (not used in training or evaluation). We will also include an ablation that isolates the contribution of the episodic pseudo-label stage by comparing the full FSC pipeline against a caption-pretraining-only baseline. These additions will allow readers to assess whether the reported gains are attributable to the pseudo-label episodes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy from held-out evaluation

full rationale

The paper presents a three-stage pipeline (caption pretraining, federated episodic adaptation via unsupervised clustering for pseudo-labels, then test-time in-context inference) whose central claim is an empirical accuracy number (71.6% 2-way 2-shot on held-out conditions) measured against external baselines. No equations, derivations, or fitted parameters are redefined as predictions; the unsupervised clustering step is an input mechanism whose quality is not asserted by construction but left as an empirical assumption. No self-citation chains or uniqueness theorems are invoked to force the result. The reported performance is therefore an independent measurement rather than a quantity that reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; full manuscript required for audit.

pith-pipeline@v0.9.1-grok · 5691 in / 1009 out tokens · 29270 ms · 2026-06-26T09:14:27.297170+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 13 canonical work pages · 9 internal anchors

[1]

Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio

Introduction Clinical audio diagnosis is fundamentally an act of grounded reasoning: a clinician interprets what theyhear—a wheeze, a murmur, an abnormal cough—in light of what theyknowabout the clinical conditions those sounds signify [1, 2]. This pro- cess is open-ended, contextual, and knowledge-intensive. Yet the dominant machine learning paradigm for...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Mountain Breeze,

Methodology 2.1. Problem Formulation We consider clinical audio diagnosis where recordings are dis- tributed acrossMinstitutions without centralized access and no expert labels are available during training. Each diag- nostic episodeE={S, x q}consists of a support setS= {(xi, yi)}N K i=1 withKaudio-label pairs for each ofNdiagnos- tic concepts, and a quer...
[3]

Datasets We evaluate on seven medical audio datasets spanning respira- tory and cardiac domains (Table 2)

Experiments 3.1. Datasets We evaluate on seven medical audio datasets spanning respira- tory and cardiac domains (Table 2). In the federated setup, each dataset is assigned to a separate client, yieldingM=7institu- tions with naturally heterogeneous recording conditions, label spaces, and sample sizes. All audio is resampled to 16 kHz and segmented into f...

1906
[4]

Results and Analysis We evaluate FSC under the label-unseen episodic protocol de- scribed in Section 3.2, where no diagnostic labels are seen dur- ing training and classification relies entirely on in-context con- ditioning at inference. To mitigate linguistic bias from dis- ease phrasing, each condition is represented by three physician- validated textua...
[5]

Conclusions This paper introduced FSC, a federated framework for few- shot clinical audio diagnosis that operates without centralized data sharing or real diagnostic labels. FSC pairs pseudo-label episode construction with a three-stage training pipeline to equip a multimodal language model with in-context diagnostic reasoning from limited support example...
[6]

All technical content, experimental design, analysis, and con- clusions were created by the authors

Generative AI Use Disclosure Generative AI tools were used solely for language editing and polishing to improve clarity and readability of the manuscript. All technical content, experimental design, analysis, and con- clusions were created by the authors. The authors take full re- sponsibility for the content of this paper
[7]

We also acknowledge the use of the Dutch National Supercom- puter Snellius for essential computational tasks

Acknowledgments This work was supported by the NWO AiNed Fellowship Grant of A.S., and in part by Google.org and the Google Cloud Re- search Credits program through the Gemini Academic Program. We also acknowledge the use of the Dutch National Supercom- puter Snellius for essential computational tasks
[8]

Non- invasive devices for respiratory sound monitoring,

´A. Troncoso, J. A. Ortega, R. Seepold, and N. M. Madrid, “Non- invasive devices for respiratory sound monitoring,”Procedia com- puter science, vol. 192, pp. 3040–3048, 2021

2021
[9]

Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds,

C. Potes, S. Parvaneh, A. Rahman, and B. Conroy, “Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds,” in2016 computing in cardiology confer- ence (CinC). IEEE, 2016, pp. 621–624

2016
[10]

Ai4covid- 19: Ai enabled preliminary diagnosis for covid-19 from cough samples via an app,

A. Imran, I. Posokhova, H. N. Qureshi, U. Masood, M. S. Riaz, K. Ali, C. N. John, M. I. Hussain, and M. Nabeel, “Ai4covid- 19: Ai enabled preliminary diagnosis for covid-19 from cough samples via an app,”Informatics in medicine unlocked, vol. 20, p. 100378, 2020

2020
[11]

Deep learning methods for heart sounds classification: A systematic re- view,

W. Chen, Q. Sun, X. Chen, G. Xie, H. Wu, and C. Xu, “Deep learning methods for heart sounds classification: A systematic re- view,”Entropy, vol. 23, no. 6, p. 667, 2021

2021
[12]

Hear–health acoustic representations,

S. Baur, Z. Nabulsi, W.-H. Weng, J. Garrison, L. Blankemeier, S. Fishman, C. Chen, S. Kakarmath, M. Maimbolwa, N. San- jaseet al., “Hear–health acoustic representations,”arXiv preprint arXiv:2403.02522, 2024

work page arXiv 2024
[13]

Federated learning for healthcare informatics,

J. Xu, B. S. Glicksberg, C. Su, P. Walker, J. Bian, and F. Wang, “Federated learning for healthcare informatics,”Journal of health- care informatics research, vol. 5, no. 1, pp. 1–19, 2021

2021
[14]

The future of digital health with federated learning,

N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albar- qouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein et al., “The future of digital health with federated learning,”NPJ digital medicine, vol. 3, no. 1, p. 119, 2020

2020
[15]

MedGemma Technical Report

A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Tra- verse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lauet al., “Medgemma technical report,”arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

A survey on in-context learning,

Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Changet al., “A survey on in-context learning,” inPro- ceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 1107–1128

2024
[17]

Generalizing from a few examples: A survey on few-shot learning,

Y . Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,”ACM computing surveys (csur), vol. 53, no. 3, pp. 1–34, 2020

2020
[18]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[19]

Clap learning audio concepts from natural language supervision,

B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[20]

Careaqa: A cardiac and respiratory audio question answer- ing model for open-ended diagnostic reasoning,

T.-N. Wang, L.-L. Chen, N. Zeghidour, and A. Saeed, “Careaqa: A cardiac and respiratory audio question answer- ing model for open-ended diagnostic reasoning,”arXiv preprint arXiv:2505.01199, 2025

work page arXiv 2025
[21]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine learning and systems, vol. 2, pp. 429– 450, 2020

2020
[22]

Federated learning for healthcare: Systematic re- view and architecture proposal,

R. S. Antunes, C. Andr ´e da Costa, A. K ¨uderle, I. A. Yari, and B. Eskofier, “Federated learning for healthcare: Systematic re- view and architecture proposal,”ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 4, pp. 1–23, 2022

2022
[23]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”ICLR, vol. 1, no. 2, p. 3, 2022

2022
[24]

Flower: A Friendly Federated Learning Research Framework

D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y . Gao, L. Sani, K. H. Li, T. Parcollet, P. P. B. de Gusm ˜AG ¸ o et al., “Flower: A friendly federated learning research frame- work,”arXiv preprint arXiv:2007.14390, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2007
[25]

An open access database for the evaluation of respira- tory sound classification algorithms,

B. M. Rocha, D. Filos, L. Mendes, G. Serbes, S. Ulukaya, Y . P. Kahya, N. Jakovljevic, T. L. Turukalo, I. M. V ogiatzis, E. Peran- toniet al., “An open access database for the evaluation of respira- tory sound classification algorithms,”Physiological measurement, vol. 40, no. 3, p. 035001, 2019

2019
[26]

The CirCor DigiScope Phonocardiogram Dataset,

J. Oliveira, F. Renna, P. Costa, M. Nogueira, A. C. Oliveira, A. Elola, C. Ferreira, A. Jorge, A. Bahrami Rad, M. Reyna, R. Sameni, G. Clifford, and M. Coimbra, “The CirCor DigiScope Phonocardiogram Dataset,”PhysioNet, May 2022, version 1.0.3. [Online]. Available: https://doi.org/10.13026/tshs-mw03

work page doi:10.13026/tshs-mw03 2022
[27]

The coughvid crowd- sourcing dataset, a corpus for the study of large-scale cough anal- ysis algorithms,

L. Orlandic, T. Teijeiro, and D. Atienza, “The coughvid crowd- sourcing dataset, a corpus for the study of large-scale cough anal- ysis algorithms,”Scientific Data, vol. 8, no. 1, p. 156, 2021

2021
[28]

Benchmarking of eight recurrent neural network vari- ants for breath phase and adventitious sound detection on a self- developed open-access lung sound database—hf lung v1,

F.-S. Hsu, S.-R. Huang, C.-W. Huang, C.-J. Huang, Y .-R. Cheng, C.-C. Chen, J. Hsiao, C.-W. Chen, L.-C. Chen, Y .-C. Lai et al., “Benchmarking of eight recurrent neural network vari- ants for breath phase and adventitious sound detection on a self- developed open-access lung sound database—hf lung v1,”PLoS One, vol. 16, no. 7, p. e0254134, 2021

2021
[29]

Sprsound: Open-source sjtu paediatric respiratory sound database,

Q. Zhang, J. Zhang, J. Yuan, H. Huang, Y . Zhang, B. Zhang, G. Lv, S. Lin, N. Wang, X. Liuet al., “Sprsound: Open-source sjtu paediatric respiratory sound database,”IEEE Transactions on Biomedical Circuits and Systems, vol. 16, no. 5, pp. 867–881, 2022

2022
[30]

Covid- 19 sounds: a large-scale audio dataset for digital respiratory screening,

T. Xia, D. Spathis, J. Ch, A. Grammenos, J. Han, A. Hasthana- sombat, E. Bondareva, T. Dang, A. Floto, P. Cicutaet al., “Covid- 19 sounds: a large-scale audio dataset for digital respiratory screening,” inThirty-fifth conference on neural information pro- cessing systems datasets and benchmarks track (round 2), 2021

2021
[31]

Zchsound: Open-source zju paediatric heart sound database with congenital heart disease,

W. Jia, Y . Wang, R. Chen, J. Ye, D. Li, F. Yin, J. Yu, J. Chen, Q. Shu, and W. Xu, “Zchsound: Open-source zju paediatric heart sound database with congenital heart disease,”IEEE Transactions on Biomedical Engineering, vol. 71, no. 8, pp. 2278–2286, 2024

2024
[32]

Pengi: An audio language model for audio tasks,

S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 18 090–18 108, 2023

2023
[33]

Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,

S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sak- shi, O. Nieto, R. Duraiswami, and D. Manocha, “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,”arXiv preprint arXiv:2406.11768, 2024

work page arXiv 2024
[34]

Gemma 3 Technical Report

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram ´e, M. Rivi`ereet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,”arXiv preprint arXiv:2507.08128, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Rouge: A package for automatic evaluation of sum- maries,

C.-Y . Lin, “Rouge: A package for automatic evaluation of sum- maries,” inText summarization branches out, 2004, pp. 74–81

2004
[38]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[39]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,”arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial intelligence and statistics. PMLR, 2017, pp. 1273–1282

2017
[41]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Qwen2.5: A party of foundation models,

Q. Team, “Qwen2.5: A party of foundation models,” September
[43]

Available: https://qwenlm.github.io/blog/qwen2

[Online]. Available: https://qwenlm.github.io/blog/qwen2. 5/

[1] [1]

Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio

Introduction Clinical audio diagnosis is fundamentally an act of grounded reasoning: a clinician interprets what theyhear—a wheeze, a murmur, an abnormal cough—in light of what theyknowabout the clinical conditions those sounds signify [1, 2]. This pro- cess is open-ended, contextual, and knowledge-intensive. Yet the dominant machine learning paradigm for...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Mountain Breeze,

Methodology 2.1. Problem Formulation We consider clinical audio diagnosis where recordings are dis- tributed acrossMinstitutions without centralized access and no expert labels are available during training. Each diag- nostic episodeE={S, x q}consists of a support setS= {(xi, yi)}N K i=1 withKaudio-label pairs for each ofNdiagnos- tic concepts, and a quer...

[3] [3]

Datasets We evaluate on seven medical audio datasets spanning respira- tory and cardiac domains (Table 2)

Experiments 3.1. Datasets We evaluate on seven medical audio datasets spanning respira- tory and cardiac domains (Table 2). In the federated setup, each dataset is assigned to a separate client, yieldingM=7institu- tions with naturally heterogeneous recording conditions, label spaces, and sample sizes. All audio is resampled to 16 kHz and segmented into f...

1906

[4] [4]

Results and Analysis We evaluate FSC under the label-unseen episodic protocol de- scribed in Section 3.2, where no diagnostic labels are seen dur- ing training and classification relies entirely on in-context con- ditioning at inference. To mitigate linguistic bias from dis- ease phrasing, each condition is represented by three physician- validated textua...

[5] [5]

Conclusions This paper introduced FSC, a federated framework for few- shot clinical audio diagnosis that operates without centralized data sharing or real diagnostic labels. FSC pairs pseudo-label episode construction with a three-stage training pipeline to equip a multimodal language model with in-context diagnostic reasoning from limited support example...

[6] [6]

All technical content, experimental design, analysis, and con- clusions were created by the authors

Generative AI Use Disclosure Generative AI tools were used solely for language editing and polishing to improve clarity and readability of the manuscript. All technical content, experimental design, analysis, and con- clusions were created by the authors. The authors take full re- sponsibility for the content of this paper

[7] [7]

We also acknowledge the use of the Dutch National Supercom- puter Snellius for essential computational tasks

Acknowledgments This work was supported by the NWO AiNed Fellowship Grant of A.S., and in part by Google.org and the Google Cloud Re- search Credits program through the Gemini Academic Program. We also acknowledge the use of the Dutch National Supercom- puter Snellius for essential computational tasks

[8] [8]

Non- invasive devices for respiratory sound monitoring,

´A. Troncoso, J. A. Ortega, R. Seepold, and N. M. Madrid, “Non- invasive devices for respiratory sound monitoring,”Procedia com- puter science, vol. 192, pp. 3040–3048, 2021

2021

[9] [9]

Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds,

C. Potes, S. Parvaneh, A. Rahman, and B. Conroy, “Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds,” in2016 computing in cardiology confer- ence (CinC). IEEE, 2016, pp. 621–624

2016

[10] [10]

Ai4covid- 19: Ai enabled preliminary diagnosis for covid-19 from cough samples via an app,

A. Imran, I. Posokhova, H. N. Qureshi, U. Masood, M. S. Riaz, K. Ali, C. N. John, M. I. Hussain, and M. Nabeel, “Ai4covid- 19: Ai enabled preliminary diagnosis for covid-19 from cough samples via an app,”Informatics in medicine unlocked, vol. 20, p. 100378, 2020

2020

[11] [11]

Deep learning methods for heart sounds classification: A systematic re- view,

W. Chen, Q. Sun, X. Chen, G. Xie, H. Wu, and C. Xu, “Deep learning methods for heart sounds classification: A systematic re- view,”Entropy, vol. 23, no. 6, p. 667, 2021

2021

[12] [12]

Hear–health acoustic representations,

S. Baur, Z. Nabulsi, W.-H. Weng, J. Garrison, L. Blankemeier, S. Fishman, C. Chen, S. Kakarmath, M. Maimbolwa, N. San- jaseet al., “Hear–health acoustic representations,”arXiv preprint arXiv:2403.02522, 2024

work page arXiv 2024

[13] [13]

Federated learning for healthcare informatics,

J. Xu, B. S. Glicksberg, C. Su, P. Walker, J. Bian, and F. Wang, “Federated learning for healthcare informatics,”Journal of health- care informatics research, vol. 5, no. 1, pp. 1–19, 2021

2021

[14] [14]

The future of digital health with federated learning,

N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albar- qouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein et al., “The future of digital health with federated learning,”NPJ digital medicine, vol. 3, no. 1, p. 119, 2020

2020

[15] [15]

MedGemma Technical Report

A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Tra- verse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lauet al., “Medgemma technical report,”arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

A survey on in-context learning,

Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Changet al., “A survey on in-context learning,” inPro- ceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 1107–1128

2024

[17] [17]

Generalizing from a few examples: A survey on few-shot learning,

Y . Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,”ACM computing surveys (csur), vol. 53, no. 3, pp. 1–34, 2020

2020

[18] [18]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[19] [19]

Clap learning audio concepts from natural language supervision,

B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[20] [20]

Careaqa: A cardiac and respiratory audio question answer- ing model for open-ended diagnostic reasoning,

T.-N. Wang, L.-L. Chen, N. Zeghidour, and A. Saeed, “Careaqa: A cardiac and respiratory audio question answer- ing model for open-ended diagnostic reasoning,”arXiv preprint arXiv:2505.01199, 2025

work page arXiv 2025

[21] [21]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine learning and systems, vol. 2, pp. 429– 450, 2020

2020

[22] [22]

Federated learning for healthcare: Systematic re- view and architecture proposal,

R. S. Antunes, C. Andr ´e da Costa, A. K ¨uderle, I. A. Yari, and B. Eskofier, “Federated learning for healthcare: Systematic re- view and architecture proposal,”ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 4, pp. 1–23, 2022

2022

[23] [23]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”ICLR, vol. 1, no. 2, p. 3, 2022

2022

[24] [24]

Flower: A Friendly Federated Learning Research Framework

D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y . Gao, L. Sani, K. H. Li, T. Parcollet, P. P. B. de Gusm ˜AG ¸ o et al., “Flower: A friendly federated learning research frame- work,”arXiv preprint arXiv:2007.14390, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2007

[25] [25]

An open access database for the evaluation of respira- tory sound classification algorithms,

B. M. Rocha, D. Filos, L. Mendes, G. Serbes, S. Ulukaya, Y . P. Kahya, N. Jakovljevic, T. L. Turukalo, I. M. V ogiatzis, E. Peran- toniet al., “An open access database for the evaluation of respira- tory sound classification algorithms,”Physiological measurement, vol. 40, no. 3, p. 035001, 2019

2019

[26] [26]

The CirCor DigiScope Phonocardiogram Dataset,

J. Oliveira, F. Renna, P. Costa, M. Nogueira, A. C. Oliveira, A. Elola, C. Ferreira, A. Jorge, A. Bahrami Rad, M. Reyna, R. Sameni, G. Clifford, and M. Coimbra, “The CirCor DigiScope Phonocardiogram Dataset,”PhysioNet, May 2022, version 1.0.3. [Online]. Available: https://doi.org/10.13026/tshs-mw03

work page doi:10.13026/tshs-mw03 2022

[27] [27]

The coughvid crowd- sourcing dataset, a corpus for the study of large-scale cough anal- ysis algorithms,

L. Orlandic, T. Teijeiro, and D. Atienza, “The coughvid crowd- sourcing dataset, a corpus for the study of large-scale cough anal- ysis algorithms,”Scientific Data, vol. 8, no. 1, p. 156, 2021

2021

[28] [28]

Benchmarking of eight recurrent neural network vari- ants for breath phase and adventitious sound detection on a self- developed open-access lung sound database—hf lung v1,

F.-S. Hsu, S.-R. Huang, C.-W. Huang, C.-J. Huang, Y .-R. Cheng, C.-C. Chen, J. Hsiao, C.-W. Chen, L.-C. Chen, Y .-C. Lai et al., “Benchmarking of eight recurrent neural network vari- ants for breath phase and adventitious sound detection on a self- developed open-access lung sound database—hf lung v1,”PLoS One, vol. 16, no. 7, p. e0254134, 2021

2021

[29] [29]

Sprsound: Open-source sjtu paediatric respiratory sound database,

Q. Zhang, J. Zhang, J. Yuan, H. Huang, Y . Zhang, B. Zhang, G. Lv, S. Lin, N. Wang, X. Liuet al., “Sprsound: Open-source sjtu paediatric respiratory sound database,”IEEE Transactions on Biomedical Circuits and Systems, vol. 16, no. 5, pp. 867–881, 2022

2022

[30] [30]

Covid- 19 sounds: a large-scale audio dataset for digital respiratory screening,

T. Xia, D. Spathis, J. Ch, A. Grammenos, J. Han, A. Hasthana- sombat, E. Bondareva, T. Dang, A. Floto, P. Cicutaet al., “Covid- 19 sounds: a large-scale audio dataset for digital respiratory screening,” inThirty-fifth conference on neural information pro- cessing systems datasets and benchmarks track (round 2), 2021

2021

[31] [31]

Zchsound: Open-source zju paediatric heart sound database with congenital heart disease,

W. Jia, Y . Wang, R. Chen, J. Ye, D. Li, F. Yin, J. Yu, J. Chen, Q. Shu, and W. Xu, “Zchsound: Open-source zju paediatric heart sound database with congenital heart disease,”IEEE Transactions on Biomedical Engineering, vol. 71, no. 8, pp. 2278–2286, 2024

2024

[32] [32]

Pengi: An audio language model for audio tasks,

S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 18 090–18 108, 2023

2023

[33] [33]

Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,

S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sak- shi, O. Nieto, R. Duraiswami, and D. Manocha, “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,”arXiv preprint arXiv:2406.11768, 2024

work page arXiv 2024

[34] [34]

Gemma 3 Technical Report

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram ´e, M. Rivi`ereet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,”arXiv preprint arXiv:2507.08128, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Rouge: A package for automatic evaluation of sum- maries,

C.-Y . Lin, “Rouge: A package for automatic evaluation of sum- maries,” inText summarization branches out, 2004, pp. 74–81

2004

[38] [38]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[39] [39]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,”arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial intelligence and statistics. PMLR, 2017, pp. 1273–1282

2017

[41] [41]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Qwen2.5: A party of foundation models,

Q. Team, “Qwen2.5: A party of foundation models,” September

[43] [43]

Available: https://qwenlm.github.io/blog/qwen2

[Online]. Available: https://qwenlm.github.io/blog/qwen2. 5/