Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio
Pith reviewed 2026-06-26 09:14 UTC · model grok-4.3
The pith
Federated Self-Contextualization lets audio-language models do in-context clinical diagnosis from pseudo-label episodes built by unsupervised clustering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FSC constructs pseudo-label episodes via unsupervised clustering of audio representations, bypassing scarce real diagnostic labels, and enables contextual reasoning from support-query pairs. Our progressive three-stage pipeline first aligns audio embeddings with the language model via caption-based pretraining, then adapts it for episodic in-context inference through federated optimization. At test time, given a small labeled support set, the model diagnoses an unseen query through multimodal reasoning. On held-out respiratory and cardiac conditions, FSC achieves 71.6% accuracy in 2-way 2-shot evaluation, outperforming audio-language baselines by over 9%.
What carries the argument
Federated Self-Contextualization (FSC), the three-stage pipeline that creates pseudo-label episodes from unsupervised clustering of audio representations and then performs federated adaptation for multimodal in-context inference.
Load-bearing premise
That unsupervised clustering of audio representations produces pseudo-label episodes sufficiently reliable for the model to learn effective multimodal in-context reasoning without real diagnostic labels.
What would settle it
A test in which the unsupervised clusters show no better alignment with actual diagnostic categories than random grouping, resulting in accuracy no higher than the audio-language baselines.
Figures
read the original abstract
Clinical audio diagnosis in low-resource settings requires models that identify conditions from minimal examples without large annotated corpora. We propose Federated Self-Contextualization (FSC), a multimodal language model framework for in-context clinical audio diagnosis across federated hospital clients. FSC constructs pseudo-label episodes via unsupervised clustering of audio representations, bypassing scarce real diagnostic labels, and enables contextual reasoning from support-query pairs. Our progressive three-stage pipeline first aligns audio embeddings with the language model via caption-based pretraining, then adapts it for episodic in-context inference through federated optimization. At test time, given a small labeled support set, the model diagnoses an unseen query through multimodal reasoning. On held-out respiratory and cardiac conditions, FSC achieves 71.6% accuracy in 2-way 2-shot evaluation, outperforming audio-language baselines by over 9%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Federated Self-Contextualization (FSC), a multimodal language model framework for in-context clinical audio diagnosis in federated hospital settings. It constructs pseudo-label episodes via unsupervised clustering of audio representations to bypass scarce diagnostic labels, then uses a three-stage pipeline of caption-based pretraining followed by federated optimization over support-query episodes. At inference, the model performs multimodal reasoning on a small labeled support set to diagnose an unseen query. The central empirical claim is that FSC achieves 71.6% accuracy in 2-way 2-shot evaluation on held-out respiratory and cardiac conditions, outperforming audio-language baselines by over 9%.
Significance. If the result holds after proper validation, the work would be significant for enabling in-context learning in medical audio analysis under privacy and annotation constraints typical of decentralized clinical environments. The combination of unsupervised pseudo-labeling with federated episodic training addresses a practical gap in low-resource multimodal medical AI. No machine-checked proofs or parameter-free derivations are present, but the empirical framing on held-out conditions offers a falsifiable prediction that could be tested with additional controls.
major comments (1)
- [Abstract] Abstract: The central performance claim (71.6% 2-way 2-shot accuracy and >9% improvement) rests on the unsupervised clustering step producing pseudo-label episodes that align with clinically relevant categories. No quantitative validation of this step (cluster purity, NMI against any labels, or alignment metrics) is supplied, so it is impossible to determine whether the reported gains arise from the episodic pseudo-label mechanism rather than caption pretraining alone or evaluation artifacts.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for highlighting the need to validate the unsupervised clustering component. We address this concern directly below and will incorporate the requested analysis in the revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claim (71.6% 2-way 2-shot accuracy and >9% improvement) rests on the unsupervised clustering step producing pseudo-label episodes that align with clinically relevant categories. No quantitative validation of this step (cluster purity, NMI against any labels, or alignment metrics) is supplied, so it is impossible to determine whether the reported gains arise from the episodic pseudo-label mechanism rather than caption pretraining alone or evaluation artifacts.
Authors: We agree that the manuscript would benefit from explicit quantitative validation of the clustering step. In the revised version we will add a dedicated analysis section reporting silhouette scores on the learned audio representations together with normalized mutual information (NMI) computed against a small held-out subset of diagnostic labels reserved solely for this validation (not used in training or evaluation). We will also include an ablation that isolates the contribution of the episodic pseudo-label stage by comparing the full FSC pipeline against a caption-pretraining-only baseline. These additions will allow readers to assess whether the reported gains are attributable to the pseudo-label episodes. revision: yes
Circularity Check
No circularity: empirical accuracy from held-out evaluation
full rationale
The paper presents a three-stage pipeline (caption pretraining, federated episodic adaptation via unsupervised clustering for pseudo-labels, then test-time in-context inference) whose central claim is an empirical accuracy number (71.6% 2-way 2-shot on held-out conditions) measured against external baselines. No equations, derivations, or fitted parameters are redefined as predictions; the unsupervised clustering step is an input mechanism whose quality is not asserted by construction but left as an empirical assumption. No self-citation chains or uniqueness theorems are invoked to force the result. The reported performance is therefore an independent measurement rather than a quantity that reduces to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio
Introduction Clinical audio diagnosis is fundamentally an act of grounded reasoning: a clinician interprets what theyhear—a wheeze, a murmur, an abnormal cough—in light of what theyknowabout the clinical conditions those sounds signify [1, 2]. This pro- cess is open-ended, contextual, and knowledge-intensive. Yet the dominant machine learning paradigm for...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Mountain Breeze,
Methodology 2.1. Problem Formulation We consider clinical audio diagnosis where recordings are dis- tributed acrossMinstitutions without centralized access and no expert labels are available during training. Each diag- nostic episodeE={S, x q}consists of a support setS= {(xi, yi)}N K i=1 withKaudio-label pairs for each ofNdiagnos- tic concepts, and a quer...
-
[3]
Datasets We evaluate on seven medical audio datasets spanning respira- tory and cardiac domains (Table 2)
Experiments 3.1. Datasets We evaluate on seven medical audio datasets spanning respira- tory and cardiac domains (Table 2). In the federated setup, each dataset is assigned to a separate client, yieldingM=7institu- tions with naturally heterogeneous recording conditions, label spaces, and sample sizes. All audio is resampled to 16 kHz and segmented into f...
1906
-
[4]
Results and Analysis We evaluate FSC under the label-unseen episodic protocol de- scribed in Section 3.2, where no diagnostic labels are seen dur- ing training and classification relies entirely on in-context con- ditioning at inference. To mitigate linguistic bias from dis- ease phrasing, each condition is represented by three physician- validated textua...
-
[5]
Conclusions This paper introduced FSC, a federated framework for few- shot clinical audio diagnosis that operates without centralized data sharing or real diagnostic labels. FSC pairs pseudo-label episode construction with a three-stage training pipeline to equip a multimodal language model with in-context diagnostic reasoning from limited support example...
-
[6]
All technical content, experimental design, analysis, and con- clusions were created by the authors
Generative AI Use Disclosure Generative AI tools were used solely for language editing and polishing to improve clarity and readability of the manuscript. All technical content, experimental design, analysis, and con- clusions were created by the authors. The authors take full re- sponsibility for the content of this paper
-
[7]
We also acknowledge the use of the Dutch National Supercom- puter Snellius for essential computational tasks
Acknowledgments This work was supported by the NWO AiNed Fellowship Grant of A.S., and in part by Google.org and the Google Cloud Re- search Credits program through the Gemini Academic Program. We also acknowledge the use of the Dutch National Supercom- puter Snellius for essential computational tasks
-
[8]
Non- invasive devices for respiratory sound monitoring,
´A. Troncoso, J. A. Ortega, R. Seepold, and N. M. Madrid, “Non- invasive devices for respiratory sound monitoring,”Procedia com- puter science, vol. 192, pp. 3040–3048, 2021
2021
-
[9]
Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds,
C. Potes, S. Parvaneh, A. Rahman, and B. Conroy, “Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds,” in2016 computing in cardiology confer- ence (CinC). IEEE, 2016, pp. 621–624
2016
-
[10]
Ai4covid- 19: Ai enabled preliminary diagnosis for covid-19 from cough samples via an app,
A. Imran, I. Posokhova, H. N. Qureshi, U. Masood, M. S. Riaz, K. Ali, C. N. John, M. I. Hussain, and M. Nabeel, “Ai4covid- 19: Ai enabled preliminary diagnosis for covid-19 from cough samples via an app,”Informatics in medicine unlocked, vol. 20, p. 100378, 2020
2020
-
[11]
Deep learning methods for heart sounds classification: A systematic re- view,
W. Chen, Q. Sun, X. Chen, G. Xie, H. Wu, and C. Xu, “Deep learning methods for heart sounds classification: A systematic re- view,”Entropy, vol. 23, no. 6, p. 667, 2021
2021
-
[12]
Hear–health acoustic representations,
S. Baur, Z. Nabulsi, W.-H. Weng, J. Garrison, L. Blankemeier, S. Fishman, C. Chen, S. Kakarmath, M. Maimbolwa, N. San- jaseet al., “Hear–health acoustic representations,”arXiv preprint arXiv:2403.02522, 2024
-
[13]
Federated learning for healthcare informatics,
J. Xu, B. S. Glicksberg, C. Su, P. Walker, J. Bian, and F. Wang, “Federated learning for healthcare informatics,”Journal of health- care informatics research, vol. 5, no. 1, pp. 1–19, 2021
2021
-
[14]
The future of digital health with federated learning,
N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albar- qouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein et al., “The future of digital health with federated learning,”NPJ digital medicine, vol. 3, no. 1, p. 119, 2020
2020
-
[15]
A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Tra- verse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lauet al., “Medgemma technical report,”arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
A survey on in-context learning,
Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Changet al., “A survey on in-context learning,” inPro- ceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 1107–1128
2024
-
[17]
Generalizing from a few examples: A survey on few-shot learning,
Y . Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,”ACM computing surveys (csur), vol. 53, no. 3, pp. 1–34, 2020
2020
-
[18]
Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,
Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
2023
-
[19]
Clap learning audio concepts from natural language supervision,
B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
2023
-
[20]
T.-N. Wang, L.-L. Chen, N. Zeghidour, and A. Saeed, “Careaqa: A cardiac and respiratory audio question answer- ing model for open-ended diagnostic reasoning,”arXiv preprint arXiv:2505.01199, 2025
-
[21]
Federated optimization in heterogeneous networks,
T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine learning and systems, vol. 2, pp. 429– 450, 2020
2020
-
[22]
Federated learning for healthcare: Systematic re- view and architecture proposal,
R. S. Antunes, C. Andr ´e da Costa, A. K ¨uderle, I. A. Yari, and B. Eskofier, “Federated learning for healthcare: Systematic re- view and architecture proposal,”ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 4, pp. 1–23, 2022
2022
-
[23]
Lora: Low-rank adaptation of large language models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”ICLR, vol. 1, no. 2, p. 3, 2022
2022
-
[24]
Flower: A Friendly Federated Learning Research Framework
D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y . Gao, L. Sani, K. H. Li, T. Parcollet, P. P. B. de Gusm ˜AG ¸ o et al., “Flower: A friendly federated learning research frame- work,”arXiv preprint arXiv:2007.14390, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2007
-
[25]
An open access database for the evaluation of respira- tory sound classification algorithms,
B. M. Rocha, D. Filos, L. Mendes, G. Serbes, S. Ulukaya, Y . P. Kahya, N. Jakovljevic, T. L. Turukalo, I. M. V ogiatzis, E. Peran- toniet al., “An open access database for the evaluation of respira- tory sound classification algorithms,”Physiological measurement, vol. 40, no. 3, p. 035001, 2019
2019
-
[26]
The CirCor DigiScope Phonocardiogram Dataset,
J. Oliveira, F. Renna, P. Costa, M. Nogueira, A. C. Oliveira, A. Elola, C. Ferreira, A. Jorge, A. Bahrami Rad, M. Reyna, R. Sameni, G. Clifford, and M. Coimbra, “The CirCor DigiScope Phonocardiogram Dataset,”PhysioNet, May 2022, version 1.0.3. [Online]. Available: https://doi.org/10.13026/tshs-mw03
-
[27]
The coughvid crowd- sourcing dataset, a corpus for the study of large-scale cough anal- ysis algorithms,
L. Orlandic, T. Teijeiro, and D. Atienza, “The coughvid crowd- sourcing dataset, a corpus for the study of large-scale cough anal- ysis algorithms,”Scientific Data, vol. 8, no. 1, p. 156, 2021
2021
-
[28]
Benchmarking of eight recurrent neural network vari- ants for breath phase and adventitious sound detection on a self- developed open-access lung sound database—hf lung v1,
F.-S. Hsu, S.-R. Huang, C.-W. Huang, C.-J. Huang, Y .-R. Cheng, C.-C. Chen, J. Hsiao, C.-W. Chen, L.-C. Chen, Y .-C. Lai et al., “Benchmarking of eight recurrent neural network vari- ants for breath phase and adventitious sound detection on a self- developed open-access lung sound database—hf lung v1,”PLoS One, vol. 16, no. 7, p. e0254134, 2021
2021
-
[29]
Sprsound: Open-source sjtu paediatric respiratory sound database,
Q. Zhang, J. Zhang, J. Yuan, H. Huang, Y . Zhang, B. Zhang, G. Lv, S. Lin, N. Wang, X. Liuet al., “Sprsound: Open-source sjtu paediatric respiratory sound database,”IEEE Transactions on Biomedical Circuits and Systems, vol. 16, no. 5, pp. 867–881, 2022
2022
-
[30]
Covid- 19 sounds: a large-scale audio dataset for digital respiratory screening,
T. Xia, D. Spathis, J. Ch, A. Grammenos, J. Han, A. Hasthana- sombat, E. Bondareva, T. Dang, A. Floto, P. Cicutaet al., “Covid- 19 sounds: a large-scale audio dataset for digital respiratory screening,” inThirty-fifth conference on neural information pro- cessing systems datasets and benchmarks track (round 2), 2021
2021
-
[31]
Zchsound: Open-source zju paediatric heart sound database with congenital heart disease,
W. Jia, Y . Wang, R. Chen, J. Ye, D. Li, F. Yin, J. Yu, J. Chen, Q. Shu, and W. Xu, “Zchsound: Open-source zju paediatric heart sound database with congenital heart disease,”IEEE Transactions on Biomedical Engineering, vol. 71, no. 8, pp. 2278–2286, 2024
2024
-
[32]
Pengi: An audio language model for audio tasks,
S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 18 090–18 108, 2023
2023
-
[33]
S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sak- shi, O. Nieto, R. Duraiswami, and D. Manocha, “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,”arXiv preprint arXiv:2406.11768, 2024
-
[34]
G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram ´e, M. Rivi`ereet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,”arXiv preprint arXiv:2507.08128, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Rouge: A package for automatic evaluation of sum- maries,
C.-Y . Lin, “Rouge: A package for automatic evaluation of sum- maries,” inText summarization branches out, 2004, pp. 74–81
2004
-
[38]
BERTScore: Evaluating Text Generation with BERT
T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[39]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,”arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
Communication-efficient learning of deep networks from decentralized data,
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial intelligence and statistics. PMLR, 2017, pp. 1273–1282
2017
-
[41]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Qwen2.5: A party of foundation models,
Q. Team, “Qwen2.5: A party of foundation models,” September
-
[43]
Available: https://qwenlm.github.io/blog/qwen2
[Online]. Available: https://qwenlm.github.io/blog/qwen2. 5/
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.