pith. sign in

arxiv: 2606.23243 · v1 · pith:T7ZBH3MDnew · submitted 2026-06-22 · 💻 cs.LG · cs.SD

Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio

Pith reviewed 2026-06-26 09:14 UTC · model grok-4.3

classification 💻 cs.LG cs.SD
keywords in-context learningaudio-language modelsfederated learningmedical audio diagnosispseudo-labelingunsupervised clusteringrespiratory audiocardiac audio
0
0 comments X

The pith

Federated Self-Contextualization lets audio-language models do in-context clinical diagnosis from pseudo-label episodes built by unsupervised clustering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that audio-language models can acquire in-context diagnostic reasoning for clinical audio when real labels are scarce by first aligning embeddings through caption pretraining and then using federated optimization on episodes whose labels come from clustering the audio representations themselves. A sympathetic reader would care because many medical audio tasks lack large annotated corpora yet still require models that generalize from only a handful of examples at test time. The approach operates across decentralized hospital clients and reaches 71.6 percent accuracy in 2-way 2-shot evaluation on held-out respiratory and cardiac conditions. It does so by treating the clustered pseudo-labels as the support set for multimodal reasoning over a query audio sample.

Core claim

FSC constructs pseudo-label episodes via unsupervised clustering of audio representations, bypassing scarce real diagnostic labels, and enables contextual reasoning from support-query pairs. Our progressive three-stage pipeline first aligns audio embeddings with the language model via caption-based pretraining, then adapts it for episodic in-context inference through federated optimization. At test time, given a small labeled support set, the model diagnoses an unseen query through multimodal reasoning. On held-out respiratory and cardiac conditions, FSC achieves 71.6% accuracy in 2-way 2-shot evaluation, outperforming audio-language baselines by over 9%.

What carries the argument

Federated Self-Contextualization (FSC), the three-stage pipeline that creates pseudo-label episodes from unsupervised clustering of audio representations and then performs federated adaptation for multimodal in-context inference.

Load-bearing premise

That unsupervised clustering of audio representations produces pseudo-label episodes sufficiently reliable for the model to learn effective multimodal in-context reasoning without real diagnostic labels.

What would settle it

A test in which the unsupervised clusters show no better alignment with actual diagnostic categories than random grouping, resulting in accuracy no higher than the audio-language baselines.

Figures

Figures reproduced from arXiv: 2606.23243 by Aaqib Saeed, Hareld Kemps, Linda Moonen, Martijn den Dekker, Ran Piao, Tsai-Ning Wang, Yuan Lu.

Figure 1
Figure 1. Figure 1: Overview of Federated Self-Contextualization (FSC). (A) Federated setting with client-side clustering for local pseudo-label generation. (B) Progressive Local ICL Training Pipeline (C) In-context clinical inference using real-world few-shot support examples. alize to real diagnostic descriptions at inference; decoupling reasoning skill acquisition from annotation availability. • We instantiate this paradig… view at source ↗
read the original abstract

Clinical audio diagnosis in low-resource settings requires models that identify conditions from minimal examples without large annotated corpora. We propose Federated Self-Contextualization (FSC), a multimodal language model framework for in-context clinical audio diagnosis across federated hospital clients. FSC constructs pseudo-label episodes via unsupervised clustering of audio representations, bypassing scarce real diagnostic labels, and enables contextual reasoning from support-query pairs. Our progressive three-stage pipeline first aligns audio embeddings with the language model via caption-based pretraining, then adapts it for episodic in-context inference through federated optimization. At test time, given a small labeled support set, the model diagnoses an unseen query through multimodal reasoning. On held-out respiratory and cardiac conditions, FSC achieves 71.6% accuracy in 2-way 2-shot evaluation, outperforming audio-language baselines by over 9%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Federated Self-Contextualization (FSC), a multimodal language model framework for in-context clinical audio diagnosis in federated hospital settings. It constructs pseudo-label episodes via unsupervised clustering of audio representations to bypass scarce diagnostic labels, then uses a three-stage pipeline of caption-based pretraining followed by federated optimization over support-query episodes. At inference, the model performs multimodal reasoning on a small labeled support set to diagnose an unseen query. The central empirical claim is that FSC achieves 71.6% accuracy in 2-way 2-shot evaluation on held-out respiratory and cardiac conditions, outperforming audio-language baselines by over 9%.

Significance. If the result holds after proper validation, the work would be significant for enabling in-context learning in medical audio analysis under privacy and annotation constraints typical of decentralized clinical environments. The combination of unsupervised pseudo-labeling with federated episodic training addresses a practical gap in low-resource multimodal medical AI. No machine-checked proofs or parameter-free derivations are present, but the empirical framing on held-out conditions offers a falsifiable prediction that could be tested with additional controls.

major comments (1)
  1. [Abstract] Abstract: The central performance claim (71.6% 2-way 2-shot accuracy and >9% improvement) rests on the unsupervised clustering step producing pseudo-label episodes that align with clinically relevant categories. No quantitative validation of this step (cluster purity, NMI against any labels, or alignment metrics) is supplied, so it is impossible to determine whether the reported gains arise from the episodic pseudo-label mechanism rather than caption pretraining alone or evaluation artifacts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the need to validate the unsupervised clustering component. We address this concern directly below and will incorporate the requested analysis in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim (71.6% 2-way 2-shot accuracy and >9% improvement) rests on the unsupervised clustering step producing pseudo-label episodes that align with clinically relevant categories. No quantitative validation of this step (cluster purity, NMI against any labels, or alignment metrics) is supplied, so it is impossible to determine whether the reported gains arise from the episodic pseudo-label mechanism rather than caption pretraining alone or evaluation artifacts.

    Authors: We agree that the manuscript would benefit from explicit quantitative validation of the clustering step. In the revised version we will add a dedicated analysis section reporting silhouette scores on the learned audio representations together with normalized mutual information (NMI) computed against a small held-out subset of diagnostic labels reserved solely for this validation (not used in training or evaluation). We will also include an ablation that isolates the contribution of the episodic pseudo-label stage by comparing the full FSC pipeline against a caption-pretraining-only baseline. These additions will allow readers to assess whether the reported gains are attributable to the pseudo-label episodes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy from held-out evaluation

full rationale

The paper presents a three-stage pipeline (caption pretraining, federated episodic adaptation via unsupervised clustering for pseudo-labels, then test-time in-context inference) whose central claim is an empirical accuracy number (71.6% 2-way 2-shot on held-out conditions) measured against external baselines. No equations, derivations, or fitted parameters are redefined as predictions; the unsupervised clustering step is an input mechanism whose quality is not asserted by construction but left as an empirical assumption. No self-citation chains or uniqueness theorems are invoked to force the result. The reported performance is therefore an independent measurement rather than a quantity that reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; full manuscript required for audit.

pith-pipeline@v0.9.1-grok · 5691 in / 1009 out tokens · 29270 ms · 2026-06-26T09:14:27.297170+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 13 canonical work pages · 9 internal anchors

  1. [1]

    Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio

    Introduction Clinical audio diagnosis is fundamentally an act of grounded reasoning: a clinician interprets what theyhear—a wheeze, a murmur, an abnormal cough—in light of what theyknowabout the clinical conditions those sounds signify [1, 2]. This pro- cess is open-ended, contextual, and knowledge-intensive. Yet the dominant machine learning paradigm for...

  2. [2]

    Mountain Breeze,

    Methodology 2.1. Problem Formulation We consider clinical audio diagnosis where recordings are dis- tributed acrossMinstitutions without centralized access and no expert labels are available during training. Each diag- nostic episodeE={S, x q}consists of a support setS= {(xi, yi)}N K i=1 withKaudio-label pairs for each ofNdiagnos- tic concepts, and a quer...

  3. [3]

    Datasets We evaluate on seven medical audio datasets spanning respira- tory and cardiac domains (Table 2)

    Experiments 3.1. Datasets We evaluate on seven medical audio datasets spanning respira- tory and cardiac domains (Table 2). In the federated setup, each dataset is assigned to a separate client, yieldingM=7institu- tions with naturally heterogeneous recording conditions, label spaces, and sample sizes. All audio is resampled to 16 kHz and segmented into f...

  4. [4]

    Results and Analysis We evaluate FSC under the label-unseen episodic protocol de- scribed in Section 3.2, where no diagnostic labels are seen dur- ing training and classification relies entirely on in-context con- ditioning at inference. To mitigate linguistic bias from dis- ease phrasing, each condition is represented by three physician- validated textua...

  5. [5]

    Conclusions This paper introduced FSC, a federated framework for few- shot clinical audio diagnosis that operates without centralized data sharing or real diagnostic labels. FSC pairs pseudo-label episode construction with a three-stage training pipeline to equip a multimodal language model with in-context diagnostic reasoning from limited support example...

  6. [6]

    All technical content, experimental design, analysis, and con- clusions were created by the authors

    Generative AI Use Disclosure Generative AI tools were used solely for language editing and polishing to improve clarity and readability of the manuscript. All technical content, experimental design, analysis, and con- clusions were created by the authors. The authors take full re- sponsibility for the content of this paper

  7. [7]

    We also acknowledge the use of the Dutch National Supercom- puter Snellius for essential computational tasks

    Acknowledgments This work was supported by the NWO AiNed Fellowship Grant of A.S., and in part by Google.org and the Google Cloud Re- search Credits program through the Gemini Academic Program. We also acknowledge the use of the Dutch National Supercom- puter Snellius for essential computational tasks

  8. [8]

    Non- invasive devices for respiratory sound monitoring,

    ´A. Troncoso, J. A. Ortega, R. Seepold, and N. M. Madrid, “Non- invasive devices for respiratory sound monitoring,”Procedia com- puter science, vol. 192, pp. 3040–3048, 2021

  9. [9]

    Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds,

    C. Potes, S. Parvaneh, A. Rahman, and B. Conroy, “Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds,” in2016 computing in cardiology confer- ence (CinC). IEEE, 2016, pp. 621–624

  10. [10]

    Ai4covid- 19: Ai enabled preliminary diagnosis for covid-19 from cough samples via an app,

    A. Imran, I. Posokhova, H. N. Qureshi, U. Masood, M. S. Riaz, K. Ali, C. N. John, M. I. Hussain, and M. Nabeel, “Ai4covid- 19: Ai enabled preliminary diagnosis for covid-19 from cough samples via an app,”Informatics in medicine unlocked, vol. 20, p. 100378, 2020

  11. [11]

    Deep learning methods for heart sounds classification: A systematic re- view,

    W. Chen, Q. Sun, X. Chen, G. Xie, H. Wu, and C. Xu, “Deep learning methods for heart sounds classification: A systematic re- view,”Entropy, vol. 23, no. 6, p. 667, 2021

  12. [12]

    Hear–health acoustic representations,

    S. Baur, Z. Nabulsi, W.-H. Weng, J. Garrison, L. Blankemeier, S. Fishman, C. Chen, S. Kakarmath, M. Maimbolwa, N. San- jaseet al., “Hear–health acoustic representations,”arXiv preprint arXiv:2403.02522, 2024

  13. [13]

    Federated learning for healthcare informatics,

    J. Xu, B. S. Glicksberg, C. Su, P. Walker, J. Bian, and F. Wang, “Federated learning for healthcare informatics,”Journal of health- care informatics research, vol. 5, no. 1, pp. 1–19, 2021

  14. [14]

    The future of digital health with federated learning,

    N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albar- qouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein et al., “The future of digital health with federated learning,”NPJ digital medicine, vol. 3, no. 1, p. 119, 2020

  15. [15]

    MedGemma Technical Report

    A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Tra- verse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lauet al., “Medgemma technical report,”arXiv preprint arXiv:2507.05201, 2025

  16. [16]

    A survey on in-context learning,

    Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Changet al., “A survey on in-context learning,” inPro- ceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 1107–1128

  17. [17]

    Generalizing from a few examples: A survey on few-shot learning,

    Y . Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,”ACM computing surveys (csur), vol. 53, no. 3, pp. 1–34, 2020

  18. [18]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

    Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  19. [19]

    Clap learning audio concepts from natural language supervision,

    B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  20. [20]

    Careaqa: A cardiac and respiratory audio question answer- ing model for open-ended diagnostic reasoning,

    T.-N. Wang, L.-L. Chen, N. Zeghidour, and A. Saeed, “Careaqa: A cardiac and respiratory audio question answer- ing model for open-ended diagnostic reasoning,”arXiv preprint arXiv:2505.01199, 2025

  21. [21]

    Federated optimization in heterogeneous networks,

    T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine learning and systems, vol. 2, pp. 429– 450, 2020

  22. [22]

    Federated learning for healthcare: Systematic re- view and architecture proposal,

    R. S. Antunes, C. Andr ´e da Costa, A. K ¨uderle, I. A. Yari, and B. Eskofier, “Federated learning for healthcare: Systematic re- view and architecture proposal,”ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 4, pp. 1–23, 2022

  23. [23]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”ICLR, vol. 1, no. 2, p. 3, 2022

  24. [24]

    Flower: A Friendly Federated Learning Research Framework

    D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y . Gao, L. Sani, K. H. Li, T. Parcollet, P. P. B. de Gusm ˜AG ¸ o et al., “Flower: A friendly federated learning research frame- work,”arXiv preprint arXiv:2007.14390, 2020

  25. [25]

    An open access database for the evaluation of respira- tory sound classification algorithms,

    B. M. Rocha, D. Filos, L. Mendes, G. Serbes, S. Ulukaya, Y . P. Kahya, N. Jakovljevic, T. L. Turukalo, I. M. V ogiatzis, E. Peran- toniet al., “An open access database for the evaluation of respira- tory sound classification algorithms,”Physiological measurement, vol. 40, no. 3, p. 035001, 2019

  26. [26]

    The CirCor DigiScope Phonocardiogram Dataset,

    J. Oliveira, F. Renna, P. Costa, M. Nogueira, A. C. Oliveira, A. Elola, C. Ferreira, A. Jorge, A. Bahrami Rad, M. Reyna, R. Sameni, G. Clifford, and M. Coimbra, “The CirCor DigiScope Phonocardiogram Dataset,”PhysioNet, May 2022, version 1.0.3. [Online]. Available: https://doi.org/10.13026/tshs-mw03

  27. [27]

    The coughvid crowd- sourcing dataset, a corpus for the study of large-scale cough anal- ysis algorithms,

    L. Orlandic, T. Teijeiro, and D. Atienza, “The coughvid crowd- sourcing dataset, a corpus for the study of large-scale cough anal- ysis algorithms,”Scientific Data, vol. 8, no. 1, p. 156, 2021

  28. [28]

    Benchmarking of eight recurrent neural network vari- ants for breath phase and adventitious sound detection on a self- developed open-access lung sound database—hf lung v1,

    F.-S. Hsu, S.-R. Huang, C.-W. Huang, C.-J. Huang, Y .-R. Cheng, C.-C. Chen, J. Hsiao, C.-W. Chen, L.-C. Chen, Y .-C. Lai et al., “Benchmarking of eight recurrent neural network vari- ants for breath phase and adventitious sound detection on a self- developed open-access lung sound database—hf lung v1,”PLoS One, vol. 16, no. 7, p. e0254134, 2021

  29. [29]

    Sprsound: Open-source sjtu paediatric respiratory sound database,

    Q. Zhang, J. Zhang, J. Yuan, H. Huang, Y . Zhang, B. Zhang, G. Lv, S. Lin, N. Wang, X. Liuet al., “Sprsound: Open-source sjtu paediatric respiratory sound database,”IEEE Transactions on Biomedical Circuits and Systems, vol. 16, no. 5, pp. 867–881, 2022

  30. [30]

    Covid- 19 sounds: a large-scale audio dataset for digital respiratory screening,

    T. Xia, D. Spathis, J. Ch, A. Grammenos, J. Han, A. Hasthana- sombat, E. Bondareva, T. Dang, A. Floto, P. Cicutaet al., “Covid- 19 sounds: a large-scale audio dataset for digital respiratory screening,” inThirty-fifth conference on neural information pro- cessing systems datasets and benchmarks track (round 2), 2021

  31. [31]

    Zchsound: Open-source zju paediatric heart sound database with congenital heart disease,

    W. Jia, Y . Wang, R. Chen, J. Ye, D. Li, F. Yin, J. Yu, J. Chen, Q. Shu, and W. Xu, “Zchsound: Open-source zju paediatric heart sound database with congenital heart disease,”IEEE Transactions on Biomedical Engineering, vol. 71, no. 8, pp. 2278–2286, 2024

  32. [32]

    Pengi: An audio language model for audio tasks,

    S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 18 090–18 108, 2023

  33. [33]

    Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,

    S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sak- shi, O. Nieto, R. Duraiswami, and D. Manocha, “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,”arXiv preprint arXiv:2406.11768, 2024

  34. [34]

    Gemma 3 Technical Report

    G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram ´e, M. Rivi`ereet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, 2025

  35. [35]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  36. [36]

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,”arXiv preprint arXiv:2507.08128, 2025

  37. [37]

    Rouge: A package for automatic evaluation of sum- maries,

    C.-Y . Lin, “Rouge: A package for automatic evaluation of sum- maries,” inText summarization branches out, 2004, pp. 74–81

  38. [38]

    BERTScore: Evaluating Text Generation with BERT

    T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

  39. [39]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,”arXiv preprint arXiv:1711.05101, 2017

  40. [40]

    Communication-efficient learning of deep networks from decentralized data,

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial intelligence and statistics. PMLR, 2017, pp. 1273–1282

  41. [41]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  42. [42]

    Qwen2.5: A party of foundation models,

    Q. Team, “Qwen2.5: A party of foundation models,” September

  43. [43]

    Available: https://qwenlm.github.io/blog/qwen2

    [Online]. Available: https://qwenlm.github.io/blog/qwen2. 5/