pith. machine review for the scientific record. sign in

arxiv: 2605.13672 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords few-shot learningshortcut learningaudio classificationbenchmarkspurious correlationscontextual shiftspretrained modelsgeneralization
0
0 comments X

The pith

A new benchmark shows few-shot audio classifiers suffer sharp drops when background correlations are broken, even in large pretrained models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SpurAudio to isolate how few-shot audio methods depend on spurious links between target sounds and their surrounding environments. Standard tests keep those links intact, so models look strong, but the benchmark swaps backgrounds across support and query sets at multiple levels of disruption. Results show that many leading approaches lose substantial accuracy once the shortcuts disappear, and this pattern holds for large foundation models. The work also finds that methods which score similarly on ordinary benchmarks can differ sharply in how their feature representations couple with the final classifier when context changes.

Core claim

SpurAudio uses the natural separability of foreground events and background environments in audio to construct controlled, multi-level contextual shifts. When these shifts are introduced, state-of-the-art few-shot methods exhibit severe performance degradation even though they achieve comparable accuracy under conventional evaluation protocols. The same vulnerability appears in large pretrained audio foundation models, showing that the problem is not explained by limited backbone capacity. Different methods display distinct sensitivities that trace to the interaction between learned representations and inference-time classifier heads.

What carries the argument

The SpurAudio benchmark, which exploits natural separability of foreground events and background environments to create controlled multi-level contextual shifts between support and query sets.

If this is right

  • Methods that appear equivalent on standard benchmarks can reveal large differences in robustness once background correlations are removed.
  • Sensitivity to spurious context is determined by how feature representations interact with the classifier head at inference time.
  • Large-scale pretraining alone does not remove dependence on background cues in few-shot audio settings.
  • Evaluation protocols must include explicit context-disruption tests to measure generalization beyond shortcut exploitation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar controlled-separation benchmarks could be developed for image and video few-shot tasks where context is harder to isolate.
  • Methods that explicitly decouple foreground from background signals may show greater robustness across domains.
  • Real-world audio systems should be stress-tested on context-shift benchmarks before deployment in variable environments.

Load-bearing premise

Foreground audio events and background environments can be cleanly separated to produce realistic and controlled contextual variations.

What would settle it

Demonstrating that every tested few-shot method maintains its original accuracy when backgrounds are swapped in a dataset where foreground and background signals are verifiably independent of each other.

Figures

Figures reproduced from arXiv: 2605.13672 by Giries Abu Ayoub, Loay Mualem, Morad Tukan.

Figure 1
Figure 1. Figure 1: Visualization of SpurAudio’s episodic structure. (a) A 1-shot 2-way episode: two fore [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: IID–OOD gap versus spurious correla￾tion strength. We plot ∆(α) for 1/5 shot. The gap grows with α, showing strong foreground – back￾ground correlations increase OOD degradation. In the standard OOD evaluation, support and query backgrounds are sampled such that some background patterns seen in the support set do not appear in the query set of other classes. To more directly probe the effect of spurious co… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of Spurious Background Correlations on Query Evaluation. We compare few-shot [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of embedding space under IID vs. OOD episodes. Across methods, embeddings form cleaner, more separable clusters in IID settings, while OOD background shifts induce query-support misalignment and increased inter-class overlap. following figures, circles denote support points while crosses denote query points. This highlights one factor behind higher/lower average IID accuracy across seed… view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualization of embedding space under IID vs. OOD episodes. Across methods, embeddings form cleaner, more separable clusters in IID settings, while OOD background shifts induce query-support misalignment and increased inter-class overlap. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE visualization of embedding space under IID vs. OOD episodes. Across methods, embeddings form cleaner, more separable clusters in IID settings, while OOD background shifts induce query-support misalignment and increased inter-class overlap. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: IID vs OOD vs Hard OOD accuracy for few-shot tasks. Each candle represents the classification accuracy of a method under a given evaluation setting. IID corresponds to standard in-distribution support-query sampling, OOD represents typical out-of-distribution support-query pairs, and Hard OOD enforces maximal background overlap across classes. Accuracy decreases progressively from IID to OOD to Hard OOD, f… view at source ↗
Figure 8
Figure 8. Figure 8: IID vs OOD accuracy for few-shot families. Each candle shows the classification accuracy of an algorithm across multiple seeds. IID accuracy remains higher, while OOD accuracy exhibits a lower mean, indicating that few-shot methods partially rely on background cues. This highlights that performance degrades when support and query distributions differ, even without enforcing extreme correlations. F FSL with… view at source ↗
Figure 9
Figure 9. Figure 9: IID configuration, multiple Conv64F backbones on multiple algorithms  Key Takeaway: • Cosine-based classifiers are the most robust head family in Conv64, because they suppress the magnitude-encoded background shortcut that magnitude-sensitive heads (Euclidean ProtoNet, dot-product Baseline) exploit. • Swapping a cosine head onto a magnitude-sensitive backbone (Meta-Baseline, Base￾line) yields the largest … view at source ↗
Figure 10
Figure 10. Figure 10: OOD configuration, multiple Conv64 backbones on multiple algorithms 34 [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: IID configuration, multiple Resnet12 backbones on multiple algorithms 35 [PITH_FULL_IMAGE:figures/full_fig_p035_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: OOD configuration, multiple Resnet12 backbones on multiple algorithms 36 [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Geometric Decomposition of Feature Embeddings across 15 Backbones. On the y-axis, [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Distributional Analysis Metrics per Class. Comparison of [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Impact of support set size (K) on IID versus OOD performance. The widening gap illustrates a “Simple Bias,” where the model increasingly relies on spurious background correlations as K grows. Notably, this generalization gap eventually plateaus, indicating that the model’s convergence on the spurious feature saturates at higher shot counts. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_15.png] view at source ↗
read the original abstract

Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SpurAudio, a benchmark for few-shot audio classification that exploits the separability of foreground events and background environments to create controlled multi-level contextual shifts between support and query sets. It evaluates state-of-the-art few-shot methods and large pretrained audio foundation models, showing severe performance degradation when background correlations are disrupted despite comparable accuracy under standard protocols, and attributes differences to how feature representations interact with classifier heads.

Significance. If the benchmark construction successfully isolates shortcut learning without confounding acoustic domain shifts, the results would demonstrate that context dependence is a systematic vulnerability in audio FSC methods, including foundation models, and motivate more robust evaluation protocols. The work extends image-based shortcut studies to audio and provides a controlled testbed for algorithmic differences that standard benchmarks obscure.

major comments (3)
  1. [§3] §3 (Benchmark Construction): The central claim attributes performance drops solely to disruption of background correlations, but the manuscript provides no quantitative verification that acoustic properties (SNR distributions, reverberation times, spectral tilt, or event masking) are matched between standard and disrupted splits after overlaying events onto new backgrounds. Without such controls or statistics, degradation could reflect increased task difficulty or domain mismatch rather than shortcut vulnerability, which is load-bearing for the interpretation of results on both conventional methods and pretrained models.
  2. [§4.2–4.3] §4.2–4.3 (Experimental Results): The reported degradations for few-shot methods and foundation models lack statistical significance tests, confidence intervals, or multiple random seeds with variance measures. This makes it difficult to determine whether observed differences between methods (e.g., sensitivity tied to classifier-head interaction) are reliable or could arise from split variability.
  3. [§3.1] §3.1 (Multi-level Shifts): The description of how foreground/background separation is performed and how multi-level contextual shifts are generated does not include explicit details on the mixing procedure, event duration alignment, or verification that foreground events remain perceptually unchanged. This detail is required to confirm that the benchmark isolates the intended spurious correlations.
minor comments (2)
  1. [Tables/Figures] Table 1 and Figure 2 captions could more explicitly state the exact number of classes, shots, and background conditions per split to aid reproducibility.
  2. [§2] The related-work section omits several recent audio domain-adaptation papers that also study environmental context, which would strengthen positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the suggested additions will improve the rigor and clarity of the manuscript and plan to incorporate them in the revision.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The central claim attributes performance drops solely to disruption of background correlations, but the manuscript provides no quantitative verification that acoustic properties (SNR distributions, reverberation times, spectral tilt, or event masking) are matched between standard and disrupted splits after overlaying events onto new backgrounds. Without such controls or statistics, degradation could reflect increased task difficulty or domain mismatch rather than shortcut vulnerability, which is load-bearing for the interpretation of results on both conventional methods and pretrained models.

    Authors: We agree that quantitative verification of acoustic property matching is essential to isolate shortcut effects from potential confounds. In the revised manuscript we will add a dedicated analysis (new table and/or appendix) reporting SNR distributions, reverberation times, spectral tilt, and event-masking statistics for both the standard and disrupted splits. These statistics will confirm that the splits are matched on these acoustic dimensions, thereby supporting the interpretation that observed degradations stem from the disruption of background correlations. revision: yes

  2. Referee: [§4.2–4.3] §4.2–4.3 (Experimental Results): The reported degradations for few-shot methods and foundation models lack statistical significance tests, confidence intervals, or multiple random seeds with variance measures. This makes it difficult to determine whether observed differences between methods (e.g., sensitivity tied to classifier-head interaction) are reliable or could arise from split variability.

    Authors: We acknowledge the need for statistical rigor. In the revision we will rerun all experiments across at least five independent random seeds, report mean performance with standard deviation, include 95% confidence intervals, and add paired statistical significance tests (e.g., t-tests) between methods and between standard versus disrupted conditions in §§4.2–4.3. revision: yes

  3. Referee: [§3.1] §3.1 (Multi-level Shifts): The description of how foreground/background separation is performed and how multi-level contextual shifts are generated does not include explicit details on the mixing procedure, event duration alignment, or verification that foreground events remain perceptually unchanged. This detail is required to confirm that the benchmark isolates the intended spurious correlations.

    Authors: We will expand §3.1 with a precise description of the mixing procedure, including the linear mixing formula with controlled SNR, the event-duration alignment strategy (zero-padding or truncation to match background length), and verification steps (objective SNR preservation metrics plus a small-scale perceptual listening test confirming that foreground events remain perceptually unaltered after mixing). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent evaluations

full rationale

The paper introduces SpurAudio as an empirical benchmark for shortcut learning in few-shot audio classification and evaluates existing methods on it. No derivations, predictions, or first-principles results are claimed; performance claims rest on direct experimental comparisons across standard and disrupted splits. The construction of the benchmark relies on external audio datasets with natural foreground/background separability, not on any self-referential fitting or self-citation chain that reduces the central claim to its own inputs. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard domain assumptions in few-shot learning and audio signal processing; no free parameters, invented entities, or ad-hoc axioms are introduced beyond those implicit in the benchmark design.

axioms (1)
  • domain assumption Target concepts in audio examples are independent of contextual background cues under standard evaluation protocols
    Explicitly stated in the abstract as the implicit assumption most evaluations make.

pith-pipeline@v0.9.0 · 5533 in / 1120 out tokens · 34958 ms · 2026-05-14T20:28:37.593180+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 4 internal anchors

  1. [1]

    Classifying sounds in polyphonic urban sound scenes.AES E-Library

    Jakob Abeßer. Classifying sounds in polyphonic urban sound scenes.AES E-Library. Online resource, 2022

  2. [2]

    How robust are audio embeddings for polyphonic sound event tagging?IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2658–2667, 2023

    Jakob Abeßer, Sascha Grollmisch, and Meinard Müller. How robust are audio embeddings for polyphonic sound event tagging?IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2658–2667, 2023

  3. [3]

    wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

  4. [4]

    Meta-learning with task-adaptive loss function for few-shot learning

    Sungyong Baik, Janghoon Choi, Heewon Kim, Dohee Cho, Jaesik Min, and Kyoung Mu Lee. Meta-learning with task-adaptive loss function for few-shot learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 9465–9474, 2021

  5. [5]

    Meta-learning with differentiable closed-form solvers

    Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers.arXiv preprint arXiv:1805.08136, 2018

  6. [6]

    Semi-supervised learning (chapelle, o

    Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews].IEEE Transactions on Neural Networks, 20(3):542–542, 2009

  7. [7]

    Beats: Audio pre-training with acoustic tokenizers.arXiv preprint arXiv:2212.09058, 2022

    Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei. Beats: Audio pre-training with acoustic tokenizers.arXiv preprint arXiv:2212.09058, 2022

  8. [8]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  9. [9]

    A closer look at few-shot classification.arXiv preprint arXiv:1904.04232, 2019

    Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification.arXiv preprint arXiv:1904.04232, 2019

  10. [10]

    Meta-baseline: Exploring simple meta-learning for few-shot learning

    Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, and Xiaolong Wang. Meta-baseline: Exploring simple meta-learning for few-shot learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 9062–9071, 2021. 10

  11. [11]

    Qwen2-Audio Technical Report

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024

  12. [12]

    Cough classification using few-shot learning.arXiv e-prints, pages arXiv–2509, 2025

    Yoga Disha Sendhil Kumar, Manas V Shetty, and Sudip Vhaduri. Cough classification using few-shot learning.arXiv e-prints, pages arXiv–2509, 2025

  13. [13]

    Learning task-aware local rep- resentations for few-shot learning

    Chuanqi Dong, Wenbin Li, Jing Huo, Zheng Gu, and Yang Gao. Learning task-aware local rep- resentations for few-shot learning. InProceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, pages 716–722, 2021

  14. [14]

    Clap learning audio concepts from natural language supervision

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  15. [16]

    Model-agnostic meta-learning for fast adap- tation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap- tation of deep networks. InInternational conference on machine learning, pages 1126–1135. PMLR, 2017

  16. [17]

    Exploring finetuned audio-llm on heart murmur features.Smart Health, page 100557, 2025

    Adrian Florea, Xilin Jiang, Nima Mesgarani, and Xiaofan Jiang. Exploring finetuned audio-llm on heart murmur features.Smart Health, page 100557, 2025

  17. [18]

    Fsd50k: an open dataset of human-labeled sound events.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. Fsd50k: an open dataset of human-labeled sound events.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021

  18. [19]

    Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  19. [20]

    Ghani, T

    B. Ghani, T. Denton, and S. Kahl. Deep learning for bioacoustics: A survey of recent advances in few-shot recognition.Journal of Applied Ecology (Example Citation), 12(4):112–125, 2024

  20. [21]

    Ast: Audio spectrogram transformer,

    Yuan Gong, Yu-An Chung, and James Glass. Ast: Audio spectrogram transformer.arXiv preprint arXiv:2104.01778, 2021

  21. [22]

    V ocalsound: A dataset for improving human vocal sounds recognition

    Yuan Gong, Jin Yu, and James Glass. V ocalsound: A dataset for improving human vocal sounds recognition. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 151–155, 2022

  22. [23]

    Entropy calibrated prototype embedding for transductive few-shot learning.Pattern Recognition Letters, 2026

    Mengfei Guo, Jiahui Wang, Qin Xu, Bo Jiang, and Bin Luo. Entropy calibrated prototype embedding for transductive few-shot learning.Pattern Recognition Letters, 2026

  23. [24]

    Metaaudio: A few-shot audio classification benchmark

    Calum Heggan, Sam Budgett, Timothy Hospedales, and Mehrdad Yaghoobi. Metaaudio: A few-shot audio classification benchmark. InInternational Conference on Artificial Neural Networks, pages 219–230. Springer, 2022

  24. [25]

    Masked autoencoders that listen.Advances in neural information processing systems, 35:28708–28720, 2022

    Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that listen.Advances in neural information processing systems, 35:28708–28720, 2022

  25. [26]

    Reshaping bioacoustics event detection: Leverag- ing few-shot learning (fsl) with transductive inference and data augmentation.Bioengineering, 11(7):685, 2024

    Nouman Ijaz, Farhad Banoori, and Insoo Koo. Reshaping bioacoustics event detection: Leverag- ing few-shot learning (fsl) with transductive inference and data augmentation.Bioengineering, 11(7):685, 2024

  26. [27]

    An automated pipeline for few-shot bird call classification: A case study with the tooth-billed pigeon.arXiv preprint arXiv:2504.16276, 2025

    Abhishek Jana, Moeumu Uili, James Atherton, Mark O’Brien, Joe Wood, and Leandra Brickson. An automated pipeline for few-shot bird call classification: A case study with the tooth-billed pigeon.arXiv preprint arXiv:2504.16276, 2025

  27. [28]

    Efficient training of audio transformers with patchout.arXiv preprint arXiv:2110.05069, 2021

    Khaled Koutini, Jan Schlüter, Hamid Eghbal-Zadeh, and Gerhard Widmer. Efficient training of audio transformers with patchout.arXiv preprint arXiv:2110.05069, 2021. 11

  28. [29]

    Hela- vfa: A hellinger distance-attention-based feature aggregation network for few-shot classification

    Gao Yu Lee, Tanmoy Dam, Daniel Puiu Poenar, Vu N Duong, and Md Meftahul Ferdaus. Hela- vfa: A hellinger distance-attention-based feature aggregation network for few-shot classification. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2173–2183, 2024

  29. [30]

    Are fewer labels possible for few-shot learning?arXiv preprint arXiv:2012.05899, 2020

    Suichan Li, Dongdong Chen, Yinpeng Chen, Lu Yuan, Lei Zhang, Qi Chu, and Nenghai Yu. Are fewer labels possible for few-shot learning?arXiv preprint arXiv:2012.05899, 2020

  30. [31]

    Asymmetric distribution measure for few-shot learning.arXiv preprint arXiv:2002.00153, 2020

    Wenbin Li, Lei Wang, Jing Huo, Yinghuan Shi, Yang Gao, and Jiebo Luo. Asymmetric distribution measure for few-shot learning.arXiv preprint arXiv:2002.00153, 2020

  31. [32]

    Revisiting local descriptor based image-to-class measure for few-shot learning

    Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Yang Gao, and Jiebo Luo. Revisiting local descriptor based image-to-class measure for few-shot learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7260–7268, 2019

  32. [33]

    Libfewshot: A comprehensive library for few-shot learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14938– 14955, 2023

    Wenbin Li, Ziyi Wang, Xuesong Yang, Chuanqi Dong, Pinzhuo Tian, Tiexin Qin, Jing Huo, Yinghuan Shi, Lei Wang, Yang Gao, et al. Libfewshot: A comprehensive library for few-shot learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14938– 14955, 2023

  33. [34]

    Federated few-shot learning-based machinery fault diagnosis in the industrial internet of things.Applied Sciences, 13(18):10458, 2023

    Yingying Liang, Peng Zhao, and Yimeng Wang. Federated few-shot learning-based machinery fault diagnosis in the industrial internet of things.Applied Sciences, 13(18):10458, 2023

  34. [35]

    Prototype rectification for few-shot learning

    Jinlu Liu, Liang Song, and Yongqiang Qin. Prototype rectification for few-shot learning. In European conference on computer vision, pages 741–756. Springer, 2020

  35. [36]

    Few-shot bioacoustic event detection at the dcase 2024 challenge.Recall, 1:F1_PB, 2024

    Wei Liu, Hy Liu, Fl Lin, Hs Liu, Tian Gao, Xin Fang, Jh Liu, Xuyao Deng, Yanjie Sun, Kele Xu, et al. Few-shot bioacoustic event detection at the dcase 2024 challenge.Recall, 1:F1_PB, 2024

  36. [37]

    Learning to affiliate: Mutual centralized learning for few-shot classification

    Yang Liu, Weifeng Zhang, Chao Xiang, Tu Zheng, Deng Cai, and Xiaofei He. Learning to affiliate: Mutual centralized learning for few-shot classification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14411–14420, 2022

  37. [38]

    Clever hans effect found in automatic detection of alzheimer’s disease through speech.arXiv preprint arXiv:2406.07410, 2024

    Yin-Long Liu, Rui Feng, Jia-Hong Yuan, and Zhen-Hua Ling. Clever hans effect found in automatic detection of alzheimer’s disease through speech.arXiv preprint arXiv:2406.07410, 2024

  38. [39]

    Whamr!: Noisy and reverberant single-channel speech separation

    Matthew Maciejewski, Gordon Wichern, Emmett McQuinn, and Jonathan Le Roux. Whamr!: Noisy and reverberant single-channel speech separation. InICASSP 2020-2020 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 696–700. IEEE, 2020

  39. [40]

    Towards practical few-shot query sets: Transductive minimum description length infer- ence.Advances in Neural Information Processing Systems, 35:34677–34688, 2022

    Ségolène Martin, Malik Boudiaf, Emilie Chouzenoux, Jean-Christophe Pesquet, and Ismail Ayed. Towards practical few-shot query sets: Transductive minimum description length infer- ence.Advances in Neural Information Processing Systems, 35:34677–34688, 2022

  40. [41]

    Active few-shot learning for rare bioacoustic feature annotation.Ecological Informatics, 82:102734, 2024

    Ben McEwen, Kaspar Soltero, Stefanie Gutschmidt, Andrew Bainbridge-Smith, James Atlas, and Richard Green. Active few-shot learning for rare bioacoustic feature annotation.Ecological Informatics, 82:102734, 2024

  41. [42]

    Pretraining representations for bioacoustic few-shot detection using supervised contrastive learning.arXiv preprint arXiv:2309.00878, 2023

    Ilyass Moummad, Romain Serizel, and Nicolas Farrugia. Pretraining representations for bioacoustic few-shot detection using supervised contrastive learning.arXiv preprint arXiv:2309.00878, 2023

  42. [43]

    Learning to detect an animal sound from five examples.Ecological informatics, 77:102258, 2023

    Ines Nolasco, Shubhr Singh, Veronica Morfi, Vincent Lostanlen, Ariana Strandburg-Peshkin, Ester Vidaña-Vila, Lisa Gill, Hanna Pamuła, Helen Whitehead, Ivan Kiskin, et al. Learning to detect an animal sound from five examples.Ecological informatics, 77:102258, 2023

  43. [44]

    Boil: Towards representation change for few-shot learning.arXiv preprint arXiv:2008.08882, 2020

    Jaehoon Oh, Hyungjun Yoo, ChangHwan Kim, and Se-Young Yun. Boil: Towards representation change for few-shot learning.arXiv preprint arXiv:2008.08882, 2020

  44. [45]

    Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. InProceedings of the 23rd Annual ACM Conference on Multimedia, pages 1015–1018. ACM Press. 12

  45. [46]

    Rapid learning or feature reuse? towards understanding the effectiveness of maml,

    Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml.arXiv preprint arXiv:1909.09157, 2019

  46. [47]

    Meta-Learning with Latent Embedding Optimization

    Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization.arXiv preprint arXiv:1807.05960, 2018

  47. [48]

    Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization

    Shiori Sagawa and Pang Wei Koh. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization

  48. [49]

    Salamon, C

    J. Salamon, C. Jacoby, and J. P. Bello. A dataset and taxonomy for urban sound research. In 22nd ACM International Conference on Multimedia (ACM-MM’14), pages 1041–1044, Orlando, FL, USA, Nov. 2014

  49. [50]

    Deep convolutional neural networks and data augmen- tation for environmental sound classification.IEEE Signal processing letters, 24(3):279–283, 2017

    Justin Salamon and Juan Pablo Bello. Deep convolutional neural networks and data augmen- tation for environmental sound classification.IEEE Signal processing letters, 24(3):279–283, 2017

  50. [51]

    An optimized few-shot learning framework for fault diagnosis in milling machines.Machines, 13(11):1010, 2025

    Faisal Saleem, Muhammad Umar, and Jong-Myon Kim. An optimized few-shot learning framework for fault diagnosis in milling machines.Machines, 13(11):1010, 2025

  51. [52]

    Wav2vec: Unsupervised pre-training for speech recognition,

    Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition.arXiv preprint arXiv:1904.05862, 2019

  52. [53]

    Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, Nishit Anand, Ashish Seth, Sreyan Ghosh, and Dinesh Manocha. Do audio-language models understand linguistic varia- tions? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)...

  53. [54]

    Prototypical contrastive learning for improved few shot audio classification

    Christos Sgouropoulos, Christos Nikou, Stefanos Vlachos, Vasileios Theiou, Christos Foukanelis, and Theodoros Giannakopoulos. Prototypical contrastive learning for improved few shot audio classification. In2025 IEEE 35th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2025

  54. [55]

    University of Haifa (Israel), 2024

    Daniel Shalam.The balanced-pairwise-affinities feature transform. University of Haifa (Israel), 2024

  55. [56]

    A few-shot learning based fault diagnosis model using sensors data from industrial machineries.Vibration, 6(4):1004–1029, 2023

    Farhan Md Siraj, Syed Tasnimul Karim Ayon, and Jia Uddin. A few-shot learning based fault diagnosis model using sensors data from industrial machineries.Vibration, 6(4):1004–1029, 2023

  56. [57]

    Prototypical networks for few-shot learning

    Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017

  57. [58]

    Learning to compare: Relation network for few-shot learning

    Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1199–1208, 2018

  58. [59]

    Sound event detection and separation: a benchmark on desed synthetic soundscapes

    Nicolas Turpault, Romain Serizel, Scott Wisdom, Hakan Erdogan, John R Hershey, Eduardo Fonseca, Prem Seetharaman, and Justin Salamon. Sound event detection and separation: a benchmark on desed synthetic soundscapes. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 840–844. IEEE, 2021

  59. [60]

    Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016

    Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016

  60. [61]

    Generalizing from a few examples: A survey on few-shot learning.ACM computing surveys (CSUR), 53(3):1–34, 2020

    Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning.ACM computing surveys (CSUR), 53(3):1–34, 2020

  61. [62]

    Few-shot classification with feature map reconstruction networks

    Davis Wertheimer, Luming Tang, and Bharath Hariharan. Few-shot classification with feature map reconstruction networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8012–8021, 2021. 13

  62. [63]

    WHAM!: Extending Speech Separation to Noisy Environments

    Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux. Wham!: Extending speech separation to noisy environments.arXiv preprint arXiv:1907.01160, 2019

  63. [64]

    Characteristics of real-world signal to noise ratios and speech listening situations of older adults with mild to moderate hearing loss.Ear and hearing, 39(2):293–304, 2018

    Yu-Hsiang Wu, Elizabeth Stangl, Octav Chipara, Syed Shabih Hasan, Anne Welhaven, and Jacob Oleson. Characteristics of real-world signal to noise ratios and speech listening situations of older adults with mild to moderate hearing loss.Ear and hearing, 39(2):293–304, 2018

  64. [65]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword- to-caption augmentation

    Yusong Wu*, Ke Chen*, Tianyu Zhang*, Yuchen Hui*, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword- to-caption augmentation. InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023

  65. [66]

    Wilddesed: an llm-powered dataset for wild domestic environment sound event detection system.arXiv preprint arXiv:2407.03656, 2024

    Yang Xiao and Rohan Kumar Das. Wilddesed: an llm-powered dataset for wild domestic environment sound event detection system.arXiv preprint arXiv:2407.03656, 2024

  66. [67]

    Joint distribution matters: Deep brownian distance covariance for few-shot classification

    Jiangtao Xie, Fei Long, Jiaming Lv, Qilong Wang, and Peihua Li. Joint distribution matters: Deep brownian distance covariance for few-shot classification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7972–7981, 2022

  67. [68]

    Transformer- based bioacoustic sound event detection on few-shot learning tasks

    Liwen You, Erika Pelaez Coyotl, Suren Gunturu, and Maarten Van Segbroeck. Transformer- based bioacoustic sound event detection on few-shot learning tasks. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  68. [69]

    Few-shot learning-based machine fault diagnosis using emd-gammatone spectrogram with limited labeled audio dataset

    Mahe Zabin, Syed Tasnimul Karim Ayon, Farhan Md Siraj, Mehedi Hasan Shuvo, Ho-Jin Choi, and Jia Uddin. Few-shot learning-based machine fault diagnosis using emd-gammatone spectrogram with limited labeled audio dataset. In2025 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 183–190. IEEE, 2025

  69. [70]

    Zhang and et al

    Y . Zhang and et al. Metacoco: A benchmark for spurious correlation in few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

  70. [71]

    Based on user prompt description

  71. [72]

    Diffkendall: A novel approach for few- shot learning with differentiable kendall’s rank correlation.Advances in Neural Information Processing Systems, 36:49403–49415, 2023

    Kaipeng Zheng, Huishuai Zhang, and Weiran Huang. Diffkendall: A novel approach for few- shot learning with differentiable kendall’s rank correlation.Advances in Neural Information Processing Systems, 36:49403–49415, 2023

  72. [73]

    Transductive few-shot learning with prototype-based label propa- gation by iterative graph refinement

    Hao Zhu and Piotr Koniusz. Transductive few-shot learning with prototype-based label propa- gation by iterative graph refinement. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23996–24006, 2023

  73. [74]

    Laplacian regularized few-shot learning

    Imtiaz Ziko, Jose Dolz, Eric Granger, and Ismail Ben Ayed. Laplacian regularized few-shot learning. InInternational conference on machine learning, pages 11660–11670. PMLR, 2020

  74. [75]

    Magnitude Contraction

    Imtiaz Masud Ziko, Malik Boudiaf, Jose Dolz, Eric Granger, and Ismail Ben Ayed. Transductive few-shot learning: Clustering is all you need?arXiv preprint arXiv:2106.09516, 2021. 14 Supplementary Material Contents Appendix Contents AExperimental Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

  75. [76]

    signal energy

    Magnitude (Radial Component): r=∥z∥ 2, representing the activation intensity or "signal energy."

  76. [77]

    Magnitude Contraction

    Direction (Angular Component): ˆz= z ∥z∥2 , representing the semantic identity on the hypersphereS d−1. To quantify the semantic alignment, we compute theCosine Similaritybetween a sample’s direction ˆz and its corresponding clean (without backgrounds) class prototype pc. The prototype is defined as the mean direction of the clean, foreground-only samples...

  77. [78]

    Magnitude Contraction

    Failure of Euclidean Metrics (e.g., ProtoNet):Euclidean distance is sensitive to magnitude differences. For a queryqand prototypep: ∥q−p∥ 2 =∥q∥ 2 +∥p∥ 2 | {z } Magnitude Term −2∥q∥∥p∥cosθ| {z } Interaction Term (4) The "Magnitude Contraction" observed in mixed samples (∥q∥ ↓ ) reduces the interaction term and alters the magnitude term, creating a large E...

  78. [79]

    in-the-wild

    Robustness of Cosine Metrics (e.g., Baseline++, Meta-Baseline):Cosine-based heads explicitly normalize feature vectors during inference: Score(q, p) = qT p ∥q∥∥p∥ = cosθ(5) By projecting all embeddings onto the unit hypersphere, these algorithms mathematically nullify the magnitude axis. Since the background information is sequestered primarily in the mag...