Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification

Chan Y. Park; Sheir Zaheer; Tu Vo

arxiv: 2606.04844 · v1 · pith:5UTZFYQ2new · submitted 2026-06-03 · 💻 cs.SD · cs.CV

Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification

Tu Vo , Sheir Zaheer , Chan Y. Park This is my paper

Pith reviewed 2026-06-28 05:07 UTC · model grok-4.3

classification 💻 cs.SD cs.CV

keywords zero-shot audio classificationnoise robustnesscontrastive audio-language modelsCLAPdrift-augmented scoringacoustic noisetext-derived bonus

0 comments

The pith

Drift-Augmented Scoring adds a text-derived per-class bonus to restore zero-shot audio classification accuracy under acoustic noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Contrastive audio-language models lose 12-30 points of accuracy when audio is mixed with noise because embeddings drift away from clean text prompts. The paper proposes Drift-Augmented Scoring, which computes a small additive bonus for each class from the change in its noise-conditioned text embeddings and adds that bonus to the cosine score. The bonus is derived entirely from text, cached once, and requires only one extra inner product per class at inference with no gradients or test batches. On LAION CLAP, the method raises accuracy 2.60-5.75 points on UrbanSound8K and mAP 1.50-1.74 points on FSD50K when clips are mixed with urban scene noise across a range of SNRs.

Core claim

Drift-Augmented Scoring improves matching by rewarding each class when the noisy audio embedding moves in the direction predicted by the difference between its clean and noise-conditioned text prompt embeddings.

What carries the argument

Drift-Augmented Scoring (DAS), a cached per-class bonus computed from the change in noise-conditioned text prompt embeddings and added to the cosine similarity score.

If this is right

DAS raises accuracy on every SNR condition tested on UrbanSound8K.
DAS raises mAP on every SNR condition tested on the full FSD50K evaluation set.
The method adds only a single inner product per class at inference after one-time caching.
The improvement holds when the audio is mixed with real urban acoustic scene recordings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same text-derived drift correction could be applied to other contrastive models that match embeddings across modalities.
Because the bonus is precomputed from text alone, the technique can be added to existing deployed zero-shot systems with negligible extra cost.
If the prediction of audio drift from text generalizes, it may reduce the need for noise-specific audio training data.

Load-bearing premise

The direction in which noise moves an audio embedding can be predicted from how the same noise changes the text prompt embeddings for that class.

What would settle it

A test set in which the DAS bonus for each class shows no positive correlation with the actual improvement in classification accuracy for that class under the same noise.

Figures

Figures reproduced from arXiv: 2606.04844 by Chan Y. Park, Sheir Zaheer, Tu Vo.

**Figure 1.** Figure 1: Drift-Augmented Scoring (DAS). (a) Offline, text-only: for each class c the CLAP text encoder produces a clean prototype Cc from "sound of c" and a unit-norm drift direction ˆδc that is the (normalised) mean of text drifts CLAP(”c with p”)−Cc over a small set of noise phrases. (b) Test-time scoring in three steps: (1) encode the audio clip into z; (2) for each class c compute two cosines—a clean cosine a1(… view at source ↗

read the original abstract

Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on standard benchmarks. We propose Drift Augmented Scoring (DAS), a small per-class bonus added to the cosine score. The bonus rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict. It is derived from text alone, computed once and cached, and adds a single inner product per class at inference, with no gradients and no test-time batch. On a LAION CLAP backbone, we compare DAS against the four variants of Acevedo et al.'s concurrent method on UrbanSound8K and the full FSD50K eval set, mixing each clip with urban acoustic scene noise across a range of SNRs. DAS improves the metric on every test condition: by +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAS gives small consistent gains on noisy zero-shot audio tasks by adding a cached text-derived bonus, but the paper never checks whether the predicted drift direction actually matches observed audio drift.

read the letter

The main thing here is a simple scoring tweak called Drift Augmented Scoring. It adds a per-class bonus to the cosine similarity in CLAP-style zero-shot audio classification. The bonus comes from the inner product between the difference in noise-conditioned versus clean text embeddings and the difference between noisy and clean audio embeddings. The claim is that this can be precomputed from text alone and improves accuracy or mAP by a few points when urban noise is mixed in at various SNRs.

What the paper actually shows is that on LAION CLAP, DAS beats the four Acevedo variants on every reported condition: +2.6 to +5.75 accuracy on UrbanSound8K and +1.5 to +1.74 mAP on full FSD50K. The method is cheap at inference and avoids any test-time optimization or gradients. That is the concrete contribution.

The soft spot is the missing check on the central assumption. The method only helps if the text-derived drift vector points in the same direction as the actual audio embedding drift on average. The abstract and experiments report only the downstream metric lifts; there is no direct measurement of alignment between the two vectors, no ablation that keeps magnitude but removes direction, and no case where the predicted direction is anti-aligned. Without that, the gains could come from any fixed per-class offset rather than from correctly anticipating drift. The derivation details and exact dataset splits are also thin in the provided material.

This is a narrow but practical robustness patch for people already using CLAP-style models on noisy audio. It is worth a serious referee if the full methods section supplies the missing alignment check and reproducible code. Otherwise it stays incremental.

Referee Report

2 major / 2 minor

Summary. The paper proposes Drift-Augmented Scoring (DAS), a lightweight per-class additive bonus to cosine similarity scores in zero-shot audio-language models such as CLAP. The bonus is derived solely from text prompts (noise-conditioned vs. clean) to reward classes whose predicted embedding drift direction aligns with observed drift in noisy audio embeddings. It is computed once and cached, requiring only one extra inner product per class at inference. On a LAION CLAP backbone, DAS is compared to four variants of a concurrent method and yields consistent gains: +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K when clips are mixed with urban acoustic scene noise across SNRs.

Significance. If the core premise is verified, DAS would provide a training-free, text-only robustness technique that adds negligible inference cost and avoids any audio data in the bonus computation. The reported gains across every tested SNR and dataset are non-trivial for zero-shot settings, but their attribution to the drift-prediction mechanism (rather than a generic per-class offset) remains untested in the provided abstract and experimental summary.

major comments (2)

[Abstract] Abstract: the central claim that the bonus 'rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict' is load-bearing, yet the manuscript reports only downstream accuracy/mAP deltas. No direct measurement of alignment (e.g., mean cosine between (noisy audio emb − clean audio emb) and (noise-text emb − clean-text emb)) or ablation that isolates the direction term while preserving magnitude is described. Without these checks the gains could arise from any additive per-class constant.
[Abstract] Abstract (methods description): the claim that the bonus is 'derived from text alone' and 'parameter-free' cannot be evaluated because no equations, derivation, or explicit definition of the bonus term (including any scaling factor or normalization) are supplied. This prevents verification that the method avoids fitting to audio data or reducing to a learned offset.

minor comments (2)

[Abstract] Abstract: error bars, number of runs, exact dataset splits, and exclusion criteria for the UrbanSound8K and FSD50K experiments are not stated, making it impossible to assess statistical reliability of the reported deltas.
[Abstract] Abstract: the comparison is limited to 'the four variants of Acevedo et al.'s concurrent method'; the manuscript should clarify whether these are the only relevant baselines and whether any standard noise-robustness techniques (e.g., SpecAugment-style augmentation at inference or prompt ensembling) were also evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The comments correctly identify that the abstract would benefit from additional supporting analyses and a more explicit description of the bonus term. We will revise the manuscript accordingly, as outlined in the point-by-point responses below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the bonus 'rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict' is load-bearing, yet the manuscript reports only downstream accuracy/mAP deltas. No direct measurement of alignment (e.g., mean cosine between (noisy audio emb − clean audio emb) and (noise-text emb − clean-text emb)) or ablation that isolates the direction term while preserving magnitude is described. Without these checks the gains could arise from any additive per-class constant.

Authors: We agree that direct verification of the alignment mechanism would strengthen the attribution of gains. In the revised manuscript we will add (1) a quantitative measurement of mean cosine similarity between the observed audio drift vector (noisy embedding minus clean embedding) and the text-predicted drift vector (noise-conditioned text embedding minus clean text embedding) across classes, datasets, and SNRs, and (2) an ablation that replaces the direction-specific bonus with a constant per-class offset of matched magnitude. These additions will appear in a new subsection of the results and will demonstrate that the reported improvements are not reproducible with a generic additive offset. revision: yes
Referee: [Abstract] Abstract (methods description): the claim that the bonus is 'derived from text alone' and 'parameter-free' cannot be evaluated because no equations, derivation, or explicit definition of the bonus term (including any scaling factor or normalization) are supplied. This prevents verification that the method avoids fitting to audio data or reducing to a learned offset.

Authors: The abstract's length constraint prevented inclusion of the full derivation. The complete manuscript (Section 3) defines the bonus explicitly as a scaled inner product between the audio embedding and a fixed, pre-computed text drift vector obtained solely from the difference between noise-conditioned and clean text embeddings; no audio data or learned parameters enter the computation. To address the referee's concern at the abstract level, we will revise the abstract to include a concise one-sentence definition of the bonus together with a pointer to the methods section for the full equations and normalization details. This change will make the text-only, parameter-free character immediately verifiable from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines the DAS bonus explicitly as a quantity computed solely from differences between noise-conditioned and clean text prompt embeddings on the CLAP backbone; this quantity is cached and added via a single inner product at inference time. No equations or steps in the provided description reduce the bonus to a fit on audio embeddings, a self-referential definition, or a self-citation chain that bears the central claim. The reported gains are downstream empirical deltas on UrbanSound8K and FSD50K; the alignment assumption between text-derived and audio drift vectors is presented as a hypothesis validated by those deltas rather than derived by construction from the target metric. This satisfies the criteria for a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters or invented entities; the central idea rests on one domain assumption about text predicting audio drift.

axioms (1)

domain assumption The direction in which noisy audio embeddings drift can be predicted from changes in noise-conditioned text prompt embeddings for each class
This premise is required for the per-class bonus to be derivable from text alone as described.

pith-pipeline@v0.9.1-grok · 5749 in / 1292 out tokens · 32157 ms · 2026-06-28T05:07:59.733219+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Salamon, Justin and MacConnell, Duncan and Cartwright, Mark and Li, Peter and Bello, Juan Pablo , booktitle =
[2]

Detection and Classification of Acoustic Scenes and Events Workshop (DCASE) , year =

A multi-device dataset for urban acoustic scene classification , author =. Detection and Classification of Acoustic Scenes and Events Workshop (DCASE) , year =
[3]

The Diverse Environments Multi-Channel Acoustic Noise Database (

Thiemann, Joachim and Ito, Nobutaka and Vincent, Emmanuel , booktitle =. The Diverse Environments Multi-Channel Acoustic Noise Database (. 2013 , publisher =

2013
[4]

MUSAN: A Music, Speech, and Noise Corpus

Snyder, David and Chen, Guoguo and Povey, Daniel , year =. 1510.08484 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Test-time prompt tuning for zero-shot generalization in vision-language models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[6]

International Conference on Learning Representations (ICLR) , year =

Tent: Fully test-time adaptation by entropy minimization , author =. International Conference on Learning Representations (ICLR) , year =
[7]

2025 , eprint =

Seth, Ashish and Selvakumar, Ramaneswaran and Kumar, Sonal and Ghosh, Sreyan and Manocha, Dinesh , booktitle =. 2025 , eprint =

2025
[8]

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =
[9]

Elizalde, Benjamin and Deshmukh, Soham and Al Ismail, Mahmoud and Wang, Huaming , booktitle =
[10]

2025 , eprint =

Ghosh, Sreyan and Kumar, Sonal and Evuru, Chandra Kiran Reddy and Nieto, Oriol and Duraiswami, Ramani and Manocha, Dinesh , booktitle =. 2025 , eprint =

2025
[11]

2025 , eprint =

Anand, Nishit and Seth, Ashish and Duraiswami, Ramani and Manocha, Dinesh , booktitle =. 2025 , eprint =

2025
[12]

2025 , eprint =

Hakim, Gustavo Adolfo Vargas and Osowiechi, David and Noori, Mehrdad and Cheraghalikhani, Milad and Bahri, Ali and Yazdanpanah, Moslem and Ben Ayed, Ismail and Desrosiers, Christian , booktitle =. 2025 , eprint =

2025
[13]

Debiasing vision- language models via biased prompts.arXiv preprint arXiv:2302.00070, 2023

Debiasing vision-language models via biased prompts , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =. 2302.00070 , archivePrefix =

work page arXiv
[14]

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

Bi, Jinhe and Wang, Yifan and Yan, Danqi and Ma, Xiaowen and Hecker, Artur and Tresp, Volker and Ma, Yunpu , year =. 2502.12119 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Selective vision-language subspace projection for few-shot

Zhu, Xingyu and Zhu, Beier and Tan, Yi and Wang, Shuo and Hao, Yanbin and Zhang, Hanwang , booktitle =. Selective vision-language subspace projection for few-shot. 2024 , eprint =

2024
[16]

Alicia and Gao, Ye , year =

Shi, Jiacheng and Du, Hongfei and Hong, Y. Alicia and Gao, Ye , year =. 2509.25495 , archivePrefix =

work page arXiv
[17]

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models , author =. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =. 2403.12952 , archivePrefix =

work page arXiv
[18]

Proceedings of Interspeech , year =

Domain adaptation method and modality gap impact in audio-text models for prototypical sound classification , author =. Proceedings of Interspeech , year =. 2506.04376 , archivePrefix =

work page arXiv
[19]

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =

Leveraging prediction entropy for automatic prompt weighting in zero-shot audio-language classification , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =. 2601.05011 , archivePrefix =

work page arXiv
[20]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

Advancing test-time adaptation in wild acoustic test settings , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2024 , eprint =

2024
[21]

2024 , eprint =

Multiple consistency-guided test-time adaptation for contrastive audio--language models with unlabeled audio , author =. 2024 , eprint =

2024
[22]

2025 , eprint =

Sun, Jingchen and Han, Shaobo and Kohno, Wataru and Chen, Changyou , booktitle =. 2025 , eprint =

2025
[23]

2025 , eprint =

Niizumi, Daisuke and Takeuchi, Daiki and Yasuda, Masahiro and Nguyen, Binh Thien and Ohishi, Yasunori and Harada, Noboru , journal =. 2025 , eprint =

2025
[24]

2025 , eprint =

Audio Flamingo 2: An audio--language model with long-audio understanding and expert reasoning abilities , author =. 2025 , eprint =

2025
[25]

2506.11350 , archivePrefix =

Dinkel, Heinrich and Yan, Zhiyong and Wang, Tianzi and Wang, Yongqing and Sun, Xingwei and Niu, Yadong and Liu, Jizhong and Li, Gang and Zhang, Junbo and Wang, Yujun , year =. 2506.11350 , archivePrefix =

work page arXiv

[1] [1]

Salamon, Justin and MacConnell, Duncan and Cartwright, Mark and Li, Peter and Bello, Juan Pablo , booktitle =

[2] [2]

Detection and Classification of Acoustic Scenes and Events Workshop (DCASE) , year =

A multi-device dataset for urban acoustic scene classification , author =. Detection and Classification of Acoustic Scenes and Events Workshop (DCASE) , year =

[3] [3]

The Diverse Environments Multi-Channel Acoustic Noise Database (

Thiemann, Joachim and Ito, Nobutaka and Vincent, Emmanuel , booktitle =. The Diverse Environments Multi-Channel Acoustic Noise Database (. 2013 , publisher =

2013

[4] [4]

MUSAN: A Music, Speech, and Noise Corpus

Snyder, David and Chen, Guoguo and Povey, Daniel , year =. 1510.08484 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Test-time prompt tuning for zero-shot generalization in vision-language models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[6] [6]

International Conference on Learning Representations (ICLR) , year =

Tent: Fully test-time adaptation by entropy minimization , author =. International Conference on Learning Representations (ICLR) , year =

[7] [7]

2025 , eprint =

Seth, Ashish and Selvakumar, Ramaneswaran and Kumar, Sonal and Ghosh, Sreyan and Manocha, Dinesh , booktitle =. 2025 , eprint =

2025

[8] [8]

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =

[9] [9]

Elizalde, Benjamin and Deshmukh, Soham and Al Ismail, Mahmoud and Wang, Huaming , booktitle =

[10] [10]

2025 , eprint =

Ghosh, Sreyan and Kumar, Sonal and Evuru, Chandra Kiran Reddy and Nieto, Oriol and Duraiswami, Ramani and Manocha, Dinesh , booktitle =. 2025 , eprint =

2025

[11] [11]

2025 , eprint =

Anand, Nishit and Seth, Ashish and Duraiswami, Ramani and Manocha, Dinesh , booktitle =. 2025 , eprint =

2025

[12] [12]

2025 , eprint =

Hakim, Gustavo Adolfo Vargas and Osowiechi, David and Noori, Mehrdad and Cheraghalikhani, Milad and Bahri, Ali and Yazdanpanah, Moslem and Ben Ayed, Ismail and Desrosiers, Christian , booktitle =. 2025 , eprint =

2025

[13] [13]

Debiasing vision- language models via biased prompts.arXiv preprint arXiv:2302.00070, 2023

Debiasing vision-language models via biased prompts , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =. 2302.00070 , archivePrefix =

work page arXiv

[14] [14]

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

Bi, Jinhe and Wang, Yifan and Yan, Danqi and Ma, Xiaowen and Hecker, Artur and Tresp, Volker and Ma, Yunpu , year =. 2502.12119 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Selective vision-language subspace projection for few-shot

Zhu, Xingyu and Zhu, Beier and Tan, Yi and Wang, Shuo and Hao, Yanbin and Zhang, Hanwang , booktitle =. Selective vision-language subspace projection for few-shot. 2024 , eprint =

2024

[16] [16]

Alicia and Gao, Ye , year =

Shi, Jiacheng and Du, Hongfei and Hong, Y. Alicia and Gao, Ye , year =. 2509.25495 , archivePrefix =

work page arXiv

[17] [17]

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models , author =. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =. 2403.12952 , archivePrefix =

work page arXiv

[18] [18]

Proceedings of Interspeech , year =

Domain adaptation method and modality gap impact in audio-text models for prototypical sound classification , author =. Proceedings of Interspeech , year =. 2506.04376 , archivePrefix =

work page arXiv

[19] [19]

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =

Leveraging prediction entropy for automatic prompt weighting in zero-shot audio-language classification , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =. 2601.05011 , archivePrefix =

work page arXiv

[20] [20]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

Advancing test-time adaptation in wild acoustic test settings , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2024 , eprint =

2024

[21] [21]

2024 , eprint =

Multiple consistency-guided test-time adaptation for contrastive audio--language models with unlabeled audio , author =. 2024 , eprint =

2024

[22] [22]

2025 , eprint =

Sun, Jingchen and Han, Shaobo and Kohno, Wataru and Chen, Changyou , booktitle =. 2025 , eprint =

2025

[23] [23]

2025 , eprint =

Niizumi, Daisuke and Takeuchi, Daiki and Yasuda, Masahiro and Nguyen, Binh Thien and Ohishi, Yasunori and Harada, Noboru , journal =. 2025 , eprint =

2025

[24] [24]

2025 , eprint =

Audio Flamingo 2: An audio--language model with long-audio understanding and expert reasoning abilities , author =. 2025 , eprint =

2025

[25] [25]

2506.11350 , archivePrefix =

Dinkel, Heinrich and Yan, Zhiyong and Wang, Tianzi and Wang, Yongqing and Sun, Xingwei and Niu, Yadong and Liu, Jizhong and Li, Gang and Zhang, Junbo and Wang, Yujun , year =. 2506.11350 , archivePrefix =

work page arXiv