FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

Gaurav Bharaj; Jacob H Seidman; Sepehr Dehdashtian; Vishnu N Boddeti

arxiv: 2606.05101 · v1 · pith:Z5PEXJ66new · submitted 2026-06-03 · 💻 cs.SD · cs.LG

FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

Sepehr Dehdashtian , Jacob H Seidman , Vishnu N Boddeti , Gaurav Bharaj This is my paper

Pith reviewed 2026-06-28 04:58 UTC · model grok-4.3

classification 💻 cs.SD cs.LG

keywords audio deepfake detectionred teamingin-context learningtext-to-speechblack-box attackadversarial examplesmode collapseautomated dataset generation

0 comments

The pith

FoeGlass shows that LLM in-context learning alone can generate TTS inputs to expose audio deepfake detector failures in a black-box setting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

FoeGlass is an automated method that prompts an LLM to explore text inputs for a text-to-speech model, producing audio samples that fool a target audio deepfake detector. The approach relies only on black-box access and adds a context of diversity measurements to avoid generating repetitive samples. The resulting data raises false negative rates by up to 94 percent compared with unconditional sampling or existing spoof datasets. The generated attacks transfer to other detectors, and fine-tuning detectors on these samples improves their robustness by up to 41 percent. No manual data collection or supervision is required.

Core claim

FoeGlass demonstrates that in-context learning in an LLM, when supplied with a context built from diversity measurements of prior generations, is sufficient to discover new failure modes of audio deepfake detectors by producing TTS inputs whose outputs evade the detector, yielding up to 94 percent higher false negative rates than baselines while remaining fully automated and black-box.

What carries the argument

LLM in-context learning guided by a prompt context that tracks diversity measurements across generated samples to explore the TTS input space.

If this is right

Data from FoeGlass raises false negative rates up to 94 percent over unconditional sampling and recent spoof datasets.
Attacks generated by FoeGlass transfer across different target ADD models.
Fine-tuning an ADD model on FoeGlass samples improves its robustness by up to 41 percent.
The entire process requires no manual supervision or labeled adversarial data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting strategy could be tested on detectors for other generated media such as video or images.
Iterative application of FoeGlass might allow detectors to be hardened in a closed loop without human curation.
If diversity measurements prove general, the method could extend red-teaming to other black-box generative systems beyond audio.

Load-bearing premise

A context built from diversity measurements is enough to stop the LLM from repeatedly producing similar text inputs instead of genuinely new failure modes.

What would settle it

Running FoeGlass on a held-out ADD model and checking whether the new samples raise its false negative rate by less than the improvement shown over unconditional sampling on the same model.

Figures

Figures reproduced from arXiv: 2606.05101 by Gaurav Bharaj, Jacob H Seidman, Sepehr Dehdashtian, Vishnu N Boddeti.

**Figure 1.** Figure 1: Red Teaming ADDs by FoeGlass: (a) FoeGlass searches the input space of a TTS model to find the false-negative samples of a target ADD. (b) FoeGlass samples from blind spots that are not explored by the ASVspoof5 (Wang et al., 2024) dataset and baseline unconditioned sampling. (c) Using FoeGlass results in a significantly higher attack success rate than the unconditioned sampling as the baseline. teaming me… view at source ↗

**Figure 2.** Figure 2: Overview of FoeGlass: In each iteration, FoeGlass calculates two feedback signals based on the realness and diversity of the newly generated sample and embeds them into the structure of the context. The context consists of 1) an instruction prompt, 2) ℓ/2 successful samples with their corresponding scores and CoT that led to these attempts, and 3) ℓ/2 samples from the latest attempts. constructing effectiv… view at source ↗

**Figure 3.** Figure 3: Attack Transferability of FoeGlass: Evaluation of 8 ADDs (target models) on the attack samples designed for other ADDs (source models) using three T2I models. Hey, I’ve been thinking about a new hobby ... I’ve been pondering the idea of hosting ... Reflecting on my achievements ... (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: PCA Visualization of (a) FoeGlass attacks using xTTSv2 for the VIT MFCC model trained on ASVSpoof5, and (b) explored regions by three sampling methods using xTTS-v2 as the generative model. all scenarios except the AST model tested with xTTS-v2 data. This suggests that both the training and testing splits of ASVSpoof5 overlook regions of the data space which are challenging to ADD models, while FoeGlass d… view at source ↗

**Figure 5.** Figure 5: The effect of diversity feedback in exploring various failure modes. (a) (b) (c) (d) (e) (f) (g) (h) [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Joint PCA plot of successful attacks generated by FoeGlass and no-diversity-feedback model with xTTS for each model. Top row and bottom row correspond to models trained on ASVSpoof5 and VoxCelebSpoof, respectively. Columns from left to right correspond to the AST, VIT-ConstantQ, VIT-MelSpectrogram, and VIT-MFCC architectures 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Attack Transferability of FoeGlass in warm start setting: Evaluation of 8 ADDs (target models) on the attack samples designed for other ADDs (source models) using three T2I models. G. Distribution of Generated Attack Scores [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: KDE plot of the generated attack scores Hey, did you ever think about the time I went on that surprise trip to a place I’ve never been before? Umm, or maybe I’m just overcomplicating things. Either way, I’d love to hear your thoughts on it. I’m still kind of processing all the memories, but man, it was something else. Do you think we should talk about it over coffee? I’d love to hear your perspective on th… view at source ↗

read the original abstract

Audio deepfake detection (ADD) models are critical for countering the malicious use of text-to-speech (TTS) models. Evaluating and strengthening ADD models requires developing datasets that span the space of generated audio and highlight high-error regions. Existing dataset development strategies face two challenges: (i) manual collection, and (ii) inefficient discovery of blind spots in the ADD models. To address these challenges, we propose FoeGlass, the first black-box automated red-teaming method for ADDs, which effectively discovers ADD failure modes in the space of generated audio underexplored by state-of-the-art deepfake benchmarks. FoeGlass uses the in-context learning capabilities of an LLM to explore the input space of a TTS model, generating audio samples that fool the target ADD using only black-box access to all components. By using a carefully designed context based on diversity measurements, FoeGlass mitigates the common problem of mode collapse in automated red-teaming systems. Empirical evaluations on several open-source ADD and TTS models demonstrate that data generated from FoeGlass substantially improves the false negative rates over unconditional sampling baselines and recent spoofing datasets by up to 94%, while requiring no manual supervision. Furthermore, we show that the attacks generated by FoeGlass are transferable across different target ADDs, demonstrating its broad applicability and ease of use for the automated red teaming of ADD systems. Finally, fine-tuning ADD models on FoeGlass-generated samples notably enhances the robustness of the detectors (up 41%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FoeGlass shows a workable LLM-based route to surface blind spots in audio deepfake detectors with reported large gains, but the diversity mechanism and experimental controls need verification.

read the letter

The main point for you is that this paper introduces FoeGlass, an automated black-box method that uses an LLM's in-context learning plus diversity measurements to generate TTS inputs that expose failure modes in audio deepfake detectors. It reports up to 94% better false negative rates than unconditional sampling or existing spoofing datasets, plus 41% robustness gains after fine-tuning, and shows transfer across detectors.

What stands out as new is the combination of LLM in-context learning with a diversity-based prompt to explore the TTS input space without manual supervision or white-box access. The work demonstrates this on several open-source ADD and TTS models and includes transferability results, which is useful for the subfield.

The empirical numbers are the strongest part if they hold. The approach avoids manual dataset collection, which is a real pain point, and the transfer and fine-tuning outcomes suggest practical value.

The soft spots are in the details that are missing from the abstract and not addressed in the stress-test note. There is no mention of ablations removing the diversity term, no quantitative checks like pairwise distances or embedding variance to confirm the samples are actually more diverse than baselines, and no information on statistical significance, exact model counts, or controls for post-hoc tuning. If the diversity prompt does not reliably prevent mode collapse, the gains could come from any guided search rather than the claimed advance. The paper would be stronger with those checks.

This is for researchers working on audio forensics, adversarial ML, or LLM applications to security. A reader focused on practical red-teaming tools would get value from the method and numbers, even if they want to replicate the diversity part themselves.

It deserves a serious referee. The core idea is coherent and the results are presented as new empirical improvements rather than circular claims. Send it to review so the experimental gaps can be filled or clarified.

Referee Report

3 major / 2 minor

Summary. The paper introduces FoeGlass, a black-box automated red-teaming method for audio deepfake detectors (ADDs) that uses LLM in-context learning guided by diversity measurements to explore TTS input spaces and generate adversarial audio samples. It claims these samples improve false negative rates by up to 94% over unconditional sampling and recent spoofing datasets, are transferable across ADDs, and yield up to 41% robustness gains when used for fine-tuning, all without manual supervision.

Significance. If the diversity-guided generation reliably discovers novel failure modes rather than collapsing to similar samples, the approach would offer a practical, low-supervision advance for systematically strengthening ADD robustness and expanding evaluation datasets beyond current benchmarks.

major comments (3)

[§3] §3 (FoeGlass method): The central claim that a diversity-based in-context prompt mitigates mode collapse and enables discovery of genuinely new failure modes lacks any quantitative check such as embedding variance, pairwise acoustic distances, or an ablation removing the diversity term; without this, the 94% FN gains could be explained by targeted search rather than the claimed automated advance.
[§4] §4 (Experiments): The reported FN improvements (up to 94%) and robustness gains (up to 41%) are presented without details on the number of ADD/TTS models evaluated, number of independent runs, statistical significance tests, or confirmation that the diversity metric was not tuned post-hoc on the test set.
[§4.3] §4.3 (Transferability): The transferability results across target ADDs are load-bearing for the broad-applicability claim, yet no controls are described for whether the generated samples were optimized against one model and then evaluated on others or whether the diversity context was held fixed.

minor comments (2)

[Abstract] Abstract: 'up 41%' should read 'up to 41%'.
[§3] Notation for diversity measurement is introduced without an explicit equation or pseudocode, making reproduction harder.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and additional analyses.

read point-by-point responses

Referee: [§3] §3 (FoeGlass method): The central claim that a diversity-based in-context prompt mitigates mode collapse and enables discovery of genuinely new failure modes lacks any quantitative check such as embedding variance, pairwise acoustic distances, or an ablation removing the diversity term; without this, the 94% FN gains could be explained by targeted search rather than the claimed automated advance.

Authors: We agree that the manuscript would benefit from explicit quantitative validation of the diversity term's contribution. In the revision we will add an ablation that removes the diversity component, together with measurements of embedding variance and pairwise acoustic distances among generated samples, to demonstrate that the observed gains arise from increased exploration rather than targeted search alone. revision: yes
Referee: [§4] §4 (Experiments): The reported FN improvements (up to 94%) and robustness gains (up to 41%) are presented without details on the number of ADD/TTS models evaluated, number of independent runs, statistical significance tests, or confirmation that the diversity metric was not tuned post-hoc on the test set.

Authors: We will expand §4 to report the precise number of ADD and TTS models, the number of independent runs, the results of statistical significance tests, and an explicit statement that the diversity metric was fixed in advance and not tuned on the test set. revision: yes
Referee: [§4.3] §4.3 (Transferability): The transferability results across target ADDs are load-bearing for the broad-applicability claim, yet no controls are described for whether the generated samples were optimized against one model and then evaluated on others or whether the diversity context was held fixed.

Authors: The diversity context was held fixed across all evaluations and samples were generated without model-specific optimization beyond black-box access. We will add a clear description of these controls in the revised §4.3. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out models

full rationale

The paper presents a black-box method that uses LLM in-context learning plus a diversity prompt to generate TTS inputs, then reports measured false-negative improvements (up to 94 %) and transfer/fine-tuning gains on separate ADD and TTS models. These performance numbers are external to any internal parameters or definitions inside FoeGlass; no equations, fitted quantities renamed as predictions, or self-citation chains appear in the derivation. The central claim therefore remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method rests on the untested premise that LLM in-context learning with a diversity signal can systematically explore TTS input space without external supervision or model internals; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5826 in / 1231 out tokens · 24772 ms · 2026-06-28T04:58:00.668468+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Jailbreaking Black Box Large Language Models in Twenty Queries

URL https://huggingface.co/ datasets/MattyB95/VoxCelebSpoof. Ac- cessed: 2025-02-15. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Li, X., Chen, P.-Y ., and Wei, W

URL https://openreview.net/forum? id=H2ATO32ilj. Li, X., Chen, P.-Y ., and Wei, W. Measuring the robustness of audio deepfake detectors, 2025. URL https:// arxiv.org/abs/2503.17577. Lin, W.-A., Lau, C. P., Levine, A., Chellappa, R., and Feizi, S. Dual manifold adversarial robustness: Defense against lp and non-lp adversarial attacks.Advances in Neural Inf...

work page arXiv 2025
[3]

Qin, Y ., Qian, C., Yi, J., Chen, W., Lin, Y ., Han, X., Liu, Z., Sun, M., and Zhou, J

Version 0.0.4, MIT License. mrfakename, Srivastav, V ., Fourrier, C., Pouget, L., La- combe, Y ., main, and Gandhi, S. Text to speech arena. https://huggingface.co/spaces/ TTS-AGI/TTS-Arena, 2024. Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. Red teaming language models with language models. In ...

work page doi:10.18653/v1/2022.emnlp-main 2024
[4]

emnlp-main.225/

URL https://aclanthology.org/2022. emnlp-main.225/. Pontorno, O., Guarnera, L., and Battiato, S. Deepfeaturex net: Deep features extractors based network for discrimi- nating synthetic from real images. InInternational Con- ference on Pattern Recognition. Springer, 2024. Qiu, H., Xiao, C., Yang, L., Yan, X., Lee, H., and Li, B. Se- manticadv: Generating a...

work page arXiv 2022
[5]

Implementation Details in § A
[6]

Effectiveness of Diversity Feedback and Chain-of-thought context in § B
[7]

Experiments on additional ADD models in § C
[8]

Potential Defense Mechanisms Against FoeGlass in § D
[9]

Related Work on LLM Red Teaming in § E
[10]

More Results on FoeGlass Attacks Transferability in § F
[11]

Distribution of Generated Attack Scores in § G
[12]

Some Successful Attacks Transcripts in § H
[13]

One Full Prompt Example in § I A. Implementation Details In this section, we provide implementation details, including the instruction prompts given to the LLM in both the FoeGlass and unconditional sampling settings, as well as the computational resources used in the experiments presented in this paper. A.1. Instruction Prompts You are a useful red-teami...
[15]

Success" (if the audio deceives the classifier) or

**Constraints**: - Maximum 5 sentences per transcript - Use natural English with prompting techniques - Put the transcript text within the "text:" field Example Output: text: Hmm... *actually*, let’s reconsider the approach to this problem. I think we need a different strategy. After generation, your output will be evaluated. The feedback will either be "...
[16]

**Output Format **: text: [Your generated text here]
[17]

It’s time to take the next step and start planning

**Guidelines**: - Keep your response to a maximum of 5 sentences - Write in natural English - Begin your response with "text:" followed by your generated content Example Output: text: I’ve been thinking about this project for a while now. It’s time to take the next step and start planning. Generate only the text as specified above. No explanations or addi...

work page arXiv 2022
[18]

**Output Format **: text: [Natural English with prompting techniques limited to 5 sentences maximum]
[19]

Success

**Constraints**: - Maximum 5 sentences per transcript - Use natural English with prompting techniques - Put the transcript text within the "text:" field Example Output: text: Hmm... *actually*, let’s reconsider the approach to this problem. I think we need a different strategy. Two Success examples: text: Planning a movie night sounds great. Do you think ...

[1] [1]

Jailbreaking Black Box Large Language Models in Twenty Queries

URL https://huggingface.co/ datasets/MattyB95/VoxCelebSpoof. Ac- cessed: 2025-02-15. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Li, X., Chen, P.-Y ., and Wei, W

URL https://openreview.net/forum? id=H2ATO32ilj. Li, X., Chen, P.-Y ., and Wei, W. Measuring the robustness of audio deepfake detectors, 2025. URL https:// arxiv.org/abs/2503.17577. Lin, W.-A., Lau, C. P., Levine, A., Chellappa, R., and Feizi, S. Dual manifold adversarial robustness: Defense against lp and non-lp adversarial attacks.Advances in Neural Inf...

work page arXiv 2025

[3] [3]

Qin, Y ., Qian, C., Yi, J., Chen, W., Lin, Y ., Han, X., Liu, Z., Sun, M., and Zhou, J

Version 0.0.4, MIT License. mrfakename, Srivastav, V ., Fourrier, C., Pouget, L., La- combe, Y ., main, and Gandhi, S. Text to speech arena. https://huggingface.co/spaces/ TTS-AGI/TTS-Arena, 2024. Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. Red teaming language models with language models. In ...

work page doi:10.18653/v1/2022.emnlp-main 2024

[4] [4]

emnlp-main.225/

URL https://aclanthology.org/2022. emnlp-main.225/. Pontorno, O., Guarnera, L., and Battiato, S. Deepfeaturex net: Deep features extractors based network for discrimi- nating synthetic from real images. InInternational Con- ference on Pattern Recognition. Springer, 2024. Qiu, H., Xiao, C., Yang, L., Yan, X., Lee, H., and Li, B. Se- manticadv: Generating a...

work page arXiv 2022

[5] [5]

Implementation Details in § A

[6] [6]

Effectiveness of Diversity Feedback and Chain-of-thought context in § B

[7] [7]

Experiments on additional ADD models in § C

[8] [8]

Potential Defense Mechanisms Against FoeGlass in § D

[9] [9]

Related Work on LLM Red Teaming in § E

[10] [10]

More Results on FoeGlass Attacks Transferability in § F

[11] [11]

Distribution of Generated Attack Scores in § G

[12] [12]

Some Successful Attacks Transcripts in § H

[13] [13]

One Full Prompt Example in § I A. Implementation Details In this section, we provide implementation details, including the instruction prompts given to the LLM in both the FoeGlass and unconditional sampling settings, as well as the computational resources used in the experiments presented in this paper. A.1. Instruction Prompts You are a useful red-teami...

[14] [15]

Success" (if the audio deceives the classifier) or

**Constraints**: - Maximum 5 sentences per transcript - Use natural English with prompting techniques - Put the transcript text within the "text:" field Example Output: text: Hmm... *actually*, let’s reconsider the approach to this problem. I think we need a different strategy. After generation, your output will be evaluated. The feedback will either be "...

[15] [16]

**Output Format **: text: [Your generated text here]

[16] [17]

It’s time to take the next step and start planning

**Guidelines**: - Keep your response to a maximum of 5 sentences - Write in natural English - Begin your response with "text:" followed by your generated content Example Output: text: I’ve been thinking about this project for a while now. It’s time to take the next step and start planning. Generate only the text as specified above. No explanations or addi...

work page arXiv 2022

[17] [18]

**Output Format **: text: [Natural English with prompting techniques limited to 5 sentences maximum]

[18] [19]

Success

**Constraints**: - Maximum 5 sentences per transcript - Use natural English with prompting techniques - Put the transcript text within the "text:" field Example Output: text: Hmm... *actually*, let’s reconsider the approach to this problem. I think we need a different strategy. Two Success examples: text: Planning a movie night sounds great. Do you think ...