pith. sign in

arxiv: 2606.10246 · v1 · pith:Y644RMFMnew · submitted 2026-06-08 · 💻 cs.SD · cs.AI· cs.LG

Linguistically Augmented Audio Speech Data (LinguAS)

Pith reviewed 2026-06-27 14:41 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.LG
keywords deepfake detectionaudio spoofinglinguistic featuresdataset augmentationspeech synthesisASVspoofmodel training
0
0 comments X

The pith

Augmenting audio deepfake datasets with five linguistic features improves detection model performance beyond ASVspoof 2021 baselines and models like HuBert and XLSR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LinguAS, a dataset of over 800 genuine and deepfaked speech samples annotated with five expert-defined linguistic features that mark natural spoken English. Models trained on this augmented data achieve significantly higher performance than standard deep learning baselines from ASVspoof 2021 and self-supervised models such as HuBert and XLSR. A sympathetic reader would care because existing detectors rely mainly on short frame-level audio features and miss larger-scale linguistic patterns that distinguish authentic human speech from fakes. The dataset supplies a balanced mix of four spoof attack types plus metadata on speaker gender and audio generators to support finer training.

Core claim

LinguAS augments genuine and spoofed audio samples with five Expert-Defined Linguistic Features (EDLFs) that occur frequently in spoken English and characterize natural human speech. When detection models are trained on data that includes these EDLF annotations, performance rises substantially above the ASVspoof 2021 deep learning baselines and SSL models such as HuBert and XLSR. The dataset maintains balance across four spoofed attack types and genuine samples while adding metadata on speaker gender and generator sources.

What carries the argument

The LinguAS dataset, which annotates each audio sample with five EDLFs to supply larger-timescale linguistic cues for training deepfake detectors.

If this is right

  • Detection models gain accuracy by using linguistic cues at longer timescales in addition to frame-level audio features.
  • The balanced collection of four attack types with gender and generator metadata supports more detailed evaluation of model behavior.
  • Emphasizing traits of real human language improves the ability of models to flag malicious deepfakes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automating annotation of the five features could remove the need for manual labeling and allow much larger datasets.
  • The same linguistic augmentation strategy might transfer to deepfake detection in video or text domains.
  • Hybrid systems that combine EDLFs with existing self-supervised audio models could yield further robustness gains.

Load-bearing premise

The five linguistic features are reliably annotatable by experts and are the direct cause of the observed performance gains rather than other dataset properties such as sample balance or metadata.

What would settle it

Retrain the same detection models on the identical audio samples but without the EDLF annotations and measure whether accuracy falls back to baseline levels.

Figures

Figures reproduced from arXiv: 2606.10246 by Ashley R. Keaton, Christine Mallinson, Vandana P. Janeja, Zahra Khanjani.

Figure 3
Figure 3. Figure 3: EDLF Ensembles [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SSL models compared to ELDF-LR also improve performance of SOTA deep learning models ensem￾bled with the EDLF LR. We chose the deep learning model VGGish and two self-supervised learning models, Hubert and XSLR. We found that EDLF-LR had greater accuracy and specificity than each of the models we tested [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ROC-AUC comparison for SSL based models, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Dropout Analysis and augmented the SOTA models, including pre-trained Hu￾BERT and XLSR representations input to MLP, as well as a CNN-based pre-trained model (VGGish-MLP). Even con￾sidering the small size of the training data, LinguAS still improved all of the models. • LinguAS training data have flexibility from “deep data” rather than “large data.” The narrow training data led to SOTA’s lower performance… view at source ↗
read the original abstract

Maliciously-created fake speech, including deepfaked and spoofed audio, is proliferating at an alarming rate, and detection models are racing to stay ahead of the curve. Yet, most detection models are trained to make inference on frame-level audio features alone without leveraging valuable linguistic cues at larger timescales. To address this gap, we present Linguistically Augmented Audio Speech Data (LinguAS), a dataset of genuine and deepfaked audio samples annotated with five strategically-chosen, Expert-Defined Linguistic Features (EDLFs) that occur frequently in spoken English and are characteristic of natural human speech. LinguAS contains over 800 audio samples, each of which are annotated with EDLFs. The dataset has a balanced number of four spoofed audio attack types and a proportionate number of genuine speech samples. We also include metadata on speaker gender and the generator/source for each spoofed audio sample, offering more granularity for model training. We found that models trained on data augmented with EDLFs had improved model performance significantly beyond the ASVspoof 2021 deep learning baselines and SSL models like HuBert and XLSR. LinguAS's augmented linguistic, gender, and generator metadata provide audio deepfake researchers with a dataset that emphasizes real human language traits to improve model inference of faked speech. Data and code are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the LinguAS dataset of over 800 genuine and deepfaked audio samples annotated with five Expert-Defined Linguistic Features (EDLFs) characteristic of natural spoken English. The dataset is balanced across four spoof attack types and supplies metadata on speaker gender and spoof generator. The central claim is that models trained on data augmented with these EDLFs achieve significantly improved performance over ASVspoof 2021 deep-learning baselines and SSL models such as HuBERT and XLSR.

Significance. If the performance gains can be rigorously demonstrated and attributed to the linguistic annotations, the dataset would supply a useful resource for incorporating longer-timescale linguistic cues into audio deepfake detection. The balanced attack distribution and auxiliary metadata are constructive design choices. The current absence of experimental evidence, however, prevents any assessment of whether these features deliver the claimed benefit.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'models trained on data augmented with EDLFs had improved model performance significantly beyond the ASVspoof 2021 deep learning baselines and SSL models like HuBert and XLSR' is unsupported by any description of experimental protocol, model architectures, evaluation metrics, statistical tests, or ablation results.
  2. No ablation or control experiment is described that holds the balanced attack types and gender/generator metadata fixed while removing or randomizing only the EDLF annotations; therefore the performance delta cannot be attributed specifically to the linguistic features rather than the balancing or auxiliary metadata.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the current version of the manuscript does not provide sufficient experimental details to support the performance claims, and we will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'models trained on data augmented with EDLFs had improved model performance significantly beyond the ASVspoof 2021 deep learning baselines and SSL models like HuBert and XLSR' is unsupported by any description of experimental protocol, model architectures, evaluation metrics, statistical tests, or ablation results.

    Authors: We acknowledge that the abstract asserts performance improvements without accompanying experimental details in the manuscript. In the revision we will add a dedicated Experiments section that specifies the model architectures, training and evaluation protocols, metrics (such as EER), statistical significance tests, and complete results tables comparing LinguAS-augmented models against the cited ASVspoof 2021 and SSL baselines. revision: yes

  2. Referee: [—] No ablation or control experiment is described that holds the balanced attack types and gender/generator metadata fixed while removing or randomizing only the EDLF annotations; therefore the performance delta cannot be attributed specifically to the linguistic features rather than the balancing or auxiliary metadata.

    Authors: We agree that an ablation isolating the contribution of the EDLFs is required to attribute gains specifically to the linguistic annotations rather than to dataset balancing or other metadata. The revised manuscript will include such controlled experiments, keeping attack-type balance and gender/generator metadata fixed while comparing models trained with versus without (or with randomized) EDLF annotations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset contribution with external baselines

full rationale

The paper introduces LinguAS as an annotated dataset and reports empirical performance gains on external benchmarks (ASVspoof 2021, HuBERT, XLSR). No equations, fitted parameters, derivations, or first-principles results are present. No self-citations are used to justify load-bearing uniqueness claims or ansatzes. The central performance claim rests on comparisons to independent external models rather than any reduction to the paper's own inputs or annotations by construction. This is a standard empirical dataset paper with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified premise that the chosen linguistic features are characteristic of natural speech and that their addition drives the performance lift; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption EDLFs occur frequently in spoken English and are characteristic of natural human speech
    Abstract states this as the justification for selecting and annotating the five features.
invented entities (1)
  • Expert-Defined Linguistic Features (EDLFs) no independent evidence
    purpose: Annotations added to audio samples to capture natural language traits for detection
    Five specific features are introduced as new annotation targets in this dataset.

pith-pipeline@v0.9.1-grok · 5779 in / 1148 out tokens · 22964 ms · 2026-06-27T14:41:28.232879+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Akhtar, T

    Z. Akhtar, T. L. Pendyala, and V. S. Athmakuri. 2024. Video and Audio Deepfake Datasets and Open Issues in Deepfake Technology: Being Ahead of the Curve. Forensic Sciences4, 3 (2024), 289–377. doi:10.3390/forensicsci4030021

  2. [2]

    Almutairi and H

    Z. Almutairi and H. Elgibreen. 2022. A review of modern audio deepfake detection methods: Challenges and future directions.Algorithms15, 5 (2022), 155

  3. [3]

    L. Blue, K. Warren, H. Abdullah, C. Gibson, L. Vargas, J. O’Dell, K. Butler, and P. Traynor. 2022. Who are you (i really wanna know)? detecting audio DeepFakes through vocal tract reconstruction. In31st USENIX Security Symposium (USENIX Security 22). 2691–2708

  4. [4]

    Chodroff and C

    E. Chodroff and C. Wilson. 2014. Burst spectrum as a cue for the stop voicing contrast in American English.The Journal of the Acoustical Society of America 136, 5 (2014), 2762–2772. doi:10.1121/1.4896470

  5. [5]

    T. P. Doan, L. Nguyen-Vu, S. Jung, and K. Hong. 2023. Bts-e: Audio deepfake detection using breathing-talking-silence encoder. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5

  6. [6]

    S. Duanmu. 2007.The phonology of standard Chinese. Oxford University Press

  7. [7]

    J. Hong, Y. Chung, S. Oh, J. Kim, J. Lee, S. Kim, and H. Cho. 2025. TwinShift: Benchmarking Audio Deepfake Detection across Synthesizer and Speaker Shifts. arXiv preprint arXiv:2510.23096(2025). arXiv:2510.23096 [cs.SD]

  8. [8]

    Huang and C.-M

    L. Huang and C.-M. Pun. 2019. Audio replay spoof attack detection using segment- based hybrid feature and DenseNetLSTM network. InICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2567–2571

  9. [9]

    Iskarous and A

    K. Iskarous and A. Vietti. 2025. Phonetic information in the vowel spectrum: the meaning of mel-Frequency Cepstral Coefficients.Journal of Phonetics112 (2025), 101434

  10. [10]

    Ito and L

    K. Ito and L. Johnson. 2017. The LJ speech dataset. (2017)

  11. [11]

    S. A. Jun. 2005.Prosodic typology: the phonology of intonation and phrasing. Oxford University Press

  12. [12]

    J. E. Kallay, U. Mayr, and M. A. Redford. 2019. Characterizing the coordination of speech production and breathing. InProceedings of the... International Congress of Phonetic Sciences, Vol. 2019. 1412

  13. [13]

    Keaton, Zahra Khanjani, Christine Mallinson, and Vandana P

    Ashley R. Keaton, Zahra Khanjani, Christine Mallinson, and Vandana P. Janeja. 2024. Linguistically Augmented Audio Speech Data (LinguAS) Dataset. https://figshare.com/projects/_b_Linguistically_Augmented_Audio_ Speech_Data_LinguAS_b_/206566. (2024). Data available at Figshare

  14. [14]

    Keaton, Zahra Khanjani, Christine Mallinson, and Vandana P

    Ashley R. Keaton, Zahra Khanjani, Christine Mallinson, and Vandana P. Janeja

  15. [15]

    https://doi.org/10.6084/m9.figshare.25909297

    Linguistically Augmented Audio Speech Data (LinguAS), Direct link for Data frame and description. https://doi.org/10.6084/m9.figshare.25909297

  16. [16]

    Khanjani, T

    Z. Khanjani, T. Ale, J. Wang, L. Davis, C. Mallinson, and V. Janeja. 2024. In- vestigating Causal Cues: Strengthening Spoofed Audio Detection with Human- Discernible Linguistic Features. (2024). arXiv:2409.06033 doi:arXiv:2409.06033

  17. [17]

    Khanjani, L

    Z. Khanjani, L. Davis, A. Tuz, K. Nwosu, C. Mallinson, and V. P. Janeja. 2023. Learning to Listen and Listening to Learn: Spoofed Audio Detection Through Lin- guistic Data Augmentation. In2023 IEEE International Conference on Intelligence and Security Informatics (ISI). Charlotte, NC, USA, 01–06

  18. [18]

    Keaton, Christine Mallinson, and Vandana P

    Zahra Khanjani, Ashley R. Keaton, Christine Mallinson, and Vandana P. Janeja

  19. [19]

    https://github.com/MultiDataLab/LinguAS

    LinguAS Code Repository. https://github.com/MultiDataLab/LinguAS. (2024). Source code available at GitHub

  20. [20]

    Khanjani, C

    Z. Khanjani, C. Mallinson, J. Foulds, and V. P. Janeja. 2024. ALDAS: Audio- Linguistic Data Augmentation for Spoofed Audio Detection.arXiv preprint arXiv:2410.15577(2024). arXiv:2410.15577 [cs.SD]

  21. [21]

    Khochare, C

    J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, and F. Kazi. 2021. A deep learn- ing framework for audio deepfake detection.Arabian Journal for Science and Engineering(2021), 1–12

  22. [22]

    Kim, S.-W

    K.-W. Kim, S.-W. Park, J. Lee, and M.-C. Joe. 2022. Assem-vc: Realistic voice conversion by assembling modern speech synthesis techniques. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6997–7001

  23. [23]

    Kinnunen, M

    T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee. 2017. The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. InInterspeech 2017. ISCA, 26

  24. [24]

    Koenecke, A

    A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Mengesha, others, and S. Goel. 2020. Racial disparities in automated speech recognition.Proceedings of the national academy of sciences117, 14 (2020), 7684–7689

  25. [25]

    Kumar, R

    K. Kumar, R. Kumar, T. de Boissière, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville. 2019. Melgan: Generative adversarial networks for conditional waveform synthesis. InAdvances in neural information processing systems 32

  26. [26]

    C.-I. Lai, A. Abad, K. Richmond, J. Yamagishi, N. Dehak, and S. King. 2019. Atten- tive filtering networks for audio replay attack detection. InICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6316–6320

  27. [27]

    J. M. Martín-Doñas, A. Álvarez, E. Rosello, A. M. Gomez, and A. M. Peinado. 2024. Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection. InInterspeech 2024. 2085–2089. doi:10.21437/ Interspeech.2024-942

  28. [28]

    Mostaani and M.M

    Z. Mostaani and M.M. Doss. 2022. On Breathing Pattern Information in Synthetic Speech. InProc. Interspeech 2022. 2768–2772. doi:10.21437/Interspeech.2022-10271

  29. [29]

    Nicolas M Müller, Pavel Czempin, Franziska Dieckmann, Adam Froghyar, and Konstantin Böttinger. 2022. Does audio deepfake detection generalize?arXiv preprint arXiv:2203.16263(2022)

  30. [30]

    S. W. Park, D. Y. Kim, and M. C. Joe. 2020. Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data.arXiv preprint arXiv:2005.03295(2020). arXiv:2005.03295 [eess.AS]

  31. [31]

    Pindrop. 2025. Pindrop’s 2025 Voice Intelligence and Security Report Re- veals 1,300% Surge in Deepfake Fraud.PR Newswire(12 June 2025). https: //www.prnewswire.com/news-releases/pindrops-2025-voice-intelligence-- security-report-reveals-1-300-surge-in-deepfake-fraud-302479482.html

  32. [32]

    Reimao and V

    R. Reimao and V. Tzerpos. 2019. FoR: A dataset for synthetic speech detection. In2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD). 1–10

  33. [33]

    Rochet-Capellan and S

    A. Rochet-Capellan and S. Fuchs. 2014. Take a breath and take the turn: how breathing meets turns in spontaneous dialogue.Philosophical Transactions of the Royal Society B: Biological Sciences369, 1658 (2014). Pages: 20130399

  34. [34]

    Sahidullah, T

    M. Sahidullah, T. Kinnunen, and C. HanilÇi. [n. d.]. A comparison of features for synthetic speech detection. Year and conference details are missing from the reference text

  35. [35]

    Eliza Strickland. 2022. Andrew Ng, AI Minimalist: The Machine-Learning Pioneer Says Small Is the New Big.IEEE Spectrum59, 4 (2022), 22–50

  36. [36]

    Valle, J

    R. Valle, J. Li, R. Prenger, and B. Catanzaro. 2020. Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6189–6193

  37. [37]

    WaveNet: A Generative Model for Raw Audio

    A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, others, and K. Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.0349912, 1 (2016). arXiv:1609.03499 [cs.SD]

  38. [38]

    Wang and J

    X. Wang and J. Yamagishi. 2021. A comparative study on recent neural spoofing countermeasures for synthetic speech detection.arXiv preprint arXiv:2103.11326 (2021). arXiv:2103.11326 [cs.SD]

  39. [39]

    J. R. Westbury and P. A. Keating. 1986. On the Naturalness of Stop Consonant Voic- ing.Journal of Linguistics22, 1 (1986), 145–166. doi:10.1017/s0022226700010598

  40. [40]

    Z. Wu, J. Yamagishi, T. Kinnunen, C. HanilÇi, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado. 2017. Asvspoof: the automatic speaker verification spoofing and countermeasures challenge.IEEE Journal of Selected Topics in Signal Processing11, 4 (2017), 588–604. Linguistically Augmented Audio Speech Data (LinguAS) Conference’17, July 2017, Wash...

  41. [41]

    ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection,

    J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans, et al. 2021. Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection.arXiv preprint arXiv:2109.00537(2021). arXiv:2109.00537 [cs.SD]

  42. [42]

    Zhang, Z

    K. Zhang, Z. Hua, R. Lan, Y. Zhang, and Y. Guo. 2025. Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes.Proceedings of the AAAI Conference on Artificial Intelligence39, 1 (2025), 1066–1074. doi:10. 1609/aaai.v39i1.32093 Conference’17, July 2017, Washington, DC, USA Keaton, Khanjani et al. Appendix A.1 Dataset Metadata ...