Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English

Hamid Mojarad; Kevin Tang

arxiv: 2606.23948 · v1 · pith:3ZBX7OSTnew · submitted 2026-06-22 · 💻 cs.CL

Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English

Hamid Mojarad , Kevin Tang This is my paper

Pith reviewed 2026-06-26 07:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords consonant cluster reductionAfrican American Englishwav2vec2Whisperlayer-wise probingphonological variationspeech recognitionAAE

0 comments

The pith

Both wav2vec2 and Whisper encode AAE consonant cluster reduction as structured gradient variation with retained cues to underlying stops rather than deletion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how self-supervised and supervised speech models represent consonant cluster reduction, a phonological process in African American English that contributes to ASR disparities. Through layer-wise probing on wav2vec2-base and Whisper-small, the authors test the models' ability to detect segmental reduction and restore underlying cluster identities. The results show high accuracy in distinguishing reduced and canonical forms. Importantly, the models retain information about the underlying stops in reduced segments. This suggests that CCR is encoded as gradient variation rather than outright deletion, demonstrating that modern speech models capture structured phonological patterns in AAE.

Core claim

Both models distinguish reduced and canonical forms with high accuracy. Crucially, reduced segments retain cues to their underlying stops, indicating that CCR is encoded as structured gradient phonological variation rather than simple segmental deletion. These results demonstrate structured phonological encoding of AAE CCR patterns in modern speech models.

What carries the argument

Layer-wise probing classifiers on wav2vec2-base and Whisper-small representations for segmental reduction detection and segmental restoration of underlying cluster identity.

If this is right

Both self-supervised and supervised models capture CCR patterns in AAE with high accuracy.
Reduced segments are represented with partial cues to the full underlying cluster rather than complete deletion.
Phonological variation is encoded in a structured, gradient manner across model layers.
This structured encoding holds for both wav2vec2-base and Whisper-small.
The findings apply directly to understanding sources of ASR disparity for AAE speakers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

ASR systems could improve by treating reductions as recoverable gradient forms instead of errors to be corrected.
The same probing approach could test whether other phonological processes in AAE or other dialects are encoded similarly.
Retained cues suggest the models might support reconstruction of canonical forms from reduced input for better transcription.
Findings indicate that internal representations preserve enough dialect-specific structure to support linguistic analysis of variation beyond surface acoustics.

Load-bearing premise

The probing classifiers accurately reflect the linguistic information encoded in the model's internal representations rather than artifacts of the probe training or the specific dataset of AAE speech samples used.

What would settle it

Re-running the same layer-wise probing tasks on a new speaker-independent AAE dataset with different probe architectures that yields no evidence of retained stop cues in reduced segments would falsify the encoding claim.

Figures

Figures reproduced from arXiv: 2606.23948 by Hamid Mojarad, Kevin Tang.

**Figure 1.** Figure 1: Segmental reduction detection - imbalanced vs. balanced for all cluster types (4-fold CV) 1 2 3 4 5 6 7 8 9 10 11 12 Transformer Layers 25% 50% 75% 100% C1 Portions 64.9 ±1.0 66.9 ±0.4 67.0 ±1.1 68.3 ±0.9 70.7 ±0.8 70.2 ±0.4 70.3 ±1.7 70.8 ±0.7 69.0 ±1.0 67.9 ±1.2 63.5 ±1.2 62.3 ±0.4 69.5 ±0.6 71.1 ±1.2 70.7 ±0.7 71.9 ±1.0 74.8 ±0.9 73.2 ±0.9 74.1 ±1.1 74.4 ±1.1 73.1 ±0.7 71.8 ±0.6 67.5 ±0.7 65.5 ±1.9 72.3… view at source ↗

**Figure 2.** Figure 2: Coarticulatory encoding - C1 portion accuracy per layer across both models (4-fold CV) 10. Our results generally follow this pattern, although in our plot the accuracy peaks more strongly at layer 5 than at layer 9, whereas they report a higher peak at layer 9 than at layer 5. Additionally, the decline after layer 10 in our data is more moderate than what they observed. In contrast, prior analyses of Whisp… view at source ↗

**Figure 3.** Figure 3: Per-cluster analysis: test accuracy per cluster type (mean (%) ± SD, 4-fold CV) and /nt/) wav2vec 2.0 again exhibits multiple rises in classification accuracy: an early increase in layers 3-5, followed by a slight decline and plateau, a renewed rise around layers 8- 10, and a subsequent decline, mirroring the pattern observed in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Segmental restoration probe (/nt/ vs /nd/) - all variants (4-fold CV) to our approach, Gessinger et al. [14] demonstrate phonemic restoration in wav2vec 2.0 and Whisper using controlled deletions (noise/silent gaps). Like their artificial “gaps”, our natural CCR deletions (/nt/ → /n/) create analogous disruptions that models successfully restore. Our segmental restoration probe mirrors this pattern: redu… view at source ↗

read the original abstract

Self-supervised and supervised speech models are increasingly used to investigate which linguistic information their internal representations encode, and at what level of abstraction they encode it. One underexplored phenomenon is consonant cluster reduction (CCR) in African American English (AAE), a widespread phonological process and a source of automatic speech recognition (ASR) disparity. To examine how CCR is represented, we conduct speaker-independent layer-wise probing of wav2vec2-base and Whisper-small using two tasks: segmental reduction detection and segmental restoration of underlying cluster identity. Both models distinguish reduced and canonical forms with high accuracy. Crucially, reduced segments retain cues to their underlying stops, indicating that CCR is encoded as structured gradient phonological variation rather than simple segmental deletion. These results demonstrate structured phonological encoding of AAE CCR patterns in modern speech models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies layer-wise probing to CCR in AAE and claims models keep gradient cues to underlying stops rather than treating it as deletion, but the restoration task likely leaks context and methods details are absent.

read the letter

The core finding is that wav2vec2-base and Whisper-small distinguish reduced and canonical consonant clusters in AAE with high accuracy, and that reduced segments still carry information about the missing stops. This points to structured encoding instead of outright deletion.

The work applies standard probing to a specific, under-studied phonological process tied to real ASR gaps. The speaker-independent split and the two-task design (detection plus restoration) are reasonable choices and give a clearer picture than single-task probing would.

The main weaknesses are in the execution. The abstract states high accuracy but reports nothing on dataset size, probe architecture, training details, or statistical tests. More importantly, the restoration task can succeed by using word identity or surrounding phonetic context rather than any acoustic cues inside the reduced segment. The stress-test concern about lexical leakage is plausible on the given description, and nothing rules it out. Without controls for that, the gradient-variation claim rests on shaky ground.

No equations or derivations appear, so there is no circularity issue. The citation pattern looks standard for probing papers.

This is for researchers already working on speech-model interpretability or dialectal ASR fairness. A reader running similar probes would get a useful data point on AAE CCR, but only after the methods and leakage controls are filled in.

Send it to peer review. The topic and task design are worth referee time; the current version needs concrete fixes on data scale and context controls before the central claim can be trusted.

Referee Report

2 major / 1 minor

Summary. The manuscript conducts speaker-independent layer-wise probing of wav2vec2-base and Whisper-small on consonant cluster reduction (CCR) in African American English. Using segmental reduction detection and segmental restoration tasks, it reports that both models distinguish reduced and canonical forms with high accuracy and that reduced segments retain cues to underlying stops, concluding that CCR is encoded as structured gradient phonological variation rather than simple deletion.

Significance. If the central empirical claims hold after addressing methodological gaps, the work would provide useful evidence on how modern speech models represent sociophonetic variation in AAE, with potential implications for ASR fairness and phonological modeling of gradient processes. The layer-wise design and dual-task approach are appropriate for locating where such information is encoded.

major comments (2)

[Methods] Methods: The description provides no dataset size, token counts, speaker numbers, or explicit validation of speaker independence (e.g., train/test speaker splits or cross-speaker controls). These omissions make it impossible to evaluate whether the reported high accuracies reflect genuine model encoding or dataset-specific artifacts.
[Segmental restoration task] Segmental restoration task (results section): The probe input specification does not state whether features are restricted to the reduced segment's internal representation or include word identity, lexical context, or surrounding phonetic frames. Without such isolation, restoration accuracy on reduced forms could arise from higher-level leakage rather than retention of segment-internal cues to underlying stops, directly undermining the gradient-variation claim versus deletion.

minor comments (1)

[Abstract] Abstract: Accuracy figures are stated qualitatively ('high accuracy') without numerical values, confidence intervals, or baseline comparisons; adding these would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Methods] Methods: The description provides no dataset size, token counts, speaker numbers, or explicit validation of speaker independence (e.g., train/test speaker splits or cross-speaker controls). These omissions make it impossible to evaluate whether the reported high accuracies reflect genuine model encoding or dataset-specific artifacts.

Authors: We agree that these quantitative details and validation steps were omitted from the submitted manuscript. The revised Methods section will include the total number of tokens and utterances analyzed, the number of speakers, and an explicit description of the speaker-independent partitioning (with no speaker overlap between training and test sets) used for all probing experiments. These additions will allow direct evaluation of whether the accuracies reflect model encoding rather than artifacts. revision: yes
Referee: [Segmental restoration task] Segmental restoration task (results section): The probe input specification does not state whether features are restricted to the reduced segment's internal representation or include word identity, lexical context, or surrounding phonetic frames. Without such isolation, restoration accuracy on reduced forms could arise from higher-level leakage rather than retention of segment-internal cues to underlying stops, directly undermining the gradient-variation claim versus deletion.

Authors: The referee correctly notes that the original text did not explicitly isolate the probe input. The segmental restoration probe uses only the hidden-state activations from the time steps aligned to the reduced segment itself; no word identity, lexical context, or neighboring phonetic frames are provided as input. We will revise the Methods section to state this restriction unambiguously, thereby confirming that any restoration accuracy derives from segment-internal cues rather than higher-level leakage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical probing study with no derivations or self-referential reductions

full rationale

The paper performs layer-wise probing experiments on wav2vec2 and Whisper using classification tasks for CCR detection and restoration. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. Claims rest on experimental accuracy metrics rather than any definitional or constructional equivalence to inputs. This is a standard empirical setup with independent content from the data and models; no load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that probing accuracy directly measures encoded phonological structure. No free parameters, axioms, or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Probing classifier performance on held-out data reflects the information present in the model's hidden representations.
This is the core premise of all probing studies and is invoked to interpret the high accuracy as evidence of structured encoding.

pith-pipeline@v0.9.1-grok · 5666 in / 1275 out tokens · 16533 ms · 2026-06-26T07:57:35.948337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 30 canonical work pages

[1]

Introduction Modern ASR systems exhibit significant performance dispar- ities across demographic groups, with particularly high error rates for speakers of AAE, a rule-governed variety of English shaped by historical, social, and cultural factors [1]. Multiple studies have documented racial bias in commercial ASR tech- nologies, showing word error rates (...

Pith/arXiv arXiv 2026
[2]

This probe directly evaluates the phonetic sensitivity of each model to CCR

on frozen encoder representations: •Segmental reduction detection: We test whether encoder representations distinguish reduced and canonical cluster pronunciations, thereby assessing whether the presence or absence of a final stop is explicitly encoded across model lay- ers. This probe directly evaluates the phonetic sensitivity of each model to CCR. •Seg...
[3]

For self-supervised models such as wav2vec 2.0, Pasad et al

Related Work Probing speech encoders has emerged as the standard method for dissecting the linguistic hierarchy of speech understanding [10]. For self-supervised models such as wav2vec 2.0, Pasad et al. [7] perform a detailed layer-wise analysis and show that early transformer layers are dominated by low-level acoustic cues, mid-layers maximize phonetic a...
[4]

[23] find that layers 13-15 of Whisper- medium yield peak performance for dysarthric speech detection and severity assessment

report that different types of stuttered disfluencies are best discriminated from fluent speech using mid-to-late Whisper lay- ers, while Yue et al. [23] find that layers 13-15 of Whisper- medium yield peak performance for dysarthric speech detection and severity assessment. Closer to our research, Gessinger et al. [14] investigate phonemic restoration in...

work page doi:10.17605/osf.io/fe2d7
[5]

Data Preparation 3.1.1

Methodology 3.1. Data Preparation 3.1.1. Corpus The Corpus of Regional African American Language (CORAAL) [24] serves as the foundational dataset for this study. The corpus provides rich linguistic resources, including audio recordings with time-aligned orthographic transcriptions in TextGrid format, featuring speaker-specific tiers at both utterance and ...

1927
[6]

Experiment 1 4.1.1

Results 4.1. Experiment 1 4.1.1. Imbalanced vs. balanced Figure 1 illustrates the layer-wise distribution of information in both wav2vec 2.0 and Whisper representations for the imbal- anced and balanced datasets described in Section 3.3.1. Both wav2vec2-base and Whisper-small clearly distinguish reduced from canonical tokens, with accuracies consistently ...
[7]

Our results generally follow this pattern, although in our plot the accuracy peaks more strongly at layer 5 than at layer 9, whereas they report a higher peak at layer 9 than at layer
[8]

Additionally, the decline after layer 10 in our data is more moderate than what they observed. In contrast, prior analyses of Whisper’s encoder (particularly for larger models) report a rising-then-plateau pattern for phonetic tasks, with performance increasing toward mid-layers and then remaining relatively sta- ble without a sharp decline toward higher ...
[9]

The standard deviations are broadly comparable across models, with slightly higher variability for wav2vec 2.0 in the early layers

Whisper follows a similar trajectory, but with a steeper rise toward the peak at layer 9, followed by a more moderate decline and a subsequent plateau toward the final layers. The standard deviations are broadly comparable across models, with slightly higher variability for wav2vec 2.0 in the early layers. This probe is theoretically more challenging than...
[10]

In each case, reduced/canonical labels were randomly shuffled while preserving original train/test splits, speaker independence, and token-cluster distributions

and the reduced-only train probe (Experiment 2). In each case, reduced/canonical labels were randomly shuffled while preserving original train/test splits, speaker independence, and token-cluster distributions. MLP probes trained on these shuf- fled labels yielded a baseline performance floor of 46-53%, con- firming that the observed accuracies exceed cha...
[11]

preceding segment

Discussion In this paper, we examined how layer-wise probing results can inform our understanding of phonological structure, focusing in particular on consonant cluster reduction, coronal stop deletion, and broader patterns in consonant cluster typology. While the Results section emphasized the computational behavior of the models, tracing how representat...
[12]

This suggests that Whisper’s representations may more closely mirror human perceptual patterns in this case

followed by a steady decline to 72.6%, Whisper aligns more closely with Thomas and Bailey [19]’s expectations, exhibit- ing the lowest accuracy among all clusters analyzed so far for this non-homorganic type with low susceptibility to reduction. This suggests that Whisper’s representations may more closely mirror human perceptual patterns in this case. Mo...
[13]

Layer-wise probing on 6,760 CORAAL tokens revealed that both wav2vec 2.0-base and Whisper-small encode CCR in a phonologically structured manner

Conclusion This study set out to determine whether self-supervised and su- pervised models treat CCR in AAE as a low-level segmental deletion phenomenon, or whether they encode it as structured, gradient phonological variation with preserved cues to under- lying forms. Layer-wise probing on 6,760 CORAAL tokens revealed that both wav2vec 2.0-base and Whisp...
[14]

As a result, it remains unclear to what extent our findings generalize to other speech repre- sentation models

Limitations and future work One limitation of this study is that we evaluated only two mod- els, wav2vec 2.0 and Whisper. As a result, it remains unclear to what extent our findings generalize to other speech repre- sentation models. In addition, certain CCR cluster types (e.g., /mp/, /pt/) were underrepresented in the data due to their lower frequency in...
[15]

The authors would like to thank Erfan Amirzadeh Shams for his helpful contribution in refining the research concept and for his technical guidance on the implementation

Acknowledgments This work is part of HM’s PhD research. The authors would like to thank Erfan Amirzadeh Shams for his helpful contribution in refining the research concept and for his technical guidance on the implementation. The authors also thank the anonymous reviewers for their insightful comments and constructive feed- back, which helped improve the ...
[16]

Perplexity.ai and Grok were used for English grammar checking, improving sentence readability, suggesting relevant literature, and code refining or debugging

Generative AI Use Disclosure No Generative AI tools were used to produce the content or re- sults of this paper. Perplexity.ai and Grok were used for English grammar checking, improving sentence readability, suggesting relevant literature, and code refining or debugging
[17]

I don’t think these devices are very culturally sensitive.-impact of automated speech recognition errors on African Americans,

Z. Mengesha, C. Heldreth, M. Lahav, J. Sublewski, and E. Tuennerman, “I don’t think these devices are very culturally sensitive.-impact of automated speech recognition errors on African Americans,”Frontiers in Artificial Intelligence, vol. V olume 4 - 2021, 2021. [Online]. Available: https://doi.org/10. 3389/frai.2021.725911

arXiv 2021
[18]

Proceedings of the National Academy of Sciences , author =

A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Mengesha, C. Toups, J. R. Rickford, D. Jurafsky, and S. Goel, “Racial disparities in automated speech recognition,” Proceedings of the National Academy of Sciences, vol. 117, no. 14, pp. 7684–7689, 2020. [Online]. Available: https://doi.org/10.1073/pnas.1915768117

work page doi:10.1073/pnas.1915768117 2020
[19]

Understanding Racial Disparities in Automatic Speech Recognition: The Case of Habitual “be

J. L. Martin and K. Tang, “Understanding Racial Disparities in Automatic Speech Recognition: The Case of Habitual “be”,” inInterspeech 2020, 2020, pp. 626–630. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-2893

work page doi:10.21437/interspeech.2020-2893 2020
[20]

Bias in automatic speech recognition: The case of African American Language,

J. L. Martin and K. E. Wright, “Bias in automatic speech recognition: The case of African American Language,”Applied Linguistics, vol. 44, no. 4, pp. 613–630, 12 2022. [Online]. Available: https://doi.org/10.1093/applin/amac066

work page doi:10.1093/applin/amac066 2022
[21]

Speech Communication , author =

A. B. Wassink, C. Gansen, and I. Bartholomew, “Uneven success: automatic speech recognition and ethnicity-related dialects,” Speech Communication, vol. 140, pp. 50–70, 2022. [Online]. Available: https://doi.org/10.1016/j.specom.2022.03.009

work page doi:10.1016/j.specom.2022.03.009 2022
[22]

Interpretable sparse features for probing self-supervised speech models,

I. Parra, “Interpretable sparse features for probing self-supervised speech models,” inThe 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia- Pacific Chapter of the Association for Computational Linguistics, S. T.y.s.s, S. Shimizu, and Y . Gong, Eds. Mumbai, India: Association for Computational Linguisti...

2025
[23]

w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 914–921. [Online]. Available: https: //doi.org/10.1109/ASRU51503.2021.9688093

work page doi:10.1109/asru51503.2021.9688093 2021
[24]

wav2vec 2.0: a framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representations,” inProceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020. [Online]. Available: https://dl.acm.org/doi/abs/10.5555/3495724. 3496768

work page doi:10.5555/3495724 2020
[25]

Robust speech recognition via large-scale w eak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023. [Online]. Available: https://dl.acm.org/doi/10.5555/3618408.3619590

work page doi:10.5555/3618408.3619590 2023
[26]

Ayadi, F

A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analysis of self-supervised speech models,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICASSP49357.2023.10096149

work page doi:10.1109/icassp49357.2023.10096149 2023
[27]

Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,

P. Cormac English, J. D. Kelleher, and J. Carson-Berndsen, “Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,” inProceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, G. Nicolai and E. Chodroff, Eds. Seattle, Washington: Association for Computational Linguistics, Jul. 2022, pp...

2022
[28]

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model,

M. Yang, R. C. M. C. Shekar, O. Kang, and J. H. L. Hansen, “What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model,” inInterspeech 2023, 2023, pp. 1923–1927. [Online]. Available: https://doi.org/10.21437/Interspeech.2023-2254

work page doi:10.21437/interspeech.2023-2254 2023
[29]

Uncovering syllable constituents in the self-attention-based speech represen- tations of whisper,

E. A Shams, I. Gessinger, and J. Carson-Berndsen, “Uncovering syllable constituents in the self-attention-based speech represen- tations of whisper,” inProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, Y . Belinkov, N. Kim, J. Jumelet, H. Mohebbi, A. Mueller, and H. Chen, Eds. Miami, Florida, US: Association ...

work page doi:10.18653/v1/2024.blackboxnlp-1.16 2024
[30]

2012.02.005

I. Gessinger, E. A. Shams, and J. Carson-Berndsen, “Under the hood: phonemic restoration in transformer-based automatic speech recognition,”Computer Speech & Language, vol. 96, p. 101893, 2026. [Online]. Available: https://doi.org/10.1016/j.csl. 2025.101893

work page doi:10.1016/j.csl 2026
[31]

Labov,Sociolinguistic Patterns

W. Labov,Sociolinguistic Patterns. Philadelphia: University of Pennsylvania Press, 1972

1972
[32]

Schreier,Consonant Change in English Worldwide: Synchrony Meets Diachrony

D. Schreier,Consonant Change in English Worldwide: Synchrony Meets Diachrony. New York: Palgrave Macmillan, 2005

2005
[33]

Wolfram,Dialect in Society

W. Wolfram,Dialect in Society. John Wiley & Sons, Ltd, 2017, ch. 7, pp. 107–126. [Online]. Available: https: //doi.org/10.1002/9781405166256.ch7

work page doi:10.1002/9781405166256.ch7 2017
[34]

Explanation in variable phonology: An exponential model of morphological constraints,

G. R. Guy, “Explanation in variable phonology: An exponential model of morphological constraints,”Language Variation and Change, vol. 3, no. 1, pp. 1–22, 1991. [Online]. Available: https://doi.org/10.1017/S0954394500000429

work page doi:10.1017/s0954394500000429 1991
[35]

Segmental phonology of African American English,

E. R. Thomas and G. Bailey, “Segmental phonology of African American English,” inThe Oxford Handbook of African American Language. Oxford University Press, 07
[36]

Available: https://doi.org/10.1093/oxfordhb/ 9780199795390.013.13

[Online]. Available: https://doi.org/10.1093/oxfordhb/ 9780199795390.013.13

work page doi:10.1093/oxfordhb/
[37]

Automatic Speech Recognition of African American English: Lexical and Contextual Effects,

H. Mojarad and K. Tang, “Automatic Speech Recognition of African American English: Lexical and Contextual Effects,” in Interspeech 2025, 2025, pp. 3883–3887. [Online]. Available: https://doi.org/10.21437/Interspeech.2025-1511

work page doi:10.21437/interspeech.2025-1511 2025
[38]

Using wav2vec 2.0 for phonetic classification tasks: methodological aspects,

L. Kim and C. Gendrot, “Using wav2vec 2.0 for phonetic classification tasks: methodological aspects,” inInterspeech 2024, 2024, pp. 1530–1534. [Online]. Available: https: //doi.org/10.21437/Interspeech.2024-1155

work page doi:10.21437/interspeech.2024-1155 2024
[39]

Exploring Whisper embeddings for stutter detection: A layer-wise study,

A. Batra, B. Kar, and P. K. Das, “Exploring Whisper embeddings for stutter detection: A layer-wise study,” in2025 33rd European Signal Processing Conference (EUSIPCO), 2025, pp. 61–65. [Online]. Available: https://doi.org/10.23919/EUSIPCO63237. 2025.11226086

work page doi:10.23919/eusipco63237 2025
[40]

Probing Whisper for dysarthric speech in detection and assessment,

Z. Yue, D. Kayande, Z. Cvetkovic, and E. Loweimi, “Probing Whisper for dysarthric speech in detection and assessment,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.04219

work page doi:10.48550/arxiv.2510.04219 2025
[41]

The Corpus of Regional African American Language,

T. Kendall and C. Farrington, “The Corpus of Regional African American Language,” 2023, publisher: The Online Resources for African American Language Project. [Online]. Available: https://oraal.uoregon.edu/coraal

2023
[42]

Considering performance in the automated and manual coding of sociolinguistic variables: Lessons from variable (ING),

T. Kendall, C. Vaughn, C. Farrington, K. Gunter, J. McLean, C. Tacata, and S. Arnson, “Considering performance in the automated and manual coding of sociolinguistic variables: Lessons from variable (ING),”Frontiers in Artificial Intelligence, vol. 4, 2021. [Online]. Available: https://doi.org/10.3389/frai. 2021.648543

work page doi:10.3389/frai 2021
[43]

Montreal Forced Aligner: Trainable Text- Speech Alignment Using Kaldi,

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable Text- Speech Alignment Using Kaldi,” inProc. Interspeech 2017, 2017, pp. 498–502. [Online]. Available: https://doi.org/10.21437/ Interspeech.2017-1386

2017
[44]

Homophone disambiguation reveals patterns of con- text mixing in speech transformers,

H. Mohebbi, G. Chrupała, W. Zuidema, and A. Al- ishahi, “Homophone disambiguation reveals patterns of con- text mixing in speech transformers,” inProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Lin- guistics, Dec. 2023, pp. 8249–8260. ...

work page doi:10.18653/v1/2023.emnlp-main.513 2023
[45]

Scikit-learn: Machine learning in python,

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and ´Edouard Duchesnay, “Scikit-learn: Machine learning in python,”Journal of Machine Learning Research, vol. 12, no. 85, pp. 2825–2830, 2011. [Online]. Available: h...

2011
[46]

Bayley and D

R. Bayley and D. Villarreal,Coronal Stop Deletion in a Rural South Texas Community, ser. Studies in English Language. Cambridge University Press, 2019, p. 198–214. [Online]. Available: https://doi.org/10.1017/9781316162316.008

work page doi:10.1017/9781316162316.008 2019
[47]

Spoken word recognition processes and the gating paradigm,

F. Grosjean, “Spoken word recognition processes and the gating paradigm,”Perception & Psychophysics, vol. 28, no. 4, pp. 267–283, 1980. [Online]. Available: https://doi.org/10.3758/ BF03204386

1980
[48]

How does OpenAI’s Whisper interpret dysarthric speech?

O. Agaoglu, Z. Yue, and Y . Zhang, “How does OpenAI’s Whisper interpret dysarthric speech?” TU Delft, Tech. Rep., 2024, student thesis/project report. [Online]. Available: https://resolver.tudelft. nl/uuid:47837feb-1b2e-4bba-9fad-f92d84024abb

2024
[49]

A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models,

A. de la Fuente and D. Jurafsky, “A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models,” inInterspeech 2024, 2024, pp. 1290–1294. [Online]. Available: https://doi.org/10.21437/Interspeech.2024-2341

work page doi:10.21437/interspeech.2024-2341 2024
[50]

Phonological abstraction without phonemes in speech perception,

H. Mitterer, O. Scharenborg, and J. M. McQueen, “Phonological abstraction without phonemes in speech perception,”Cognition, vol. 129, no. 2, pp. 356–361, 2013. [Online]. Available: https://doi.org/10.1016/j.cognition.2013.07.011

work page doi:10.1016/j.cognition.2013.07.011 2013
[51]

Wolfram and E

W. Wolfram and E. R. Thomas,The development of African American English. Blackwell Publishers, 2002. [Online]. Available: https://doi.org/10.1002/9780470690178

work page doi:10.1002/9780470690178 2002
[52]

The child as linguistic historian,

W. Labov, “The child as linguistic historian,”Language Variation and Change, vol. 1, no. 1, p. 85–97, 1989. [Online]. Available: https://doi.org/10.1017/S0954394500000120

work page doi:10.1017/s0954394500000120 1989
[53]

Reexamining the development of African Amer- ican English: Evidence from isolated communities,

W. Wolfram, “Reexamining the development of African Amer- ican English: Evidence from isolated communities,”Lan- guage, vol. 79, pp. 282–316, 06 2003. [Online]. Available: https://doi.org/10.1353/lan.2003.0144

work page doi:10.1353/lan.2003.0144 2003
[54]

Phonological and phonetic characteristics of African American Vernacular English,

E. R. Thomas, “Phonological and phonetic characteristics of African American Vernacular English,”Language and Linguistics Compass, vol. 1, no. 5, pp. 450–475, 2007. [Online]. Available: https://doi.org/10.1111/j.1749-818X.2007.00029.x

work page doi:10.1111/j.1749-818x.2007.00029.x 2007
[55]

L. J. Green,African American English: A Linguistic Introduction. Cambridge University Press, 2002

2002
[56]

IEEE/ACM Trans

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021. [Online]. Available: https://doi.org/10.1109/TASLP.2021.3122291

work page doi:10.1109/taslp.2021.3122291 2021
[57]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , volume=

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022. [Online]. Ava...

work page doi:10.1109/jstsp.2022.3188113 2022
[58]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y . Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024. [Online]. Available: http://jmlr.org/papers/v25/23-1318.html

2024

[1] [1]

Introduction Modern ASR systems exhibit significant performance dispar- ities across demographic groups, with particularly high error rates for speakers of AAE, a rule-governed variety of English shaped by historical, social, and cultural factors [1]. Multiple studies have documented racial bias in commercial ASR tech- nologies, showing word error rates (...

Pith/arXiv arXiv 2026

[2] [2]

This probe directly evaluates the phonetic sensitivity of each model to CCR

on frozen encoder representations: •Segmental reduction detection: We test whether encoder representations distinguish reduced and canonical cluster pronunciations, thereby assessing whether the presence or absence of a final stop is explicitly encoded across model lay- ers. This probe directly evaluates the phonetic sensitivity of each model to CCR. •Seg...

[3] [3]

For self-supervised models such as wav2vec 2.0, Pasad et al

Related Work Probing speech encoders has emerged as the standard method for dissecting the linguistic hierarchy of speech understanding [10]. For self-supervised models such as wav2vec 2.0, Pasad et al. [7] perform a detailed layer-wise analysis and show that early transformer layers are dominated by low-level acoustic cues, mid-layers maximize phonetic a...

[4] [4]

[23] find that layers 13-15 of Whisper- medium yield peak performance for dysarthric speech detection and severity assessment

report that different types of stuttered disfluencies are best discriminated from fluent speech using mid-to-late Whisper lay- ers, while Yue et al. [23] find that layers 13-15 of Whisper- medium yield peak performance for dysarthric speech detection and severity assessment. Closer to our research, Gessinger et al. [14] investigate phonemic restoration in...

work page doi:10.17605/osf.io/fe2d7

[5] [5]

Data Preparation 3.1.1

Methodology 3.1. Data Preparation 3.1.1. Corpus The Corpus of Regional African American Language (CORAAL) [24] serves as the foundational dataset for this study. The corpus provides rich linguistic resources, including audio recordings with time-aligned orthographic transcriptions in TextGrid format, featuring speaker-specific tiers at both utterance and ...

1927

[6] [6]

Experiment 1 4.1.1

Results 4.1. Experiment 1 4.1.1. Imbalanced vs. balanced Figure 1 illustrates the layer-wise distribution of information in both wav2vec 2.0 and Whisper representations for the imbal- anced and balanced datasets described in Section 3.3.1. Both wav2vec2-base and Whisper-small clearly distinguish reduced from canonical tokens, with accuracies consistently ...

[7] [7]

Our results generally follow this pattern, although in our plot the accuracy peaks more strongly at layer 5 than at layer 9, whereas they report a higher peak at layer 9 than at layer

[8] [8]

Additionally, the decline after layer 10 in our data is more moderate than what they observed. In contrast, prior analyses of Whisper’s encoder (particularly for larger models) report a rising-then-plateau pattern for phonetic tasks, with performance increasing toward mid-layers and then remaining relatively sta- ble without a sharp decline toward higher ...

[9] [9]

The standard deviations are broadly comparable across models, with slightly higher variability for wav2vec 2.0 in the early layers

Whisper follows a similar trajectory, but with a steeper rise toward the peak at layer 9, followed by a more moderate decline and a subsequent plateau toward the final layers. The standard deviations are broadly comparable across models, with slightly higher variability for wav2vec 2.0 in the early layers. This probe is theoretically more challenging than...

[10] [10]

In each case, reduced/canonical labels were randomly shuffled while preserving original train/test splits, speaker independence, and token-cluster distributions

and the reduced-only train probe (Experiment 2). In each case, reduced/canonical labels were randomly shuffled while preserving original train/test splits, speaker independence, and token-cluster distributions. MLP probes trained on these shuf- fled labels yielded a baseline performance floor of 46-53%, con- firming that the observed accuracies exceed cha...

[11] [11]

preceding segment

Discussion In this paper, we examined how layer-wise probing results can inform our understanding of phonological structure, focusing in particular on consonant cluster reduction, coronal stop deletion, and broader patterns in consonant cluster typology. While the Results section emphasized the computational behavior of the models, tracing how representat...

[12] [12]

This suggests that Whisper’s representations may more closely mirror human perceptual patterns in this case

followed by a steady decline to 72.6%, Whisper aligns more closely with Thomas and Bailey [19]’s expectations, exhibit- ing the lowest accuracy among all clusters analyzed so far for this non-homorganic type with low susceptibility to reduction. This suggests that Whisper’s representations may more closely mirror human perceptual patterns in this case. Mo...

[13] [13]

Layer-wise probing on 6,760 CORAAL tokens revealed that both wav2vec 2.0-base and Whisper-small encode CCR in a phonologically structured manner

Conclusion This study set out to determine whether self-supervised and su- pervised models treat CCR in AAE as a low-level segmental deletion phenomenon, or whether they encode it as structured, gradient phonological variation with preserved cues to under- lying forms. Layer-wise probing on 6,760 CORAAL tokens revealed that both wav2vec 2.0-base and Whisp...

[14] [14]

As a result, it remains unclear to what extent our findings generalize to other speech repre- sentation models

Limitations and future work One limitation of this study is that we evaluated only two mod- els, wav2vec 2.0 and Whisper. As a result, it remains unclear to what extent our findings generalize to other speech repre- sentation models. In addition, certain CCR cluster types (e.g., /mp/, /pt/) were underrepresented in the data due to their lower frequency in...

[15] [15]

The authors would like to thank Erfan Amirzadeh Shams for his helpful contribution in refining the research concept and for his technical guidance on the implementation

Acknowledgments This work is part of HM’s PhD research. The authors would like to thank Erfan Amirzadeh Shams for his helpful contribution in refining the research concept and for his technical guidance on the implementation. The authors also thank the anonymous reviewers for their insightful comments and constructive feed- back, which helped improve the ...

[16] [16]

Perplexity.ai and Grok were used for English grammar checking, improving sentence readability, suggesting relevant literature, and code refining or debugging

Generative AI Use Disclosure No Generative AI tools were used to produce the content or re- sults of this paper. Perplexity.ai and Grok were used for English grammar checking, improving sentence readability, suggesting relevant literature, and code refining or debugging

[17] [17]

I don’t think these devices are very culturally sensitive.-impact of automated speech recognition errors on African Americans,

Z. Mengesha, C. Heldreth, M. Lahav, J. Sublewski, and E. Tuennerman, “I don’t think these devices are very culturally sensitive.-impact of automated speech recognition errors on African Americans,”Frontiers in Artificial Intelligence, vol. V olume 4 - 2021, 2021. [Online]. Available: https://doi.org/10. 3389/frai.2021.725911

arXiv 2021

[18] [18]

Proceedings of the National Academy of Sciences , author =

A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Mengesha, C. Toups, J. R. Rickford, D. Jurafsky, and S. Goel, “Racial disparities in automated speech recognition,” Proceedings of the National Academy of Sciences, vol. 117, no. 14, pp. 7684–7689, 2020. [Online]. Available: https://doi.org/10.1073/pnas.1915768117

work page doi:10.1073/pnas.1915768117 2020

[19] [19]

Understanding Racial Disparities in Automatic Speech Recognition: The Case of Habitual “be

J. L. Martin and K. Tang, “Understanding Racial Disparities in Automatic Speech Recognition: The Case of Habitual “be”,” inInterspeech 2020, 2020, pp. 626–630. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-2893

work page doi:10.21437/interspeech.2020-2893 2020

[20] [20]

Bias in automatic speech recognition: The case of African American Language,

J. L. Martin and K. E. Wright, “Bias in automatic speech recognition: The case of African American Language,”Applied Linguistics, vol. 44, no. 4, pp. 613–630, 12 2022. [Online]. Available: https://doi.org/10.1093/applin/amac066

work page doi:10.1093/applin/amac066 2022

[21] [21]

Speech Communication , author =

A. B. Wassink, C. Gansen, and I. Bartholomew, “Uneven success: automatic speech recognition and ethnicity-related dialects,” Speech Communication, vol. 140, pp. 50–70, 2022. [Online]. Available: https://doi.org/10.1016/j.specom.2022.03.009

work page doi:10.1016/j.specom.2022.03.009 2022

[22] [22]

Interpretable sparse features for probing self-supervised speech models,

I. Parra, “Interpretable sparse features for probing self-supervised speech models,” inThe 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia- Pacific Chapter of the Association for Computational Linguistics, S. T.y.s.s, S. Shimizu, and Y . Gong, Eds. Mumbai, India: Association for Computational Linguisti...

2025

[23] [23]

w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 914–921. [Online]. Available: https: //doi.org/10.1109/ASRU51503.2021.9688093

work page doi:10.1109/asru51503.2021.9688093 2021

[24] [24]

wav2vec 2.0: a framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representations,” inProceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020. [Online]. Available: https://dl.acm.org/doi/abs/10.5555/3495724. 3496768

work page doi:10.5555/3495724 2020

[25] [25]

Robust speech recognition via large-scale w eak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023. [Online]. Available: https://dl.acm.org/doi/10.5555/3618408.3619590

work page doi:10.5555/3618408.3619590 2023

[26] [26]

Ayadi, F

A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analysis of self-supervised speech models,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICASSP49357.2023.10096149

work page doi:10.1109/icassp49357.2023.10096149 2023

[27] [27]

Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,

P. Cormac English, J. D. Kelleher, and J. Carson-Berndsen, “Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,” inProceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, G. Nicolai and E. Chodroff, Eds. Seattle, Washington: Association for Computational Linguistics, Jul. 2022, pp...

2022

[28] [28]

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model,

M. Yang, R. C. M. C. Shekar, O. Kang, and J. H. L. Hansen, “What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model,” inInterspeech 2023, 2023, pp. 1923–1927. [Online]. Available: https://doi.org/10.21437/Interspeech.2023-2254

work page doi:10.21437/interspeech.2023-2254 2023

[29] [29]

Uncovering syllable constituents in the self-attention-based speech represen- tations of whisper,

E. A Shams, I. Gessinger, and J. Carson-Berndsen, “Uncovering syllable constituents in the self-attention-based speech represen- tations of whisper,” inProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, Y . Belinkov, N. Kim, J. Jumelet, H. Mohebbi, A. Mueller, and H. Chen, Eds. Miami, Florida, US: Association ...

work page doi:10.18653/v1/2024.blackboxnlp-1.16 2024

[30] [30]

2012.02.005

I. Gessinger, E. A. Shams, and J. Carson-Berndsen, “Under the hood: phonemic restoration in transformer-based automatic speech recognition,”Computer Speech & Language, vol. 96, p. 101893, 2026. [Online]. Available: https://doi.org/10.1016/j.csl. 2025.101893

work page doi:10.1016/j.csl 2026

[31] [31]

Labov,Sociolinguistic Patterns

W. Labov,Sociolinguistic Patterns. Philadelphia: University of Pennsylvania Press, 1972

1972

[32] [32]

Schreier,Consonant Change in English Worldwide: Synchrony Meets Diachrony

D. Schreier,Consonant Change in English Worldwide: Synchrony Meets Diachrony. New York: Palgrave Macmillan, 2005

2005

[33] [33]

Wolfram,Dialect in Society

W. Wolfram,Dialect in Society. John Wiley & Sons, Ltd, 2017, ch. 7, pp. 107–126. [Online]. Available: https: //doi.org/10.1002/9781405166256.ch7

work page doi:10.1002/9781405166256.ch7 2017

[34] [34]

Explanation in variable phonology: An exponential model of morphological constraints,

G. R. Guy, “Explanation in variable phonology: An exponential model of morphological constraints,”Language Variation and Change, vol. 3, no. 1, pp. 1–22, 1991. [Online]. Available: https://doi.org/10.1017/S0954394500000429

work page doi:10.1017/s0954394500000429 1991

[35] [35]

Segmental phonology of African American English,

E. R. Thomas and G. Bailey, “Segmental phonology of African American English,” inThe Oxford Handbook of African American Language. Oxford University Press, 07

[36] [36]

Available: https://doi.org/10.1093/oxfordhb/ 9780199795390.013.13

[Online]. Available: https://doi.org/10.1093/oxfordhb/ 9780199795390.013.13

work page doi:10.1093/oxfordhb/

[37] [37]

Automatic Speech Recognition of African American English: Lexical and Contextual Effects,

H. Mojarad and K. Tang, “Automatic Speech Recognition of African American English: Lexical and Contextual Effects,” in Interspeech 2025, 2025, pp. 3883–3887. [Online]. Available: https://doi.org/10.21437/Interspeech.2025-1511

work page doi:10.21437/interspeech.2025-1511 2025

[38] [38]

Using wav2vec 2.0 for phonetic classification tasks: methodological aspects,

L. Kim and C. Gendrot, “Using wav2vec 2.0 for phonetic classification tasks: methodological aspects,” inInterspeech 2024, 2024, pp. 1530–1534. [Online]. Available: https: //doi.org/10.21437/Interspeech.2024-1155

work page doi:10.21437/interspeech.2024-1155 2024

[39] [39]

Exploring Whisper embeddings for stutter detection: A layer-wise study,

A. Batra, B. Kar, and P. K. Das, “Exploring Whisper embeddings for stutter detection: A layer-wise study,” in2025 33rd European Signal Processing Conference (EUSIPCO), 2025, pp. 61–65. [Online]. Available: https://doi.org/10.23919/EUSIPCO63237. 2025.11226086

work page doi:10.23919/eusipco63237 2025

[40] [40]

Probing Whisper for dysarthric speech in detection and assessment,

Z. Yue, D. Kayande, Z. Cvetkovic, and E. Loweimi, “Probing Whisper for dysarthric speech in detection and assessment,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.04219

work page doi:10.48550/arxiv.2510.04219 2025

[41] [41]

The Corpus of Regional African American Language,

T. Kendall and C. Farrington, “The Corpus of Regional African American Language,” 2023, publisher: The Online Resources for African American Language Project. [Online]. Available: https://oraal.uoregon.edu/coraal

2023

[42] [42]

Considering performance in the automated and manual coding of sociolinguistic variables: Lessons from variable (ING),

T. Kendall, C. Vaughn, C. Farrington, K. Gunter, J. McLean, C. Tacata, and S. Arnson, “Considering performance in the automated and manual coding of sociolinguistic variables: Lessons from variable (ING),”Frontiers in Artificial Intelligence, vol. 4, 2021. [Online]. Available: https://doi.org/10.3389/frai. 2021.648543

work page doi:10.3389/frai 2021

[43] [43]

Montreal Forced Aligner: Trainable Text- Speech Alignment Using Kaldi,

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable Text- Speech Alignment Using Kaldi,” inProc. Interspeech 2017, 2017, pp. 498–502. [Online]. Available: https://doi.org/10.21437/ Interspeech.2017-1386

2017

[44] [44]

Homophone disambiguation reveals patterns of con- text mixing in speech transformers,

H. Mohebbi, G. Chrupała, W. Zuidema, and A. Al- ishahi, “Homophone disambiguation reveals patterns of con- text mixing in speech transformers,” inProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Lin- guistics, Dec. 2023, pp. 8249–8260. ...

work page doi:10.18653/v1/2023.emnlp-main.513 2023

[45] [45]

Scikit-learn: Machine learning in python,

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and ´Edouard Duchesnay, “Scikit-learn: Machine learning in python,”Journal of Machine Learning Research, vol. 12, no. 85, pp. 2825–2830, 2011. [Online]. Available: h...

2011

[46] [46]

Bayley and D

R. Bayley and D. Villarreal,Coronal Stop Deletion in a Rural South Texas Community, ser. Studies in English Language. Cambridge University Press, 2019, p. 198–214. [Online]. Available: https://doi.org/10.1017/9781316162316.008

work page doi:10.1017/9781316162316.008 2019

[47] [47]

Spoken word recognition processes and the gating paradigm,

F. Grosjean, “Spoken word recognition processes and the gating paradigm,”Perception & Psychophysics, vol. 28, no. 4, pp. 267–283, 1980. [Online]. Available: https://doi.org/10.3758/ BF03204386

1980

[48] [48]

How does OpenAI’s Whisper interpret dysarthric speech?

O. Agaoglu, Z. Yue, and Y . Zhang, “How does OpenAI’s Whisper interpret dysarthric speech?” TU Delft, Tech. Rep., 2024, student thesis/project report. [Online]. Available: https://resolver.tudelft. nl/uuid:47837feb-1b2e-4bba-9fad-f92d84024abb

2024

[49] [49]

A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models,

A. de la Fuente and D. Jurafsky, “A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models,” inInterspeech 2024, 2024, pp. 1290–1294. [Online]. Available: https://doi.org/10.21437/Interspeech.2024-2341

work page doi:10.21437/interspeech.2024-2341 2024

[50] [50]

Phonological abstraction without phonemes in speech perception,

H. Mitterer, O. Scharenborg, and J. M. McQueen, “Phonological abstraction without phonemes in speech perception,”Cognition, vol. 129, no. 2, pp. 356–361, 2013. [Online]. Available: https://doi.org/10.1016/j.cognition.2013.07.011

work page doi:10.1016/j.cognition.2013.07.011 2013

[51] [51]

Wolfram and E

W. Wolfram and E. R. Thomas,The development of African American English. Blackwell Publishers, 2002. [Online]. Available: https://doi.org/10.1002/9780470690178

work page doi:10.1002/9780470690178 2002

[52] [52]

The child as linguistic historian,

W. Labov, “The child as linguistic historian,”Language Variation and Change, vol. 1, no. 1, p. 85–97, 1989. [Online]. Available: https://doi.org/10.1017/S0954394500000120

work page doi:10.1017/s0954394500000120 1989

[53] [53]

Reexamining the development of African Amer- ican English: Evidence from isolated communities,

W. Wolfram, “Reexamining the development of African Amer- ican English: Evidence from isolated communities,”Lan- guage, vol. 79, pp. 282–316, 06 2003. [Online]. Available: https://doi.org/10.1353/lan.2003.0144

work page doi:10.1353/lan.2003.0144 2003

[54] [54]

Phonological and phonetic characteristics of African American Vernacular English,

E. R. Thomas, “Phonological and phonetic characteristics of African American Vernacular English,”Language and Linguistics Compass, vol. 1, no. 5, pp. 450–475, 2007. [Online]. Available: https://doi.org/10.1111/j.1749-818X.2007.00029.x

work page doi:10.1111/j.1749-818x.2007.00029.x 2007

[55] [55]

L. J. Green,African American English: A Linguistic Introduction. Cambridge University Press, 2002

2002

[56] [56]

IEEE/ACM Trans

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021. [Online]. Available: https://doi.org/10.1109/TASLP.2021.3122291

work page doi:10.1109/taslp.2021.3122291 2021

[57] [57]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , volume=

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022. [Online]. Ava...

work page doi:10.1109/jstsp.2022.3188113 2022

[58] [58]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y . Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024. [Online]. Available: http://jmlr.org/papers/v25/23-1318.html

2024