Recognition: unknown
PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech
Pith reviewed 2026-05-07 14:19 UTC · model grok-4.3
The pith
PSP benchmark shows that TTS systems leading on word error rate do not lead uniformly on retroflex or prosodic fidelity for Indic languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PSP decomposes accent into retroflex collapse rate, aspiration fidelity, vowel-length fidelity, Tamil-zha fidelity, Fréchet Audio Distance, and prosodic signature divergence; when four systems are scored on Hindi, Telugu, and Tamil, retroflex collapse increases monotonically with phonological difficulty, PSP orderings diverge from WER orderings, and no system is Pareto-optimal across the six dimensions.
What carries the argument
PSP, the Phoneme Substitution Profile, which decomposes accent into six complementary dimensions measured via forced alignment plus native-speaker-centroid probes on Wav2Vec2-XLS-R layer-9 embeddings together with corpus-level distributional distances.
If this is right
- Retroflex collapse rate grows monotonically from Hindi to Telugu to Tamil.
- PSP rankings of the tested systems differ from their WER rankings.
- No single system achieves the lowest error on every one of the six dimensions at once.
- The released native reference centroids, embeddings, and scoring code enable direct comparison of future systems on the same dimensions.
Where Pith is reading between the lines
- Model developers could run PSP on new checkpoints to locate which specific phonological features still need work rather than chasing a single aggregate score.
- The same six-dimension decomposition could be applied to other language families that share retroflex or aspiration contrasts.
- Pairing PSP scores with WER in a joint dashboard would give a more complete picture than either metric alone.
- Extending the benchmark to additional Indic languages would test whether the monotonic difficulty pattern holds beyond the pilot set.
Load-bearing premise
Forced alignment combined with native-speaker-centroid acoustic probes on Wav2Vec2-XLS-R layer-9 embeddings accurately captures phonological features such as retroflex collapse and aspiration fidelity.
What would settle it
A controlled listening test in which native speakers rate accent naturalness on the same utterances and the ratings fail to correlate with the four phonological PSP scores.
read the original abstract
Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9 embeddings; the latter two are corpus-level distributional distances. In this v1 we benchmark four commercial and open-source systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5->R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi < Telugu < Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering -- commercial WER-leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto-optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000-clip embeddings for FAD, 500-clip prosodic feature matrices for PSD, 300-utterance golden sets per language, scoring code under MIT, and centroids under CC-BY. Formal MOS-correlation is deferred to v2; v1 reports five internal-consistency signals plus a native-audio sanity check.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PSP, an interpretable per-dimension accent benchmark for Indic TTS. It decomposes accent into six dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF) measured via forced alignment plus native-centroid probes on Wav2Vec2-XLS-R layer-9 embeddings; plus corpus-level Fréchet Audio Distance (FAD) and prosodic signature divergence (PSD). The work benchmarks five systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS, Praxy Voice) on Hindi/Telugu/Tamil pilot sets, reports that retroflex collapse increases monotonically with phonological difficulty (Hindi ~1%, Telugu ~40%, Tamil ~68%), that PSP orderings diverge from WER orderings, and that no system is Pareto-optimal across the six dimensions. It releases native reference centroids, embeddings, prosodic matrices, golden sets, and scoring code.
Significance. If the probes hold, PSP fills a clear gap by supplying phonologically grounded, per-feature accent metrics that standard WER/MOS miss for Indic languages. The explicit release of 500-clip native centroids per language, 1000-clip FAD embeddings, 500-clip PSD matrices, 300-utterance golden sets, and MIT-licensed code is a concrete strength that supports reproducibility and follow-on work. The reported divergence between PSP and WER rankings, together with the absence of any Pareto-optimal system, would usefully guide targeted accent improvements if externally validated.
major comments (1)
- [Phonological probe construction and evaluation] The headline claims (PSP ordering diverges from WER; no system is Pareto-optimal) are load-bearing on the four phonological dimensions (RR, AF, LF, ZF) being faithful proxies for the intended features. These are defined as distances to native-speaker centroids in Wav2Vec2-XLS-R layer-9 embeddings after forced alignment. The manuscript reports internal-consistency signals and a native-audio sanity check but defers formal correlation with human phonological judgments (retroflex collapse, aspiration fidelity, etc.) to v2. Given uneven Indic coverage in XLS-R pretraining and known risks of alignment errors on Tamil/Telugu, the observed divergences could reflect embedding geometry rather than true accent differences. A concrete human-rating correlation study on the targeted contrasts is required before the ordering claims can be treated as robust.
minor comments (2)
- [Abstract] The abstract states that v1 reports 'five internal-consistency signals' but does not enumerate them; adding a short table or explicit list in the main text would improve transparency and allow readers to assess their strength.
- [Results] The pilot sets are small (300 utterances per language); reporting per-system standard deviations or bootstrap intervals on the six PSP dimensions would help readers gauge the stability of the reported orderings.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the single major comment below, clarifying the current evidence for the probes while acknowledging the need for further validation.
read point-by-point responses
-
Referee: [Phonological probe construction and evaluation] The headline claims (PSP ordering diverges from WER; no system is Pareto-optimal) are load-bearing on the four phonological dimensions (RR, AF, LF, ZF) being faithful proxies for the intended features. These are defined as distances to native-speaker centroids in Wav2Vec2-XLS-R layer-9 embeddings after forced alignment. The manuscript reports internal-consistency signals and a native-audio sanity check but defers formal correlation with human phonological judgments (retroflex collapse, aspiration fidelity, etc.) to v2. Given uneven Indic coverage in XLS-R pretraining and known risks of alignment errors on Tamil/Telugu, the observed divergences could reflect embedding geometry rather than true accent differences. A concrete human-rating correlation study on the targeted contrasts is required before the ordering claims can be treated as robu
Authors: We agree that a formal human-rating correlation study on the targeted contrasts is the appropriate next step for establishing robustness and is planned for v2. In v1 we present the probes as an initial, interpretable decomposition supported by five internal-consistency signals and a native-audio sanity check. The observed monotonic increase in retroflex collapse rate with phonological difficulty (Hindi ~1%, Telugu ~40%, Tamil ~68%) is consistent with independent linguistic descriptions of these languages and would be unlikely if the metric were driven purely by embedding artifacts. The native-audio sanity check further shows that the same probes assign higher fidelity to held-out native recordings than to any of the evaluated systems. We acknowledge the risks arising from uneven XLS-R pretraining coverage and forced-alignment errors on Tamil/Telugu; the revised manuscript will expand the description of all five consistency checks, add explicit discussion of these potential confounds, and qualify the ordering claims accordingly. We do not treat the current PSP rankings as definitive without external validation. revision: partial
Circularity Check
No significant circularity; benchmark definitions and empirical claims are independent
full rationale
The PSP benchmark defines its six dimensions using established external tools (forced alignment, pre-trained Wav2Vec2-XLS-R embeddings, native-speaker centroids, and corpus-level distributional distances) without any self-referential fitting, renaming, or reduction of the target quantities to the inputs by construction. The reported findings—monotonic growth in retroflex collapse, divergence of PSP ordering from WER, and absence of Pareto-optimal systems—are direct empirical observations obtained by applying these fixed measures to the evaluated TTS outputs. No load-bearing self-citations, ansatzes smuggled via prior work, or uniqueness theorems appear in the derivation chain. The paper explicitly defers external human correlation to v2 while providing internal consistency checks, but this does not create circularity in the present claims.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation
LASE eliminates the script-induced drop in speaker similarity (from 0.08-0.1 down to near zero) by training a language-adversarial projection head on top of frozen WavLM using synthesized cross-script pairs.
Reference graph
Works this paper leans on
-
[1]
XLS-R: Self- supervised cross-lingual speech representation learning at scale,
A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino et al. , “XLS-R: Self- supervised cross-lingual speech representation learning at scale,” Interspeech, 2022
2022
-
[2]
Quantifying speaker embedding phonolog- ical rule interactions in accented speech synthesis,
T. Lertpetchpun, Y. Lee, T. Trachu, J. Lee, T. Feng, D. Byrd, and S. Narayanan, “Quantifying speaker embedding phonolog- ical rule interactions in accented speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2026, arXiv:2601.14417
-
[3]
Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost,
V. P. T. Menta, “Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost,” Companion paper, arXiv preprint, 2026, https://github.com/praxelhq/praxy
2026
-
[4]
UTMOS: UTokyo-SaruLab system for Voice- MOS challenge 2022,
T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for Voice- MOS challenge 2022,” in Interspeech, 2022
2022
-
[5]
The VoiceMOS challenge 2022,
W.-C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The VoiceMOS challenge 2022,” in Interspeech, 2022
2022
-
[6]
Frechet audio distance: A reference-free metric for evaluating music enhancement algorithms,
K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Frechet audio distance: A reference-free metric for evaluating music enhancement algorithms,” in Interspeech, 2019
2019
-
[7]
Understanding Frechet speech distance for TTS evaluation,
J.-W. Kim, D. Agarwal, and F. Cerina, “Understanding Frechet speech distance for TTS evaluation,” arXiv:2601.21386, 2026
-
[8]
Durational variability in speech and the rhythm class hypothesis,
E. Grabe and E. L. Low, “Durational variability in speech and the rhythm class hypothesis,” Papers in Laboratory Phonology 7, pp. 515–546, 2002
2002
-
[9]
Learning-free L2-accented speech generation using phonological rules,
T. Lertpetchpun, Y. Lee, J. Lee, T. Feng, D. Byrd, and S. Narayanan, “Learning-free L2-accented speech generation using phonological rules,” arXiv:2603.07550, 2026
-
[10]
Pairwise eval- uation of accent similarity in speech synthesis,
J. Zhong, S. Liu, D. Wells, and K. Richmond, “Pairwise eval- uation of accent similarity in speech synthesis,” in Interspeech, 2025, arXiv:2505.14410
-
[11]
Rasmalai: A large-scale Indic speech dataset with accent and intonation descriptions,
A. Sankar et al. , “Rasmalai: A large-scale Indic speech dataset with accent and intonation descriptions,” in Interspeech, 2025
2025
-
[12]
IndicVoices-R: A speaker-generalization TTS benchmark for Indic languages,
AI4Bharat, “IndicVoices-R: A speaker-generalization TTS benchmark for Indic languages,” in NeurIPS Datasets and Benchmarks, 2024
2024
-
[13]
FLEURS: Few- shot learning evaluation of universal representations of speech,
A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: Few- shot learning evaluation of universal representations of speech,” in IEEE Spoken Language Technology Workshop (SLT) , 2022
2022
-
[14]
Towards building ASR systems for the next billion users,
T. Javed, S. Doddapaneni, A. Raman et al. , “Towards building ASR systems for the next billion users,” in AAAI, 2022
2022
-
[15]
torchaudio: An audio library for PyTorch,
PyTorch Team, “torchaudio: An audio library for PyTorch,” https://pytorch.org/audio/stable/, 2024
2024
-
[16]
Parler-TTS: Open-source text-to-speech,
Y. Lacombe et al. , “Parler-TTS: Open-source text-to-speech,” https://github.com/huggingface/parler-tts, 2024
2024
-
[17]
Chatterbox Multilingual TTS,
Resemble AI, “Chatterbox Multilingual TTS,” https://github. com/resemble-ai/chatterbox, 2025
2025
-
[18]
Towards building text-to-speech systems for the next billion users,
G. Kumar et al. , “Towards building text-to-speech systems for the next billion users,” in ICASSP, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.