arxiv: 2604.25476 · v1 · submitted 2026-04-28 · 💻 cs.SD · cs.CL

Recognition: unknown

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

Venkata Pushpak Teja Menta

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:19 UTC · model grok-4.3

classification 💻 cs.SD cs.CL

keywords accent benchmarkIndic TTSphoneme substitution profileretroflex collapseaspiration fidelityprosodic divergencetext-to-speech evaluationWER divergence

0 comments

The pith

PSP benchmark shows that TTS systems leading on word error rate do not lead uniformly on retroflex or prosodic fidelity for Indic languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PSP, a benchmark that splits accent evaluation for Indic text-to-speech into six separate dimensions instead of relying on overall intelligibility or naturalness scores. It measures retroflex collapse, aspiration fidelity, vowel length, Tamil zha fidelity, audio distance, and prosodic divergence using forced alignment and native-speaker acoustic probes on Wav2Vec2 embeddings. When applied to four commercial and open-source systems on Hindi, Telugu, and Tamil, the results show retroflex errors rise steadily with language difficulty and that system rankings shift when accent is examined dimension by dimension. A reader would care because existing metrics can approve outputs that still sound non-native on phonemically important features, and the per-dimension view points to concrete places where synthesis needs targeted work.

Core claim

PSP decomposes accent into retroflex collapse rate, aspiration fidelity, vowel-length fidelity, Tamil-zha fidelity, Fréchet Audio Distance, and prosodic signature divergence; when four systems are scored on Hindi, Telugu, and Tamil, retroflex collapse increases monotonically with phonological difficulty, PSP orderings diverge from WER orderings, and no system is Pareto-optimal across the six dimensions.

What carries the argument

PSP, the Phoneme Substitution Profile, which decomposes accent into six complementary dimensions measured via forced alignment plus native-speaker-centroid probes on Wav2Vec2-XLS-R layer-9 embeddings together with corpus-level distributional distances.

If this is right

Retroflex collapse rate grows monotonically from Hindi to Telugu to Tamil.
PSP rankings of the tested systems differ from their WER rankings.
No single system achieves the lowest error on every one of the six dimensions at once.
The released native reference centroids, embeddings, and scoring code enable direct comparison of future systems on the same dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model developers could run PSP on new checkpoints to locate which specific phonological features still need work rather than chasing a single aggregate score.
The same six-dimension decomposition could be applied to other language families that share retroflex or aspiration contrasts.
Pairing PSP scores with WER in a joint dashboard would give a more complete picture than either metric alone.
Extending the benchmark to additional Indic languages would test whether the monotonic difficulty pattern holds beyond the pilot set.

Load-bearing premise

Forced alignment combined with native-speaker-centroid acoustic probes on Wav2Vec2-XLS-R layer-9 embeddings accurately captures phonological features such as retroflex collapse and aspiration fidelity.

What would settle it

A controlled listening test in which native speakers rate accent naturalness on the same utterances and the ratings fail to correlate with the four phonological PSP scores.

read the original abstract

Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9 embeddings; the latter two are corpus-level distributional distances. In this v1 we benchmark four commercial and open-source systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5->R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi < Telugu < Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering -- commercial WER-leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto-optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000-clip embeddings for FAD, 500-clip prosodic feature matrices for PSD, 300-utterance golden sets per language, scoring code under MIT, and centroids under CC-BY. Formal MOS-correlation is deferred to v2; v1 reports five internal-consistency signals plus a native-audio sanity check.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PSP introduces a needed per-dimension accent benchmark for Indic TTS, but the phonological metrics lack human validation so far.

read the letter

PSP introduces a needed per-dimension accent benchmark for Indic TTS, but the phonological metrics lack human validation so far. The paper decomposes accent into six dimensions tailored to Indic languages—retroflex collapse rate, aspiration fidelity, vowel-length fidelity, Tamil zha fidelity, Frechet Audio Distance, and prosodic signature divergence—and applies them to a few commercial and open systems on Hindi, Telugu, and Tamil data. It shows that retroflex issues scale with language difficulty and that WER rankings do not match the accent rankings, with no system leading on every dimension. Releasing the native centroids, embeddings, scoring code, and reference sets under open licenses makes the work immediately usable by others.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces PSP, an interpretable per-dimension accent benchmark for Indic TTS. It decomposes accent into six dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF) measured via forced alignment plus native-centroid probes on Wav2Vec2-XLS-R layer-9 embeddings; plus corpus-level Fréchet Audio Distance (FAD) and prosodic signature divergence (PSD). The work benchmarks five systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS, Praxy Voice) on Hindi/Telugu/Tamil pilot sets, reports that retroflex collapse increases monotonically with phonological difficulty (Hindi ~1%, Telugu ~40%, Tamil ~68%), that PSP orderings diverge from WER orderings, and that no system is Pareto-optimal across the six dimensions. It releases native reference centroids, embeddings, prosodic matrices, golden sets, and scoring code.

Significance. If the probes hold, PSP fills a clear gap by supplying phonologically grounded, per-feature accent metrics that standard WER/MOS miss for Indic languages. The explicit release of 500-clip native centroids per language, 1000-clip FAD embeddings, 500-clip PSD matrices, 300-utterance golden sets, and MIT-licensed code is a concrete strength that supports reproducibility and follow-on work. The reported divergence between PSP and WER rankings, together with the absence of any Pareto-optimal system, would usefully guide targeted accent improvements if externally validated.

major comments (1)

[Phonological probe construction and evaluation] The headline claims (PSP ordering diverges from WER; no system is Pareto-optimal) are load-bearing on the four phonological dimensions (RR, AF, LF, ZF) being faithful proxies for the intended features. These are defined as distances to native-speaker centroids in Wav2Vec2-XLS-R layer-9 embeddings after forced alignment. The manuscript reports internal-consistency signals and a native-audio sanity check but defers formal correlation with human phonological judgments (retroflex collapse, aspiration fidelity, etc.) to v2. Given uneven Indic coverage in XLS-R pretraining and known risks of alignment errors on Tamil/Telugu, the observed divergences could reflect embedding geometry rather than true accent differences. A concrete human-rating correlation study on the targeted contrasts is required before the ordering claims can be treated as robust.

minor comments (2)

[Abstract] The abstract states that v1 reports 'five internal-consistency signals' but does not enumerate them; adding a short table or explicit list in the main text would improve transparency and allow readers to assess their strength.
[Results] The pilot sets are small (300 utterances per language); reporting per-system standard deviations or bootstrap intervals on the six PSP dimensions would help readers gauge the stability of the reported orderings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below, clarifying the current evidence for the probes while acknowledging the need for further validation.

read point-by-point responses

Referee: [Phonological probe construction and evaluation] The headline claims (PSP ordering diverges from WER; no system is Pareto-optimal) are load-bearing on the four phonological dimensions (RR, AF, LF, ZF) being faithful proxies for the intended features. These are defined as distances to native-speaker centroids in Wav2Vec2-XLS-R layer-9 embeddings after forced alignment. The manuscript reports internal-consistency signals and a native-audio sanity check but defers formal correlation with human phonological judgments (retroflex collapse, aspiration fidelity, etc.) to v2. Given uneven Indic coverage in XLS-R pretraining and known risks of alignment errors on Tamil/Telugu, the observed divergences could reflect embedding geometry rather than true accent differences. A concrete human-rating correlation study on the targeted contrasts is required before the ordering claims can be treated as robu

Authors: We agree that a formal human-rating correlation study on the targeted contrasts is the appropriate next step for establishing robustness and is planned for v2. In v1 we present the probes as an initial, interpretable decomposition supported by five internal-consistency signals and a native-audio sanity check. The observed monotonic increase in retroflex collapse rate with phonological difficulty (Hindi ~1%, Telugu ~40%, Tamil ~68%) is consistent with independent linguistic descriptions of these languages and would be unlikely if the metric were driven purely by embedding artifacts. The native-audio sanity check further shows that the same probes assign higher fidelity to held-out native recordings than to any of the evaluated systems. We acknowledge the risks arising from uneven XLS-R pretraining coverage and forced-alignment errors on Tamil/Telugu; the revised manuscript will expand the description of all five consistency checks, add explicit discussion of these potential confounds, and qualify the ordering claims accordingly. We do not treat the current PSP rankings as definitive without external validation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; benchmark definitions and empirical claims are independent

full rationale

The PSP benchmark defines its six dimensions using established external tools (forced alignment, pre-trained Wav2Vec2-XLS-R embeddings, native-speaker centroids, and corpus-level distributional distances) without any self-referential fitting, renaming, or reduction of the target quantities to the inputs by construction. The reported findings—monotonic growth in retroflex collapse, divergence of PSP ordering from WER, and absence of Pareto-optimal systems—are direct empirical observations obtained by applying these fixed measures to the evaluated TTS outputs. No load-bearing self-citations, ansatzes smuggled via prior work, or uniqueness theorems appear in the derivation chain. The paper explicitly defers external human correlation to v2 while providing internal consistency checks, but this does not create circularity in the present claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to identify any free parameters, axioms, or invented entities; the benchmark relies on standard tools like Wav2Vec2 embeddings and forced alignment.

pith-pipeline@v0.9.0 · 5691 in / 1135 out tokens · 77400 ms · 2026-05-07T14:19:27.592787+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation
cs.SD 2026-05 unverdicted novelty 6.0

LASE eliminates the script-induced drop in speaker similarity (from 0.08-0.1 down to near zero) by training a language-adversarial projection head on top of frozen WavLM using synthesized cross-script pairs.

Reference graph

Works this paper leans on

18 extracted references · 4 canonical work pages · cited by 1 Pith paper

[1]

XLS-R: Self- supervised cross-lingual speech representation learning at scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino et al. , “XLS-R: Self- supervised cross-lingual speech representation learning at scale,” Interspeech, 2022

2022
[2]

Quantifying speaker embedding phonolog- ical rule interactions in accented speech synthesis,

T. Lertpetchpun, Y. Lee, T. Trachu, J. Lee, T. Feng, D. Byrd, and S. Narayanan, “Quantifying speaker embedding phonolog- ical rule interactions in accented speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2026, arXiv:2601.14417

work page arXiv 2026
[3]

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost,

V. P. T. Menta, “Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost,” Companion paper, arXiv preprint, 2026, https://github.com/praxelhq/praxy

2026
[4]

UTMOS: UTokyo-SaruLab system for Voice- MOS challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for Voice- MOS challenge 2022,” in Interspeech, 2022

2022
[5]

The VoiceMOS challenge 2022,

W.-C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The VoiceMOS challenge 2022,” in Interspeech, 2022

2022
[6]

Frechet audio distance: A reference-free metric for evaluating music enhancement algorithms,

K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Frechet audio distance: A reference-free metric for evaluating music enhancement algorithms,” in Interspeech, 2019

2019
[7]

Understanding Frechet speech distance for TTS evaluation,

J.-W. Kim, D. Agarwal, and F. Cerina, “Understanding Frechet speech distance for TTS evaluation,” arXiv:2601.21386, 2026

work page arXiv 2026
[8]

Durational variability in speech and the rhythm class hypothesis,

E. Grabe and E. L. Low, “Durational variability in speech and the rhythm class hypothesis,” Papers in Laboratory Phonology 7, pp. 515–546, 2002

2002
[9]

Learning-free L2-accented speech generation using phonological rules,

T. Lertpetchpun, Y. Lee, J. Lee, T. Feng, D. Byrd, and S. Narayanan, “Learning-free L2-accented speech generation using phonological rules,” arXiv:2603.07550, 2026

work page arXiv 2026
[10]

Pairwise eval- uation of accent similarity in speech synthesis,

J. Zhong, S. Liu, D. Wells, and K. Richmond, “Pairwise eval- uation of accent similarity in speech synthesis,” in Interspeech, 2025, arXiv:2505.14410

work page arXiv 2025
[11]

Rasmalai: A large-scale Indic speech dataset with accent and intonation descriptions,

A. Sankar et al. , “Rasmalai: A large-scale Indic speech dataset with accent and intonation descriptions,” in Interspeech, 2025

2025
[12]

IndicVoices-R: A speaker-generalization TTS benchmark for Indic languages,

AI4Bharat, “IndicVoices-R: A speaker-generalization TTS benchmark for Indic languages,” in NeurIPS Datasets and Benchmarks, 2024

2024
[13]

FLEURS: Few- shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: Few- shot learning evaluation of universal representations of speech,” in IEEE Spoken Language Technology Workshop (SLT) , 2022

2022
[14]

Towards building ASR systems for the next billion users,

T. Javed, S. Doddapaneni, A. Raman et al. , “Towards building ASR systems for the next billion users,” in AAAI, 2022

2022
[15]

torchaudio: An audio library for PyTorch,

PyTorch Team, “torchaudio: An audio library for PyTorch,” https://pytorch.org/audio/stable/, 2024

2024
[16]

Parler-TTS: Open-source text-to-speech,

Y. Lacombe et al. , “Parler-TTS: Open-source text-to-speech,” https://github.com/huggingface/parler-tts, 2024

2024
[17]

Chatterbox Multilingual TTS,

Resemble AI, “Chatterbox Multilingual TTS,” https://github. com/resemble-ai/chatterbox, 2025

2025
[18]

Towards building text-to-speech systems for the next billion users,

G. Kumar et al. , “Towards building text-to-speech systems for the next billion users,” in ICASSP, 2023

2023