Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages
Pith reviewed 2026-05-09 22:04 UTC · model grok-4.3
The pith
A controlled pairwise evaluation framework allows reliable ranking of TTS systems for ten Indian languages by collecting multi-dimensional judgments from native speakers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Applying it to 5,000+ sentences in 10 Indic languages and 7 TTS systems yields over 120,000 comparisons from 1,900 native raters, enabling a Bradley-Terry leaderboard, SHAP-based preference interpretation, and analysis of model strengths across perceptual dimensions.
What carries the argument
The controlled multidimensional pairwise evaluation framework, which pairs sentences with linguistic controls and collects annotations on six perceptual dimensions plus overall preference.
If this is right
- Bradley-Terry modeling can construct a stable multilingual leaderboard from the pairwise data.
- SHAP analysis can reveal which perceptual dimensions drive human preferences for each model.
- Models show distinct strengths and trade-offs, such as high intelligibility but lower expressiveness in some systems.
- Large-scale native rater data supports reliable comparison despite perceptual variance when linguistic controls are applied.
- The framework identifies specific areas for TTS improvement in Indic languages.
Where Pith is reading between the lines
- Adapting this framework to other under-resourced languages could standardize TTS evaluation globally.
- The chosen perceptual dimensions may need testing against downstream tasks like user satisfaction in voice assistants.
- Future work might explore how these preferences correlate with actual usage patterns in daily communication.
- Combining this human data with automated metrics could create hybrid evaluation systems that better predict real-world performance.
Load-bearing premise
That the collected pairwise comparisons, even with high variance in speech perception, produce reliable and consistent signals for leaderboard construction and preference interpretation once linguistic factors are controlled.
What would settle it
A replication study with independent raters or new sentence samples that produces a substantially reordered leaderboard or contradictory SHAP feature importances would falsify the reliability of the signals.
Figures
read the original abstract
Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a controlled multidimensional pairwise evaluation framework for multilingual TTS in 10 Indic languages. It evaluates 7 state-of-the-art TTS systems on 5K+ native and code-mixed sentences, collecting over 120K pairwise comparisons from 1900 native raters. Judgments cover overall preference plus six perceptual dimensions (intelligibility, expressiveness, voice quality, liveliness, noise, hallucinations). Bradley-Terry modeling is used to build a multilingual leaderboard, SHAP analysis interprets preference drivers, and the work examines leaderboard reliability along with model strengths and trade-offs across dimensions.
Significance. If the judgment signals prove reliable after controls, this work provides the first large-scale, linguistically grounded preference dataset for TTS in underrepresented Indic languages. It could inform voice-first application design in India and offer a reusable framework for multidimensional evaluation in other multilingual settings, particularly where perceptual variance is high.
major comments (2)
- [Abstract] Abstract and evaluation framework description: no quantitative checks on signal-to-noise (e.g., intra-class correlation on repeated pairs, BT log-likelihood on held-out data, or rank stability across subsamples) are reported despite explicit acknowledgment of high variance from linguistic diversity and multidimensional perception. This leaves the stability of the Bradley-Terry leaderboard and the validity of subsequent SHAP attributions under-supported.
- [Evaluation Framework] Data collection pipeline: the manuscript provides insufficient detail on the precise linguistic controls and rater-bias mitigation steps (e.g., rater screening, balancing of code-mixed vs. native sentences, or filtering of low-consistency raters) applied to the 120K comparisons. Without these, it is unclear whether the collected signals are sufficiently consistent to support the central claims about reliable leaderboard construction and interpretable preference drivers.
minor comments (2)
- [Abstract] The abstract would benefit from a brief statement of the exact sentence distribution per language and the number of dimensions rated per pair to improve immediate clarity.
- [Methods] Notation for the six perceptual dimensions and their mapping to the overall preference judgment could be made more explicit in the methods to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the quantitative support for signal reliability and to provide greater transparency on the data collection pipeline.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation framework description: no quantitative checks on signal-to-noise (e.g., intra-class correlation on repeated pairs, BT log-likelihood on held-out data, or rank stability across subsamples) are reported despite explicit acknowledgment of high variance from linguistic diversity and multidimensional perception. This leaves the stability of the Bradley-Terry leaderboard and the validity of subsequent SHAP attributions under-supported.
Authors: We agree that explicit quantitative checks on signal-to-noise would better support the claims given the acknowledged variance. The initial submission did not include these metrics. In the revised manuscript we have added a dedicated subsection (now Section 4.3) reporting intra-class correlation on repeated pairs, Bradley-Terry log-likelihood on held-out comparisons, and rank stability across multiple data subsamples. These results are summarized in the updated abstract and demonstrate that the leaderboard remains stable and that the SHAP attributions rest on reliable preference signals. revision: yes
-
Referee: [Evaluation Framework] Data collection pipeline: the manuscript provides insufficient detail on the precise linguistic controls and rater-bias mitigation steps (e.g., rater screening, balancing of code-mixed vs. native sentences, or filtering of low-consistency raters) applied to the 120K comparisons. Without these, it is unclear whether the collected signals are sufficiently consistent to support the central claims about reliable leaderboard construction and interpretable preference drivers.
Authors: We thank the referee for noting the need for additional detail. The original manuscript described the pipeline at a summary level. In the revision we have expanded Section 3 to specify the rater screening process (native-speaker qualification via proficiency checks), the balancing protocol (equal proportions of native and code-mixed sentences per language), and the post-collection filtering of low-consistency raters (those failing repeated-pair agreement thresholds). A new table and accompanying text now document the final rater pool and consistency statistics, clarifying how these steps support the reliability of the collected signals. revision: yes
Circularity Check
No circularity: purely empirical study with new human preference data
full rationale
The paper collects 120K+ new pairwise judgments from 1900 native raters on 5K+ sentences across 10 Indic languages, then applies standard Bradley-Terry modeling to build a leaderboard and SHAP to interpret dimension-specific preferences. No equations, parameters, or derivations reduce the reported leaderboard or SHAP attributions to fitted values or definitions taken from the paper's own inputs. The framework relies on fresh crowdsourced data rather than self-referential fitting, self-citation chains, or renaming of prior results. All load-bearing steps (data collection, BT fitting, SHAP) are externally grounded in the new annotations and remain falsifiable against those annotations.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Bradley-Terry model assumptions hold for the collected pairwise TTS preferences
- domain assumption Native rater judgments on the six dimensions provide perceptually grounded signals after linguistic controls
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.