Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Aaditya Pareek; Adish Pandya; Ashwin Sankar; Deepon Halder; Gaurav Yadav; Ishvinder Sethi; Kartik Rajput; Mitesh M Khapra; Mohammed Safi Ur Rahman Khan; Nikhil Narasimhan

arxiv: 2604.21481 · v2 · pith:LT3QH4IWnew · submitted 2026-04-23 · 💻 cs.CL

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Srija Anand , Ashwin Sankar , Ishvinder Sethi , Aaditya Pareek , Kartik Rajput , Gaurav Yadav , Nikhil Narasimhan , Adish Pandya

show 5 more authors

Deepon Halder Mohammed Safi Ur Rahman Khan Praveen S V Shobhit Banga Mitesh M Khapra

This is my paper

Pith reviewed 2026-05-09 22:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords TTSpairwise evaluationIndian languagespreference analysismultilingualspeech qualityBradley-TerrySHAP

0 comments

The pith

A controlled pairwise evaluation framework allows reliable ranking of TTS systems for ten Indian languages by collecting multi-dimensional judgments from native speakers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for evaluating text-to-speech systems in Indian languages that accounts for linguistic diversity by using controlled sentence sets and asking raters for preferences on specific qualities. Over 120,000 pairwise comparisons from 1,900 native listeners across ten languages and seven systems produce data for ranking models. This matters because it provides a scalable way to understand what people actually prefer in voice output for languages where automated metrics fall short. The approach combines overall preference with ratings on intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations to build a leaderboard and analyze trade-offs.

Core claim

The authors present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Applying it to 5,000+ sentences in 10 Indic languages and 7 TTS systems yields over 120,000 comparisons from 1,900 native raters, enabling a Bradley-Terry leaderboard, SHAP-based preference interpretation, and analysis of model strengths across perceptual dimensions.

What carries the argument

The controlled multidimensional pairwise evaluation framework, which pairs sentences with linguistic controls and collects annotations on six perceptual dimensions plus overall preference.

If this is right

Bradley-Terry modeling can construct a stable multilingual leaderboard from the pairwise data.
SHAP analysis can reveal which perceptual dimensions drive human preferences for each model.
Models show distinct strengths and trade-offs, such as high intelligibility but lower expressiveness in some systems.
Large-scale native rater data supports reliable comparison despite perceptual variance when linguistic controls are applied.
The framework identifies specific areas for TTS improvement in Indic languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adapting this framework to other under-resourced languages could standardize TTS evaluation globally.
The chosen perceptual dimensions may need testing against downstream tasks like user satisfaction in voice assistants.
Future work might explore how these preferences correlate with actual usage patterns in daily communication.
Combining this human data with automated metrics could create hybrid evaluation systems that better predict real-world performance.

Load-bearing premise

That the collected pairwise comparisons, even with high variance in speech perception, produce reliable and consistent signals for leaderboard construction and preference interpretation once linguistic factors are controlled.

What would settle it

A replication study with independent raters or new sentence samples that produces a substantially reordered leaderboard or contradictory SHAP feature importances would falsify the reliability of the signals.

Figures

Figures reproduced from arXiv: 2604.21481 by Aaditya Pareek, Adish Pandya, Ashwin Sankar, Deepon Halder, Gaurav Yadav, Ishvinder Sethi, Kartik Rajput, Mitesh M Khapra, Mohammed Safi Ur Rahman Khan, Nikhil Narasimhan, Praveen S V, Shobhit Banga, Srija Anand.

**Figure 2.** Figure 2: System ranks shift across benchmark domains. How Do Rankings Change with Input Type? We examine leaderboard stability across the three subsets discussed in §3.1: Normalized, Symbolic, and Code-mixed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 1.** Figure 1: presents per-language rankings. GEMINI 2.5 PRO TTS ranks first in 9 of 10 languages, with near parity with ELEVEN LABS V3 in the case of Marathi. Rankings among ELEVEN LABS V3, SONIC 3, and BULBUL V3 BETA vary across languages with relatively small differences while INDIC F5 consistently ranks at or near the bottom. bn gu hi kn ml mr or ta te ur 700 800 900 1000 1100 1200 Gemini 2.5 Pro TTS Eleven Labs v3 … view at source ↗

**Figure 3.** Figure 3: Multi-dimensional perceptual performance of TTS systems measured by average win rates across six axes. Can Granular Judgments Predict Overall Preference? Overall preference provides a reliable ranking, but it does not reveal how raters combine multiple perceptual cues into a single judgment. We therefore test whether overall preference can be reconstructed from granular axis-level evaluations. For each co… view at source ↗

**Figure 4.** Figure 4: Mean absolute SHAP values showing the relative contribution of each perceptual axis to overall preference. Which Axes Drive Preference? To understand which perceptual attributes most strongly influence listener preference, we perform SHAP [34] (SHapley Additive exPlanations) analysis ( [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies a sizable new dataset of pairwise TTS preferences across ten Indic languages with multidimensional breakdowns and SHAP analysis, though the stability of the derived rankings remains an open question given speech perception variance.

read the letter

The main point is that this work collects over 120,000 pairwise comparisons from 1,900 native raters on seven TTS systems in ten Indic languages, then breaks the results down by six perceptual dimensions and uses SHAP to show what drives overall preference. That scale and structure is bigger than most prior TTS evaluations for these languages. They use controlled sentences that mix native and code-mixed text, which helps ground the judgments in realistic usage. Bradley-Terry modeling turns the pairs into a multilingual leaderboard, and the dimension-specific ratings let them surface trade-offs like voice quality versus expressiveness. Native raters and the focus on issues such as hallucinations add practical value. The setup shows clear thinking about linguistic diversity and avoids the usual English-centric shortcuts. The soft spot sits in the signal quality. Speech perception carries high variance, and the abstract itself notes this from linguistic diversity and the multidimensional judgments. Without reported checks on inter-rater agreement, repeated-pair consistency, or Bradley-Terry model fit stability, the rankings and SHAP attributions could be sensitive to sampling noise rather than true system differences. If the full paper includes those metrics and shows they hold, the claims strengthen considerably. This is for researchers building or benchmarking TTS for Indic or other high-diversity languages. Readers who need concrete preference data and dimension-level insights will find usable material here. It deserves peer review because the data collection is substantial and the evaluation questions are relevant, even if the analysis would benefit from tighter validation on reliability.

Referee Report

2 major / 2 minor

Summary. The paper presents a controlled multidimensional pairwise evaluation framework for multilingual TTS in 10 Indic languages. It evaluates 7 state-of-the-art TTS systems on 5K+ native and code-mixed sentences, collecting over 120K pairwise comparisons from 1900 native raters. Judgments cover overall preference plus six perceptual dimensions (intelligibility, expressiveness, voice quality, liveliness, noise, hallucinations). Bradley-Terry modeling is used to build a multilingual leaderboard, SHAP analysis interprets preference drivers, and the work examines leaderboard reliability along with model strengths and trade-offs across dimensions.

Significance. If the judgment signals prove reliable after controls, this work provides the first large-scale, linguistically grounded preference dataset for TTS in underrepresented Indic languages. It could inform voice-first application design in India and offer a reusable framework for multidimensional evaluation in other multilingual settings, particularly where perceptual variance is high.

major comments (2)

[Abstract] Abstract and evaluation framework description: no quantitative checks on signal-to-noise (e.g., intra-class correlation on repeated pairs, BT log-likelihood on held-out data, or rank stability across subsamples) are reported despite explicit acknowledgment of high variance from linguistic diversity and multidimensional perception. This leaves the stability of the Bradley-Terry leaderboard and the validity of subsequent SHAP attributions under-supported.
[Evaluation Framework] Data collection pipeline: the manuscript provides insufficient detail on the precise linguistic controls and rater-bias mitigation steps (e.g., rater screening, balancing of code-mixed vs. native sentences, or filtering of low-consistency raters) applied to the 120K comparisons. Without these, it is unclear whether the collected signals are sufficiently consistent to support the central claims about reliable leaderboard construction and interpretable preference drivers.

minor comments (2)

[Abstract] The abstract would benefit from a brief statement of the exact sentence distribution per language and the number of dimensions rated per pair to improve immediate clarity.
[Methods] Notation for the six perceptual dimensions and their mapping to the overall preference judgment could be made more explicit in the methods to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the quantitative support for signal reliability and to provide greater transparency on the data collection pipeline.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation framework description: no quantitative checks on signal-to-noise (e.g., intra-class correlation on repeated pairs, BT log-likelihood on held-out data, or rank stability across subsamples) are reported despite explicit acknowledgment of high variance from linguistic diversity and multidimensional perception. This leaves the stability of the Bradley-Terry leaderboard and the validity of subsequent SHAP attributions under-supported.

Authors: We agree that explicit quantitative checks on signal-to-noise would better support the claims given the acknowledged variance. The initial submission did not include these metrics. In the revised manuscript we have added a dedicated subsection (now Section 4.3) reporting intra-class correlation on repeated pairs, Bradley-Terry log-likelihood on held-out comparisons, and rank stability across multiple data subsamples. These results are summarized in the updated abstract and demonstrate that the leaderboard remains stable and that the SHAP attributions rest on reliable preference signals. revision: yes
Referee: [Evaluation Framework] Data collection pipeline: the manuscript provides insufficient detail on the precise linguistic controls and rater-bias mitigation steps (e.g., rater screening, balancing of code-mixed vs. native sentences, or filtering of low-consistency raters) applied to the 120K comparisons. Without these, it is unclear whether the collected signals are sufficiently consistent to support the central claims about reliable leaderboard construction and interpretable preference drivers.

Authors: We thank the referee for noting the need for additional detail. The original manuscript described the pipeline at a summary level. In the revision we have expanded Section 3 to specify the rater screening process (native-speaker qualification via proficiency checks), the balancing protocol (equal proportions of native and code-mixed sentences per language), and the post-collection filtering of low-consistency raters (those failing repeated-pair agreement thresholds). A new table and accompanying text now document the final rater pool and consistency statistics, clarifying how these steps support the reliability of the collected signals. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with new human preference data

full rationale

The paper collects 120K+ new pairwise judgments from 1900 native raters on 5K+ sentences across 10 Indic languages, then applies standard Bradley-Terry modeling to build a leaderboard and SHAP to interpret dimension-specific preferences. No equations, parameters, or derivations reduce the reported leaderboard or SHAP attributions to fitted values or definitions taken from the paper's own inputs. The framework relies on fresh crowdsourced data rather than self-referential fitting, self-citation chains, or renaming of prior results. All load-bearing steps (data collection, BT fitting, SHAP) are externally grounded in the new annotations and remain falsifiable against those annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions of crowdsourced preference collection and the Bradley-Terry model; no new free parameters, invented entities, or ad-hoc axioms are introduced beyond domain-standard statistical tools.

axioms (2)

domain assumption Bradley-Terry model assumptions hold for the collected pairwise TTS preferences
Invoked to convert comparisons into a leaderboard ranking.
domain assumption Native rater judgments on the six dimensions provide perceptually grounded signals after linguistic controls
Underpins the claim that the framework reduces high variance in speech perception.

pith-pipeline@v0.9.0 · 5513 in / 1387 out tokens · 33021 ms · 2026-05-09T22:04:56.068757+00:00 · methodology

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)