pith. machine review for the scientific record. sign in

arxiv: 2604.21481 · v1 · submitted 2026-04-23 · 💻 cs.CL

Recognition: unknown

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Aaditya Pareek, Adish Pandya, Ashwin Sankar, Deepon Halder, Gaurav Yadav, Ishvinder Sethi, Kartik Rajput, Mitesh M Khapra, Mohammed Safi Ur Rahman Khan, Nikhil Narasimhan, Praveen S V, Shobhit Banga, Srija Anand

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:04 UTC · model grok-4.3

classification 💻 cs.CL
keywords TTSpairwise evaluationIndian languagespreference analysismultilingualspeech qualityBradley-TerrySHAP
0
0 comments X

The pith

A controlled pairwise evaluation framework allows reliable ranking of TTS systems for ten Indian languages by collecting multi-dimensional judgments from native speakers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for evaluating text-to-speech systems in Indian languages that accounts for linguistic diversity by using controlled sentence sets and asking raters for preferences on specific qualities. Over 120,000 pairwise comparisons from 1,900 native listeners across ten languages and seven systems produce data for ranking models. This matters because it provides a scalable way to understand what people actually prefer in voice output for languages where automated metrics fall short. The approach combines overall preference with ratings on intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations to build a leaderboard and analyze trade-offs.

Core claim

The authors present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Applying it to 5,000+ sentences in 10 Indic languages and 7 TTS systems yields over 120,000 comparisons from 1,900 native raters, enabling a Bradley-Terry leaderboard, SHAP-based preference interpretation, and analysis of model strengths across perceptual dimensions.

What carries the argument

The controlled multidimensional pairwise evaluation framework, which pairs sentences with linguistic controls and collects annotations on six perceptual dimensions plus overall preference.

If this is right

  • Bradley-Terry modeling can construct a stable multilingual leaderboard from the pairwise data.
  • SHAP analysis can reveal which perceptual dimensions drive human preferences for each model.
  • Models show distinct strengths and trade-offs, such as high intelligibility but lower expressiveness in some systems.
  • Large-scale native rater data supports reliable comparison despite perceptual variance when linguistic controls are applied.
  • The framework identifies specific areas for TTS improvement in Indic languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adapting this framework to other under-resourced languages could standardize TTS evaluation globally.
  • The chosen perceptual dimensions may need testing against downstream tasks like user satisfaction in voice assistants.
  • Future work might explore how these preferences correlate with actual usage patterns in daily communication.
  • Combining this human data with automated metrics could create hybrid evaluation systems that better predict real-world performance.

Load-bearing premise

That the collected pairwise comparisons, even with high variance in speech perception, produce reliable and consistent signals for leaderboard construction and preference interpretation once linguistic factors are controlled.

What would settle it

A replication study with independent raters or new sentence samples that produces a substantially reordered leaderboard or contradictory SHAP feature importances would falsify the reliability of the signals.

Figures

Figures reproduced from arXiv: 2604.21481 by Aaditya Pareek, Adish Pandya, Ashwin Sankar, Deepon Halder, Gaurav Yadav, Ishvinder Sethi, Kartik Rajput, Mitesh M Khapra, Mohammed Safi Ur Rahman Khan, Nikhil Narasimhan, Praveen S V, Shobhit Banga, Srija Anand.

Figure 2
Figure 2. Figure 2: System ranks shift across benchmark domains. How Do Rankings Change with Input Type? We examine leaderboard stability across the three subsets discussed in §3.1: Normalized, Symbolic, and Code-mixed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: presents per-language rankings. GEMINI 2.5 PRO TTS ranks first in 9 of 10 languages, with near parity with ELEVEN LABS V3 in the case of Marathi. Rankings among ELEVEN LABS V3, SONIC 3, and BULBUL V3 BETA vary across languages with relatively small differences while INDIC F5 consistently ranks at or near the bottom. bn gu hi kn ml mr or ta te ur 700 800 900 1000 1100 1200 Gemini 2.5 Pro TTS Eleven Labs v3 … view at source ↗
Figure 3
Figure 3. Figure 3: Multi-dimensional perceptual performance of TTS systems measured by average win rates across six axes. Can Granular Judgments Predict Overall Preference? Overall preference provides a reliable ranking, but it does not reveal how raters combine multiple perceptual cues into a sin￾gle judgment. We therefore test whether overall preference can be reconstructed from granular axis-level evaluations. For each co… view at source ↗
Figure 4
Figure 4. Figure 4: Mean absolute SHAP values showing the relative con￾tribution of each perceptual axis to overall preference. Which Axes Drive Preference? To understand which percep￾tual attributes most strongly influence listener preference, we perform SHAP [34] (SHapley Additive exPlanations) analysis ( [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a controlled multidimensional pairwise evaluation framework for multilingual TTS in 10 Indic languages. It evaluates 7 state-of-the-art TTS systems on 5K+ native and code-mixed sentences, collecting over 120K pairwise comparisons from 1900 native raters. Judgments cover overall preference plus six perceptual dimensions (intelligibility, expressiveness, voice quality, liveliness, noise, hallucinations). Bradley-Terry modeling is used to build a multilingual leaderboard, SHAP analysis interprets preference drivers, and the work examines leaderboard reliability along with model strengths and trade-offs across dimensions.

Significance. If the judgment signals prove reliable after controls, this work provides the first large-scale, linguistically grounded preference dataset for TTS in underrepresented Indic languages. It could inform voice-first application design in India and offer a reusable framework for multidimensional evaluation in other multilingual settings, particularly where perceptual variance is high.

major comments (2)
  1. [Abstract] Abstract and evaluation framework description: no quantitative checks on signal-to-noise (e.g., intra-class correlation on repeated pairs, BT log-likelihood on held-out data, or rank stability across subsamples) are reported despite explicit acknowledgment of high variance from linguistic diversity and multidimensional perception. This leaves the stability of the Bradley-Terry leaderboard and the validity of subsequent SHAP attributions under-supported.
  2. [Evaluation Framework] Data collection pipeline: the manuscript provides insufficient detail on the precise linguistic controls and rater-bias mitigation steps (e.g., rater screening, balancing of code-mixed vs. native sentences, or filtering of low-consistency raters) applied to the 120K comparisons. Without these, it is unclear whether the collected signals are sufficiently consistent to support the central claims about reliable leaderboard construction and interpretable preference drivers.
minor comments (2)
  1. [Abstract] The abstract would benefit from a brief statement of the exact sentence distribution per language and the number of dimensions rated per pair to improve immediate clarity.
  2. [Methods] Notation for the six perceptual dimensions and their mapping to the overall preference judgment could be made more explicit in the methods to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the quantitative support for signal reliability and to provide greater transparency on the data collection pipeline.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation framework description: no quantitative checks on signal-to-noise (e.g., intra-class correlation on repeated pairs, BT log-likelihood on held-out data, or rank stability across subsamples) are reported despite explicit acknowledgment of high variance from linguistic diversity and multidimensional perception. This leaves the stability of the Bradley-Terry leaderboard and the validity of subsequent SHAP attributions under-supported.

    Authors: We agree that explicit quantitative checks on signal-to-noise would better support the claims given the acknowledged variance. The initial submission did not include these metrics. In the revised manuscript we have added a dedicated subsection (now Section 4.3) reporting intra-class correlation on repeated pairs, Bradley-Terry log-likelihood on held-out comparisons, and rank stability across multiple data subsamples. These results are summarized in the updated abstract and demonstrate that the leaderboard remains stable and that the SHAP attributions rest on reliable preference signals. revision: yes

  2. Referee: [Evaluation Framework] Data collection pipeline: the manuscript provides insufficient detail on the precise linguistic controls and rater-bias mitigation steps (e.g., rater screening, balancing of code-mixed vs. native sentences, or filtering of low-consistency raters) applied to the 120K comparisons. Without these, it is unclear whether the collected signals are sufficiently consistent to support the central claims about reliable leaderboard construction and interpretable preference drivers.

    Authors: We thank the referee for noting the need for additional detail. The original manuscript described the pipeline at a summary level. In the revision we have expanded Section 3 to specify the rater screening process (native-speaker qualification via proficiency checks), the balancing protocol (equal proportions of native and code-mixed sentences per language), and the post-collection filtering of low-consistency raters (those failing repeated-pair agreement thresholds). A new table and accompanying text now document the final rater pool and consistency statistics, clarifying how these steps support the reliability of the collected signals. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with new human preference data

full rationale

The paper collects 120K+ new pairwise judgments from 1900 native raters on 5K+ sentences across 10 Indic languages, then applies standard Bradley-Terry modeling to build a leaderboard and SHAP to interpret dimension-specific preferences. No equations, parameters, or derivations reduce the reported leaderboard or SHAP attributions to fitted values or definitions taken from the paper's own inputs. The framework relies on fresh crowdsourced data rather than self-referential fitting, self-citation chains, or renaming of prior results. All load-bearing steps (data collection, BT fitting, SHAP) are externally grounded in the new annotations and remain falsifiable against those annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions of crowdsourced preference collection and the Bradley-Terry model; no new free parameters, invented entities, or ad-hoc axioms are introduced beyond domain-standard statistical tools.

axioms (2)
  • domain assumption Bradley-Terry model assumptions hold for the collected pairwise TTS preferences
    Invoked to convert comparisons into a leaderboard ranking.
  • domain assumption Native rater judgments on the six dimensions provide perceptually grounded signals after linguistic controls
    Underpins the claim that the framework reduces high variance in speech perception.

pith-pipeline@v0.9.0 · 5513 in / 1387 out tokens · 33021 ms · 2026-05-09T22:04:56.068757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 11 canonical work pages · 1 internal anchor

  1. [1]

    Introduction India is widely recognized as a voice first nation, where many people prefer to access digital services primarily through speech rather than text interfaces. The country’s linguistic diversity, with hundreds of languages and widespread bilingualism, leads to real world speech that frequently includes code-mixing, do- main specific vocabulary,...

  2. [2]

    Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

    Related Work Subjective evaluation of TTS systems:Subjective listening tests such as MOS [5, 6], CMOS [9], and MUSHRA [11] re- main standard for evaluating TTS quality. However, these stud- ies are often limited in scale, or language coverage and typi- cally report aggregate scores that obscure the perceptual fac- tors [6, 13]. Multidimensional extensions...

  3. [3]

    Evaluation Framework We describe our controlled multidimensional pairwise evalu- ation framework, including the benchmark, rater recruitment, annotation protocol, perceptual axes, and ranking methodology. 3.1. Benchmark Construction We construct a multilingual evaluation benchmark of 5,357 sen- tences across 10 Indian languages: Bengali, Gujarati, Hindi, ...

  4. [4]

    To ensure fair comparison across systems, all models were evaluated using identical text prompts without style conditioning

    Results We evaluate 7 state-of-the-art TTS systems—GEMINI2.5 PRO TTS, GPT-4O-MINITTS, ELEVENLABS V3, SONIC3, SPEECH2.8 HD, BULBUL V3 BETA, and INDICF5 [32]— spanning commercial production APIs, open-source systems, and Indic-specialized models. To ensure fair comparison across systems, all models were evaluated using identical text prompts without style c...

  5. [5]

    Using 5.3K sentences across 10 Indic languages, we collect over 120K pairwise judg- ments from 1900+ vetted native raters and construct a leader- board with Bradley–Terry modeling

    Conclusion We present a controlled, multidimensional pairwise evaluation framework for multilingual TTS systems. Using 5.3K sentences across 10 Indic languages, we collect over 120K pairwise judg- ments from 1900+ vetted native raters and construct a leader- board with Bradley–Terry modeling. Beyond aggregate rank- ings, our six perceptual axes support fi...

  6. [6]

    These tools assisted with improving clarity, grammar, and conciseness of the writing

    Generative AI Use Disclosure Generative AI tools were used solely for language polishing and editing during the preparation of this manuscript. These tools assisted with improving clarity, grammar, and conciseness of the writing. No generative AI system was used to generate ex- perimental results, analyses, figures, or scientific conclusions. All technica...

  7. [7]

    Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,

    K. Shen, Z. Ju, X. Tan, E. Liu, Y . Leng, L. He, T. Qin, sheng zhao, and J. Bian, “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” inThe Twelfth International Conference on Learning Representations,

  8. [8]

    Available: https://openreview.net/forum?id= Rc7dAwVL3v

    [Online]. Available: https://openreview.net/forum?id= Rc7dAwVL3v

  9. [9]

    Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,

    Z. Ju, Y . Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y . Liu, Y . Leng, K. Song, S. Tang, Z. Wu, T. Qin, X.-Y . Li, W. Ye, S. Zhang, J. Bian, L. He, J. Li, and S. Zhao, “Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” 2024

  10. [10]

    Rasmalai : Resources for Adap- tive Speech Modeling in IndiAn Languages with Accents and In- tonations,

    A. Sankar, Y . Lacombe, S. Thomas, P. Srinivasa Varadhan, S. Gandhi, and M. M. Khapra, “Rasmalai : Resources for Adap- tive Speech Modeling in IndiAn Languages with Accents and In- tonations,” inInterspeech 2025, 2025, pp. 4128–4132

  11. [11]

    F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching.arXiv preprint arXiv:2410.06885, 2024

    Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” 2025. [Online]. Available: https://arxiv.org/abs/2410.06885

  12. [12]

    Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,

    A. Kirkland, S. Mehta, H. Lameris, G. E. Henter, E. Szekely, and J. Gustafson, “Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,” inProc. 12th ISCA Speech Synthesis Workshop (SSW2023), 2023, pp. 41–47

  13. [13]

    Are we using enough listeners? no! - an empirically-supported critique of interspeech 2014 TTS evaluations,

    M. Wester, C. Valentini-Botinhao, and G. E. Henter, “Are we using enough listeners? no! - an empirically-supported critique of interspeech 2014 TTS evaluations,” inINTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6-10, 2015. ISCA, 2015, pp. 3476–3480. [Online]. Available: https://d...

  14. [14]

    Assess- ing the impact of contextual framing on subjective TTS quality,

    J. Edlund, C. T ˚annander, S. Le Maguer, and P. Wagner, “Assess- ing the impact of contextual framing on subjective TTS quality,” inInterspeech 2024, 2024, pp. 1205–1209

  15. [15]

    Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation,

    R. Dall, J. Yamagishi, and S. King, “Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation,” inSpeech Prosody 2014, 2014, pp. 1012–1016

  16. [16]

    P. C. Loizou,Speech Quality Assessment. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 623–654. [Online]. Available: https://doi.org/10.1007/978-3-642-19551-8 23

  17. [17]

    Refining the evaluation of speech synthesis: A summary of the blizzard challenge 2023,

    O. Perrotin, B. Stephenson, S. Gerber, G. Bailly, and S. King, “Refining the evaluation of speech synthesis: A summary of the blizzard challenge 2023,”Comput. Speech Lang., vol. 90, no. C, Mar. 2025. [Online]. Available: https://doi.org/10.1016/j.csl.2024.101747

  18. [18]

    Method for the subjective assessment of intermediate quality level of audio systems,

    ITU-R, “Method for the subjective assessment of intermediate quality level of audio systems,” https://www.itu.int/dms pubrec/ itu-r/rec/bs/R-REC-BS.1534-3-201510-I!!PDF-E.pdf, 2015

  19. [19]

    Rethinking mushra: Addressing modern challenges in text-to-speech evaluation,

    P. Varadhan, A. Gulati, A. Sankar, S. Anand, A. Gupta, A. Mukherjee, S. K. Marepally, A. Bhatia, S. Jaju, S. Bhooshan, and M. M. Khapra, “Rethinking mushra: Addressing modern challenges in text-to-speech evaluation,” Trans. Mach. Learn. Res., vol. 2025, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:274141640

  20. [20]

    The limits of the mean opinion score for speech synthesis evaluation,

    S. Le Maguer, S. King, and N. Harte, “The limits of the mean opinion score for speech synthesis evaluation,”Computer Speech and Language, vol. 84, p. 101577, 2024. [On- line]. Available: https://www.sciencedirect.com/science/article/ pii/S0885230823000967

  21. [21]

    The State Of TTS: A Case Study with Human Fooling Rates,

    P. Srinivasa Varadhan, S. Thomas, S. Teja M S, S. Bhooshan, and M. M. Khapra, “The State Of TTS: A Case Study with Human Fooling Rates,” inInterspeech 2025, 2025, pp. 2285–2289

  22. [22]

    Elaichi: Enhancing low-resource tts by addressing infrequent and low-frequency character bigrams,

    S. Anand, P. S. Varadhan, M. Singal, and M. M. Khapra, “Elaichi: Enhancing low-resource tts by addressing infrequent and low-frequency character bigrams,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17901

  23. [23]

    Subjective Eval- uation of Text-to-Speech Models: Comparing Absolute Category Rating and Ranking by Elimination Tests,

    K. Kayyar, C. Dittmar, N. Pia, and E. Habets, “Subjective Eval- uation of Text-to-Speech Models: Comparing Absolute Category Rating and Ranking by Elimination Tests,” in12th ISCA Speech Synthesis Workshop (SSW2023), 2023, pp. 191–196

  24. [24]

    Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech,

    E. Cooper and J. Yamagishi, “Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech,” in Interspeech 2023, 2023, pp. 1104–1108

  25. [25]

    Why we should report the details in subjective evaluation of tts more rigorously,

    C.-H. Chiang, W.-P. Huang, and H. yi Lee, “Why we should report the details in subjective evaluation of tts more rigorously,” 2023

  26. [26]

    Rank analysis of incomplete block designs: I. the method of paired comparisons,

    R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952. [Online]. Available: http://www.jstor.org/stable/2334029

  27. [27]

    Pairwise Evalu- ation of Accent Similarity in Speech Synthesis,

    J. Zhong, S. Liu, D. Wells, and K. Richmond, “Pairwise Evalu- ation of Accent Similarity in Speech Synthesis,” inInterspeech 2025, 2025, pp. 2290–2294

  28. [28]

    A law of comparative judgment

    L. L. Thurstone, “A law of comparative judgment.”Psychological Review, vol. 34, no. 4, pp. 273–286, 1927

  29. [29]

    Tts arena 2.0: Bench- marking text-to-speech models in the wild,

    mrfakename, V . Srivastav, C. Fourrier, L. Pouget, Y . Lacombe, main, S. Gandhi, A. Passos, and P. Cuenca, “Tts arena 2.0: Bench- marking text-to-speech models in the wild,” https://huggingface. co/spaces/TTS-AGI/TTS-Arena-V2, 2025

  30. [30]

    Text-to-speech leaderboard,

    Artificial Analysis, “Text-to-speech leaderboard,” Artificial- Analysis.ai, 2025, accessed: 2025-03-04. [Online]. Available: https://artificialanalysis.ai/text-to-speech/leaderboard

  31. [31]

    TTS benchmarks: Evaluating latency and quality of text-to-speech models,

    Coval, “TTS benchmarks: Evaluating latency and quality of text-to-speech models,” Coval.dev, 2026, accessed: 2026-03-04. [Online]. Available: https://app.coval.dev/tts-benchmarks

  32. [32]

    Towards building text-to-speech systems for the next billion users,

    G. K. Kumar, P. S V , P. Kumar, M. M. Khapra, and K. Nandaku- mar, “Towards building text-to-speech systems for the next billion users,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1– 5

  33. [33]

    Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings,

    P. Srinivasa Varadhan, A. Sankar, G. Raju, and M. M. Khapra, “Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings,” inInterspeech 2024, 2024, pp. 1830–1834

  34. [34]

    Exploring the role of lan- guage families for building indic speech synthesisers,

    A. Prakash and H. A. Murthy, “Exploring the role of lan- guage families for building indic speech synthesisers,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 734–747, 2023

  35. [35]

    (2025) A new era of intelligence with gemini 3

    Google. (2025) A new era of intelligence with gemini 3. Google Blog, accessed: 2026-03-04. [Online]. Available: https: //blog.google/products-and-platforms/products/gemini/gemini-3/

  36. [36]

    doi: 10.1214/aos/ 1176342360

    D. R. Hunter, “MM algorithms for generalized Bradley-Terry models,”The Annals of Statistics, vol. 32, no. 1, pp. 384 – 406, 2004. [Online]. Available: https://doi.org/10.1214/aos/ 1079120141

  37. [37]

    Chatbot arena: An open platform for evaluating llms by human preference,

    W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, and I. Stoica, “Chatbot arena: An open platform for evaluating llms by human preference,” 2024

  38. [38]

    Bootstrap confidence intervals,

    T. J. DiCiccio and B. Efron, “Bootstrap confidence intervals,” Statistical Science, vol. 11, no. 3, pp. 189 – 228, 1996. [Online]. Available: https://doi.org/10.1214/ss/1032280214

  39. [39]

    Phir hera fairy: An english fairytaler is a strong faker of fluent speech in low-resource indian languages,

    P. S. Varadhan, S. Anand, S. Siddhartha, and M. M. Khapra, “Phir hera fairy: An english fairytaler is a strong faker of fluent speech in low-resource indian languages,” 2025. [Online]. Available: https://arxiv.org/abs/2505.20693

  40. [40]

    Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

    T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. New York, NY , USA: Association for Computing Machinery, 2016, p. 785–794. [Online]. Available: https: //doi.org/10.1145/2939672.2939785

  41. [41]

    A unified approach to inter- preting model predictions,

    S. M. Lundberg and S.-I. Lee, “A unified approach to inter- preting model predictions,” inProceedings of the 31st Interna- tional Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 4768–4777