Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Aaditya Pareek; Amritansh Walecha; Bhaskar Singh; Hanuman Sidh; Kaushal Bhogale; Mahima Manik; Manas Dhir; Manmeet Kaur; Mitesh M. Khapra; Sagar Jain

arxiv: 2604.19151 · v3 · pith:D7PUEU5Nnew · submitted 2026-04-21 · 💻 cs.CL · cs.SD· eess.AS

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Kaushal Bhogale , Manas Dhir , Amritansh Walecha , Manmeet Kaur , Vanshika Chhabra , Aaditya Pareek , Hanuman Sidh , Mahima Manik

show 6 more authors

Sagar Jain Bhaskar Singh Utkarsh Singh Tahir Javed Shobhit Banga Mitesh M. Khapra

This is my paper

Pith reviewed 2026-05-10 02:32 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords speech recognitionIndian languagesASR benchmarkunscripted speechtelephonic conversationsmultilingual ASRregional disparities

0 comments

The pith

A benchmark of unscripted phone conversations reveals gaps in current speech recognition for Indian languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks for speech recognition in India rely on scripted recordings and penalize natural spelling variations in transcripts, which can lead to overfitting on specific test sets rather than real performance. The paper introduces Voice of India as a new closed-source dataset collected from actual telephonic calls across 15 languages and 139 regional clusters, with over 536 hours of speech and transcripts that allow for spelling differences. This allows evaluation of systems under realistic conditions including variations in audio quality and speaker demographics. By analyzing results at district level and across factors like gender and device, the work identifies where models struggle most in everyday use.

Core claim

The central discovery is that a large-scale benchmark built from unscripted telephonic conversations in 15 major Indian languages provides a more representative test for automatic speech recognition systems than existing scripted datasets. With 306230 utterances from 36691 speakers totaling 536 hours and transcripts that account for spelling variations, it exposes geographic disparities in performance and highlights challenges related to audio quality, speaking rate, gender, and device type.

What carries the argument

The Voice of India benchmark dataset, derived from real telephonic conversations with manually transcribed utterances that permit spelling variants to reflect natural language use.

If this is right

ASR systems exhibit varying performance across different Indian districts, indicating regional biases in current models.
Performance degrades under conditions of poor audio quality, fast speaking rates, or certain device types.
Accounting for spelling variations in evaluation leads to fairer assessment of code-mixed speech.
Insights from the dataset can guide targeted improvements in real-world Indic ASR applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of speech systems for multilingual countries could use similar unscripted collection methods to create more practical benchmarks.
The geographic analysis suggests that ASR performance may correlate with socioeconomic factors in different regions, warranting further study.
Extending this approach to other languages with high dialectal variation could improve global ASR equity.

Load-bearing premise

That collecting data from unscripted telephonic conversations and creating transcripts with spelling variants produces a benchmark that is less biased and more reflective of real-world speech than scripted alternatives.

What would settle it

Demonstrating that state-of-the-art ASR models achieve comparable word error rates on this unscripted benchmark as on existing scripted ones, without specific adaptations for spelling variations or regional accents, would undermine the claimed superiority.

Figures

Figures reproduced from arXiv: 2604.19151 by Aaditya Pareek, Amritansh Walecha, Bhaskar Singh, Hanuman Sidh, Kaushal Bhogale, Mahima Manik, Manas Dhir, Manmeet Kaur, Mitesh M. Khapra, Sagar Jain, Shobhit Banga, Tahir Javed, Utkarsh Singh, Vanshika Chhabra.

**Figure 1.** Figure 1: The WER map of India: Average Word Error Rate (WER) for ASR models for districts of India rigid string matching. To avoid penalizing legitimate orthographic variation, the dataset includes multiple valid transcripts that capture natural spelling differences and alternative renderings commonly found in spontaneous and code mixed speech. A central goal of the benchmark is to expose geographic disparities … view at source ↗

**Figure 4.** Figure 4: shows that deviations from ideal acoustic conditions consistently increase error rates across DNSMOS [29] quality quartiles, speaking-rate quartiles, and utterance duration bins (<2s, 2–5s, >5s). Audio degradation raises WER monotonically; ElevenLabs Scribe rises from 15.31% to 25.20% and Gemini-3-Pro from 13.42% to 23.44% between the highest and lowest quality quartiles. Speaking rate exhibits a U-shaped… view at source ↗

read the original abstract

Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Voice of India, a closed-source benchmark of 306230 utterances (536 hours) drawn from unscripted telephonic conversations in 15 major Indian languages across 139 regional clusters, involving 36691 speakers. Transcripts accommodate spelling variations. The authors report geographic performance disparities at the district level and factor analyses on audio quality, speaking rate, gender, and device type to identify challenges for existing ASR systems.

Significance. If the dataset construction, transcript quality, and analyses prove sound upon verification, the work could usefully document real-world Indic ASR difficulties beyond scripted benchmarks. The scale and multi-factor breakdown offer potential guidance for system improvements in under-resourced settings. The closed-source status, however, sharply curtails community adoption and independent testing of the claimed advantages.

major comments (3)

[Abstract] Abstract: the central claims that the dataset 'reveals disparities' and 'highlight[s] where current ASR systems struggle' are unsupported by any quantitative WER numbers, baseline comparisons, or error rates. Without these, the asserted superiority over scripted benchmarks cannot be evaluated.
[Abstract and §3] Dataset release statement (Abstract and §3): declaring the benchmark closed-source prevents reproduction of the district-level disparity results and the audio-quality/speaking-rate/gender/device breakdowns. This directly undermines the paper's stated purpose of supplying a usable real-world benchmark for the Indic ASR community.
[§4–5] Analysis sections (§4–5): the claim that unscripted telephonic data plus spelling-variant transcripts yield a meaningfully less biased representation requires explicit validation, such as side-by-side WER evaluation of the same models on Voice of India versus existing scripted Indic corpora.

minor comments (2)

[Abstract] Abstract: format the utterance count as 306,230 for standard readability.
[Introduction] Introduction: specify the exact criteria used to define the 139 regional clusters and how speaker demographics were sampled to ensure geographic coverage.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed review and constructive comments on our manuscript. We address each major comment below, clarifying our approach and indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims that the dataset 'reveals disparities' and 'highlight[s] where current ASR systems struggle' are unsupported by any quantitative WER numbers, baseline comparisons, or error rates. Without these, the asserted superiority over scripted benchmarks cannot be evaluated.

Authors: The abstract condenses findings from the district-level geographic analysis and the multi-factor breakdowns in §§4–5. These sections quantify performance variations across audio quality, speaking rate, gender, device type, and regional clusters, which directly support the claims of disparities and system struggles. While the abstract itself avoids lengthy numerical tables for brevity, the underlying analyses contain the supporting quantitative breakdowns. In revision we will add a concise sentence to the abstract referencing key aggregate statistics (e.g., WER ranges and factor-specific deltas) drawn from §§4–5. revision: partial
Referee: [Abstract and §3] Dataset release statement (Abstract and §3): declaring the benchmark closed-source prevents reproduction of the district-level disparity results and the audio-quality/speaking-rate/gender/device breakdowns. This directly undermines the paper's stated purpose of supplying a usable real-world benchmark for the Indic ASR community.

Authors: We recognize that closed-source status precludes independent reproduction. The decision stems from privacy and consent constraints inherent to real, unscripted telephonic conversations involving 36 691 speakers. The manuscript supplies the full collection protocol, transcription guidelines (including spelling-variant handling), sampling strategy across 139 clusters, and all statistical results from the factor analyses. These elements allow the community to understand the observed challenges and to design mitigation strategies even without direct data access. We therefore retain the closed-source designation. revision: no
Referee: [§4–5] Analysis sections (§4–5): the claim that unscripted telephonic data plus spelling-variant transcripts yield a meaningfully less biased representation requires explicit validation, such as side-by-side WER evaluation of the same models on Voice of India versus existing scripted Indic corpora.

Authors: Sections 4 and 5 demonstrate that the combination of unscripted speech and variant-aware transcripts surfaces realistic error patterns (e.g., higher WER on code-mixed terms and dialectal variants) that scripted benchmarks typically mask. While we do not include head-to-head WER tables against every existing corpus, the factor analyses isolate the contribution of each variable and show elevated difficulty relative to the clean, scripted conditions described in prior work. In revision we will expand the discussion to cite quantitative comparisons reported in the literature for the same model families and to articulate why direct re-evaluation on Voice of India is not feasible under the current release policy. revision: partial

standing simulated objections not resolved

Independent reproduction of the district-level disparity results and factor analyses is not possible because the benchmark remains closed-source for privacy reasons.

Circularity Check

0 steps flagged

No circularity: empirical benchmark paper with no derivations or self-referential reductions

full rationale

The paper introduces Voice of India as a closed-source dataset of unscripted telephonic speech across 15 Indic languages, with accompanying geographic and factor analyses. No mathematical derivations, model predictions, fitted parameters, or uniqueness theorems are claimed. The central contribution is data collection and empirical reporting; all performance observations are presented as direct measurements on the collected utterances rather than outputs derived from prior self-citations or internal definitions. No load-bearing steps reduce by construction to the paper's own inputs, satisfying the default expectation of no significant circularity for a purely empirical benchmark effort.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data collection and benchmarking paper with no mathematical derivations, free parameters, axioms, or invented entities. All claims rest on the described collection process and empirical observations.

pith-pipeline@v0.9.0 · 5505 in / 1228 out tokens · 36563 ms · 2026-05-10T02:32:46.543796+00:00 · methodology

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)