pith. sign in

arxiv: 2606.21408 · v1 · pith:G2HIZZXLnew · submitted 2026-06-19 · 📡 eess.AS

Vaani Benchmark V1.0: An Inclusive Multimodal Benchmark Dataset for Hindi

Pith reviewed 2026-06-26 13:05 UTC · model grok-4.3

classification 📡 eess.AS
keywords Hindi ASR benchmarkmultimodal datasetspontaneous speechinclusive evaluationIndia districtsmulti-reference transcriptionreal-world audio
0
0 comments X

The pith

Vaani Benchmark collects spontaneous Hindi speech from 104 districts with three transcriptions each to enable more representative ASR testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Vaani Benchmark V1.0 as a multimodal dataset for evaluating Hindi automatic speech recognition systems. Spontaneous speech is elicited through image prompts and recorded in everyday acoustic conditions from speakers across 104 districts representing varied demographics. Each audio segment receives three independent transcriptions to handle natural differences in wording and spelling. The authors then test multiple open-source and proprietary ASR models on the data to compare results. This setup targets gaps in geographic coverage and transcription consistency found in earlier Hindi benchmarks.

Core claim

The central claim is that an inclusive multimodal Hindi ASR benchmark dataset collected from 104 districts across India, featuring spontaneous speech elicited by image prompts and recorded in real-world conditions across diverse demographic groups, with each segment annotated by three independent transcriptions, enables more robust, inclusive, and realistic evaluation of ASR systems than prior resources.

What carries the argument

The Vaani Benchmark V1.0 dataset of spontaneous speech recordings elicited via image prompts, captured in real-world conditions, and supplied with three independent transcriptions per segment to support multi-reference scoring.

If this is right

  • ASR models can be evaluated for performance across wider geographic and demographic ranges within India.
  • Multi-reference transcriptions allow scoring that tolerates natural orthographic and lexical variations.
  • Real-world acoustic conditions test how systems handle everyday noise and environments.
  • Direct comparisons between open-source and proprietary models become possible on the same inclusive data.
  • Benchmark results can guide improvements in handling spontaneous rather than read speech.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model developers may shift toward collecting training data from similarly broad regional samples to close performance gaps.
  • The image-prompt method could transfer to speech data collection for other languages where scripted prompts limit naturalness.
  • Public releases of such benchmarks might accelerate community efforts to build region-aware Hindi ASR tools.

Load-bearing premise

The collection of spontaneous speech via image prompts from 104 districts across diverse demographic groups in real-world conditions will produce data that meaningfully improves upon the geographic and demographic limitations of existing Hindi ASR benchmarks.

What would settle it

A direct comparison finding that ASR model rankings and error rates on this dataset match those on existing Hindi benchmarks with no added insight from the wider coverage or multiple transcriptions would falsify the claim of improved evaluation robustness.

Figures

Figures reproduced from arXiv: 2606.21408 by Agneedh Basu, Nihar Desai, Pavan Kumar J, Pranav Bhat, Prasanta Kumar Ghosh, Saurabh Kumar, Sujith Pulikodan, Visruth Sanka.

Figure 1
Figure 1. Figure 1: Vaani Benchmark Data Preparation Process events, making it difficult to systematically analyze model ro￾bustness across diverse and realistic acoustic scenarios. Recent models are increasingly moving toward multimodal architectures [19, 20]. Evaluating such models in multimodal scenarios requires datasets that contain aligned multimodal data. However, there is a lack of benchmark datasets for Indic languag… view at source ↗
Figure 2
Figure 2. Figure 2: Districts and States represented in the Vaani Bench￾mark. Following rigorous quality checks, the benchmark dataset includes speech data from 3,252 speakers, amounting to 20.64 hours of audio across 104 districts in 22 states and Union Terri￾tories. The dataset was collected using 8,315 images, with each audio segment transcribed by three independent transcribers. To support open research while preserving t… view at source ↗
read the original abstract

Benchmarking is critical for the systematic evaluation and comparison of automatic speech recognition (ASR) systems. While several open-source datasets are available for Hindi ASR, existing benchmarks remain limited in geographic diversity, demographic representation, and transcription robustness. We introduce an inclusive, multimodal Hindi ASR benchmark collected from 104 districts across India. The dataset consists of spontaneous speech elicited using image prompts and recorded in real-world acoustic conditions across diverse demographic groups. Each audio segment is annotated with three independent transcriptions, enabling multi-reference evaluation that accounts for permissible orthographic and lexical variations. This design supports more robust, inclusive, and realistic ASR evaluation. We benchmark multiple open-source and proprietary ASR models and report their comparative performance on the benchmark dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Vaani Benchmark V1.0, a multimodal Hindi ASR dataset collected from 104 districts across India. Spontaneous speech is elicited via image prompts and recorded in real-world conditions from diverse demographic groups; each audio segment receives three independent transcriptions to support multi-reference evaluation accounting for orthographic and lexical variation. The authors benchmark multiple open-source and proprietary ASR models and assert that the design enables more robust, inclusive, and realistic evaluation than existing Hindi benchmarks.

Significance. A dataset with verified broad geographic coverage, demographic diversity, real-world acoustics, and multi-reference transcriptions would be a useful addition for Hindi ASR evaluation, as it could expose model weaknesses missed by narrower existing resources. The triple-transcription protocol is a constructive design choice for handling permissible variation. However, the manuscript supplies no quantitative evidence (demographics, coverage metrics, acoustic statistics, or side-by-side comparisons) that these benefits are realized.

major comments (2)
  1. [Abstract] Abstract: the central claim that the collection 'supports more robust, inclusive, and realistic ASR evaluation' is asserted without any speaker-level statistics (age/gender/dialect/education), district-level coverage map or sampling protocol, acoustic-condition quantifiers (SNR, reverberation), or direct comparison of coverage against Common Voice Hindi or IndicTTS; this evidence is required to substantiate the inclusivity improvement over prior benchmarks.
  2. [Abstract] Abstract and methods description: no error analysis, verification procedure, or ablation is reported to confirm that image-prompted spontaneous speech from 104 districts actually overcomes the geographic and demographic limitations cited for existing datasets; benchmarking models on the new set alone does not test the inclusivity claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for quantitative support of our inclusivity claims. We address each major comment below, indicating where revisions will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the collection 'supports more robust, inclusive, and realistic ASR evaluation' is asserted without any speaker-level statistics (age/gender/dialect/education), district-level coverage map or sampling protocol, acoustic-condition quantifiers (SNR, reverberation), or direct comparison of coverage against Common Voice Hindi or IndicTTS; this evidence is required to substantiate the inclusivity improvement over prior benchmarks.

    Authors: We agree that the manuscript would be strengthened by including quantitative evidence. In the revised version, we will add a dedicated section with available speaker-level demographics (age, gender, education where recorded), a district coverage summary with sampling protocol details, acoustic statistics including average SNR, and a comparison table against Common Voice Hindi and IndicTTS to directly substantiate the coverage improvements. revision: yes

  2. Referee: [Abstract] Abstract and methods description: no error analysis, verification procedure, or ablation is reported to confirm that image-prompted spontaneous speech from 104 districts actually overcomes the geographic and demographic limitations cited for existing datasets; benchmarking models on the new set alone does not test the inclusivity claim.

    Authors: The multi-reference transcriptions and real-world recording conditions are designed to address variation and realism, with benchmarking results showing model performance differences attributable to these factors. We acknowledge the value of additional validation. In revision, we will expand the methods with transcription verification procedures, include an error analysis stratified by district or demographic factors where possible, and add discussion of how the dataset design targets the cited limitations of prior resources. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset creation paper with no derivations or fitted predictions

full rationale

The paper introduces a new Hindi ASR benchmark dataset collected from 104 districts using image prompts, with multi-reference transcriptions, and reports model benchmarks. No equations, parameters, or predictions appear in the provided text. The abstract and description contain only descriptive claims about inclusivity and robustness; these do not reduce to prior fitted values or self-citations by construction. No self-definitional steps, uniqueness theorems, or ansatzes are invoked. This matches the default case of a self-contained dataset paper with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the described collection and annotation protocol yields representative spontaneous speech; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Three independent transcriptions adequately capture permissible orthographic and lexical variations in Hindi spontaneous speech.
    Invoked in the abstract to justify multi-reference evaluation.

pith-pipeline@v0.9.1-grok · 5679 in / 1169 out tokens · 42391 ms · 2026-06-26T13:05:14.606948+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 1 canonical work pages

  1. [1]

    This process re- veals the strengths and limitations of models, enables mean- ingful comparison across systems, and informs decisions about their real-world applicability

    Introduction Benchmarking is a critical stage in the model development pro- cess, as it evaluates model performance under unseen condi- tions using standardized benchmark datasets. This process re- veals the strengths and limitations of models, enables mean- ingful comparison across systems, and informs decisions about their real-world applicability. Stat...

  2. [2]

    Vaani Benchmark The benchmark targets systematic assessment of Hindi ASR robustness by pairing spoken utterances with multiple human references. It comprises 20.64 hours of spontaneous speech collected from an inclusive, geographically distributed speaker population spanning 104 districts across 22 Indian states and Union Territories, enabling fine-graine...

  3. [3]

    The evaluated models comprise base versions as well as fine- tuned variants, and include both monolingual and multilingual architectures

    Evaluations We evaluated multiple models on the proposed benchmark dataset, including both open-source and commercial systems. The evaluated models comprise base versions as well as fine- tuned variants, and include both monolingual and multilingual architectures. While the Vaani dataset inherently supports mul- timodal benchmarking—such as speech-image r...

  4. [4]

    We evaluate multiple open-source and proprietary ASR models on this dataset

    Conclusion We present an inclusive, multimodal Hindi ASR benchmark comprising speech data collected from 104 districts, with three independent reference transcriptions for each segment. We evaluate multiple open-source and proprietary ASR models on this dataset. Our analysis reveals substantial differences in Word Error Rate (WER) when using multi-referen...

  5. [5]

    Quantifying bias in automatic speech recognition,

    S. Feng, O. Kudina, B. M. Halpern, and O. Scharenborg, “Quantifying bias in automatic speech recognition,” 2021. [Online]. Available: https://arxiv.org/abs/2103.15122

  6. [6]

    Hey asr system! why aren’t you more inclusive? automatic speech recognition sys- tems’ bias and proposed bias mitigation techniques. a literature review,

    M. K. Ngueajio and G. Washington, “Hey asr system! why aren’t you more inclusive? automatic speech recognition sys- tems’ bias and proposed bias mitigation techniques. a literature review,” inInternational conference on human-computer interac- tion. Springer, 2022, pp. 421–440

  7. [7]

    Census of india 2011: Language data,

    Office of the Registrar General , “Census of india 2011: Language data,” 2011

  8. [8]

    Indicsuperb: A speech processing universal performance benchmark for indian languages,

    T. Javed, K. Bhogale, A. Raman, P. Kumar, A. Kunchukuttan, and M. M. Khapra, “Indicsuperb: A speech processing universal performance benchmark for indian languages,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 12 942–12 950

  9. [9]

    Lahaja: A robust multi-accent benchmark for evaluating hindi asr systems,

    T. Javed, J. Nawale, S. Joshi, E. George, K. Bhogale, D. Mehen- dale, and M. M. Khapra, “Lahaja: A robust multi-accent benchmark for evaluating hindi asr systems,”arXiv preprint arXiv:2408.11440, 2024

  10. [10]

    Vistaar: Diverse benchmarks and training sets for indian language asr,

    K. S. Bhogale, S. Sundaresan, A. Raman, T. Javed, M. M. Khapra, and P. Kumar, “Vistaar: Diverse benchmarks and training sets for indian language asr,”arXiv preprint arXiv:2305.15386, 2023

  11. [11]

    Fleurs: Few-shot learning evaluation of universal representations of speech,

    A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805

  12. [12]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,”arXiv preprint arXiv:1912.06670, 2019

  13. [13]

    Mucs 2021: Multilingual and code-switching asr challenges for low resource indian languages,

    A. Diwan, R. Vaideeswaran, S. Shah, A. Singh, S. Raghavan, S. Khare, V . Unni, S. Vyas, A. Rajpuria, C. Yarra, A. Mittal, P. K. Ghosh, P. Jyothi, K. Bali, V . Seshadri, S. Sitaram, S. Bharadwaj, J. Nanavati, R. Nanavati, and K. Sankaranarayanan, “Mucs 2021: Multilingual and code-switching asr challenges for low resource indian languages,” inInterspeech 20...

  14. [14]

    Gram vaani asr challenge on spontaneous telephone speech recordings in regional variations of hindi,

    A. R. K. Kumar, N. Ravi, A. Seth, A. Seth, and A. Singh, “Gram vaani asr challenge on spontaneous telephone speech recordings in regional variations of hindi,” 2022

  15. [15]

    Respin- s1. 0: A read speech corpus of 10000+ hours in dialects of nine in- dian languages,

    S. Kumar, A. Singh, J. Bandekar, S. Murthy, S. Sharma, S. Badi- ger, S. Udupa, A. Nagireddi, S. R. KM, R. Saxenaet al., “Respin- s1. 0: A read speech corpus of 10000+ hours in dialects of nine in- dian languages,” inThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems Datasets and Benchmarks Track

  16. [16]

    Style-agnostic evaluation of asr using multiple reference transcripts,

    Q. McNamara, M. ´Angel del R ´ıo Fern ´andez, N. Bhandari, M. Ratajczak, D. Chen, C. Miller, and M. Jett ´e, “Style-agnostic evaluation of asr using multiple reference transcripts,” 2024. [Online]. Available: https://arxiv.org/abs/2412.07937

  17. [17]

    Multi-reference wer for evaluating asr for languages with no orthographic rules,

    A. Ali, W. Magdy, P. Bell, and S. Renais, “Multi-reference wer for evaluating asr for languages with no orthographic rules,” in 2015 IEEE Workshop on Automatic Speech Recognition and Un- derstanding (ASRU). IEEE, 2015, pp. 576–580

  18. [18]

    Cultural bias in large language models: Evaluating ai agents through moral questionnaires,

    S. M ¨unker, “Cultural bias in large language models: Evaluating ai agents through moral questionnaires,” 2025. [Online]. Available: https://arxiv.org/abs/2507.10073

  19. [19]

    J., Madotto, A., and Fung, P

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, p. 1–38, Mar. 2023. [Online]. Available: http://dx.doi.org/10.1145/3571730

  20. [20]

    Code-switching and mixing in communication- a study on language contact in indian media,

    C. Barnali, “Code-switching and mixing in communication- a study on language contact in indian media,” inThe Future of Ethics, Education and Research. Scientia Moralitas Research Institute, 2017, pp. 110–123

  21. [21]

    Code-switching in end-to-end automatic speech recognition: A systematic literature review,

    M. T. Agro, A. Kulkarni, K. Kadaoui, Z. Talat, and H. Aldarmaki, “Code-switching in end-to-end automatic speech recognition: A systematic literature review,”arXiv preprint arXiv:2507.07741, 2025

  22. [22]

    Speech recognition in noisy environments: A survey,

    Y . Gong, “Speech recognition in noisy environments: A survey,” Speech communication, vol. 16, no. 3, pp. 261–291, 1995

  23. [23]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025

  24. [24]

    Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,

    Microsoft, :, A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chen, D. Chen, D. Chen, J. Chen, W. Chen, Y .-C. Chen, Y . ling Chen, Q. Dai, X. Dai, R. Fan, M. Gao, M. Gao, A. Garg, A. Goswami, J. Hao, A. Hendy, Y . Hu, X. Jin, M. Khademi, D. Kim, Y . J. Kim, G. Lee, J. Li, Y . Li, C. Liang, X. Lin...

  25. [25]

    Available: https://arxiv.org/abs/2503.01743

    [Online]. Available: https://arxiv.org/abs/2503.01743

  26. [26]

    Vaani: Capturing the language landscape for an inclusive digital india,

    S. Pulikodan, A. Singh, A. Basu, N. Desai, P. K. J, P. D. Bhat, R. Dharmaraju, R. Gupta, S. Udupa, S. Kumar, S. Sharma, V . Sanka, D. Tewari, H. Dhand, A. Kamat, S. Singh, S. Vashishth, P. Talukdar, R. Acharya, and P. K. Ghosh, “Vaani: Capturing the language landscape for an inclusive digital india,” 2026. [Online]. Available: https://arxiv.org/abs/2603.28714

  27. [27]

    Indicconformer-600m-multilingual,

    AI4Bharat, “Indicconformer-600m-multilingual,” https: //huggingface.co/ai4bharat/indic-conformer-600m-multilingual, 2025

  28. [28]

    ASR Models,

    Speech Lab, IIT Madras, “ASR Models,” https://asr.iitm.ac.in/ models/, accessed: March 2, 2026

  29. [29]

    Harveenchadha/hindi large wav2vec2,

    Harveen Singh Chadha, “Harveenchadha/hindi large wav2vec2,” https://huggingface.co/Harveenchadha/hindi large wav2vec2, 2022, hugging Face model card, Apache-2.0 license; access: March 2, 2026

  30. [30]

    Robust speech recognition via large- scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large- scale weak supervision,” 2022. [Online]. Available: https: //arxiv.org/abs/2212.04356

  31. [31]

    Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages,

    O. A. team, G. Keren, A. Kozhevnikov, Y . Meng, C. Ropers, M. Setzler, S. Wang, I. Adebara, M. Auli, C. Balioglu, K. Chan, C. Cheng, J. Chuang, C. Droof, M. Duppenthaler, P.-A. Duquenne, A. Erben, C. Gao, G. M. Gonzalez, K. Lyu, S. Miglani, V . Pratap, K. R. Sadagopan, S. Saleem, A. Turkatenko, A. Ventayol-Boada, Z.-X. Yong, Y .-A. Chung, J. Maillard, R. ...

  32. [32]

    V oxtral,

    A. H. Liu, A. Ehrenberg, A. Lo, C. Denoix, C. Barreau, G. Lample, J.-M. Delignon, K. R. Chandu, P. von Platen, P. R. Muddireddy, S. Gandhi, S. Ghosh, S. Mishra, T. Foubert, A. Rastogi, A. Yang, A. Q. Jiang, A. Sablayrolles, A. H ´eliou, A. Martin, A. Agarwal, A. Roux, A. Darcet, A. Mensch, B. Bout, B. Rozi `ere, B. D. Monicault, C. Bamford, C. Wallenwein,...

  33. [33]

    Shunyalabs/pingala v1 universal,

    Shunya Labs, “Shunyalabs/pingala v1 universal,” https: //huggingface.co/shunyalabs/pingala-v1-universal, 2025, hug- ging Face model card, Shunya Labs RAIL-M License; access: March 2, 2026

  34. [34]

    Gemini 3 flash,

    Google DeepMind, “Gemini 3 flash,” https://genai.google/api/ models/gemini-3-flash, 2025

  35. [35]

    Chirp speech model,

    Google Cloud, “Chirp speech model,” https://cloud.google.com/ speech-to-text, 2025, universal multilingual ASR model used in Speech-to-Text API; accessed 2026-02-05

  36. [36]

    Azure speech service,

    Microsoft, “Azure speech service,” https://learn.microsoft.com/ azure/ai-services/speech-service/overview, 2025, cloud speech service for ASR and TTS; accessed 2026-02-05

  37. [37]

    Saarika-v2.5 speech recognition model,

    S. AI, “Saarika-v2.5 speech recognition model,” https://docs. sarvam.ai/api-reference-docs/asr/models/saarika, 2025, speech- to-text model supporting multiple Indian languages; accessed 2026-02-05

  38. [38]

    Gpt-4o transcribe model documentation,

    OpenAI, “Gpt-4o transcribe model documentation,” https:// developers.openai.com/api/docs/models/gpt-4o-transcribe, 2024, accessed: 2026

  39. [39]

    Introducing saaras v3: Built for the way india speaks,

    Sarvam AI, “Introducing saaras v3: Built for the way india speaks,” https://www.sarvam.ai/blogs/asr/, February 2026, ac- cessed: 2026