IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages
Pith reviewed 2026-06-26 19:13 UTC · model grok-4.3
The pith
A 56-hour benchmark across eight Indic languages shows audio LLMs differ substantially in whether they use supplied context or fall back on pretraining knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IndicContextEval is a multilingual benchmark of 56 hours of speech from 555 speakers across eight Indic languages and twenty-three professional domains. A seven-level prompting framework progressively supplies metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts containing incorrect entities. When five audio LLMs are tested under these conditions, they exhibit substantial differences in context utilisation behaviour.
What carries the argument
The 7-level prompting framework that progressively introduces contextual signals (metadata, descriptions, entity lists, adversarial incorrect entities) to isolate genuine context utilization from parametric knowledge.
If this is right
- AudioLLMs must be evaluated under varying context conditions rather than fixed prompts to measure actual grounding.
- Models that change output when supplied with correct versus incorrect entities demonstrate measurable context utilisation.
- Multilingual benchmarks are required to reveal language-specific patterns in how context is used.
- Adversarial entity lists provide a direct test of whether a model overrides parametric knowledge with supplied information.
Where Pith is reading between the lines
- Developers could use level-by-level accuracy curves to decide whether a model is suitable for domain-specific transcription tasks.
- The same progressive-prompt design could be applied to non-Indic languages to check whether context-use differences are universal.
- If context utilisation proves low in many models, training methods that explicitly reward fidelity to supplied metadata may become necessary.
Load-bearing premise
Performance differences across the seven prompt levels reflect changes in context use rather than prompt sensitivity, dataset artifacts, or other model behaviors.
What would settle it
If every model produces statistically identical transcription accuracy and error patterns when moving from no-context prompts to full correct context and then to adversarial incorrect-entity prompts, the claim of substantial differences in context utilisation would not hold.
Figures
read the original abstract
AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces IndicContextEval, a 56-hour benchmark of natural speech from 555 speakers across 8 Indic languages and 23 domains. It defines a 7-level prompting framework that progressively adds contextual signals (metadata, natural-language descriptions, entity lists in English and native script, and adversarial incorrect entities) to test whether AudioLLMs utilize provided context or fall back on parametric knowledge. Evaluation of five models is reported to reveal substantial differences in context-utilization behavior.
Significance. A validated benchmark that isolates context utilization from parametric knowledge would be a useful contribution for AudioLLM development, especially in Indic languages where pretraining data are limited. The scale (8 languages, 555 speakers, 23 domains) and the inclusion of adversarial prompts are positive design choices that could support reproducible evaluation if the isolation claim holds.
major comments (1)
- [7-level prompting framework (abstract and §3–4)] The central claim that performance deltas across the 7 prompt levels demonstrate genuine context utilization rests on the assumption that these levels isolate contextual grounding from prompt sensitivity, length effects, and language-mixing artifacts. No ablations are described that hold prompt length, English/native-script mixing, and instruction-following structure constant while varying only the contextual content (metadata, entity lists, or adversarial entities). Without such controls, the observed differences could arise from non-contextual factors.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the 7-level prompting framework. We agree that additional controls are needed to strengthen claims about context isolation and will incorporate them in revision.
read point-by-point responses
-
Referee: [7-level prompting framework (abstract and §3–4)] The central claim that performance deltas across the 7 prompt levels demonstrate genuine context utilization rests on the assumption that these levels isolate contextual grounding from prompt sensitivity, length effects, and language-mixing artifacts. No ablations are described that hold prompt length, English/native-script mixing, and instruction-following structure constant while varying only the contextual content (metadata, entity lists, or adversarial entities). Without such controls, the observed differences could arise from non-contextual factors.
Authors: We acknowledge the validity of this concern. The manuscript does not currently include ablations that hold prompt length, mixing ratios, and instruction structure fixed while varying only contextual content. In the revised version we will add matched control sets: (i) fixed-length prompts differing only in entity-list content, (ii) English-only vs. native-script variants with identical token counts, and (iii) instruction templates that differ solely in the presence/absence of adversarial entities. These will be reported alongside the existing 7-level results to isolate contextual effects from prompt artifacts. revision: yes
Circularity Check
No circularity: benchmark paper with no derivations or self-referential predictions
full rationale
The paper introduces IndicContextEval as a new benchmark dataset and 7-level prompting framework for evaluating AudioLLMs on context utilization across Indic languages. No equations, fitted parameters, or mathematical derivations are present. The central claims rest on empirical evaluation of five models on the new data, with performance differences reported directly from those runs rather than reduced to inputs by construction. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur. This is a standard benchmark contribution that is self-contained against external evaluation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Automatic speech recognition systems are increasingly de- ployed in applications where contextual information is avail- able at inference time. For example, meeting transcription sys- tems may know the meeting topic, medical dictation systems have access to domain terminology, and voice assistants often maintain user-specific entity lists. Su...
-
[2]
Related Work Contextual Biasing in ASR.Contextual information such as domain terminology or user-specific entities can significantly improve speech recognition. Early approaches incorporated arXiv:2606.19157v2 [eess.AS] 24 Jun 2026 such information through language model fusion, whether at decoding time [1] or by jointly training the sequence model with a...
Pith/arXiv arXiv 2026
-
[3]
To support this goal, the dataset satisfies 4 criteria
The IndicContextEval Benchmark Design Goals:The benchmark is designed to enable controlled evaluation of contextual grounding in AudioLLMs. To support this goal, the dataset satisfies 4 criteria. First, it contains natural speech across 8 Indian languages, covering diverse scripts and linguistic structures. Second, recordings span 23 professional domains,...
-
[4]
Experimental Setup 4.1. Models evaluated We evaluate 5 models on our benchmark, selecting leading pro- prietary and open-weight AudioLLMs that claim support for all 8 languages in our dataset, alongside a strong standalone ASR baseline
-
[5]
Since it requires the tar- get language as input, it cannot operate at L0 and is evaluated at L1 as a competitive non-LLM reference
Standalone ASR baseline(evaluated at L1 only): We evalu- ateIndicConformer[30], a 600M-parameter multilingual Con- former trained on 22 Indian languages. Since it requires the tar- get language as input, it cannot operate at L0 and is evaluated at L1 as a competitive non-LLM reference
-
[6]
We do not include models such asQwen3-Omni[10] andVoxtral[11] because their official doc- umentation does not claim support for all eight Indian languages evaluated in this work
AudioLLMs(evaluated at all seven levels, L0–L6): We eval- uateGPT-4o Transcribe[6],Gemini 3 Flash[7],Sarvam Au- dio[8] for commercial models and selectGemma-3N[9] (8B- E4B) for open weight models. We do not include models such asQwen3-Omni[10] andVoxtral[11] because their official doc- umentation does not claim support for all eight Indian languages evalu...
-
[7]
Baseline performance Table 2 reports the average WER at L1 (language prompt) for all models
Results 5.1. Baseline performance Table 2 reports the average WER at L1 (language prompt) for all models. Sarvam Audio achieves the lowest WER, followed by Table 3:WER (%) by prompt level. Model L0 L1 L2 L3 L4 L5 L6 GPT-4o T 29.83 28.61 28.37 26.08 27.9726.0428.47 Gemini 3F 24.30 18.90 19.28 18.39 19.8817.4619.67 Sarvam 20.39 16.86 16.78 16.43 16.8015.701...
-
[8]
Conclusion We introduced IndicContextEval, a multilingual benchmark and controlled prompt taxonomy for evaluating context utilisation in AudioLLMs. IndicContextEval spans 55.93 hours of natu- ral speech across eight Indian languages and 23 domains, en- abling systematic analysis of how different contextual signals affect transcription. Our experiments sho...
-
[9]
We sincerely thank the NPTEL team at IIT Madras for their invaluable assistance in reaching and engaging participants for the data collection ef- fort
Acknowledgments We gratefully acknowledge the support of EkStep Foundation and Nilekani Philanthropies, whose generous funding made this work possible by supporting the team, resources, and cloud in- frastructure required for the project. We sincerely thank the NPTEL team at IIT Madras for their invaluable assistance in reaching and engaging participants ...
-
[10]
These tools assisted with improving clarity, grammar, and conciseness of the writing
Generative AI Use Disclosure Generative AI tools were used solely for language polishing and editing during the preparation of this manuscript. These tools assisted with improving clarity, grammar, and conciseness of the writing. No generative AI system was used to generate ex- perimental results, analyses, figures, or scientific conclusions. All technica...
-
[11]
Shallow-fusion end-to-end contextual biasing,
D. Zhao, T. N. Sainath, D. Rybach, P. Rondon, D. Bhatia, B. Li, and R. Pang, “Shallow-fusion end-to-end contextual biasing,” in Proc. Interspeech, 2019, pp. 1418–1422
2019
-
[12]
Cold fusion: Training seq2seq models together with language models,
A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion: Training seq2seq models together with language models,” inProc. Interspeech, 2018, pp. 387–391
2018
-
[13]
Deep context: End-to-end contextual speech recogni- tion,
G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: End-to-end contextual speech recogni- tion,” inProc. IEEE SLT, 2018, pp. 418–425
2018
-
[14]
Improving contextual recogni- tion of rare words with an alternate spelling prediction model,
J. Drexler Fox and N. Delworth, “Improving contextual recogni- tion of rare words with an alternate spelling prediction model,” arXiv preprint arXiv:2209.01250, 2022
arXiv 2022
-
[15]
Improving asr contextual biasing with guided attention,
J. Tang, K. Kim, S. Shon, F. Wu, and P. Sridhar, “Improving asr contextual biasing with guided attention,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12 096–12 100
2024
-
[16]
Gpt-4o transcribe model,
OpenAI, “Gpt-4o transcribe model,” 2024. [Online]. Available: https://developers.openai.com/api/docs/models/gpt-4o-transcribe
2024
-
[17]
Gemini 3,
Google DeepMind, “Gemini 3,” 2025. [Online]. Available: https://deepmind.google/models/gemini/
2025
-
[18]
Sarvam audio: Speech recognition beyond transcription,
Sarvam AI, “Sarvam audio: Speech recognition beyond transcription,” 2026. [Online]. Available: https://www.sarvam.ai/ blogs/sarvam-audio
2026
-
[19]
Gemma 3n: Powerful, efficient, mobile-first ai,
Google, “Gemma 3n: Powerful, efficient, mobile-first ai,”
-
[20]
Available: https://developers.googleblog.com/en/ introducing-gemma-3n
[Online]. Available: https://developers.googleblog.com/en/ introducing-gemma-3n
-
[21]
J. Xuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025
Pith/arXiv arXiv 2025
- [22]
-
[23]
Indicvoices: Towards building an inclusive multi- lingual speech dataset for indian languages,
T. Javedet al., “Indicvoices: Towards building an inclusive multi- lingual speech dataset for indian languages,” inFindings of ACL, 2024, pp. 10 740–10 782
2024
-
[24]
Common voice: A massively-multilingual speech corpus,
R. Ardilaet al., “Common voice: A massively-multilingual speech corpus,” inProc. LREC, 2020, pp. 4218–4222
2020
-
[25]
Fleurs: Few-shot learning evaluation of universal representations of speech,
A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 798– 805
2023
-
[26]
Contextasr-bench: A massive contextual speech recognition benchmark,
H. Wang, L. Ma, D. Guo, X. Wang, L. Xie, J. Xu, and J. Lin, “Contextasr-bench: A massive contextual speech recognition benchmark,”arXiv preprint arXiv:2507.05727, 2025
arXiv 2025
-
[27]
Profasr-bench: A benchmark for context- conditioned asr in high-stakes professional speech,
D. B. Piskala, “Profasr-bench: A benchmark for context- conditioned asr in high-stakes professional speech,”arXiv preprint arXiv:2512.23686, 2025
arXiv 2025
-
[28]
Cb-whisper: Contextual biasing whisper using open- vocabulary keyword-spotting,
Y . Liet al., “Cb-whisper: Contextual biasing whisper using open- vocabulary keyword-spotting,” inProc. LREC-COLING, 2024, pp. 2941–2946
2024
-
[29]
Owsm- biasing: Contextualizing open whisper-style speech models for asr with dynamic vocabulary,
Y . Sudo, Y . Fujita, A. Kojima, T. Mizumoto, and L. Liu, “Owsm- biasing: Contextualizing open whisper-style speech models for asr with dynamic vocabulary,”arXiv preprint arXiv:2506.09448, 2025
arXiv 2025
-
[30]
V . Lall and Y . Liu, “Contextual biasing to improve domain- specific custom vocabulary audio transcription without explicit fine-tuning of whisper model,”arXiv preprint arXiv:2410.18363, 2024
arXiv 2024
-
[31]
Im- proving rare-word recognition of whisper in zero-shot settings,
Y . Jogi, V . Aggarwal, S. S. Nair, Y . Verma, and A. Kubba, “Im- proving rare-word recognition of whisper in zero-shot settings,” arXiv preprint arXiv:2502.11572, 2025
arXiv 2025
-
[32]
Can contextual biasing remain effective with whisper and gpt-2?
G. Sun, X. Zheng, C. Zhang, and P. C. Woodland, “Can contextual biasing remain effective with whisper and gpt-2?”arXiv preprint arXiv:2306.01942, 2023
arXiv 2023
-
[33]
Br-asr: Efficient and scalable bias retrieval framework for contextual biasing asr in speech llm,
X. Gong, A. Lv, Z. Wang, H. Zhu, and Y . Qian, “Br-asr: Efficient and scalable bias retrieval framework for contextual biasing asr in speech llm,”arXiv preprint arXiv:2505.19179, 2025
arXiv 2025
-
[34]
Con- textual biasing for llm-based asr with hotword retrieval and rein- forcement learning,
Y . Kong, J. Hou, J. Tang, B. Zhu, J. Zhang, and S. Xue, “Con- textual biasing for llm-based asr with hotword retrieval and rein- forcement learning,”arXiv preprint arXiv:2512.21828, 2025
arXiv 2025
-
[35]
Lightweight prompt biasing for contextualized end-to-end asr systems,
B. Ren, Y . Shi, and J. Li, “Lightweight prompt biasing for contextualized end-to-end asr systems,”arXiv preprint arXiv:2506.06252, 2025
arXiv 2025
-
[36]
Prompting large language mod- els for zero-shot domain adaptation in speech recognition,
Y . Li, Y . Wu, J. Li, and S. Liu, “Prompting large language mod- els for zero-shot domain adaptation in speech recognition,”arXiv preprint arXiv:2306.16007, 2023
arXiv 2023
-
[37]
Smile: Speech meta in-context learning for low-resource language automatic speech recognition,
M. H. Hsu and H. Y . Lee, “Smile: Speech meta in-context learning for low-resource language automatic speech recognition,”arXiv preprint arXiv:2409.10429, 2024
arXiv 2024
-
[38]
Do prompts really prompt? exploring the prompt understanding capability of whis- per,
C.-K. Yang, K.-P. Huang, and H.-Y . Lee, “Do prompts really prompt? exploring the prompt understanding capability of whis- per,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1–8
2024
-
[39]
Earnings-22: A practical benchmark for accents in the wild,
M. Del Rio, P. Ha, Q. McNamara, C. Miller, and S. Chandra, “Earnings-22: A practical benchmark for accents in the wild,” in Proc. Interspeech, 2022
2022
-
[40]
Sarvam-translate,
Sarvam AI, “Sarvam-translate,” https://huggingface.co/sarvamai/ sarvam-translate, 2025, hugging Face model repository
2025
-
[41]
Indicconformer: Multilingual asr model for 22 indian languages,
AI4Bharat, “Indicconformer: Multilingual asr model for 22 indian languages,” 2024. [Online]. Available: https: //huggingface.co/ai4bharat/indic-conformer-600m-multilingual
2024
-
[42]
Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages,
D. Kakwani, A. Kunchukuttan, S. Gollaet al., “Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages,” inFindings of EMNLP, 2020, pp. 4948–4961
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.