pith. sign in

arxiv: 2606.19157 · v2 · pith:TVMFOZOBnew · submitted 2026-06-17 · 📡 eess.AS · cs.CL

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

Pith reviewed 2026-06-26 19:13 UTC · model grok-4.3

classification 📡 eess.AS cs.CL
keywords AudioLLMscontext utilizationIndic languagesbenchmarkmultilingual speechprompting frameworkcontextual groundingadversarial prompts
0
0 comments X

The pith

A 56-hour benchmark across eight Indic languages shows audio LLMs differ substantially in whether they use supplied context or fall back on pretraining knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates IndicContextEval to determine whether audio large language models genuinely condition their transcriptions on textual prompts such as domain descriptions or entity lists. Existing tests cannot separate context use from memorized knowledge because they use fixed prompts without explicit contextual inputs. The benchmark supplies 56 hours of natural speech from 555 speakers in 8 languages and 23 domains. A seven-level prompting scheme adds metadata, descriptions, correct entity lists in English or native script, and finally incorrect adversarial entities. Evaluation of five models finds large differences in how much each model adjusts its output when context is provided or contradicted.

Core claim

IndicContextEval is a multilingual benchmark of 56 hours of speech from 555 speakers across eight Indic languages and twenty-three professional domains. A seven-level prompting framework progressively supplies metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts containing incorrect entities. When five audio LLMs are tested under these conditions, they exhibit substantial differences in context utilisation behaviour.

What carries the argument

The 7-level prompting framework that progressively introduces contextual signals (metadata, descriptions, entity lists, adversarial incorrect entities) to isolate genuine context utilization from parametric knowledge.

If this is right

  • AudioLLMs must be evaluated under varying context conditions rather than fixed prompts to measure actual grounding.
  • Models that change output when supplied with correct versus incorrect entities demonstrate measurable context utilisation.
  • Multilingual benchmarks are required to reveal language-specific patterns in how context is used.
  • Adversarial entity lists provide a direct test of whether a model overrides parametric knowledge with supplied information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could use level-by-level accuracy curves to decide whether a model is suitable for domain-specific transcription tasks.
  • The same progressive-prompt design could be applied to non-Indic languages to check whether context-use differences are universal.
  • If context utilisation proves low in many models, training methods that explicitly reward fidelity to supplied metadata may become necessary.

Load-bearing premise

Performance differences across the seven prompt levels reflect changes in context use rather than prompt sensitivity, dataset artifacts, or other model behaviors.

What would settle it

If every model produces statistically identical transcription accuracy and error patterns when moving from no-context prompts to full correct context and then to adversarial incorrect-entity prompts, the claim of substantial differences in context utilisation would not hold.

Figures

Figures reproduced from arXiv: 2606.19157 by Dhruv Subhash Rathi, Eldho Ittan George, Kaushal Bhogale, Mitesh M. Khapra, R J Hari, Sakshi Joshi, Sanskar Singh.

Figure 1
Figure 1. Figure 1: NEER (%) across context levels. Native-script enti￾ties (L5) produce large drops for GPT-4o Transcribe, Gemini 3 Flash, and Gemma-3N, with a smaller effect on Sarvam Audio. L6 (adversarial) returns near L1 for all models. Gemini 3 Flash achieves the best NEER (17.39% at L5). nical, professional, and creative fields. These include areas such as Core Engineering, Data Science, Medical Sciences, and Robotics … view at source ↗
read the original abstract

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces IndicContextEval, a 56-hour benchmark of natural speech from 555 speakers across 8 Indic languages and 23 domains. It defines a 7-level prompting framework that progressively adds contextual signals (metadata, natural-language descriptions, entity lists in English and native script, and adversarial incorrect entities) to test whether AudioLLMs utilize provided context or fall back on parametric knowledge. Evaluation of five models is reported to reveal substantial differences in context-utilization behavior.

Significance. A validated benchmark that isolates context utilization from parametric knowledge would be a useful contribution for AudioLLM development, especially in Indic languages where pretraining data are limited. The scale (8 languages, 555 speakers, 23 domains) and the inclusion of adversarial prompts are positive design choices that could support reproducible evaluation if the isolation claim holds.

major comments (1)
  1. [7-level prompting framework (abstract and §3–4)] The central claim that performance deltas across the 7 prompt levels demonstrate genuine context utilization rests on the assumption that these levels isolate contextual grounding from prompt sensitivity, length effects, and language-mixing artifacts. No ablations are described that hold prompt length, English/native-script mixing, and instruction-following structure constant while varying only the contextual content (metadata, entity lists, or adversarial entities). Without such controls, the observed differences could arise from non-contextual factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the 7-level prompting framework. We agree that additional controls are needed to strengthen claims about context isolation and will incorporate them in revision.

read point-by-point responses
  1. Referee: [7-level prompting framework (abstract and §3–4)] The central claim that performance deltas across the 7 prompt levels demonstrate genuine context utilization rests on the assumption that these levels isolate contextual grounding from prompt sensitivity, length effects, and language-mixing artifacts. No ablations are described that hold prompt length, English/native-script mixing, and instruction-following structure constant while varying only the contextual content (metadata, entity lists, or adversarial entities). Without such controls, the observed differences could arise from non-contextual factors.

    Authors: We acknowledge the validity of this concern. The manuscript does not currently include ablations that hold prompt length, mixing ratios, and instruction structure fixed while varying only contextual content. In the revised version we will add matched control sets: (i) fixed-length prompts differing only in entity-list content, (ii) English-only vs. native-script variants with identical token counts, and (iii) instruction templates that differ solely in the presence/absence of adversarial entities. These will be reported alongside the existing 7-level results to isolate contextual effects from prompt artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark paper with no derivations or self-referential predictions

full rationale

The paper introduces IndicContextEval as a new benchmark dataset and 7-level prompting framework for evaluating AudioLLMs on context utilization across Indic languages. No equations, fitted parameters, or mathematical derivations are present. The central claims rest on empirical evaluation of five models on the new data, with performance differences reported directly from those runs rather than reduced to inputs by construction. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur. This is a standard benchmark contribution that is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark paper with no mathematical content; no free parameters, axioms, or invented entities are introduced or required.

pith-pipeline@v0.9.1-grok · 5695 in / 1118 out tokens · 17722 ms · 2026-06-26T19:13:00.284954+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 2 linked inside Pith

  1. [1]

    Introduction Automatic speech recognition systems are increasingly de- ployed in applications where contextual information is avail- able at inference time. For example, meeting transcription sys- tems may know the meeting topic, medical dictation systems have access to domain terminology, and voice assistants often maintain user-specific entity lists. Su...

  2. [2]

    Related Work Contextual Biasing in ASR.Contextual information such as domain terminology or user-specific entities can significantly improve speech recognition. Early approaches incorporated arXiv:2606.19157v2 [eess.AS] 24 Jun 2026 such information through language model fusion, whether at decoding time [1] or by jointly training the sequence model with a...

  3. [3]

    To support this goal, the dataset satisfies 4 criteria

    The IndicContextEval Benchmark Design Goals:The benchmark is designed to enable controlled evaluation of contextual grounding in AudioLLMs. To support this goal, the dataset satisfies 4 criteria. First, it contains natural speech across 8 Indian languages, covering diverse scripts and linguistic structures. Second, recordings span 23 professional domains,...

  4. [4]

    Experimental Setup 4.1. Models evaluated We evaluate 5 models on our benchmark, selecting leading pro- prietary and open-weight AudioLLMs that claim support for all 8 languages in our dataset, alongside a strong standalone ASR baseline

  5. [5]

    Since it requires the tar- get language as input, it cannot operate at L0 and is evaluated at L1 as a competitive non-LLM reference

    Standalone ASR baseline(evaluated at L1 only): We evalu- ateIndicConformer[30], a 600M-parameter multilingual Con- former trained on 22 Indian languages. Since it requires the tar- get language as input, it cannot operate at L0 and is evaluated at L1 as a competitive non-LLM reference

  6. [6]

    We do not include models such asQwen3-Omni[10] andVoxtral[11] because their official doc- umentation does not claim support for all eight Indian languages evaluated in this work

    AudioLLMs(evaluated at all seven levels, L0–L6): We eval- uateGPT-4o Transcribe[6],Gemini 3 Flash[7],Sarvam Au- dio[8] for commercial models and selectGemma-3N[9] (8B- E4B) for open weight models. We do not include models such asQwen3-Omni[10] andVoxtral[11] because their official doc- umentation does not claim support for all eight Indian languages evalu...

  7. [7]

    Baseline performance Table 2 reports the average WER at L1 (language prompt) for all models

    Results 5.1. Baseline performance Table 2 reports the average WER at L1 (language prompt) for all models. Sarvam Audio achieves the lowest WER, followed by Table 3:WER (%) by prompt level. Model L0 L1 L2 L3 L4 L5 L6 GPT-4o T 29.83 28.61 28.37 26.08 27.9726.0428.47 Gemini 3F 24.30 18.90 19.28 18.39 19.8817.4619.67 Sarvam 20.39 16.86 16.78 16.43 16.8015.701...

  8. [8]

    Conclusion We introduced IndicContextEval, a multilingual benchmark and controlled prompt taxonomy for evaluating context utilisation in AudioLLMs. IndicContextEval spans 55.93 hours of natu- ral speech across eight Indian languages and 23 domains, en- abling systematic analysis of how different contextual signals affect transcription. Our experiments sho...

  9. [9]

    We sincerely thank the NPTEL team at IIT Madras for their invaluable assistance in reaching and engaging participants for the data collection ef- fort

    Acknowledgments We gratefully acknowledge the support of EkStep Foundation and Nilekani Philanthropies, whose generous funding made this work possible by supporting the team, resources, and cloud in- frastructure required for the project. We sincerely thank the NPTEL team at IIT Madras for their invaluable assistance in reaching and engaging participants ...

  10. [10]

    These tools assisted with improving clarity, grammar, and conciseness of the writing

    Generative AI Use Disclosure Generative AI tools were used solely for language polishing and editing during the preparation of this manuscript. These tools assisted with improving clarity, grammar, and conciseness of the writing. No generative AI system was used to generate ex- perimental results, analyses, figures, or scientific conclusions. All technica...

  11. [11]

    Shallow-fusion end-to-end contextual biasing,

    D. Zhao, T. N. Sainath, D. Rybach, P. Rondon, D. Bhatia, B. Li, and R. Pang, “Shallow-fusion end-to-end contextual biasing,” in Proc. Interspeech, 2019, pp. 1418–1422

  12. [12]

    Cold fusion: Training seq2seq models together with language models,

    A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion: Training seq2seq models together with language models,” inProc. Interspeech, 2018, pp. 387–391

  13. [13]

    Deep context: End-to-end contextual speech recogni- tion,

    G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: End-to-end contextual speech recogni- tion,” inProc. IEEE SLT, 2018, pp. 418–425

  14. [14]

    Improving contextual recogni- tion of rare words with an alternate spelling prediction model,

    J. Drexler Fox and N. Delworth, “Improving contextual recogni- tion of rare words with an alternate spelling prediction model,” arXiv preprint arXiv:2209.01250, 2022

  15. [15]

    Improving asr contextual biasing with guided attention,

    J. Tang, K. Kim, S. Shon, F. Wu, and P. Sridhar, “Improving asr contextual biasing with guided attention,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12 096–12 100

  16. [16]

    Gpt-4o transcribe model,

    OpenAI, “Gpt-4o transcribe model,” 2024. [Online]. Available: https://developers.openai.com/api/docs/models/gpt-4o-transcribe

  17. [17]

    Gemini 3,

    Google DeepMind, “Gemini 3,” 2025. [Online]. Available: https://deepmind.google/models/gemini/

  18. [18]

    Sarvam audio: Speech recognition beyond transcription,

    Sarvam AI, “Sarvam audio: Speech recognition beyond transcription,” 2026. [Online]. Available: https://www.sarvam.ai/ blogs/sarvam-audio

  19. [19]

    Gemma 3n: Powerful, efficient, mobile-first ai,

    Google, “Gemma 3n: Powerful, efficient, mobile-first ai,”

  20. [20]

    Available: https://developers.googleblog.com/en/ introducing-gemma-3n

    [Online]. Available: https://developers.googleblog.com/en/ introducing-gemma-3n

  21. [21]

    Qwen3-omni technical report,

    J. Xuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

  22. [22]

    V oxtral,

    A. H. Liuet al., “V oxtral,”arXiv preprint arXiv:2507.13264, 2025

  23. [23]

    Indicvoices: Towards building an inclusive multi- lingual speech dataset for indian languages,

    T. Javedet al., “Indicvoices: Towards building an inclusive multi- lingual speech dataset for indian languages,” inFindings of ACL, 2024, pp. 10 740–10 782

  24. [24]

    Common voice: A massively-multilingual speech corpus,

    R. Ardilaet al., “Common voice: A massively-multilingual speech corpus,” inProc. LREC, 2020, pp. 4218–4222

  25. [25]

    Fleurs: Few-shot learning evaluation of universal representations of speech,

    A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 798– 805

  26. [26]

    Contextasr-bench: A massive contextual speech recognition benchmark,

    H. Wang, L. Ma, D. Guo, X. Wang, L. Xie, J. Xu, and J. Lin, “Contextasr-bench: A massive contextual speech recognition benchmark,”arXiv preprint arXiv:2507.05727, 2025

  27. [27]

    Profasr-bench: A benchmark for context- conditioned asr in high-stakes professional speech,

    D. B. Piskala, “Profasr-bench: A benchmark for context- conditioned asr in high-stakes professional speech,”arXiv preprint arXiv:2512.23686, 2025

  28. [28]

    Cb-whisper: Contextual biasing whisper using open- vocabulary keyword-spotting,

    Y . Liet al., “Cb-whisper: Contextual biasing whisper using open- vocabulary keyword-spotting,” inProc. LREC-COLING, 2024, pp. 2941–2946

  29. [29]

    Owsm- biasing: Contextualizing open whisper-style speech models for asr with dynamic vocabulary,

    Y . Sudo, Y . Fujita, A. Kojima, T. Mizumoto, and L. Liu, “Owsm- biasing: Contextualizing open whisper-style speech models for asr with dynamic vocabulary,”arXiv preprint arXiv:2506.09448, 2025

  30. [30]

    Contextual biasing to improve domain- specific custom vocabulary audio transcription without explicit fine-tuning of whisper model,

    V . Lall and Y . Liu, “Contextual biasing to improve domain- specific custom vocabulary audio transcription without explicit fine-tuning of whisper model,”arXiv preprint arXiv:2410.18363, 2024

  31. [31]

    Im- proving rare-word recognition of whisper in zero-shot settings,

    Y . Jogi, V . Aggarwal, S. S. Nair, Y . Verma, and A. Kubba, “Im- proving rare-word recognition of whisper in zero-shot settings,” arXiv preprint arXiv:2502.11572, 2025

  32. [32]

    Can contextual biasing remain effective with whisper and gpt-2?

    G. Sun, X. Zheng, C. Zhang, and P. C. Woodland, “Can contextual biasing remain effective with whisper and gpt-2?”arXiv preprint arXiv:2306.01942, 2023

  33. [33]

    Br-asr: Efficient and scalable bias retrieval framework for contextual biasing asr in speech llm,

    X. Gong, A. Lv, Z. Wang, H. Zhu, and Y . Qian, “Br-asr: Efficient and scalable bias retrieval framework for contextual biasing asr in speech llm,”arXiv preprint arXiv:2505.19179, 2025

  34. [34]

    Con- textual biasing for llm-based asr with hotword retrieval and rein- forcement learning,

    Y . Kong, J. Hou, J. Tang, B. Zhu, J. Zhang, and S. Xue, “Con- textual biasing for llm-based asr with hotword retrieval and rein- forcement learning,”arXiv preprint arXiv:2512.21828, 2025

  35. [35]

    Lightweight prompt biasing for contextualized end-to-end asr systems,

    B. Ren, Y . Shi, and J. Li, “Lightweight prompt biasing for contextualized end-to-end asr systems,”arXiv preprint arXiv:2506.06252, 2025

  36. [36]

    Prompting large language mod- els for zero-shot domain adaptation in speech recognition,

    Y . Li, Y . Wu, J. Li, and S. Liu, “Prompting large language mod- els for zero-shot domain adaptation in speech recognition,”arXiv preprint arXiv:2306.16007, 2023

  37. [37]

    Smile: Speech meta in-context learning for low-resource language automatic speech recognition,

    M. H. Hsu and H. Y . Lee, “Smile: Speech meta in-context learning for low-resource language automatic speech recognition,”arXiv preprint arXiv:2409.10429, 2024

  38. [38]

    Do prompts really prompt? exploring the prompt understanding capability of whis- per,

    C.-K. Yang, K.-P. Huang, and H.-Y . Lee, “Do prompts really prompt? exploring the prompt understanding capability of whis- per,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1–8

  39. [39]

    Earnings-22: A practical benchmark for accents in the wild,

    M. Del Rio, P. Ha, Q. McNamara, C. Miller, and S. Chandra, “Earnings-22: A practical benchmark for accents in the wild,” in Proc. Interspeech, 2022

  40. [40]

    Sarvam-translate,

    Sarvam AI, “Sarvam-translate,” https://huggingface.co/sarvamai/ sarvam-translate, 2025, hugging Face model repository

  41. [41]

    Indicconformer: Multilingual asr model for 22 indian languages,

    AI4Bharat, “Indicconformer: Multilingual asr model for 22 indian languages,” 2024. [Online]. Available: https: //huggingface.co/ai4bharat/indic-conformer-600m-multilingual

  42. [42]

    Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages,

    D. Kakwani, A. Kunchukuttan, S. Gollaet al., “Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages,” inFindings of EMNLP, 2020, pp. 4948–4961