IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

Dhruv Subhash Rathi; Eldho Ittan George; Kaushal Bhogale; Mitesh M. Khapra; R J Hari; Sakshi Joshi; Sanskar Singh

arxiv: 2606.19157 · v2 · pith:TVMFOZOBnew · submitted 2026-06-17 · 📡 eess.AS · cs.CL

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

Sakshi Joshi , Dhruv Subhash Rathi , Sanskar Singh , Eldho Ittan George , R J Hari , Kaushal Bhogale , Mitesh M. Khapra This is my paper

Pith reviewed 2026-06-26 19:13 UTC · model grok-4.3

classification 📡 eess.AS cs.CL

keywords AudioLLMscontext utilizationIndic languagesbenchmarkmultilingual speechprompting frameworkcontextual groundingadversarial prompts

0 comments

The pith

A 56-hour benchmark across eight Indic languages shows audio LLMs differ substantially in whether they use supplied context or fall back on pretraining knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates IndicContextEval to determine whether audio large language models genuinely condition their transcriptions on textual prompts such as domain descriptions or entity lists. Existing tests cannot separate context use from memorized knowledge because they use fixed prompts without explicit contextual inputs. The benchmark supplies 56 hours of natural speech from 555 speakers in 8 languages and 23 domains. A seven-level prompting scheme adds metadata, descriptions, correct entity lists in English or native script, and finally incorrect adversarial entities. Evaluation of five models finds large differences in how much each model adjusts its output when context is provided or contradicted.

Core claim

IndicContextEval is a multilingual benchmark of 56 hours of speech from 555 speakers across eight Indic languages and twenty-three professional domains. A seven-level prompting framework progressively supplies metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts containing incorrect entities. When five audio LLMs are tested under these conditions, they exhibit substantial differences in context utilisation behaviour.

What carries the argument

The 7-level prompting framework that progressively introduces contextual signals (metadata, descriptions, entity lists, adversarial incorrect entities) to isolate genuine context utilization from parametric knowledge.

If this is right

AudioLLMs must be evaluated under varying context conditions rather than fixed prompts to measure actual grounding.
Models that change output when supplied with correct versus incorrect entities demonstrate measurable context utilisation.
Multilingual benchmarks are required to reveal language-specific patterns in how context is used.
Adversarial entity lists provide a direct test of whether a model overrides parametric knowledge with supplied information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could use level-by-level accuracy curves to decide whether a model is suitable for domain-specific transcription tasks.
The same progressive-prompt design could be applied to non-Indic languages to check whether context-use differences are universal.
If context utilisation proves low in many models, training methods that explicitly reward fidelity to supplied metadata may become necessary.

Load-bearing premise

Performance differences across the seven prompt levels reflect changes in context use rather than prompt sensitivity, dataset artifacts, or other model behaviors.

What would settle it

If every model produces statistically identical transcription accuracy and error patterns when moving from no-context prompts to full correct context and then to adversarial incorrect-entity prompts, the claim of substantial differences in context utilisation would not hold.

Figures

Figures reproduced from arXiv: 2606.19157 by Dhruv Subhash Rathi, Eldho Ittan George, Kaushal Bhogale, Mitesh M. Khapra, R J Hari, Sakshi Joshi, Sanskar Singh.

**Figure 1.** Figure 1: NEER (%) across context levels. Native-script entities (L5) produce large drops for GPT-4o Transcribe, Gemini 3 Flash, and Gemma-3N, with a smaller effect on Sarvam Audio. L6 (adversarial) returns near L1 for all models. Gemini 3 Flash achieves the best NEER (17.39% at L5). nical, professional, and creative fields. These include areas such as Core Engineering, Data Science, Medical Sciences, and Robotics … view at source ↗

read the original abstract

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IndicContextEval introduces a useful new benchmark for context use in Indic AudioLLMs, but the 7-level framework still needs checks to confirm the deltas track genuine context grounding rather than prompt artifacts.

read the letter

The paper's main contribution is a 56-hour dataset of natural speech across 8 Indic languages and 23 domains, paired with a 7-level prompting setup that adds metadata, descriptions, entity lists, and adversarial incorrect entities. This directly targets whether AudioLLMs actually condition on the supplied context or fall back to pretraining knowledge, which existing fixed-prompt benchmarks do not test.

It fills a clear gap: prior work on AudioLLMs has been light on Indic languages and on explicit context evaluation. The scale (555 speakers) and the progressive design are practical steps forward.

The soft spot is the one the stress-test note flags. Performance shifts across levels could come from models reacting to prompt length, structure, or language mixing rather than the actual contextual content. The abstract does not describe ablations that hold those factors fixed while varying only the signal, so the claim of "substantial differences in context utilisation" rests on an assumption that is not yet shown to hold. Dataset construction details like transcription validation and speaker criteria are also not visible here.

This is for groups building or evaluating multilingual AudioLLMs who need testbeds beyond English-centric setups. It is worth sending to peer review because the benchmark itself is new and the question it asks matters; the evaluation claims will need tightening in revision but the core resource stands on its own.

Referee Report

1 major / 0 minor

Summary. The paper introduces IndicContextEval, a 56-hour benchmark of natural speech from 555 speakers across 8 Indic languages and 23 domains. It defines a 7-level prompting framework that progressively adds contextual signals (metadata, natural-language descriptions, entity lists in English and native script, and adversarial incorrect entities) to test whether AudioLLMs utilize provided context or fall back on parametric knowledge. Evaluation of five models is reported to reveal substantial differences in context-utilization behavior.

Significance. A validated benchmark that isolates context utilization from parametric knowledge would be a useful contribution for AudioLLM development, especially in Indic languages where pretraining data are limited. The scale (8 languages, 555 speakers, 23 domains) and the inclusion of adversarial prompts are positive design choices that could support reproducible evaluation if the isolation claim holds.

major comments (1)

[7-level prompting framework (abstract and §3–4)] The central claim that performance deltas across the 7 prompt levels demonstrate genuine context utilization rests on the assumption that these levels isolate contextual grounding from prompt sensitivity, length effects, and language-mixing artifacts. No ablations are described that hold prompt length, English/native-script mixing, and instruction-following structure constant while varying only the contextual content (metadata, entity lists, or adversarial entities). Without such controls, the observed differences could arise from non-contextual factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the 7-level prompting framework. We agree that additional controls are needed to strengthen claims about context isolation and will incorporate them in revision.

read point-by-point responses

Referee: [7-level prompting framework (abstract and §3–4)] The central claim that performance deltas across the 7 prompt levels demonstrate genuine context utilization rests on the assumption that these levels isolate contextual grounding from prompt sensitivity, length effects, and language-mixing artifacts. No ablations are described that hold prompt length, English/native-script mixing, and instruction-following structure constant while varying only the contextual content (metadata, entity lists, or adversarial entities). Without such controls, the observed differences could arise from non-contextual factors.

Authors: We acknowledge the validity of this concern. The manuscript does not currently include ablations that hold prompt length, mixing ratios, and instruction structure fixed while varying only contextual content. In the revised version we will add matched control sets: (i) fixed-length prompts differing only in entity-list content, (ii) English-only vs. native-script variants with identical token counts, and (iii) instruction templates that differ solely in the presence/absence of adversarial entities. These will be reported alongside the existing 7-level results to isolate contextual effects from prompt artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark paper with no derivations or self-referential predictions

full rationale

The paper introduces IndicContextEval as a new benchmark dataset and 7-level prompting framework for evaluating AudioLLMs on context utilization across Indic languages. No equations, fitted parameters, or mathematical derivations are present. The central claims rest on empirical evaluation of five models on the new data, with performance differences reported directly from those runs rather than reduced to inputs by construction. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur. This is a standard benchmark contribution that is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark paper with no mathematical content; no free parameters, axioms, or invented entities are introduced or required.

pith-pipeline@v0.9.1-grok · 5695 in / 1118 out tokens · 17722 ms · 2026-06-26T19:13:00.284954+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 2 linked inside Pith

[1]

Introduction Automatic speech recognition systems are increasingly de- ployed in applications where contextual information is avail- able at inference time. For example, meeting transcription sys- tems may know the meeting topic, medical dictation systems have access to domain terminology, and voice assistants often maintain user-specific entity lists. Su...
[2]

Related Work Contextual Biasing in ASR.Contextual information such as domain terminology or user-specific entities can significantly improve speech recognition. Early approaches incorporated arXiv:2606.19157v2 [eess.AS] 24 Jun 2026 such information through language model fusion, whether at decoding time [1] or by jointly training the sequence model with a...

Pith/arXiv arXiv 2026
[3]

To support this goal, the dataset satisfies 4 criteria

The IndicContextEval Benchmark Design Goals:The benchmark is designed to enable controlled evaluation of contextual grounding in AudioLLMs. To support this goal, the dataset satisfies 4 criteria. First, it contains natural speech across 8 Indian languages, covering diverse scripts and linguistic structures. Second, recordings span 23 professional domains,...
[4]

Experimental Setup 4.1. Models evaluated We evaluate 5 models on our benchmark, selecting leading pro- prietary and open-weight AudioLLMs that claim support for all 8 languages in our dataset, alongside a strong standalone ASR baseline
[5]

Since it requires the tar- get language as input, it cannot operate at L0 and is evaluated at L1 as a competitive non-LLM reference

Standalone ASR baseline(evaluated at L1 only): We evalu- ateIndicConformer[30], a 600M-parameter multilingual Con- former trained on 22 Indian languages. Since it requires the tar- get language as input, it cannot operate at L0 and is evaluated at L1 as a competitive non-LLM reference
[6]

We do not include models such asQwen3-Omni[10] andVoxtral[11] because their official doc- umentation does not claim support for all eight Indian languages evaluated in this work

AudioLLMs(evaluated at all seven levels, L0–L6): We eval- uateGPT-4o Transcribe[6],Gemini 3 Flash[7],Sarvam Au- dio[8] for commercial models and selectGemma-3N[9] (8B- E4B) for open weight models. We do not include models such asQwen3-Omni[10] andVoxtral[11] because their official doc- umentation does not claim support for all eight Indian languages evalu...
[7]

Baseline performance Table 2 reports the average WER at L1 (language prompt) for all models

Results 5.1. Baseline performance Table 2 reports the average WER at L1 (language prompt) for all models. Sarvam Audio achieves the lowest WER, followed by Table 3:WER (%) by prompt level. Model L0 L1 L2 L3 L4 L5 L6 GPT-4o T 29.83 28.61 28.37 26.08 27.9726.0428.47 Gemini 3F 24.30 18.90 19.28 18.39 19.8817.4619.67 Sarvam 20.39 16.86 16.78 16.43 16.8015.701...

arXiv
[8]

Conclusion We introduced IndicContextEval, a multilingual benchmark and controlled prompt taxonomy for evaluating context utilisation in AudioLLMs. IndicContextEval spans 55.93 hours of natu- ral speech across eight Indian languages and 23 domains, en- abling systematic analysis of how different contextual signals affect transcription. Our experiments sho...
[9]

We sincerely thank the NPTEL team at IIT Madras for their invaluable assistance in reaching and engaging participants for the data collection ef- fort

Acknowledgments We gratefully acknowledge the support of EkStep Foundation and Nilekani Philanthropies, whose generous funding made this work possible by supporting the team, resources, and cloud in- frastructure required for the project. We sincerely thank the NPTEL team at IIT Madras for their invaluable assistance in reaching and engaging participants ...
[10]

These tools assisted with improving clarity, grammar, and conciseness of the writing

Generative AI Use Disclosure Generative AI tools were used solely for language polishing and editing during the preparation of this manuscript. These tools assisted with improving clarity, grammar, and conciseness of the writing. No generative AI system was used to generate ex- perimental results, analyses, figures, or scientific conclusions. All technica...
[11]

Shallow-fusion end-to-end contextual biasing,

D. Zhao, T. N. Sainath, D. Rybach, P. Rondon, D. Bhatia, B. Li, and R. Pang, “Shallow-fusion end-to-end contextual biasing,” in Proc. Interspeech, 2019, pp. 1418–1422

2019
[12]

Cold fusion: Training seq2seq models together with language models,

A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion: Training seq2seq models together with language models,” inProc. Interspeech, 2018, pp. 387–391

2018
[13]

Deep context: End-to-end contextual speech recogni- tion,

G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: End-to-end contextual speech recogni- tion,” inProc. IEEE SLT, 2018, pp. 418–425

2018
[14]

Improving contextual recogni- tion of rare words with an alternate spelling prediction model,

J. Drexler Fox and N. Delworth, “Improving contextual recogni- tion of rare words with an alternate spelling prediction model,” arXiv preprint arXiv:2209.01250, 2022

arXiv 2022
[15]

Improving asr contextual biasing with guided attention,

J. Tang, K. Kim, S. Shon, F. Wu, and P. Sridhar, “Improving asr contextual biasing with guided attention,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12 096–12 100

2024
[16]

Gpt-4o transcribe model,

OpenAI, “Gpt-4o transcribe model,” 2024. [Online]. Available: https://developers.openai.com/api/docs/models/gpt-4o-transcribe

2024
[17]

Gemini 3,

Google DeepMind, “Gemini 3,” 2025. [Online]. Available: https://deepmind.google/models/gemini/

2025
[18]

Sarvam audio: Speech recognition beyond transcription,

Sarvam AI, “Sarvam audio: Speech recognition beyond transcription,” 2026. [Online]. Available: https://www.sarvam.ai/ blogs/sarvam-audio

2026
[19]

Gemma 3n: Powerful, efficient, mobile-first ai,

Google, “Gemma 3n: Powerful, efficient, mobile-first ai,”
[20]

Available: https://developers.googleblog.com/en/ introducing-gemma-3n

[Online]. Available: https://developers.googleblog.com/en/ introducing-gemma-3n
[21]

Qwen3-omni technical report,

J. Xuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025
[22]

V oxtral,

A. H. Liuet al., “V oxtral,”arXiv preprint arXiv:2507.13264, 2025

arXiv 2025
[23]

Indicvoices: Towards building an inclusive multi- lingual speech dataset for indian languages,

T. Javedet al., “Indicvoices: Towards building an inclusive multi- lingual speech dataset for indian languages,” inFindings of ACL, 2024, pp. 10 740–10 782

2024
[24]

Common voice: A massively-multilingual speech corpus,

R. Ardilaet al., “Common voice: A massively-multilingual speech corpus,” inProc. LREC, 2020, pp. 4218–4222

2020
[25]

Fleurs: Few-shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 798– 805

2023
[26]

Contextasr-bench: A massive contextual speech recognition benchmark,

H. Wang, L. Ma, D. Guo, X. Wang, L. Xie, J. Xu, and J. Lin, “Contextasr-bench: A massive contextual speech recognition benchmark,”arXiv preprint arXiv:2507.05727, 2025

arXiv 2025
[27]

Profasr-bench: A benchmark for context- conditioned asr in high-stakes professional speech,

D. B. Piskala, “Profasr-bench: A benchmark for context- conditioned asr in high-stakes professional speech,”arXiv preprint arXiv:2512.23686, 2025

arXiv 2025
[28]

Cb-whisper: Contextual biasing whisper using open- vocabulary keyword-spotting,

Y . Liet al., “Cb-whisper: Contextual biasing whisper using open- vocabulary keyword-spotting,” inProc. LREC-COLING, 2024, pp. 2941–2946

2024
[29]

Owsm- biasing: Contextualizing open whisper-style speech models for asr with dynamic vocabulary,

Y . Sudo, Y . Fujita, A. Kojima, T. Mizumoto, and L. Liu, “Owsm- biasing: Contextualizing open whisper-style speech models for asr with dynamic vocabulary,”arXiv preprint arXiv:2506.09448, 2025

arXiv 2025
[30]

Contextual biasing to improve domain- specific custom vocabulary audio transcription without explicit fine-tuning of whisper model,

V . Lall and Y . Liu, “Contextual biasing to improve domain- specific custom vocabulary audio transcription without explicit fine-tuning of whisper model,”arXiv preprint arXiv:2410.18363, 2024

arXiv 2024
[31]

Im- proving rare-word recognition of whisper in zero-shot settings,

Y . Jogi, V . Aggarwal, S. S. Nair, Y . Verma, and A. Kubba, “Im- proving rare-word recognition of whisper in zero-shot settings,” arXiv preprint arXiv:2502.11572, 2025

arXiv 2025
[32]

Can contextual biasing remain effective with whisper and gpt-2?

G. Sun, X. Zheng, C. Zhang, and P. C. Woodland, “Can contextual biasing remain effective with whisper and gpt-2?”arXiv preprint arXiv:2306.01942, 2023

arXiv 2023
[33]

Br-asr: Efficient and scalable bias retrieval framework for contextual biasing asr in speech llm,

X. Gong, A. Lv, Z. Wang, H. Zhu, and Y . Qian, “Br-asr: Efficient and scalable bias retrieval framework for contextual biasing asr in speech llm,”arXiv preprint arXiv:2505.19179, 2025

arXiv 2025
[34]

Con- textual biasing for llm-based asr with hotword retrieval and rein- forcement learning,

Y . Kong, J. Hou, J. Tang, B. Zhu, J. Zhang, and S. Xue, “Con- textual biasing for llm-based asr with hotword retrieval and rein- forcement learning,”arXiv preprint arXiv:2512.21828, 2025

arXiv 2025
[35]

Lightweight prompt biasing for contextualized end-to-end asr systems,

B. Ren, Y . Shi, and J. Li, “Lightweight prompt biasing for contextualized end-to-end asr systems,”arXiv preprint arXiv:2506.06252, 2025

arXiv 2025
[36]

Prompting large language mod- els for zero-shot domain adaptation in speech recognition,

Y . Li, Y . Wu, J. Li, and S. Liu, “Prompting large language mod- els for zero-shot domain adaptation in speech recognition,”arXiv preprint arXiv:2306.16007, 2023

arXiv 2023
[37]

Smile: Speech meta in-context learning for low-resource language automatic speech recognition,

M. H. Hsu and H. Y . Lee, “Smile: Speech meta in-context learning for low-resource language automatic speech recognition,”arXiv preprint arXiv:2409.10429, 2024

arXiv 2024
[38]

Do prompts really prompt? exploring the prompt understanding capability of whis- per,

C.-K. Yang, K.-P. Huang, and H.-Y . Lee, “Do prompts really prompt? exploring the prompt understanding capability of whis- per,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1–8

2024
[39]

Earnings-22: A practical benchmark for accents in the wild,

M. Del Rio, P. Ha, Q. McNamara, C. Miller, and S. Chandra, “Earnings-22: A practical benchmark for accents in the wild,” in Proc. Interspeech, 2022

2022
[40]

Sarvam-translate,

Sarvam AI, “Sarvam-translate,” https://huggingface.co/sarvamai/ sarvam-translate, 2025, hugging Face model repository

2025
[41]

Indicconformer: Multilingual asr model for 22 indian languages,

AI4Bharat, “Indicconformer: Multilingual asr model for 22 indian languages,” 2024. [Online]. Available: https: //huggingface.co/ai4bharat/indic-conformer-600m-multilingual

2024
[42]

Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages,

D. Kakwani, A. Kunchukuttan, S. Gollaet al., “Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages,” inFindings of EMNLP, 2020, pp. 4948–4961

2020

[1] [1]

Introduction Automatic speech recognition systems are increasingly de- ployed in applications where contextual information is avail- able at inference time. For example, meeting transcription sys- tems may know the meeting topic, medical dictation systems have access to domain terminology, and voice assistants often maintain user-specific entity lists. Su...

[2] [2]

Related Work Contextual Biasing in ASR.Contextual information such as domain terminology or user-specific entities can significantly improve speech recognition. Early approaches incorporated arXiv:2606.19157v2 [eess.AS] 24 Jun 2026 such information through language model fusion, whether at decoding time [1] or by jointly training the sequence model with a...

Pith/arXiv arXiv 2026

[3] [3]

To support this goal, the dataset satisfies 4 criteria

The IndicContextEval Benchmark Design Goals:The benchmark is designed to enable controlled evaluation of contextual grounding in AudioLLMs. To support this goal, the dataset satisfies 4 criteria. First, it contains natural speech across 8 Indian languages, covering diverse scripts and linguistic structures. Second, recordings span 23 professional domains,...

[4] [4]

Experimental Setup 4.1. Models evaluated We evaluate 5 models on our benchmark, selecting leading pro- prietary and open-weight AudioLLMs that claim support for all 8 languages in our dataset, alongside a strong standalone ASR baseline

[5] [5]

Since it requires the tar- get language as input, it cannot operate at L0 and is evaluated at L1 as a competitive non-LLM reference

Standalone ASR baseline(evaluated at L1 only): We evalu- ateIndicConformer[30], a 600M-parameter multilingual Con- former trained on 22 Indian languages. Since it requires the tar- get language as input, it cannot operate at L0 and is evaluated at L1 as a competitive non-LLM reference

[6] [6]

We do not include models such asQwen3-Omni[10] andVoxtral[11] because their official doc- umentation does not claim support for all eight Indian languages evaluated in this work

AudioLLMs(evaluated at all seven levels, L0–L6): We eval- uateGPT-4o Transcribe[6],Gemini 3 Flash[7],Sarvam Au- dio[8] for commercial models and selectGemma-3N[9] (8B- E4B) for open weight models. We do not include models such asQwen3-Omni[10] andVoxtral[11] because their official doc- umentation does not claim support for all eight Indian languages evalu...

[7] [7]

Baseline performance Table 2 reports the average WER at L1 (language prompt) for all models

Results 5.1. Baseline performance Table 2 reports the average WER at L1 (language prompt) for all models. Sarvam Audio achieves the lowest WER, followed by Table 3:WER (%) by prompt level. Model L0 L1 L2 L3 L4 L5 L6 GPT-4o T 29.83 28.61 28.37 26.08 27.9726.0428.47 Gemini 3F 24.30 18.90 19.28 18.39 19.8817.4619.67 Sarvam 20.39 16.86 16.78 16.43 16.8015.701...

arXiv

[8] [8]

Conclusion We introduced IndicContextEval, a multilingual benchmark and controlled prompt taxonomy for evaluating context utilisation in AudioLLMs. IndicContextEval spans 55.93 hours of natu- ral speech across eight Indian languages and 23 domains, en- abling systematic analysis of how different contextual signals affect transcription. Our experiments sho...

[9] [9]

We sincerely thank the NPTEL team at IIT Madras for their invaluable assistance in reaching and engaging participants for the data collection ef- fort

Acknowledgments We gratefully acknowledge the support of EkStep Foundation and Nilekani Philanthropies, whose generous funding made this work possible by supporting the team, resources, and cloud in- frastructure required for the project. We sincerely thank the NPTEL team at IIT Madras for their invaluable assistance in reaching and engaging participants ...

[10] [10]

These tools assisted with improving clarity, grammar, and conciseness of the writing

Generative AI Use Disclosure Generative AI tools were used solely for language polishing and editing during the preparation of this manuscript. These tools assisted with improving clarity, grammar, and conciseness of the writing. No generative AI system was used to generate ex- perimental results, analyses, figures, or scientific conclusions. All technica...

[11] [11]

Shallow-fusion end-to-end contextual biasing,

D. Zhao, T. N. Sainath, D. Rybach, P. Rondon, D. Bhatia, B. Li, and R. Pang, “Shallow-fusion end-to-end contextual biasing,” in Proc. Interspeech, 2019, pp. 1418–1422

2019

[12] [12]

Cold fusion: Training seq2seq models together with language models,

A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion: Training seq2seq models together with language models,” inProc. Interspeech, 2018, pp. 387–391

2018

[13] [13]

Deep context: End-to-end contextual speech recogni- tion,

G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: End-to-end contextual speech recogni- tion,” inProc. IEEE SLT, 2018, pp. 418–425

2018

[14] [14]

Improving contextual recogni- tion of rare words with an alternate spelling prediction model,

J. Drexler Fox and N. Delworth, “Improving contextual recogni- tion of rare words with an alternate spelling prediction model,” arXiv preprint arXiv:2209.01250, 2022

arXiv 2022

[15] [15]

Improving asr contextual biasing with guided attention,

J. Tang, K. Kim, S. Shon, F. Wu, and P. Sridhar, “Improving asr contextual biasing with guided attention,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12 096–12 100

2024

[16] [16]

Gpt-4o transcribe model,

OpenAI, “Gpt-4o transcribe model,” 2024. [Online]. Available: https://developers.openai.com/api/docs/models/gpt-4o-transcribe

2024

[17] [17]

Gemini 3,

Google DeepMind, “Gemini 3,” 2025. [Online]. Available: https://deepmind.google/models/gemini/

2025

[18] [18]

Sarvam audio: Speech recognition beyond transcription,

Sarvam AI, “Sarvam audio: Speech recognition beyond transcription,” 2026. [Online]. Available: https://www.sarvam.ai/ blogs/sarvam-audio

2026

[19] [19]

Gemma 3n: Powerful, efficient, mobile-first ai,

Google, “Gemma 3n: Powerful, efficient, mobile-first ai,”

[20] [20]

Available: https://developers.googleblog.com/en/ introducing-gemma-3n

[Online]. Available: https://developers.googleblog.com/en/ introducing-gemma-3n

[21] [21]

Qwen3-omni technical report,

J. Xuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025

[22] [22]

V oxtral,

A. H. Liuet al., “V oxtral,”arXiv preprint arXiv:2507.13264, 2025

arXiv 2025

[23] [23]

Indicvoices: Towards building an inclusive multi- lingual speech dataset for indian languages,

T. Javedet al., “Indicvoices: Towards building an inclusive multi- lingual speech dataset for indian languages,” inFindings of ACL, 2024, pp. 10 740–10 782

2024

[24] [24]

Common voice: A massively-multilingual speech corpus,

R. Ardilaet al., “Common voice: A massively-multilingual speech corpus,” inProc. LREC, 2020, pp. 4218–4222

2020

[25] [25]

Fleurs: Few-shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 798– 805

2023

[26] [26]

Contextasr-bench: A massive contextual speech recognition benchmark,

H. Wang, L. Ma, D. Guo, X. Wang, L. Xie, J. Xu, and J. Lin, “Contextasr-bench: A massive contextual speech recognition benchmark,”arXiv preprint arXiv:2507.05727, 2025

arXiv 2025

[27] [27]

Profasr-bench: A benchmark for context- conditioned asr in high-stakes professional speech,

D. B. Piskala, “Profasr-bench: A benchmark for context- conditioned asr in high-stakes professional speech,”arXiv preprint arXiv:2512.23686, 2025

arXiv 2025

[28] [28]

Cb-whisper: Contextual biasing whisper using open- vocabulary keyword-spotting,

Y . Liet al., “Cb-whisper: Contextual biasing whisper using open- vocabulary keyword-spotting,” inProc. LREC-COLING, 2024, pp. 2941–2946

2024

[29] [29]

Owsm- biasing: Contextualizing open whisper-style speech models for asr with dynamic vocabulary,

Y . Sudo, Y . Fujita, A. Kojima, T. Mizumoto, and L. Liu, “Owsm- biasing: Contextualizing open whisper-style speech models for asr with dynamic vocabulary,”arXiv preprint arXiv:2506.09448, 2025

arXiv 2025

[30] [30]

Contextual biasing to improve domain- specific custom vocabulary audio transcription without explicit fine-tuning of whisper model,

V . Lall and Y . Liu, “Contextual biasing to improve domain- specific custom vocabulary audio transcription without explicit fine-tuning of whisper model,”arXiv preprint arXiv:2410.18363, 2024

arXiv 2024

[31] [31]

Im- proving rare-word recognition of whisper in zero-shot settings,

Y . Jogi, V . Aggarwal, S. S. Nair, Y . Verma, and A. Kubba, “Im- proving rare-word recognition of whisper in zero-shot settings,” arXiv preprint arXiv:2502.11572, 2025

arXiv 2025

[32] [32]

Can contextual biasing remain effective with whisper and gpt-2?

G. Sun, X. Zheng, C. Zhang, and P. C. Woodland, “Can contextual biasing remain effective with whisper and gpt-2?”arXiv preprint arXiv:2306.01942, 2023

arXiv 2023

[33] [33]

Br-asr: Efficient and scalable bias retrieval framework for contextual biasing asr in speech llm,

X. Gong, A. Lv, Z. Wang, H. Zhu, and Y . Qian, “Br-asr: Efficient and scalable bias retrieval framework for contextual biasing asr in speech llm,”arXiv preprint arXiv:2505.19179, 2025

arXiv 2025

[34] [34]

Con- textual biasing for llm-based asr with hotword retrieval and rein- forcement learning,

Y . Kong, J. Hou, J. Tang, B. Zhu, J. Zhang, and S. Xue, “Con- textual biasing for llm-based asr with hotword retrieval and rein- forcement learning,”arXiv preprint arXiv:2512.21828, 2025

arXiv 2025

[35] [35]

Lightweight prompt biasing for contextualized end-to-end asr systems,

B. Ren, Y . Shi, and J. Li, “Lightweight prompt biasing for contextualized end-to-end asr systems,”arXiv preprint arXiv:2506.06252, 2025

arXiv 2025

[36] [36]

Prompting large language mod- els for zero-shot domain adaptation in speech recognition,

Y . Li, Y . Wu, J. Li, and S. Liu, “Prompting large language mod- els for zero-shot domain adaptation in speech recognition,”arXiv preprint arXiv:2306.16007, 2023

arXiv 2023

[37] [37]

Smile: Speech meta in-context learning for low-resource language automatic speech recognition,

M. H. Hsu and H. Y . Lee, “Smile: Speech meta in-context learning for low-resource language automatic speech recognition,”arXiv preprint arXiv:2409.10429, 2024

arXiv 2024

[38] [38]

Do prompts really prompt? exploring the prompt understanding capability of whis- per,

C.-K. Yang, K.-P. Huang, and H.-Y . Lee, “Do prompts really prompt? exploring the prompt understanding capability of whis- per,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1–8

2024

[39] [39]

Earnings-22: A practical benchmark for accents in the wild,

M. Del Rio, P. Ha, Q. McNamara, C. Miller, and S. Chandra, “Earnings-22: A practical benchmark for accents in the wild,” in Proc. Interspeech, 2022

2022

[40] [40]

Sarvam-translate,

Sarvam AI, “Sarvam-translate,” https://huggingface.co/sarvamai/ sarvam-translate, 2025, hugging Face model repository

2025

[41] [41]

Indicconformer: Multilingual asr model for 22 indian languages,

AI4Bharat, “Indicconformer: Multilingual asr model for 22 indian languages,” 2024. [Online]. Available: https: //huggingface.co/ai4bharat/indic-conformer-600m-multilingual

2024

[42] [42]

Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages,

D. Kakwani, A. Kunchukuttan, S. Gollaet al., “Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages,” inFindings of EMNLP, 2020, pp. 4948–4961

2020