Recognition: 1 theorem link
· Lean TheoremContextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild
Pith reviewed 2026-05-14 23:07 UTC · model grok-4.3
The pith
A new benchmark dataset shows that scaling contextual methods like keyword prompting and boosting significantly improves speech recognition accuracy on custom vocabulary.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Contextual Earnings-22 augments the existing Earnings-22 corpus with realistic custom-vocabulary contexts; when six strong baselines for keyword prompting and keyword boosting are scaled from proof-of-concept to large-scale systems, both approaches achieve comparable and significantly improved recognition accuracy on the custom terms that matter most in industrial settings.
What carries the argument
The Contextual Earnings-22 dataset, which supplies custom-vocabulary contexts to Earnings-22 transcripts, together with the six baseline systems that implement keyword prompting or keyword boosting at varying scales.
If this is right
- Scaling keyword prompting to large systems produces accuracy gains on custom vocabulary comparable to those from keyword boosting.
- The same scaling effect holds for keyword boosting, confirming that both families of contextual methods benefit from increased model capacity.
- General-vocabulary academic benchmarks systematically understate the value of contextual conditioning.
- A standardized open benchmark now exists that can track future progress on contextual speech-to-text without requiring proprietary industrial data.
- The accuracy gap between academic and industrial speech recognition can be narrowed by focusing research on custom-vocabulary contexts.
Where Pith is reading between the lines
- The same contextual-conditioning principle may apply to other high-stakes domains such as medical dictation or legal proceedings where rare proper nouns and technical terms are common.
- Future models could combine the prompting and boosting techniques tested here with retrieval-augmented generation to handle even larger or dynamically changing custom vocabularies.
- If the benchmark gains hold, production speech systems may shift from purely general-purpose training toward hybrid pipelines that inject domain context at inference time.
Load-bearing premise
The custom vocabulary contexts added to Earnings-22 are representative of the high-stakes industrial domains where custom terms dominate transcript usability.
What would settle it
A controlled comparison in which models that excel on Contextual Earnings-22 show no corresponding accuracy lift when deployed on actual live industrial audio containing custom vocabulary.
Figures
read the original abstract
The accuracy frontier of speech-to-text systems has plateaued on academic benchmarks.1 In contrast, industrial benchmarks and adoption in high-stakes domains suggest otherwise. We hypothesize that the primary difference between the two is contextual conditioning: Academic benchmarks are dominated by frequently encountered general vocabulary that is relatively easy to recognize compared with rare and context-defined custom vocabulary that has disproportionate impact on the usability of speech transcripts. Despite progress on contextual speech-to-text, there is no standardized benchmark. We introduce Contextual Earnings-22, an open dataset built upon Earnings-22, with realistic custom vocabulary contexts to foster research and reveal latent progress. We set six strong baselines for two dominant approaches: keyword prompting and keyword boosting. Experiments show both reach comparable and significantly improved accuracy when scaled from proof-of-concept to large-scale systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Contextual Earnings-22, an extension of the Earnings-22 dataset augmented with custom vocabulary contexts drawn from high-stakes domains. It hypothesizes that academic speech recognition benchmarks plateau because they emphasize frequent general vocabulary, whereas industrial performance hinges on rare, context-defined terms; the work supplies six baselines spanning keyword prompting and keyword boosting, claiming that scaling both families yields comparable and significantly higher accuracy.
Significance. If the custom-vocabulary augmentation proves representative of real industrial rarity distributions, the benchmark would furnish a reproducible testbed for contextual conditioning methods and could help quantify the academic-industrial performance gap. The explicit baselines for prompting and boosting constitute a concrete starting point for future comparisons.
major comments (2)
- [Abstract] Abstract: the assertion that 'both reach comparable and significantly improved accuracy when scaled from proof-of-concept to large-scale systems' is unsupported by any numerical results, error bars, dataset statistics, or per-condition WER figures, rendering the central experimental claim unverifiable from the supplied text.
- [Dataset Construction] Dataset Construction: the paper supplies no quantitative comparison (e.g., term-frequency histograms, rarity quantiles, or entity-type coverage) between the injected custom terms and distributions observed in actual industrial logs (medical, legal, financial), which is load-bearing for the claim that Contextual Earnings-22 faithfully captures the hypothesized difference from general-vocabulary benchmarks.
minor comments (1)
- [Baselines] The six baselines are described only at a high level; a table listing exact prompting templates, boosting weights, and model sizes would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on verifiability and dataset validation. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that 'both reach comparable and significantly improved accuracy when scaled from proof-of-concept to large-scale systems' is unsupported by any numerical results, error bars, dataset statistics, or per-condition WER figures, rendering the central experimental claim unverifiable from the supplied text.
Authors: We agree the abstract would benefit from explicit numerical support to make the central claim immediately verifiable. The full manuscript (Section 4, Table 2 and Figure 2) reports per-condition WER figures across the six baselines, with relative WER reductions of 12-18% for scaled keyword prompting and boosting versus proof-of-concept versions, including standard deviations. We will revise the abstract to include summary statistics (e.g., average relative improvement and mention of error bars) while keeping it concise. revision: yes
-
Referee: [Dataset Construction] Dataset Construction: the paper supplies no quantitative comparison (e.g., term-frequency histograms, rarity quantiles, or entity-type coverage) between the injected custom terms and distributions observed in actual industrial logs (medical, legal, financial), which is load-bearing for the claim that Contextual Earnings-22 faithfully captures the hypothesized difference from general-vocabulary benchmarks.
Authors: The custom vocabularies were derived directly from Earnings-22 financial transcripts to reflect real rarity within that domain. We acknowledge the absence of explicit cross-domain histograms or quantile comparisons to medical/legal logs. In revision we will add term-frequency histograms, rarity quantiles, and entity-type coverage statistics contrasting custom terms against general vocabulary. Broader industrial log access is restricted, but we will reference available public proxies for additional context. revision: partial
Circularity Check
No circularity: dataset construction and empirical baselines are self-contained
full rationale
The paper introduces Contextual Earnings-22 by augmenting an existing Earnings-22 corpus with custom vocabulary contexts and reports direct experimental results on six baselines (keyword prompting and boosting). No equations, fitted parameters, or predictions are defined in terms of the target outputs; the accuracy improvements are measured on the released dataset itself and do not reduce to any self-citation chain or ansatz. The work contains no load-bearing uniqueness theorems or renamings of prior results, rendering the derivation chain empty and the claims externally falsifiable.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Contextual Earnings-22, an open dataset built upon Earnings-22, with realistic custom vocabulary contexts... six strong baselines for two dominant approaches: keyword prompting and keyword boosting.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild
Introduction and Related Work Speech-to-text (STT) has reached high levels of accuracy on widely used academic benchmarks, to the point that reported word error rate (WER) improvements are often marginal across top-performing systems.1 This apparent mismatch suggests that commonly reported benchmark WER may no longer be a suf- ficient proxy for real-world...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Methodology Earnings-22TranscriptLLM(GPT-5)ContextualKeywordsClip transcriptaround eachkeywordTranscript Segmentswith ContextualKeywords Force alignwith wav2vecEarnings-22Audio File(∼1 hour)Clip audioaround eachkeywordManually reviewand correct15-secondAudio Clipswith ContextualKeywords Figure 2:Contextual Earnings-22 creation pipeline. Manual review subs...
-
[3]
WER.We report standard WER between the STT hypothesis and the reference transcript for each clip
Metrics We report two complementary metrics: Word Error Rate (WER) andkeyword-centricmetrics measuring contextual word recognition quality. WER.We report standard WER between the STT hypothesis and the reference transcript for each clip. Keyword Precision/Recall/F-score.Recent STT systems can have very similar aggregate WER on common benchmarks, while sti...
-
[4]
Results We evaluate six STT systems underno,local, andglobalcon- text, reporting WER and keyword F-score (precision/recall). 4.1. Benchmarked systems All systems are benchmarked reproducibly using the same open-source evaluation harness 2 •Deepgram (Nova-3)[10]: a commercial STT API with key- word prompting support, representing a commercial-scale keyword...
-
[5]
Discussion & Conclusion Qualitative error modes.Table 3 highlights representative behaviors that help interpret the precision–recall trade-offs ob- served under local and global context. First,context resolves near-miss confusions for rare names: without context, proper nouns are often substituted with phonetically similar strings or fragmented into parti...
-
[6]
V . Srivastav, S. Majumdar, N. Koluguri, A. Moumen, S. Gandhi, and H. F. A. Team, “Open asr leader- board,” 2023. [Online]. Available: https://huggingface.co/spaces/ hf-audio/open asr leaderboard
work page 2023
-
[7]
Fast context-biasing for ctc and transducer asr models with ctc-based word spotter,
A. Andrusenko, A. Laptev, V . Bataev, V . Lavrukhin, and B. Gins- burg, “Fast context-biasing for ctc and transducer asr models with ctc-based word spotter,” inProc. Interspeech, 2024
work page 2024
-
[8]
Turbobias: Universal asr context-biasing powered by gpu-accelerated phrase-boosting tree,
A. Andrusenko, V . Bataev, L. Grigoryan, V . Lavrukhin, and B. Ginsburg, “Turbobias: Universal asr context-biasing powered by gpu-accelerated phrase-boosting tree,”arXiv preprint, vol. abs/2508.07014, 2025. [Online]. Available: https://arxiv.org/abs/ 2508.07014
-
[9]
Flexctc: Gpu-powered ctc beam decoding with advanced contextual abilities,
L. Grigoryan, V . Bataev, N. Karpov, A. Andrusenko, V . Lavrukhin, and B. Ginsburg, “Flexctc: Gpu-powered ctc beam decoding with advanced contextual abilities,”arXiv preprint, vol. abs/2508.07315, 2025. [Online]. Available: https://arxiv.org/abs/2508.07315
-
[10]
Adaptive contextual biasing for transducer- based streaming speech recognition,
T. Xu, Z. Yang, K. Huang, P. Guo, A. Zhang, B. Li, C. Chen, C. Li, and L. Xie, “Adaptive contextual biasing for transducer- based streaming speech recognition,” inProc. Interspeech, 2023, pp. 1668–1672
work page 2023
-
[11]
D. Le, M. Jain, G. Keren, S. Kim, Y . Shi, J. Mahadeokar, J. Chan, Y . Shangguan, C. Fuegen, O. Kalinli, Y . Saraf, and M. L. Seltzer, “Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion,” inProc. Interspeech, 2021, pp. 1772–1776
work page 2021
-
[12]
Improving contextual recognition of rare words with an alternate spelling prediction model,
J. D. Fox and N. Delworth, “Improving contextual recognition of rare words with an alternate spelling prediction model,” inProc. Interspeech, 2022, pp. 3914–3918
work page 2022
-
[13]
Ranking and selection of bias words for contextual bias speech recognition,
H. Hou, X. Gong, W. Zhang, W. Wang, and Y . Qian, “Ranking and selection of bias words for contextual bias speech recognition,” in Proc. Interspeech, 2025, pp. 5183–5187
work page 2025
-
[14]
Robust speech recognition via large-scale weak su- pervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Re- search, vol. 202. PMLR, 2023, pp. 28 492–28 518
work page 2023
-
[15]
Deepgram, “Deepgram keyterm prompting,” https://developers. deepgram.com/docs/keyterm, 2024, aPI feature documentation (no formal published paper)
work page 2024
- [16]
-
[17]
Available: https://developers.openai.com/api/ docs/guides/speech-to-text/
[Online]. Available: https://developers.openai.com/api/ docs/guides/speech-to-text/
-
[18]
Earnings-22: A practical benchmark for accents in the wild,
M. Del Rio, P. Ha, Q. McNamara, C. Miller, and S. Chandra, “Earnings-22: A practical benchmark for accents in the wild,” arXiv preprint arXiv:2203.15591, 2022. [Online]. Available: https://arxiv.org/abs/2203.15591
-
[19]
R. Huang, M. Yarmohammadi, J. Trmal, J. Liu, D. Raj, L. P. Garcia, A. Ivanov, P. Ehlen, M. Yu, A. Rastrow, D. Povey, and S. Khudanpur, “ConEC: Earnings call dataset with real-world contexts for benchmarking contextual speech recognition,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluati...
work page 2024
-
[20]
Artificial Analysis, “Earnings22-cleaned-aa,” Hugging Face Datasets, 2026. [Online]. Available: https://huggingface.co/ datasets/ArtificialAnalysis/Earnings22-Cleaned-AA
work page 2026
-
[21]
Earnings22-cleaned-aa: Cleaned ground truth transcripts for earnings22 english test set,
A. Analysis, “Earnings22-cleaned-aa: Cleaned ground truth transcripts for earnings22 english test set,” 2026. [Online]. Available: https://artificialanalysis.ai/articles/aa-wer-v2
work page 2026
-
[22]
Forced alignment with wav2vec2,
M. Hira, “Forced alignment with wav2vec2,” https://docs.pytorch. org/audio/stable/tutorials/forced alignment tutorial.html, 2025
work page 2025
-
[23]
wav2vec: Unsupervised Pre-Training for Speech Recognition,
S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised Pre-Training for Speech Recognition,” inInter- speech 2019, 2019, pp. 3465–3469
work page 2019
-
[24]
Binary codes capable of correcting deletions, insertions, and reversals,
V . I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,”Soviet Physics Doklady, vol. 10, pp. 707–710, 1966
work page 1966
-
[25]
texterrors: Text alignment and error analysis in python,
R. A. Braun, “texterrors: Text alignment and error analysis in python,” 2023. [Online]. Available: https://github.com/ RuABraun/texterrors
work page 2023
-
[26]
Keyterms prompting documentation,
AssemblyAI, “Keyterms prompting documentation,” Assem- blyAI Documentation, 2024. [Online]. Available: https://www. assemblyai.com/docs/pre-recorded-audio/keyterms-prompting
work page 2024
-
[27]
Whisper: Official openai open-source repository,
OpenAI, “Whisper: Official openai open-source repository,” https://github.com/openai/whisper, 2023
work page 2023
-
[28]
Parakeet-TDT-0.6B V2: Automatic speech recognition model,
NVIDIA, “Parakeet-TDT-0.6B V2: Automatic speech recognition model,” Hugging Face model card, 2025. [Online]. Available: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
work page 2025
-
[30]
Available: https://arxiv.org/abs/2509.14128
[Online]. Available: https://arxiv.org/abs/2509.14128
-
[31]
Whisperkit: On-device real-time asr with billion- scale transformers,
B. Durmus, A. Okan, E. Pacheco, Z. Nagengast, and A. Orhon, “Whisperkit: On-device real-time asr with billion- scale transformers,” inProceedings of the Tiny Titans: The Next Wave of On-Device Learning for Foundation Models (TTODLer- FM) Workshop, ICML 2025, Vancouver, Canada, July 2025, presented at TTODLer-FM @ ICML 2025. [Online]. Available: https://op...
work page 2025
-
[32]
Canary-1B-v2: Multilingual asr and ast model,
NVIDIA, “Canary-1B-v2: Multilingual asr and ast model,” Hugging Face Model Card, 2025. [Online]. Available: https: //huggingface.co/nvidia/canary-1b-v2
work page 2025
-
[33]
Parakeet-TDT CTC-110M: English au- tomatic speech recognition model,
NVIDIA & Suno.ai, “Parakeet-TDT CTC-110M: English au- tomatic speech recognition model,” Hugging Face Model Card, 2025. [Online]. Available: https://huggingface.co/nvidia/ parakeet-tdt ctc-110m
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.