pith. machine review for the scientific record. sign in

arxiv: 2604.10736 · v2 · submitted 2026-04-12 · 💻 cs.CL · cs.SD

Recognition: unknown

BlasBench: An Open Benchmark for Irish Speech Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:50 UTC · model grok-4.3

classification 💻 cs.CL cs.SD
keywords Irish ASRspeech recognition benchmarktext normalizationlow-resource languagesgeneralization gapWhisper hallucinationmultilingual models
0
0 comments X

The pith

BlasBench supplies an Irish-aware normalizer and scoring harness that makes ASR comparisons for the language reliable and exposes a large generalization gap between fine-tuned and multilingual models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing multilingual ASR benchmarks treat Irish like any other language and apply generic text normalization that ignores fadas, lenition, and eclipsis. This paper releases BlasBench, a standalone open harness containing an Irish-specific normalizer plus reproducible scoring code and per-utterance outputs. When twelve systems are evaluated on both Common Voice ga-IE and FLEURS ga-IE, fine-tuned models lose 33 to 43 WER points on the second dataset while massively multilingual models lose only 7 to 10 points. The result shows that single-dataset leaderboards hide robustness failures that matter for low-resource languages.

Core claim

BlasBench demonstrates that an Irish-aware normalizer preserving fadas, lenition, and eclipsis is required for valid ASR evaluation; with it in place, Whisper variants exceed 100 percent WER through insertion hallucination, Microsoft Azure reaches 22.2 percent on Common Voice and 57.5 percent on FLEURS, and the best open model reaches 30.65 percent and 39.09 percent respectively, while fine-tuned systems degrade far more than multilingual ones when moving between the two corpora.

What carries the argument

BlasBench, an open evaluation harness built around a standalone Irish-aware normaliser that preserves fadas, lenition, and eclipsis and supplies reproducible scoring with released per-utterance predictions.

If this is right

  • Single-dataset leaderboards for Irish ASR become unreliable and must be replaced by multi-corpus evaluation.
  • Fine-tuning on one Irish resource produces brittle models that fail on new domains or recording conditions.
  • Massively multilingual pre-training confers measurable robustness advantages for Irish that single-language fine-tuning does not.
  • Hallucination-driven insertions in Whisper models render their WER scores uninterpretable without an Irish-aware normalizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar orthography-preserving normalizers may be needed for other languages with diacritics or mutation rules before their ASR benchmarks can be trusted.
  • The generalization gap observed here suggests that low-resource language evaluation should routinely include at least two independent test sets drawn from different sources.
  • Releasing both the normalizer code and the raw predictions allows future researchers to test new models or normalizer variants without re-collecting data.

Load-bearing premise

The custom normalizer fully captures all relevant Irish orthographic rules and the two chosen corpora are representative enough to support claims about generalization.

What would settle it

A manual audit of the normalizer output on held-out Irish text that reveals systematic errors in handling eclipsis or lenition, or a third Irish speech corpus on which the reported 33-43 point degradation for fine-tuned models disappears.

Figures

Figures reproduced from arXiv: 2604.10736 by John Conway, Jyoutir Raj.

Figure 1
Figure 1. Figure 1: BlasBench evaluation pipeline. The Irish nor [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Existing multilingual benchmarks include Irish among dozens of languages but apply no Irish-aware text normalisation, leaving reliable and reproducible ASR comparison impossible. We introduce BlasBench, an open evaluation harness that provides a standalone Irish-aware normaliser preserving fadas, lenition, and eclipsis; a reproducible scoring harness and per-utterance predictions released for all evaluated runs. We pilot this by benchmarking 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE. All Whisper variants exceed 100% WER through insertion-driven hallucination. Microsoft Azure reaches 22.2% WER on Common Voice and 57.5% on FLEURS; the best open model, Omnilingual ASR 7B, reaches 30.65% and 39.09% respectively. Models fine-tuned on Common Voice degrade 33-43 points moving to FLEURS, while massively multilingual models degrade only 7-10 - a generalisation gap that single-dataset evaluation misses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BlasBench, an open evaluation harness for Irish ASR that supplies a standalone Irish-aware text normalizer preserving fadas, lenition, and eclipsis, together with a reproducible scoring pipeline and released per-utterance predictions. It benchmarks 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE, reporting concrete WERs (Azure 22.2 % / 57.5 %, best open model Omnilingual ASR 7B at 30.65 % / 39.09 %) and a generalization gap in which fine-tuned models degrade 33–43 points while massively multilingual models degrade only 7–10 points; all Whisper variants exceed 100 % WER via insertion-driven hallucination.

Significance. If the normalizer is shown to be correct and complete, BlasBench would supply a much-needed reproducible resource for Irish ASR that current multilingual benchmarks lack. The public release of the harness, normalizer code, and per-utterance predictions is a clear strength that directly supports the reproducibility claim. The reported generalization gap supplies a concrete, falsifiable observation that single-dataset evaluation misses and that future low-resource ASR work can test.

major comments (2)
  1. [§3] §3 (Normalizer): the manuscript presents the custom normalizer as the load-bearing component that makes all WER numbers reliable and reproducible, yet supplies neither an exhaustive rule list, concrete edge-case examples (dialectal mutations, compound-word handling, punctuation interactions), nor any quantitative validation (e.g., agreement with native-speaker gold normalizations or inter-annotator metrics). Without this, the central claim that observed WER differences reflect model behavior rather than normalization artifacts cannot be evaluated.
  2. [§4–5] §4–5 (Results): the assertion that Whisper variants exceed 100 % WER “through insertion-driven hallucination” is stated without error analysis, utterance-level examples, or breakdown of insertion versus substitution rates. This detail is required to substantiate the architectural-family comparison and to allow readers to judge whether the failure mode is systematic or dataset-specific.
minor comments (2)
  1. [Abstract] The abstract states “four architecture families” but does not enumerate them; the methods section should list the families explicitly for immediate readability.
  2. [Results table] Table 1 (or equivalent results table) reports point WER estimates without confidence intervals or statistical tests; adding these would strengthen the generalization-gap claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the specific revisions we will incorporate to strengthen the presentation of the normalizer and the error analysis.

read point-by-point responses
  1. Referee: [§3] §3 (Normalizer): the manuscript presents the custom normalizer as the load-bearing component that makes all WER numbers reliable and reproducible, yet supplies neither an exhaustive rule list, concrete edge-case examples (dialectal mutations, compound-word handling, punctuation interactions), nor any quantitative validation (e.g., agreement with native-speaker gold normalizations or inter-annotator metrics). Without this, the central claim that observed WER differences reflect model behavior rather than normalization artifacts cannot be evaluated.

    Authors: We agree that the current description of the normalizer in §3 is insufficient for full reproducibility and independent verification. In the revised manuscript we will expand this section to provide an exhaustive enumerated list of all normalization rules, multiple concrete examples addressing dialectal mutations, compound-word handling, and punctuation interactions, and quantitative validation results including agreement metrics with native-speaker gold normalizations and inter-annotator agreement scores. These additions will allow readers to confirm that the reported WER differences arise from model behavior rather than normalization artifacts. revision: yes

  2. Referee: [§4–5] §4–5 (Results): the assertion that Whisper variants exceed 100 % WER “through insertion-driven hallucination” is stated without error analysis, utterance-level examples, or breakdown of insertion versus substitution rates. This detail is required to substantiate the architectural-family comparison and to allow readers to judge whether the failure mode is systematic or dataset-specific.

    Authors: We acknowledge that the manuscript currently states the >100 % WER observation for Whisper variants without supporting error analysis. In the revised version we will add a dedicated error-analysis subsection (or appendix) that reports per-model insertion, substitution, and deletion rates, supplies representative utterance-level examples of the insertion-driven hallucinations, and discusses whether the pattern appears systematic across both Common Voice and FLEURS. This will strengthen the architectural-family comparison and enable readers to assess the generality of the failure mode. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with independent components

full rationale

The paper presents an open evaluation harness, a custom normalizer, and empirical WER results on public datasets (Common Voice ga-IE, FLEURS ga-IE). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The normalizer is introduced as a standalone contribution rather than derived from prior results by the same authors. All reported numbers are direct measurements, not reductions to inputs by construction. This matches the default expectation for non-circular empirical benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard WER metric and the assumption that the two public Irish datasets are suitable test beds; no new free parameters, axioms beyond standard ASR evaluation practice, or invented entities are introduced.

axioms (1)
  • domain assumption Word error rate remains a valid primary metric once text is properly normalized for Irish orthography.
    Invoked when reporting all WER figures and degradation gaps.

pith-pipeline@v0.9.0 · 5464 in / 1253 out tokens · 39196 ms · 2026-05-10T15:50:33.685650+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    WER We Stand: Benchmarking U rdu ASR Models

    Samee Arif, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, and Awais Athar. WER We Stand: Benchmarking U rdu ASR Models. In Proc. COLING, pages 5952--5961, 2025. arXiv:2409.11252

  2. [2]

    Claude Opus 4.6 (Models overview)

    Anthropic. Claude Opus 4.6 (Models overview). https://docs.anthropic.com/en/docs/about-claude/models, 2026

  3. [3]

    Bootstrap estimates for confidence intervals in ASR performance evaluation

    Maximilian Bisani and Hermann Ney. Bootstrap estimates for confidence intervals in ASR performance evaluation. In Proc. ICASSP, 2004

  4. [4]

    Common voice: A massively-multilingual speech corpus,

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common Voice: A Massively-Multilingual Speech Corpus. In Proc. LREC, 2020. arXiv:1912.06670

  5. [5]

    How I Built ASR for Endangered Languages with a Spoken Dictionary

    Christopher Bartley and Anton Ragni. How I Built ASR for Endangered Languages with a Spoken Dictionary. arXiv:2510.04832, 2025

  6. [6]

    \' O Meachair, and Jennifer Foster

    James Barry, Joachim Wagner, Lauren Cassidy, Alan Cowap, Teresa Lynn, Abigail Walsh, M\' i che\' a l J. \' O Meachair, and Jennifer Foster. ga BERT -- an I rish Language Model. In Proc. LREC, pages 4774--4788, 2022. arXiv:2107.12930

  7. [7]

    Fleurs: Few-shot learning evaluation of universal representations of speech,

    Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. FLEURS : Few-shot Learning Evaluation of Universal Representations of Speech. In Proc. SLT, 2022. arXiv:2205.12446

  8. [8]

    Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, and Melvin Johnson

    Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, and Melvin Johnson. XTREME-S : Evaluating Cross-lingual Speech Representations. In Proc. Interspeech, 2022...

  9. [9]

    S. Faste. Wav2Vec 2.0 for I rish ASR : A Multilingual Approach to Under-Resourced Languages. MSc thesis, University of Groningen, Campus Fryslân, 2022. https://campus-fryslan.studenttheses.ub.rug.nl/234/

  10. [10]

    Larry Gillick and Stephen J. Cox. Some Statistical Issues in the Comparison of Speech Recognition Algorithms. In Proc. ICASSP, pages 532--535, 1989

  11. [11]

    Development and Evaluation of Speech Recognition for the W elsh Language

    Dewi Jones. Development and Evaluation of Speech Recognition for the W elsh Language. In Proc. CLTW, 2022

  12. [12]

    autoresearch: AI agents running research on single- GPU nanochat training automatically

    Andrej Karpathy. autoresearch: AI agents running research on single- GPU nanochat training automatically. https://github.com/karpathy/autoresearch, 2026

  13. [13]

    Omnilingual asr: Open- source multilingual speech recognition for 1600+ languages.arXiv preprint arXiv:2511.09690, 2025

    Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, Kevin Chan, Chierh Cheng, Joe Chuang, Caley Droof, Mark Duppenthaler, Paul-Ambroise Duquenne, Alexander Erben, Cynthia Gao, Gabriel Mejia Gonzalez, Kehan Lyu, Sagar Miglani, Vineel Pratap, Kaushik Ram Sadagopan, Safiyyah Salee...

  14. [14]

    A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on S cottish G aelic

    Ond r ej Klejch, William Lamb, and Peter Bell. A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on S cottish G aelic. In Proc. Interspeech, 2025. arXiv:2506.04915

  15. [15]

    Automatic Speech Recognition for I rish: the ABAIR-\' E IST System

    Liam Lonergan, Mengjie Qian, Harald Berthelsen, Andy Murphy, Christoph Wendler, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Automatic Speech Recognition for I rish: the ABAIR-\' E IST System. In Proc. CLTW, pages 47--51, 2022

  16. [16]

    Cross-dialect lexicon optimisation for an endangered language ASR system: the case of I rish

    Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Cross-dialect lexicon optimisation for an endangered language ASR system: the case of I rish. In Proc. Interspeech, pages 4865--4869, 2022. doi:10.21437/Interspeech.2022-838

  17. [17]

    Towards Dialect-inclusive Recognition in a Low-resource Language: Are Balanced Corpora the Answer? In Proc

    Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Towards Dialect-inclusive Recognition in a Low-resource Language: Are Balanced Corpora the Answer? In Proc. Interspeech, pages 5082--5086, 2023. arXiv:2307.07295

  18. [18]

    Towards Spoken Dialect Identification of I rish

    Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Towards Spoken Dialect Identification of I rish. In Proc. SIGUL, pages 63--67, 2023. arXiv:2307.07436

  19. [19]

    Low-resource speech recognition and dialect identification of I rish in a multi-task framework

    Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Low-resource speech recognition and dialect identification of I rish in a multi-task framework. In Proc. Odyssey, pages 67--73, 2024. arXiv:2405.01293

  20. [20]

    Fotheidil: an Automatic Transcription System for the I rish Language

    Liam Lonergan, Ibon Saratxaga, John Sloan, Oscar Maharg Bravo, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Fotheidil: an Automatic Transcription System for the I rish Language. In Proc. CLTW, pages 35--45, 2025. arXiv:2501.00509

  21. [21]

    Pillai, and Elizabeth Sherly

    Kavya Manohar, Leena G. Pillai, and Elizabeth Sherly. What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations. In Proc. EMNLP, 2024. arXiv:2409.02449

  22. [22]

    FLEURS-R : A Restored Multilingual Speech Corpus for Generation Tasks

    Min Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, and Michiel Bacchiani. FLEURS-R : A Restored Multilingual Speech Corpus for Generation Tasks. In Proc. Interspeech, 2024. arXiv:2408.06227

  23. [23]

    Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation

    Yasmin Moslem. Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation. In Proc. IWSLT, 2024. arXiv:2406.17363

  24. [24]

    Findings of the IWSLT 2024 Evaluation Campaign

    Ibrahim Said Ahmad et al. Findings of the IWSLT 2024 Evaluation Campaign. In Proc. IWSLT, 2024. arXiv:2411.05088

  25. [25]

    OWSM-CTC : An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

    Yifan Peng, Yui Sudo, Muhammad Shakeel, and Shinji Watanabe. OWSM-CTC : An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification. In Proc. ACL, 2024. arXiv:2402.12654

  26. [26]

    Scaling speech technology to 1,000+ languages,

    Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling Speech Technology to 1,000+ Languages. arXiv:2305.13516, 2023

  27. [27]

    Puvvada, Piotr \

    Krishna C. Puvvada, Piotr \. Z elasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, and Boris Ginsburg. Less is More: Accurate Speech Recognition & Translation without Web-Scale Data. In Proc. Interspeech, 2024. arXiv:2406.19674

  28. [28]

    Knill, and Mark J

    Mengjie Qian, Siyuan Tang, Rao Ma, Kate M. Knill, and Mark J. F. Gales. Learn and Don't Forget: Adding a New Language to ASR Foundation Models. In Proc. Interspeech, 2024. arXiv:2407.06800

  29. [29]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision. In Proc. ICML, pages 28492--28518, 2023. arXiv:2212.04356

  30. [30]

    Ml-superb: Multilingual speech universal performance benchmark,

    Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, and Shinji Watanabe. ML-SUPERB : Multilingual Speech Universal PERformance Benchmark. In Proc. Interspeech, 2023. arXiv:2305.10615

  31. [31]

    Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation,

    Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Rao Koluguri, Piotr \. Z elasko, Somshubra Majumdar, Adel Moumen, and Sanchit Gandhi. Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation. arXiv:2510.06961, 2025

  32. [32]

    Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Fran c o...

  33. [33]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  34. [34]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...