pith. machine review for the scientific record. sign in

arxiv: 2604.04598 · v1 · submitted 2026-04-06 · 💻 cs.CL

Recognition: no theorem link

Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords Pashto ASRzero-shot multilingual ASRscript fidelityword error ratecross-domain evaluationlow-resource languagesWhisper modelsFLEURS dataset
0
0 comments X

The pith

Whisper models produce Pashto script in under 1% of cases when transcribing Pashto audio, while other models exceed 93% fidelity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper delivers the first public multi-model benchmarks for automatic speech recognition on Pashto using two read-speech test sets. It shows that zero-shot Whisper models yield word error rates from 90% to over 400% and almost never generate Pashto script, defaulting instead to Arabic script. Other models such as MMS-1B and SeamlessM4T achieve much higher script fidelity and lower error rates, proving that word error rate alone cannot detect when a model has failed to perform actual ASR. Fine-tuned Pashto models suffer large accuracy drops when tested across different data sources, except for one augmented version that holds steady.

Core claim

The paper reports zero-shot Whisper word error rates ranging from 90% to 297% on FLEURS Pashto and up to 461% on Common Voice Pashto, with SeamlessM4T reaching the best zero-shot result of 39.7% on Common Voice. A language-identification audit of outputs reveals that no Whisper model produces Pashto script in more than 0.8% of utterances, whereas MMS-1B, SeamlessM4T, and OmniASR each exceed 93% Pashto-script fidelity. Published word error rates of 14% for fine-tuned models rise to 32.5-59% on out-of-distribution sets, while one augmented model maintains 35.1% on both sets. Errors concentrate on Pashto-unique phonemes such as retroflex consonants and lateral fricatives.

What carries the argument

The language-identification audit that classifies model outputs by script to measure Pashto-script fidelity, exposing failures invisible to word error rate alone.

If this is right

  • Word error rate cannot be trusted as a sole metric when the output script does not match the target language.
  • SeamlessM4T currently delivers the strongest zero-shot Pashto result among tested models on Common Voice data.
  • Most fine-tuned Pashto models lose substantial accuracy when moved to a different test set unless they incorporate augmentation.
  • Pashto retroflex and lateral fricative sounds drive a disproportionate share of recognition errors across models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation of multilingual ASR systems should routinely include script-level checks alongside word error rate to catch hidden failure modes.
  • The observed cross-domain drops suggest that read-speech-only benchmarks underestimate the robustness needed for practical Pashto applications.
  • Similar script mismatches may affect other low-resource languages that use non-Latin scripts, warranting targeted audits.

Load-bearing premise

The FLEURS Pashto test set and filtered Common Voice subset are representative enough to support general claims about zero-shot performance and cross-domain degradation for Pashto.

What would settle it

A language-identification audit on Whisper model outputs from a new Pashto audio corpus that finds more than 10% of utterances in Pashto script would contradict the reported script failure.

Figures

Figures reproduced from arXiv: 2604.04598 by Hanif Rahman.

Figure 1
Figure 1. Figure 1: Zero-shot WER for all ten models on FLEURS (left) [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Script output distribution per model on FLEURS [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-domain generalisation: FLEURS WER ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: WER deviation from overall (35.4%) by charac [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
read the original abstract

Pashto is spoken by approximately 60--80 million people but has no published benchmarks for multilingual automatic speech recognition (ASR) on any shared public test set. This paper reports the first reproducible multi-model evaluation on public Pashto data, covering zero-shot ASR, script-level failure, and cross-domain evaluation of fine-tuned models. For zero-shot ASR, ten models (all seven Whisper sizes, MMS-1B, SeamlessM4T-v2-large, and OmniASR-CTC-300M) are evaluated on the FLEURS Pashto test set and a filtered Common Voice~24 subset; zero-shot Whisper WER ranges from 90% to 297%, with the medium model collapsing to 461% on Common Voice~24 consistent with decoder looping. SeamlessM4T achieves 39.7% WER on Common Voice~24 (the best zero-shot result reported to date, as of submission); MMS-1B achieves 43.8% on FLEURS. For script failure, a language-identification audit shows that no Whisper model produces Pashto-script output in more than 0.8% of utterances, while MMS-1B, SeamlessM4T, and OmniASR each exceed 93% Pashto-script fidelity; WER alone does not reveal this failure, since a model generating Arabic-script output on Pashto audio has not achieved ASR in any interpretable sense. For cross-domain evaluation, five fine-tuned Pashto ASR models are evaluated on both test sets: published WER figures of 14% degrade to 32.5--59% on out-of-distribution sets, while one augmented model achieves 35.1% on both sets with zero cross-domain degradation. Character-class error stratification confirms that Pashto-unique phonemes (the retroflex series and lateral fricatives) account for disproportionate error mass. All evaluations cover read speech only. Five structural impediments to cumulative progress are identified and five ordered research priorities are argued.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper provides the first public benchmarks for Pashto ASR by evaluating ten multilingual speech models (Whisper variants, MMS-1B, SeamlessM4T, OmniASR) on the FLEURS Pashto test set and a filtered Common Voice 24 subset. It reports zero-shot WERs, conducts a language identification audit revealing script output failures in Whisper models (≤0.8% Pashto script) versus high fidelity in others (>93%), demonstrates cross-domain degradation in fine-tuned models, and outlines structural impediments and research priorities. All evaluations are limited to read speech.

Significance. If the empirical results hold, this manuscript makes a valuable contribution by establishing baseline numbers for Pashto, a high-speaker-count language without prior shared benchmarks. The script-failure analysis is particularly insightful, showing that WER metrics alone can obscure complete failures in producing language-appropriate output. The reproducible setup on public datasets and the call for specific research priorities are strengths that can guide future work in multilingual ASR. The limitation to read speech is transparently noted but tempers the generalizability of the zero-shot and cross-domain claims.

major comments (2)
  1. The filtering steps applied to the Common Voice 24 Pashto subset are referenced but not detailed with counts of removed utterances or exact criteria (e.g., audio quality thresholds or duration limits). This information is necessary to evaluate potential selection bias and to enable full reproducibility of the reported WER and script percentages.
  2. The method for the language-identification audit (including the specific LID model or tool employed and its validation on Pashto) is not sufficiently described. Given that the central distinction between WER and interpretable ASR rests on the 0.8% vs. 93%+ figures, providing these details would strengthen confidence in the script-failure claim.
minor comments (3)
  1. Clarify the exact test set for the Whisper WER range of 90% to 297% and specify that the 461% figure for the medium model is on Common Voice 24.
  2. Consider adding the number of utterances in each test set and any statistical tests (e.g., significance of WER differences) to the tables or text reporting the main results.
  3. Ensure consistent terminology for 'Pashto-script output' and expand acronyms like ASR, WER, and LID on first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript's contribution and for the constructive comments on reproducibility. We address each major comment below and will incorporate the requested clarifications to strengthen the paper.

read point-by-point responses
  1. Referee: The filtering steps applied to the Common Voice 24 Pashto subset are referenced but not detailed with counts of removed utterances or exact criteria (e.g., audio quality thresholds or duration limits). This information is necessary to evaluate potential selection bias and to enable full reproducibility of the reported WER and script percentages.

    Authors: We agree that the filtering criteria and counts must be explicitly reported for reproducibility and to allow assessment of selection bias. The revised manuscript will include a dedicated paragraph in the datasets section specifying the exact criteria applied (duration limits, any quality thresholds), along with the number of utterances before and after filtering for the Common Voice 24 Pashto subset. revision: yes

  2. Referee: The method for the language-identification audit (including the specific LID model or tool employed and its validation on Pashto) is not sufficiently described. Given that the central distinction between WER and interpretable ASR rests on the 0.8% vs. 93%+ figures, providing these details would strengthen confidence in the script-failure claim.

    Authors: We concur that additional methodological detail on the language-identification audit is warranted to support the script-failure analysis. The revised manuscript will expand the relevant methods subsection to name the specific LID model or tool, describe its application to the ASR outputs, and report any validation performed on Pashto data. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking on external public datasets

full rationale

The paper performs direct evaluation of ten ASR models on two public read-speech test sets (FLEURS Pashto and filtered Common Voice 24), reporting measured WER values, a language-ID audit for script output, and degradation of previously published fine-tuned WERs when tested out-of-domain. No equations, fitted parameters, predictions derived from the paper's own results, or self-referential derivations appear. Core claims rest on external benchmarks and explicit measurement protocols rather than any tautological reduction to inputs. The read-speech limitation is stated openly as a boundary condition, not smuggled in as a result. This is a standard empirical evaluation paper whose findings are falsifiable against the cited public corpora.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard ASR evaluation practices and public datasets without introducing new free parameters or postulated entities.

axioms (2)
  • domain assumption Word Error Rate remains the primary metric for ASR even when script output is wrong
    Invoked throughout the zero-shot and fine-tuned sections; the paper itself demonstrates its insufficiency for script failures.
  • domain assumption The chosen public Pashto subsets are adequate proxies for general Pashto speech
    Used to support zero-shot and cross-domain claims; limited to read speech only.

pith-pipeline@v0.9.0 · 5668 in / 1428 out tokens · 67425 ms · 2026-05-10T19:40:41.401309+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fine-tuning Whisper for Pashto ASR: strategies and scale

    cs.CL 2026-04 unverdicted novelty 4.0

    Vanilla fine-tuning of Whisper on Pashto data achieves 21.22% WER on CommonVoice v20, with whisper-small at 24.89% on v24 showing diminishing returns and online augmentation adding 7.25 pp benefit.

Reference graph

Works this paper leans on

28 extracted references · 9 canonical work pages · cited by 1 Pith paper

  1. [1]

    ihanif/ps_base_l1: Fine-tuned Whisper base for Pashto ASR,

    H. Rahman, “ihanif/ps_base_l1: Fine-tuned Whisper base for Pashto ASR,” https://huggingface.co/ihanif/ps_base_l1, 2025, whisper Base (72M) fine-tuned on Common Voice Pashto; WER 14 on the Common Voice test split; model weights publicly available on HuggingFace Hub under Creative Commons licence

  2. [2]

    Benchmarking Whisper for low-resource speech recognition: An N-shot evaluation on Pashto, Punjabi, and Urdu,

    N. U. Sehar, A. Khalid, F. Adeeba, and S. Hussain, “Benchmarking Whisper for low-resource speech recognition: An N-shot evaluation on Pashto, Punjabi, and Urdu,” in Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025) . Abu Dhabi, UAE: Association for Computational Linguistics, 2025, pp. 202–207. [Online]. Ava...

  3. [3]

    Robust speech recognition via large-scale weak supervi- sion,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” in Proceedings of the 40th International Conference on Machine Learning (ICML) , 2023, whisper; trained on 680k hours; multilingual incl. Pashto; WER on Pashto not reported in original paper

  4. [4]

    Scaling speech technology to 1,000+ languages,

    V . Pratap, A. Tjandra, B. Shi, P . Tomasello, A. Babu, S. Xu, A. Baevski, and M. Auli, “Scaling speech technology to 1,000+ languages,” Jour- nal of Machine Learning Research , vol. 25, 2024, arXiv:2305.13516; MMS: Bible-domain training for 1,000+ languages; Pashto included; no general-domain Pashto eval; register mismatch

  5. [5]

    arXiv preprint arXiv:2312.05187 , year=

    L. Barrault, Y .-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P .-A. Dupont, P .-A. Duquenne, C. Fox, M. Gao, T. Gojayev, N. Gong et al. , “Seamless: Multilingual expressive and streaming speech translation,” arXiv preprint arXiv:2312.05187, 2023, seamlessM4T v2 and Seam- lessStreaming; Pashto S2T included; zero-shot WER evaluated in this paper: 44.0% FLEU...

  6. [6]

    Omnilingual asr: Open- source multilingual speech recognition for 1600+ languages.arXiv preprint arXiv:2511.09690, 2025

    G. Keren, A. Kozhevnikov, Y . Meng, C. Ropers, M. Setzler, S. Wang, I. Adebara, M. Auli, C. Balioglu, K. Chan, C. Cheng et al. , “Om- nilingual ASR: Open-source multilingual speech recognition for 1600+ languages,” arXiv preprint arXiv:2511.09690, 2025, cTC and LLM model families; 1672 languages; Pashto (pbt_Arab, pbu_Arab) included; CTC-300M zero-shot WE...

  7. [7]

    Fleurs: Few-shot learning evaluation of universal representations of speech,

    A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, S. Bhosale, D. Ghosh et al. , “FLEURS: Few- shot learning evaluation of universal representations of speech,” in Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805, arXiv:2205.12446

  8. [8]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. T yers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the 12th Language Resources and Evaluation Conference (LREC) . Marseille, France: European Language Resources Association, 2020, pp. 4218– 4222, foundational Common...

  9. [9]

    The phonemic value of Pashto language special letters,

    S. K. Shahedkhel, “The phonemic value of Pashto language special letters,” Journal of Arts and Humanities , vol. 8, no. 7, 2019

  10. [10]

    Tegey and B

    H. Tegey and B. Robson, A Reference Grammar of Pashto . Washing- ton, D.C.: Center for Applied Linguistics, 1996, foundational Pashto grammar; phoneme inventory; morphology

  11. [11]

    ihanif/pashto-asr-v3: W2V-BERT 2.0 fine-tuned for Pashto ASR,

    H. Rahman, “ihanif/pashto-asr-v3: W2V-BERT 2.0 fine-tuned for Pashto ASR,” https://huggingface.co/ihanif/pashto-asr-v3, 2024, fine-tuned from facebook/w2v-bert-2.0 (600M parameters; CTC decoder); trained on FLEURS Pashto training split; published WER 13.96%; model weights publicly available on HuggingFace Hub

  12. [12]

    From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition,

    A. C. Morris, V . Maier, and P . Green, “From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition,” in Proceedings of INTERSPEECH 2004 , 2004, pp. 2765–2768

  13. [13]

    Advocating character error rate for multilingual ASR evaluation,

    T. D K, J. James, D. P . Gopinath, and M. Ashraf K, “Advocating character error rate for multilingual ASR evaluation,” in Findings of the Association for Computational Linguistics: NAACL 2025 , 2025, pp. 4941–4950, arXiv:2410.07400; CER correlates more closely with human judgement than WER for Malayalam, English, Arabic; WER structurally fails for aggluti...

  14. [14]

    ihanif/pashto-asr-base: Whisper base fine-tuned for Pashto ASR,

    H. Rahman, “ihanif/pashto-asr-base: Whisper base fine-tuned for Pashto ASR,” https://huggingface.co/ihanif/pashto- asr- base, 2023, whisper Base (72M) fine-tuned on FLEURS Pashto training split; CTC-style output; no published WER on model card; model weights publicly available on HuggingFace Hub

  15. [15]

    ihanif/wav2vec2-xls-r-300m-pashto: XLS-R 300m fine-tuned for Pashto ASR,

    ——, “ihanif/wav2vec2-xls-r-300m-pashto: XLS-R 300m fine-tuned for Pashto ASR,” https://huggingface.co/ihanif/wav2vec2-xls-r-300m-pasht o, 2024, xLS-R 300M fine-tuned on FLEURS Pashto training split; CTC decoder; no published WER on model card; model weights publicly available on HuggingFace Hub

  16. [16]

    ihanif/w2v-bert2-pashto-augmented: Augmented W2V-BERT 2.0 for Pashto ASR,

    ——, “ihanif/w2v-bert2-pashto-augmented: Augmented W2V-BERT 2.0 for Pashto ASR,” https://huggingface.co/ihanif/w2v- bert2- p ashto- augmented, 2024, fine-tuned from facebook/w2v-bert-2.0 (600M parameters; CTC decoder); trained on FLEURS Pashto training split with data augmentation; no published WER on model card; model weights publicly available on HuggingFace Hub

  17. [17]

    Differences between acous- tic characteristics of spontaneous and read speech and their effects on speech recognition performance,

    S. Nakamura, K. Iwano, and S. Furui, “Differences between acous- tic characteristics of spontaneous and read speech and their effects on speech recognition performance,” Computer Speech & Language , vol. 22, no. 2, pp. 171–184, 2008

  18. [18]

    Towards unified processing of Perso-Arabic scripts for ASR,

    S. Bandarupalli, B. Akkiraju, S. C. Devarakonda, H. Sivaramasethu, V . Narasinga, and A. Vuppala, “Towards unified processing of Perso-Arabic scripts for ASR,” in Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script (AbjadNLP 2025) , 2025, addresses data-pipeline challenges for Arabic, Persian, Urdu, Sindhi, Pashto: shared-script langu...

  19. [19]

    Pillai, and Elizabeth Sherly

    K. Manohar, L. G. Pillai, and E. Sherly, “What is lost in normalization? Exploring pitfalls in multilingual ASR model evaluations,” in Proceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, arXiv:2409.02449; Whisper’s normalization removes vowel diacritics from Indic scripts, producing 10.7–34.1 pp artificial WER r...

  20. [20]

    WER We Stand: Benchmarking U rdu ASR Models

    S. Arif, S. Farid, A. J. Khan, M. Abbas, A. A. Raza, and A. Athar, “WER we stand: Benchmarking Urdu ASR models,” in Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025) , 2025, arXiv:2409.11252; evaluates Whisper, MMS- 1B, and SeamlessM4T on read and conversational Urdu (Arabic-script IEEE/ACM TRANSACTIONS ON AUDIO, ...

  21. [21]

    WavLM: Large-scale self-supervised pre- training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Y oshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre- training for full stack speech processing,” pp. 1505–1518, 2022, wavLM; strong self-supervised speech model

  22. [22]

    Pashto synthetic speech dataset,

    H. Rahman, “Pashto synthetic speech dataset,” Private dataset; available on reasonable request from the author, 2023, approximately 10,000 utterances synthesised from Common Voice Pashto sentence prompts via Microsoft Edge TTS ( ps-AF-GulNawazNeural and ps-AF-LatifaNeural); audio at 24,000 Hz; held at ihanif/pashto_speech_20k (HuggingFace, private); not p...

  23. [23]

    VITS: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

    J. Kim, J. Kong, and J. Son, “VITS: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proceedings of the 38th International Conference on Machine Learning (ICML) , 2021, pp. 5530–5540, vITS; single-stage end-to-end TTS; high natu- ralness; strong for low-resource

  24. [24]

    Kokoro-82M: A small but mighty TTS model,

    hexgrad, “Kokoro-82M: A small but mighty TTS model,” https://hu ggingface.co/hexgrad/Kokoro-82M, 2024, 82M parameter StyleTTS2- based model; Apache 2.0 licence; trained on <100h; no fine-tuning code publicly released at time of writing

  25. [25]

    Xls-r: Self-supervised cross-lingual speech represen- tation learning at scale,

    A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P . von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” arXiv preprint arXiv:2111.09296, 2021, xLS-R (128 languages); multilingual SSL; Pashto data from VoxLingua107 (Y ouT ube-scraped), NOT...

  26. [26]

    HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 29, pp. 3451–3460, 2021, huBERT; strong ASR baseline for low-resource fine-tuning

  27. [27]

    VoxLingua107: a dataset for spoken language recognition,

    J. Valk and T. Alumäe, “VoxLingua107: a dataset for spoken language recognition,” in Proceedings of the IEEE Spoken Language Technology Workshop (SLT) , 2021, pp. 652–658, 107-language Y ouT ube-scraped corpus; XLS-R uses this as one of five pre-training data sources

  28. [28]

    Pashto Common Voice: Building the first open speech corpus for a 60-million-speaker low-resource language,

    H. Rahman and S. u. Rehman, “Pashto Common Voice: Building the first open speech corpus for a 60-million-speaker low-resource language,” arXiv preprint arXiv:2603.27021, 2026, documents Common Voice Pashto corpus development across CV14–CV23; 107,781 clips (60,337 validated); WER 13.4% fine-tuned Whisper Base on MCV20 test split. [Online]. Available: http...