pith. machine review for the scientific record. sign in

arxiv: 2604.06507 · v1 · submitted 2026-04-07 · 💻 cs.CL

Recognition: no theorem link

Fine-tuning Whisper for Pashto ASR: strategies and scale

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:25 UTC · model grok-4.3

classification 💻 cs.CL
keywords Pashto ASRWhisper fine-tuningLoRACommonVoicespeech recognitionlow-resource languagesmodel adaptationdata augmentation
0
0 comments X

The pith

Vanilla full fine-tuning of Whisper reaches 21% word error rate for Pashto ASR and beats other methods

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pashto was missing from Whisper pre-training, so standard models output wrong scripts and exceed 100% error on Pashto audio. The work tests full fine-tuning against LoRA, encoder freezing, and Urdu transfer on CommonVoice Pashto data. Full updates of all parameters deliver the lowest error rate while the alternatives fall short by large margins. Scaling to more data shows medium models hit a practical ceiling before larger ones add little improvement. The findings matter for turning available Pashto recordings into working speech systems without starting from scratch.

Core claim

Vanilla full fine-tuning achieves WER 21.22% on CommonVoice Pashto v20, outperforming LoRA by 33.36 pp, frozen-encoder by 14.76 pp, and Urdu transfer by 44.56 pp. On v24 data with 113 hours, whisper-small reaches 24.89% and whisper-large-v3-turbo reaches 23.37%, indicating diminishing returns and making whisper-small the practical optimum. Online augmentation improves WER by 7.25 pp, while errors concentrate on word-final gender suffixes and substitutions involving the Pashto retroflex /ts/.

What carries the argument

Direct comparison of four fine-tuning strategies on Whisper models: vanilla full-parameter updates, LoRA at rank 64, freezing the first two of six encoder layers, and multistage transfer from Urdu checkpoints.

If this is right

  • Full parameter updates are necessary because layer separation assumptions break down in the base Whisper model.
  • Whisper-small provides the best accuracy versus compute trade-off once training data reaches 113 hours.
  • Online data augmentation during training delivers consistent gains across model sizes.
  • Error patterns point to specific morphological and consonant issues that future models could target directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Languages with phonological features absent from pre-training may require full fine-tuning even when parameter-efficient methods work elsewhere.
  • Public release of the fine-tuned checkpoints makes it possible to test performance on Pashto speech outside CommonVoice.
  • The observed plateau in gains suggests that collecting more diverse Pashto data could matter more than scaling model size further.

Load-bearing premise

That the word error rate gaps between strategies reflect genuine differences in approach rather than random training effects or unintended overlap in the CommonVoice Pashto training and test splits.

What would settle it

Re-run the four strategies several times with fresh random seeds on the same data splits and check whether the same ordering of WER results holds.

Figures

Figures reproduced from arXiv: 2604.06507 by Hanif Rahman.

Figure 1
Figure 1. Figure 1: Zero-shot versus fine-tuned WER for all three model sizes on CV24. Fine-tuning reduces WER by 86–144 pp absolute [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 6
Figure 6. Figure 6: Augmentation ablation on CV24 (matched hyperparameters). Augmentation provides 7.25 pp WER benefit. The clean comparison (matched lr=5e-5, eff_batch=32, BF16) shows online augmentation provides a 7.25 pp WER benefit on CV24 (27.13% vs 34.38%). The benefit holds for CER as well (8.79% vs 11.43%, a 2.64 pp gap). Speed perturbation and noise injection reduce overfitting to a limited speaker distribution even … view at source ↗
Figure 2
Figure 2. Figure 2: Augmentation ablation on CV24 (matched hyperparameters). Augmentation provides 7.25 pp WER benefit. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Efficiency frontier: zero-shot and fine-tuned WER across model sizes on CV24. X-axis is log-scale in parameters [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
read the original abstract

Pashto is absent from Whisper's pre-training corpus despite being one of CommonVoice's largest language collections, leaving off-the-shelf models unusable: all Whisper sizes output Arabic, Dari, or Urdu script on Pashto audio, achieving word error rates above 100%. We compare four fine-tuning strategies for whisper-base on CommonVoice Pashto v20: vanilla full fine-tuning, LoRA (rank 64), frozen-encoder (2/6 layers), and multistage Urdu-to-Pashto transfer. We extend vanilla fine-tuning to whisper-small and whisper-large-v3-turbo on CommonVoice Pashto v24 (113 hours). Vanilla fine-tuning achieves WER 21.22% on CV20, outperforming LoRA by 33.36 pp, frozen-encoder by 14.76 pp, and Urdu transfer by 44.56 pp. Frozen-encoder fine-tuning degrades performance on whisper-base (6 encoder layers): layer-function separation does not hold at this depth, and freezing removes a third of trainable capacity. Urdu-to-Pashto transfer fails due to an unverified intermediate checkpoint, phonological mismatch, and insufficient training. On CV24, whisper-small achieves WER 24.89% (2.24 pp over whisper-base at 3.3x parameters); whisper-large-v3-turbo achieves 23.37% (a further 1.52 pp). Diminishing returns indicate whisper-small is the practical optimum at 113 hours. Online augmentation provides 7.25 pp WER benefit over matched training. Error analysis identifies word-final suffix confusion (masculine -ay vs. feminine -a) and retroflex substitutions involving the Pashto-unique consonant /ts/ as dominant failure modes. Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript evaluates four fine-tuning strategies for Whisper ASR models on Pashto using CommonVoice v20 and v24 datasets. Vanilla full fine-tuning on whisper-base yields 21.22% WER on CV20, outperforming LoRA (by 33.36 pp), frozen-encoder (by 14.76 pp), and Urdu-to-Pashto transfer (by 44.56 pp). Scaling vanilla fine-tuning to whisper-small and large-v3-turbo on the 113-hour CV24 set shows whisper-small at 24.89% WER as the practical optimum, with diminishing returns for larger models. Online augmentation improves WER by 7.25 pp; error analysis highlights word-final suffix confusion and retroflex /ts/ substitutions. Checkpoints and scripts are released on Hugging Face.

Significance. If the empirical results hold, the work offers concrete, actionable guidance for adapting Whisper to low-resource languages absent from pre-training, showing that full fine-tuning is required over parameter-efficient alternatives when capacity is limited and phonological mismatch is present. The large observed gaps, scaling analysis, released artifacts, and linguistic error breakdown make it a useful reference for practitioners and a reproducible baseline for Pashto ASR.

major comments (2)
  1. [§4] §4 (Results on CV20): The reported WER gaps (14–44 pp) are large enough to exceed typical single-run variance in ASR fine-tuning, but the manuscript does not report multiple random seeds, standard deviations, or statistical significance tests; this leaves open the possibility that training stochasticity or data-split specifics contribute to the observed differences between strategies.
  2. [§3.3] §3.3 (Urdu-to-Pashto transfer): The explanation that failure stems from an 'unverified intermediate checkpoint' is post-hoc; without the exact checkpoint hash, training log, or ablation confirming the checkpoint's validity, it is difficult to isolate whether phonological mismatch or training insufficiency is the dominant cause.
minor comments (3)
  1. [Abstract] Abstract: Training hyperparameters (learning rate, batch size, epochs, optimizer) and exact train/dev/test split sizes for both CV20 and CV24 are not stated, even though the full paper supplies configurations; adding a one-sentence summary would improve standalone readability.
  2. [§5] §5 (Error analysis): The discussion of masculine -ay vs. feminine -a suffix confusion would benefit from one or two concrete Pashto example transcriptions and their erroneous outputs to illustrate the dominant failure mode.
  3. [Table 1] Table 1 or equivalent: The LoRA rank-64 configuration and the exact number of frozen encoder layers (stated as 2/6) should be cross-referenced to the model architecture diagram for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment of the work. We address each major comment below with point-by-point responses and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Results on CV20): The reported WER gaps (14–44 pp) are large enough to exceed typical single-run variance in ASR fine-tuning, but the manuscript does not report multiple random seeds, standard deviations, or statistical significance tests; this leaves open the possibility that training stochasticity or data-split specifics contribute to the observed differences between strategies.

    Authors: We agree that reporting results across multiple random seeds with standard deviations and statistical tests would strengthen the reliability of the comparisons. The observed WER gaps (14.76–44.56 pp) are substantially larger than typical single-run variance in ASR fine-tuning (commonly 1–2 pp), making it unlikely that stochasticity alone explains the differences. However, each configuration was trained only once with a fixed seed owing to the high computational cost of full fine-tuning on the available hardware. In the revised manuscript we have added an explicit limitations paragraph acknowledging the single-run nature of the experiments and recommending multi-seed evaluation for future replications; no significance tests are reported because multiple independent runs were not performed. revision: partial

  2. Referee: [§3.3] §3.3 (Urdu-to-Pashto transfer): The explanation that failure stems from an 'unverified intermediate checkpoint' is post-hoc; without the exact checkpoint hash, training log, or ablation confirming the checkpoint's validity, it is difficult to isolate whether phonological mismatch or training insufficiency is the dominant cause.

    Authors: We acknowledge that the original phrasing attributing failure to an unverified intermediate checkpoint was based on contemporaneous training observations rather than fully documented artifacts. We do not have the precise checkpoint hash or complete training logs from that run available for release. In the revised §3.3 we have updated the discussion to state that the multistage transfer approach underperformed, most likely due to phonological mismatch between Urdu and Pashto together with possible training issues in the intermediate stage, while explicitly noting the absence of an ablation study and the lack of verifiable checkpoint details as limitations of the current analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a purely empirical comparison of ASR fine-tuning runs on CommonVoice Pashto. It reports measured WER values (e.g., 21.22 % for vanilla full fine-tuning) obtained from concrete training configurations, without any equations, fitted parameters, predictions, or first-principles derivations. All performance gaps are presented as direct experimental outcomes, supported by released checkpoints and scripts. No self-definitional steps, fitted-input predictions, or load-bearing self-citations exist because the paper contains no derivation chain that could reduce its results to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper performs standard supervised fine-tuning and evaluation on public speech data without introducing new mathematical parameters, axioms beyond domain-standard ASR metrics, or postulated entities.

axioms (1)
  • domain assumption Word Error Rate is a suitable primary metric for comparing ASR systems on Pashto
    All performance claims rest on WER numbers reported for the different fine-tuning runs.

pith-pipeline@v0.9.0 · 5622 in / 1308 out tokens · 50241 ms · 2026-05-10T18:25:44.432052+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Fine-tuning Whisper for Pashto ASR: strategies and scale

    Introduction Automatic speech recognition has advanced rapidly through large-scale weakly-supervised models. Whisper [2], trained on 680,000 hours of multilingual audio, achieves zero-shot transcription across 99 languages. For languages well-represented in that training mix (European languages, Mandarin, Arabic, Japanese) zero-shot WER is often below 20%...

  2. [2]

    The first systematic comparison of four Whisper fine-tuning strategies (vanilla, LoRA, frozen encoder, multistage Urdu transfer) on Pashto, using a fixed pre-augmented training corpus and a held-out test split from CommonVoice Pashto v20

  3. [3]

    ’s [3] recommendation applies: the benefit requires models with 12 or more encoder layers

    Evidence that encoder-layer freezing degrades performance on whisper-base (6 encoder layers), bounding the conditions under which Liu et al. ’s [3] recommendation applies: the benefit requires models with 12 or more encoder layers

  4. [4]

    A failure analysis of Urdu→Pashto multistage transfer, identifying three candidate mecha- nisms: unverified intermediate model quality, phonological mismatch, and insufficient second- stage training steps

  5. [5]

    A model-scale efficiency comparison on CommonVoice Pashto v24 (whisper-base, whisper- small, whisper-large-v3-turbo) with approximately 113 hours of training data

  6. [6]

    Sections 2–5 cover related work, datasets, methods, and setup; Sections 6–8 report and discuss results

    Open release of fine-tuned checkpoints and evaluation scripts on HuggingFace, with results on a standardised test split for future Pashto ASR benchmarking. Sections 2–5 cover related work, datasets, methods, and setup; Sections 6–8 report and discuss results

  7. [7]

    [2] trained Whisper on 680,000 hours of multilingual audio spanning 99 languages

    Related work 2.1 Whisper fine-tuning strategies for low-resource ASR Radford et al. [2] trained Whisper on 680,000 hours of multilingual audio spanning 99 languages. Performance degrades sharply for languages underrepresented in the pre-training data. Liu et al

  8. [8]

    confirm this on Fleurs: zero-shot WER ranges from 48% to over 90% for seven low-resource languages, while targeted fine-tuning reduces WER to 12–45%. Pashto has very limited presence in the pre-training corpus; Whisper outputs Arabic, Dari, or Urdu script rather than Pashto text at zero shot, and WER exceeds 100% for all model sizes on CV24 [1]. V anilla ...

  9. [9]

    Dataset 3.1 Pashto language context Pashto is an Eastern Iranian language spoken primarily in Afghanistan and Pakistan. Its script is an extended Perso-Arabic alphabet of 44 letters, several of which represent phonemes absent from Arabic, Persian, and Urdu: the retroflex consonants /￿/, /￿/, /￿/, /￿/ and the pharyngeal fricative /ħ/. This extended phoneme...

  10. [10]

    The base model is openai/whisper-base (74.4M parameters, 6 encoder layers, 6 decoder layers)

    Methods 4.1 Common configuration All CV20 strategy arm experiments (exp_001–004) share the configuration in Table 4. The base model is openai/whisper-base (74.4M parameters, 6 encoder layers, 6 decoder layers). Early stopping patience is 5 (2,500 steps without improvement); best checkpoint selected by minimum normalised WER on the validation split. 4.2 Va...

  11. [11]

    Exp_004 ran on an NVIDIA A40 (48 GB VRAM); the model and batch configuration fit within 24 GB, so results are not affected by this difference

    Experimental setup 5.1 Hardware CV20 strategy arm experiments (exp_001–003) ran on a single NVIDIA RTX 4090 (24 GB VRAM). Exp_004 ran on an NVIDIA A40 (48 GB VRAM); the model and batch configuration fit within 24 GB, so results are not affected by this difference. CV24 model scale experiments: • Exp_008 (whisper-base, BF16, eff_batch=32): RTX 4090 or A40....

  12. [12]

    All three models exceed 100% WER because Whisper outputs Arabic, Dari, or Urdu script on Pashto audio; errors exceed the reference word count

    Results 6.1 Zero-shot baselines Zero-shot WER for each model on CV24 (N=13,643) is reported by Rahman [1] and reproduced in Table 2. All three models exceed 100% WER because Whisper outputs Arabic, Dari, or Urdu script on Pashto audio; errors exceed the reference word count. No Whisper model generates Pashto-script output in more than 0.8% of utterances a...

  13. [13]

    This contradicts Liu et al

    Discussion 7.1 Why layer freezing fails on whisper-base The frozen-encoder strategy (exp_003) underperforms vanilla by 14.76 pp WER. This contradicts Liu et al. [3], who report that freezing the bottom encoder layers improves WER for whisper-small and larger models. The reversal at whisper-base has a clear structural explanation. Liu et al. ’s CKA analysi...

  14. [14]

    sharjeel103/whisper-base-urdu has no pub- lished test-set WER

    Unverified intermediate checkpoint. sharjeel103/whisper-base-urdu has no pub- lished test-set WER. If this model has poor Urdu representations, stage-2 training inher- its a worse initialisation than the original 680,000-hour multilingual whisper-base. Vanilla 15 starts from those multilingual weights; the multistage approach is entirely dependent on a th...

  15. [15]

    Urdu and Pashto share the Nastaliq script and Persian/Arabic loanword vocabulary, but Pashto’s retroflex and pharyngeal consonants are absent from Urdu’s phoneme set

    Phonological mismatch. Urdu and Pashto share the Nastaliq script and Persian/Arabic loanword vocabulary, but Pashto’s retroflex and pharyngeal consonants are absent from Urdu’s phoneme set. An Urdu-trained model has no signal for these sounds and may have suppressed sensitivity to them. The conservative second-stage LR (5e-6) provides only a weak gradient...

  16. [16]

    Insufficient second-stage training. The training curve shows monotonically declining loss through 10,000 steps, with the best checkpoint at step 7,500 (WER 65.78%) and loss still improving at step 10,000 while WER marginally worsens. The model had not converged within the training budget. For future cross-lingual transfer: use Persian (Dari/Farsi) as the i...

  17. [17]

    Three conclusions follow directly

    Conclusion This paper compared four strategies for fine-tuning Whisper on Pashto ASR and characterised the accuracy-versus-compute trade-off across three model sizes. Three conclusions follow directly. Vanilla full fine-tuning is the correct default for whisper-base on Pashto: WER 21.22% on CV20, 33.36 pp better than LoRA r=64, 14.76 pp better than frozen...

  18. [18]

    Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation

    H. Rahman, “Benchmarking multilingual speech models on Pashto: Zero-shot ASR, script failure, and cross-domain evaluation,” arXiv preprint arXiv:2604.04598 , 2026, A vailable: https://arxiv.org/abs/2604.04598

  19. [19]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proceedings of the 40th international conference on machine learning (ICML) , 2023, pp. 28492–28518. A vailable: https://procee dings.mlr.press/v202/radford23a.html

  20. [20]

    Exploration of Whisper fine-tuning strategies for low-resource ASR,

    Y. Liu, X. Yang, and D. Qu, “Exploration of Whisper fine-tuning strategies for low-resource ASR,” EURASIP Journal on Audio, Speech, and Music Processing , vol. 2024, no. 29, 2024, doi: 10.1186/s13636-024-00349-3

  21. [21]

    Whispering in Amharic: Fine- tuning Whisper for low-resource language,

    D. K. Gete, B. Y. Ahmed, T. D. Belay, and Y. A. Ejigu, “Whispering in Amharic: Fine- tuning Whisper for low-resource language,” arXiv preprint arXiv:2503.18485 , 2025, A vail- able: https://arxiv.org/abs/2503.18485

  22. [22]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu et al. , “LoRA: Low-rank adaptation of large language models,” in International conference on learning representations (ICLR) , 2022. A vailable: https://openreview.net/for um?id=nZeVKeeFYf9

  23. [23]

    Low-resource speech recognition by fine-tuning Whis- per with Optuna-LoRA,

    L. Yang, B. Hou, and M. Qin, “Low-resource speech recognition by fine-tuning Whis- per with Optuna-LoRA,” Applied Sciences , vol. 15, no. 24, p. 13090, 2025, doi: 10.3390/app152413090

  24. [24]

    Multistage fine-tuning strategies for automatic speech recognition in low-resource languages,

    L. G. Pillai, K. Manohar, B. K. Raju, and E. Sherly, “Multistage fine-tuning strategies for automatic speech recognition in low-resource languages,” arXiv preprint arXiv:2411.04573 , 2024, A vailable: https://arxiv.org/abs/2411.04573 25