Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

Felix Akeret

arxiv: 2606.07608 · v1 · pith:4A3FRXHUnew · submitted 2026-05-29 · 💻 cs.CL · cs.AI· cs.LG· cs.SD

Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

Felix Akeret This is my paper

Pith reviewed 2026-06-28 22:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.SD

keywords Swiss German ASRWhisper fine-tuningbenchmark contaminationword error ratedialect speech recognitionself-trainingcontent WERweak supervision

0 comments

The pith

A vanilla Whisper model self-trained on the test set alone reaches 13.88% WER, beating all published Swiss German ASR systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current Swiss German ASR benchmarks suffer from contamination, allowing models to perform well by matching evaluation conventions rather than recognizing the dialect. Using subtitle-aligned broadcast data for fine-tuning Whisper, the authors produce an honest baseline of 25.6% WER on strictly disjoint test data, which improves to 13.8% when counting only genuine recognition failures instead of stylistic differences. A decisive experiment shows that even a model with no Swiss German training data achieves 13.88% WER after self-training on the test set, undercutting prior reported figures of 17.1-17.5%. This indicates that the benchmark has been measuring data leakage and convention familiarity more than genuine capability. The released models and metrics supply a reproducible reference point free of such issues.

Core claim

Published state-of-the-art Swiss German ASR results of 17.1-17.5% WER are inflated by benchmark contamination, as demonstrated by a vanilla Whisper model self-trained on the ASGDTS test set with zero Swiss German data achieving 13.88% WER; an honestly evaluated fine-tuned model on disjoint data reaches 25.6% WER or 13.8% cWER after separating genuine errors from valid stylistic variation.

What carries the argument

Self-training a vanilla Whisper model directly on the ASGDTS test set to serve as a contamination probe that isolates the effect of prior exposure to evaluation data.

If this is right

The ASGDTS benchmark primarily measures convention matching rather than dialectal comprehension.
Content WER (cWER) that excludes tense, word order, and orthographic variations provides a stricter measure of actual recognition quality.
Bias-corrected estimates place the true error rate near 8.5%, roughly one third of the raw WER.
Openly released LoRA and full fine-tuned models offer reproducible baselines that require no institutional data agreements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other low-resource dialect or accented speech benchmarks may contain similar hidden contamination from repeated use of the same evaluation data.
Future ASR evaluations for dialects should employ multiple independent test sets collected at different times to reduce leakage risk.
Subtitle-based weak supervision from broadcast media could be tested on additional language pairs where aligned transcripts are scarce.

Load-bearing premise

The self-training experiment on the test set serves as a valid proxy for the data exposure present in the training histories of the published comparison systems.

What would settle it

If a new Swiss German test set collected after all published systems were trained shows those systems still achieving below 17% WER while the contamination-free baseline remains above 20%, the contamination explanation would fail.

Figures

Figures reproduced from arXiv: 2606.07608 by Felix Akeret.

**Figure 1.** Figure 1: Benchmark dilution: WER vs. dilution factor (total training h / ASGDTS h, log scale). Blue: our self-training experiments on ASGDTS. Red triangles: published systems — D’Intino (17.1% on SRB-300 test, 303h same-convention, proper split) and Michaud (17.5% on ASGDTS, included in training). Both fall within the range of our self-trained models, consistent with convention matching as the dominant factor. Our … view at source ↗

read the original abstract

We present a systematic study of fine-tuning OpenAI's Whisper large-v3 for Swiss German ASR, using 1,367 hours of broadcast speech paired with Standard German subtitles as weak supervision. Through 16 iterative training runs on an NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory, up to 1 PFLOP FP4), we compare LoRA and full fine-tuning of the 1.55B-parameter model, investigate hallucination root causes, and quantify the effect of data quality, subtitle alignment, and training strategy. Our best model achieves 25.6% measured WER on the All Swiss German Dialects Test Set (ASGDTS) in an honest evaluation on strictly disjoint data. A harmonized error analysis separating genuine errors from valid stylistic variation (tense, word order, Swiss orthography) yields a content WER (cWER) of 13.8%, counting only actual recognition failures. Bias-corrected estimation reduces this to 8.5%, suggesting the true error rate is roughly one third of measured WER. We demonstrate that published state-of-the-art Swiss German ASR results (17.1-17.5% WER) are inflated by benchmark contamination: a vanilla Whisper model self-trained on the ASGDTS test set with zero Swiss German data achieves 13.88% WER, surpassing all published systems. Experiments with Phi-4-multimodal show an even stronger memorization effect (3.9% WER), revealing that the benchmark primarily measures convention matching rather than dialectal comprehension. We release two models, a LoRA adapter (25.32% WER, 13.9% cWER) and a full fine-tuned model (25.60% WER, 13.8% cWER), among the few publicly available, honestly evaluated Whisper models for Swiss German, under Apache 2.0 with full reproducibility, requiring no institutional data agreements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-training on the test set beats published Swiss German numbers and flags a real evaluation problem, but the link to prior work stays indirect.

read the letter

The punchline is that a vanilla Whisper model self-trained only on the ASGDTS test set, with zero Swiss German data, reaches 13.88% WER and beats the published 17.1-17.5% figures. That single control experiment is the clearest new piece and shows the benchmark can be gamed by convention matching rather than actual dialect recognition.

The paper runs 16 iterative fine-tuning passes on 1,367 hours of broadcast audio with Standard German subtitles, compares LoRA against full fine-tuning, and separates measured WER from a content WER that ignores tense, word order, and orthography differences. Their best honest model hits 25.6% WER / 13.8% cWER on strictly disjoint data, and they release both the LoRA adapter and full model under Apache 2.0 with reproducibility details. The Phi-4 experiment showing even lower error under the same setup adds weight to the memorization point. These elements are useful and grounded.

The soft spot is the direct claim that published results are inflated by contamination. The self-training run demonstrates what explicit leakage can achieve, but the paper supplies no data-overlap audit, source tracing, or reproduction of the specific pipelines behind the 17.1-17.5% numbers. If those systems used different broadcast material or stricter filtering, the explanation does not transfer. The bias correction down to 8.5% also rests on choices that would need the full methods section to evaluate.

This is for people working on dialectal ASR or benchmark hygiene. It raises a legitimate flag without overclaiming the math or data. A serious editor should send it to review so the alignment and self-training details can be checked.

Referee Report

2 major / 2 minor

Summary. The paper reports a study of fine-tuning Whisper large-v3 on 1,367 hours of Swiss German broadcast speech paired with Standard German subtitles as weak supervision. Across 16 training runs comparing LoRA and full fine-tuning, the best model reaches 25.6% WER (13.8% cWER after harmonized error analysis) on the ASGDTS test set under strictly disjoint evaluation. A bias-corrected estimate lowers this to 8.5%. The central claim is that published Swiss German ASR SOTA results (17.1-17.5% WER) are inflated by benchmark contamination, supported by a vanilla Whisper model self-trained on the ASGDTS test set (zero Swiss German data) achieving 13.88% WER and a Phi-4-multimodal model reaching 3.9% WER. Two models are released under Apache 2.0 with full reproducibility.

Significance. If the contamination claim holds, the work would usefully expose limitations in current Swiss German ASR benchmarks and provide a reproducible honest baseline that separates genuine recognition errors from stylistic variation. The release of models, emphasis on disjoint evaluation, and explicit reporting of 16 runs plus self-training controls are strengths that support reproducibility in the field.

major comments (2)

[Abstract] Abstract and contamination experiment section: the claim that published 17.1-17.5% WER results are inflated by contamination is not directly supported, because the self-training experiment (13.88% WER) only shows performance achievable with explicit test-set exposure; no data-overlap audit, reproduction of the cited pipelines, or analysis of the broadcast sources and splits used in those papers is provided to establish that the published systems actually employed comparable leakage or convention matching.
[Results] Results on bias correction: the reduction of cWER from 13.8% to a bias-corrected 8.5% is load-bearing for the honest baseline claim, yet the manuscript provides insufficient detail on the correction procedure, the exact definition of the bias term, and how it interacts with the harmonized error analysis separating stylistic variation from genuine errors.

minor comments (2)

[Methods] The description of the 16 iterative training runs would benefit from an explicit table listing hyperparameters, LoRA rank, learning rates, and data subsets used in each run to allow independent verification.
Notation for cWER should be defined once in the main text with a clear equation or procedural description rather than only in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and contamination experiment section: the claim that published 17.1-17.5% WER results are inflated by contamination is not directly supported, because the self-training experiment (13.88% WER) only shows performance achievable with explicit test-set exposure; no data-overlap audit, reproduction of the cited pipelines, or analysis of the broadcast sources and splits used in those papers is provided to establish that the published systems actually employed comparable leakage or convention matching.

Authors: We agree that the self-training result provides suggestive rather than definitive evidence of contamination in the specific prior works. The experiment shows that exposure to the test set alone enables a model with zero Swiss German training data to exceed published SOTA performance, which is consistent with convention matching playing a substantial role. We will revise the abstract and contamination experiment section to frame the finding as evidence that benchmark contamination via convention mismatch can produce inflated results, rather than claiming direct inflation of the cited 17.1-17.5% figures. We will also add an explicit limitation statement noting the absence of a full data-overlap audit or pipeline reproduction. This constitutes a partial revision. revision: partial
Referee: [Results] Results on bias correction: the reduction of cWER from 13.8% to a bias-corrected 8.5% is load-bearing for the honest baseline claim, yet the manuscript provides insufficient detail on the correction procedure, the exact definition of the bias term, and how it interacts with the harmonized error analysis separating stylistic variation from genuine errors.

Authors: The referee correctly identifies that the bias-correction procedure is under-specified. We will expand the relevant results section to provide the exact definition of the bias term, the mathematical formulation of the correction, the estimation method used to arrive at 8.5%, and a clear description of how the correction interacts with the harmonized error analysis. This will ensure the honest baseline claim is fully transparent and reproducible. revision_made = 'yes' revision: yes

Circularity Check

0 steps flagged

No significant circularity; central contamination claim rests on independent experimental control

full rationale

The paper's key demonstration—that a vanilla Whisper model self-trained on the ASGDTS test set reaches 13.88% WER—constitutes an explicit, reproducible experiment rather than a fitted parameter or self-referential definition. No equations reduce by construction to inputs, no predictions are statistically forced from subsets of the same data, and no load-bearing self-citations or uniqueness theorems are invoked. The comparison to external published figures (17.1-17.5% WER) relies on reported numbers from other work, not on any internal tautology. Bias correction and cWER adjustments are post-hoc analyses but do not collapse the main result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The work rests on the domain assumption that Standard German subtitles supply usable weak supervision for Swiss German recognition and on the premise that the self-training control fairly represents contamination in prior published models; no free parameters or invented entities are described in the abstract.

free parameters (1)

training hyperparameters and LoRA configuration
Sixteen iterative runs comparing LoRA and full fine-tuning imply multiple hyperparameters were selected or tuned, though none are named in the abstract.

axioms (2)

domain assumption Standard German subtitles paired with Swiss German audio constitute sufficient weak supervision for ASR fine-tuning
The training pipeline uses 1,367 hours of broadcast speech with Standard German subtitles as weak supervision.
domain assumption The ASGDTS test set was not seen during training of the published systems being compared
The contamination claim is advanced by contrasting the self-trained 13.88% WER against the 17.1-17.5% published figures.

pith-pipeline@v0.9.1-grok · 5920 in / 1640 out tokens · 35292 ms · 2026-06-28T22:49:31.944915+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-LoRAs.arXiv:2503.01743,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects

5.6B multimodal model with mixture-of-LoRAs for vision and speech. Verena Blaschke, Miriam Winkler, and Barbara Plank. Standard-to-dialect transfer trends differ across text and speech: A case study on intent and topic classification in german dialects. arXiv:2510.07890,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Learning rate scaling across LoRA ranks and transfer to full finetuning.arXiv:2602.06204,

Nan Chen, Soledad Villar, and Soufiane Hayou. Learning rate scaling across LoRA ranks and transfer to full finetuning.arXiv:2602.06204,

work page arXiv
[4]

Advancing STT for low-resource real-world speech

Flavio D’Intino and Hans-Peter Hutter. Advancing STT for low-resource real-world speech. InHCI International 2025, volume 15822 ofLNCS, pages 290–309. Springer,

2025
[5]

Eyal Liron Dolev, Clemens Fidel Lutz, and Noëmi Aepli

doi: 10.1007/978-3-031-93429-2_20. Eyal Liron Dolev, Clemens Fidel Lutz, and Noëmi Aepli. Does whisper understand swiss german? an automatic, qualitative, and human evaluation. InProc. VarDial Workshop,

work page doi:10.1007/978-3-031-93429-2_20
[6]

LoRA-only: 7.19% WER; encoder unfreeze: 3.54% WER on zeroth-test

35K Korean speech samples. LoRA-only: 7.19% WER; encoder unfreeze: 3.54% WER on zeroth-test. 23 Yunpeng Liu, Xukui Yang, and Dan Qu. Exploration of whisper fine-tuning strategies for low-resource ASR.EURASIP Journal on Audio, Speech, and Music Processing, 2024(1):29,

2024
[7]

Michel Plüss, Lukas Neukom, and Manfred Vogel

QLoRA, 17.5% WER on ASGDTS. Michel Plüss, Lukas Neukom, and Manfred Vogel. SwissText 2021 Task 3: Swiss German Speech to Standard German Text. InProceedings of the Swiss Text Analytics Conference,

2021
[8]

Zitha Sasindran, Harsha Yelchuri, T. V. Prabhakar, and Supreeth Rao.Heval: A new hybrid evaluation metric for automatic speech recognition tasks.arXiv:2211.01722,

work page arXiv
[9]

Swiss german speech to text system evaluation.arXiv:2207.00412,

Yanick Schraner, Christian Scheller, Michel Plüss, and Manfred Vogel. Swiss german speech to text system evaluation.arXiv:2207.00412,

work page arXiv
[10]

Fine-tuning whisper on low-resource languages for real-world applications.arXiv:2412.15726,

Vincenzo Timmel, Claudio Paonessa, Reza Kakooee, Manfred Vogel, and Daniel Perruchoud. Fine-tuning whisper on low-resource languages for real-world applications.arXiv:2412.15726,

work page arXiv
[11]

Evaluation of LLMs in speech is often flawed: Test set contamination in large language models for speech recognition.arXiv:2505.22251,

Yuan Tseng, Titouan Parcollet, Rogier van Dalen, Shucong Zhang, and Sourav Bhattacharya. Evaluation of LLMs in speech is often flawed: Test set contamination in large language models for speech recognition.arXiv:2505.22251,

work page arXiv
[12]

Calm-whisper: Reduce whisper hallucination on non-speech by calming crazy heads down

Yingzhi Wang, Anas Alhmoud, Saad Alsahly, Muhammad Alqurishi, and Mirco Ravanelli. Calm-whisper: Reduce whisper hallucination on non-speech by calming crazy heads down. arXiv:2505.12969,

work page arXiv

[1] [1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-LoRAs.arXiv:2503.01743,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects

5.6B multimodal model with mixture-of-LoRAs for vision and speech. Verena Blaschke, Miriam Winkler, and Barbara Plank. Standard-to-dialect transfer trends differ across text and speech: A case study on intent and topic classification in german dialects. arXiv:2510.07890,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Learning rate scaling across LoRA ranks and transfer to full finetuning.arXiv:2602.06204,

Nan Chen, Soledad Villar, and Soufiane Hayou. Learning rate scaling across LoRA ranks and transfer to full finetuning.arXiv:2602.06204,

work page arXiv

[4] [4]

Advancing STT for low-resource real-world speech

Flavio D’Intino and Hans-Peter Hutter. Advancing STT for low-resource real-world speech. InHCI International 2025, volume 15822 ofLNCS, pages 290–309. Springer,

2025

[5] [5]

Eyal Liron Dolev, Clemens Fidel Lutz, and Noëmi Aepli

doi: 10.1007/978-3-031-93429-2_20. Eyal Liron Dolev, Clemens Fidel Lutz, and Noëmi Aepli. Does whisper understand swiss german? an automatic, qualitative, and human evaluation. InProc. VarDial Workshop,

work page doi:10.1007/978-3-031-93429-2_20

[6] [6]

LoRA-only: 7.19% WER; encoder unfreeze: 3.54% WER on zeroth-test

35K Korean speech samples. LoRA-only: 7.19% WER; encoder unfreeze: 3.54% WER on zeroth-test. 23 Yunpeng Liu, Xukui Yang, and Dan Qu. Exploration of whisper fine-tuning strategies for low-resource ASR.EURASIP Journal on Audio, Speech, and Music Processing, 2024(1):29,

2024

[7] [7]

Michel Plüss, Lukas Neukom, and Manfred Vogel

QLoRA, 17.5% WER on ASGDTS. Michel Plüss, Lukas Neukom, and Manfred Vogel. SwissText 2021 Task 3: Swiss German Speech to Standard German Text. InProceedings of the Swiss Text Analytics Conference,

2021

[8] [8]

Zitha Sasindran, Harsha Yelchuri, T. V. Prabhakar, and Supreeth Rao.Heval: A new hybrid evaluation metric for automatic speech recognition tasks.arXiv:2211.01722,

work page arXiv

[9] [9]

Swiss german speech to text system evaluation.arXiv:2207.00412,

Yanick Schraner, Christian Scheller, Michel Plüss, and Manfred Vogel. Swiss german speech to text system evaluation.arXiv:2207.00412,

work page arXiv

[10] [10]

Fine-tuning whisper on low-resource languages for real-world applications.arXiv:2412.15726,

Vincenzo Timmel, Claudio Paonessa, Reza Kakooee, Manfred Vogel, and Daniel Perruchoud. Fine-tuning whisper on low-resource languages for real-world applications.arXiv:2412.15726,

work page arXiv

[11] [11]

Evaluation of LLMs in speech is often flawed: Test set contamination in large language models for speech recognition.arXiv:2505.22251,

Yuan Tseng, Titouan Parcollet, Rogier van Dalen, Shucong Zhang, and Sourav Bhattacharya. Evaluation of LLMs in speech is often flawed: Test set contamination in large language models for speech recognition.arXiv:2505.22251,

work page arXiv

[12] [12]

Calm-whisper: Reduce whisper hallucination on non-speech by calming crazy heads down

Yingzhi Wang, Anas Alhmoud, Saad Alsahly, Muhammad Alqurishi, and Mirco Ravanelli. Calm-whisper: Reduce whisper hallucination on non-speech by calming crazy heads down. arXiv:2505.12969,

work page arXiv