arxiv: 2603.29087 · v2 · submitted 2026-03-31 · 💻 cs.SD · eess.AS

Recognition: 1 theorem link

· Lean Theorem

IQRA 2026: Interspeech Challenge on Automatic Pronunciation Assessment for Modern Standard Arabic (MSA)

Yassine El Kheir , Amit Meghanani , Mostafa Shahin , Omnia Ibrahim , Shammur Absar Chowdhury , Nada AlMarwani , Youssef Elshahawy , Ahmed Ali

Authors on Pith no claims yet

Pith reviewed 2026-05-14 00:10 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords mispronunciation detectionModern Standard ArabicInterspeech challengepronunciation assessmentMDDauthentic datasetself-supervised learning

0 comments

The pith

The second IQRA challenge shows a 0.28 F1-score gain in detecting mispronunciations in Modern Standard Arabic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reports results from the second edition of the IQRA Interspeech Challenge on automatic mispronunciation detection and diagnosis for Modern Standard Arabic. A new dataset of authentic human mispronounced speech was added to the training and evaluation resources. Participant systems using CTC-based self-supervised models, two-stage fine-tuning, and large audio-language models delivered a substantial performance increase over the first edition.

Core claim

The introduction of the Iqra_Extra_IS26 authentic mispronunciation dataset together with novel participant architectures and modeling strategies produced a 0.28 F1-score improvement in MDD for MSA compared with the prior challenge edition.

What carries the argument

The Iqra_Extra_IS26 dataset of authentic human mispronounced speech, which supplies real pronunciation error examples to improve model training and evaluation.

Load-bearing premise

The observed performance jump stems mainly from the new authentic dataset and participant-proposed methods rather than changes in evaluation protocol or data distribution.

What would settle it

Re-evaluating the top systems from this edition on the first edition's dataset and finding little or no F1 improvement.

read the original abstract

We present the findings of the second edition of the IQRA Interspeech Challenge, a challenge on automatic Mispronunciation Detection and Diagnosis (MDD) for Modern Standard Arabic (MSA). Building on the previous edition, this iteration introduces \textbf{Iqra\_Extra\_IS26}, a new dataset of authentic human mispronounced speech, complementing the existing training and evaluation resources. Submitted systems employed a diverse range of approaches, spanning CTC-based self-supervised learning models, two-stage fine-tuning strategies, and using large audio-language models. Compared to the first edition, we observe a substantial jump of \textbf{0.28 in F1-score}, attributable both to novel architectures and modeling strategies proposed by participants and to the additional authentic mispronunciation data made available. These results demonstrate the growing maturity of Arabic MDD research and establish a stronger foundation for future work in Arabic pronunciation assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new authentic mispronunciation dataset is the concrete addition here, but the 0.28 F1 jump lacks the controls needed to attribute it cleanly.

read the letter

The new Iqra_Extra_IS26 dataset of real human mispronunciations in Modern Standard Arabic is the part that stands out, and the challenge shows participants reaching higher F1 scores than the first edition. The paper runs the second round of this Interspeech task on mispronunciation detection and diagnosis, adding the extra data to prior resources and letting teams try CTC-based self-supervised models, two-stage fine-tuning, and large audio-language models. It reports the 0.28 F1 improvement and frames it as evidence of progress in Arabic pronunciation assessment. That data release is useful because Arabic speech resources remain limited, and authentic errors can support better training for language-learning tools. The challenge format also surfaces a range of modeling ideas in one place. The soft spot is the attribution of the score jump. The abstract ties it to the new data and participant methods, yet supplies no cross-edition baseline, no re-run of prior systems on the updated test set, no confirmation that labeling or evaluation stayed fixed, and no error analysis or statistical checks. Without those, part of the delta could trace to distribution shifts or protocol changes rather than the claimed factors. This paper is for researchers working on Arabic speech technology or automatic pronunciation scoring. Anyone building benchmarks or tools for low-resource languages would get value from the dataset and the reported numbers. It deserves peer review because the data contribution is tangible and the field needs more Arabic MDD resources, even if the analysis of the improvement needs tightening. A referee could reasonably ask for clearer comparisons to the prior edition.

Referee Report

1 major / 2 minor

Summary. The manuscript reports findings from the second edition of the IQRA Interspeech Challenge on automatic mispronunciation detection and diagnosis (MDD) for Modern Standard Arabic (MSA). It introduces the new Iqra_Extra_IS26 dataset of authentic human mispronounced speech to complement existing resources and states that submitted systems achieved a 0.28 F1-score improvement over the first edition, attributing this gain to novel participant architectures (e.g., CTC-based self-supervised models, two-stage fine-tuning, large audio-language models) and the additional authentic data.

Significance. If the reported improvement and its attribution can be substantiated, the work demonstrates meaningful progress in Arabic MDD by expanding resources with authentic mispronunciations and showcasing diverse modeling strategies. This could provide a stronger empirical foundation for future pronunciation assessment research in under-resourced languages. However, the absence of controlled comparisons limits the ability to isolate the contributions of the new data versus other factors, reducing the immediate impact on the field.

major comments (1)

Abstract: The central claim of a 0.28 F1-score jump 'attributable both to novel architectures and modeling strategies proposed by participants and to the additional authentic mispronunciation data' is not supported by any cross-edition baseline, ablation, or controlled experiment. No re-evaluation of prior top systems on the new test set, no isolation of the Iqra_Extra_IS26 data effect while holding models fixed, and no discussion of possible changes in test-set composition, labeling criteria, or F1 computation protocol are provided, so the delta cannot be confidently attributed to the claimed factors rather than distribution or evaluation shifts.

minor comments (2)

The manuscript would benefit from explicit reporting of the exact F1-score computation details, data splits used for the new dataset, statistical significance tests on the improvement, and per-system performance tables to allow readers to assess the diversity and robustness of submitted approaches.
Error analysis or qualitative examples of remaining mispronunciation types that the new systems still struggle with would strengthen the discussion of growing maturity in Arabic MDD.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript for the IQRA 2026 challenge. We address the major comment on the attribution of the observed F1-score improvement below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract: The central claim of a 0.28 F1-score jump 'attributable both to novel architectures and modeling strategies proposed by participants and to the additional authentic mispronunciation data' is not supported by any cross-edition baseline, ablation, or controlled experiment. No re-evaluation of prior top systems on the new test set, no isolation of the Iqra_Extra_IS26 data effect while holding models fixed, and no discussion of possible changes in test-set composition, labeling criteria, or F1 computation protocol are provided, so the delta cannot be confidently attributed to the claimed factors rather than distribution or evaluation shifts.

Authors: We agree that the manuscript does not include controlled ablations, re-evaluations of prior systems on the updated test set, or explicit analysis of potential shifts in test-set composition, labeling, or F1 protocol. The reported 0.28 F1 improvement is an observed difference between challenge editions, and while the new authentic data and advanced participant systems were introduced this year, we cannot isolate their individual contributions or rule out confounding factors. We will revise the abstract to report the observed gain without claiming direct attribution, instead noting that it coincides with the availability of Iqra_Extra_IS26 and the diverse modeling approaches. We will also add a limitations paragraph discussing possible evaluation shifts and recommending future controlled experiments to isolate data versus architecture effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical challenge report with direct observations only

full rationale

The paper is an Interspeech challenge summary that reports participant-submitted system results and an observed 0.28 F1-score increase relative to the prior edition. No derivations, equations, predictions, or parameter fits are present. The attribution statement is an observational claim about submitted systems and added data, with no self-referential reduction, fitted-input-as-prediction, or load-bearing self-citation chain that collapses the result to its own inputs by construction. The derivation chain is therefore empty and self-contained as raw empirical reporting.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical challenge summary paper with no mathematical derivations, fitted parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5488 in / 1216 out tokens · 62988 ms · 2026-05-14T00:10:52.043869+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Submitted systems employed a diverse range of approaches, spanning CTC-based self-supervised learning models, two-stage fine-tuning strategies, and using large audio-language models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

[1]

Introduction The field of Computer-Aided Pronunciation Training (CAPT) and its core component, Mispronunciation Detection and Diag- nosis (MDD), has become indispensable tools for self-directed language learners globally [1, 2]. CAPT systems serve two primary functions: (i) pronunciation assessment, identifying and localising phoneme-level errors in a lea...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Task Definition The task follows [21, 20]: given a speech utterance and its ref- erence vowelized transcript, the systems predict the sequence of pronounced phonemes

Challenge Setup 2.1. Task Definition The task follows [21, 20]: given a speech utterance and its ref- erence vowelized transcript, the systems predict the sequence of pronounced phonemes. Predictions are aligned against both the canonical phoneme sequence and the annotated verbatim sequence (the deliberately mispronounced version) to derive MDD metrics. W...

work page
[3]

Expert in- house vowelization was applied to all transcriptions, followed by phonetization via the Halabi MSA phonetizer

augmented with Qur’anic recitation segments. Expert in- house vowelization was applied to all transcriptions, followed by phonetization via the Halabi MSA phonetizer. The resulting 74,000 utterances span diverse speakers with balanced gender distribution. Unlike the internal CMV-Ar corpus used in the first edition, Iqra train is now publicly accessible, e...

work page 2026
[4]

Training uses Adam (lr=10−4), batch size 16, and early stopping on dev-set PER, with Iqra train and Iqra TTS as training data

Baseline System We provide a single SSL-based baseline following the setup in [21]: the multilingualmHuBERT[24] model (pretrained on 147 languages, 94M parameters) with a frozen encoder, a SUPERB-style [25] weighted layer sum, and a 2-layer 1024- unit Bi-LSTM with CTC loss [26]. Training uses Adam (lr=10−4), batch size 16, and early stopping on dev-set PE...

work page
[5]

Submitted Systems We summarize the top-6 systems ranked by F1-score on QuranMB.v2. The submissions collectively span three broad paradigms: enhanced CTC-based temporal modeling, SSL fine- tuning with language model integration, and generative large audio-language models (LALMs). whu-iasp (1st, F1=0.7201).This system combines a frozen wav2vec2-xls-r-300m1 ...

work page arXiv 1918
[6]

Overall Performance Table 2 presents the full leaderboard on QuranMB.v2

Results and Discussion 5.1. Overall Performance Table 2 presents the full leaderboard on QuranMB.v2. The results demonstrate broad and substantial community-wide progress: 13 of 19 submitted systems surpass the organizer baseline (F1=0.4414), with the best-performing system (whu- iasp) achieving an absolute F1 improvement of0.2787over the baseline. Notabl...

work page 2025
[7]

A recurring theme across top submissions is the disproportionate value of authen- tic mispronunciation data

Discussion The IQRA 2026 challenge marks a significant milestone in Arabic pronunciation assessment, yet the results also surface important open questions that the community must address to move this research toward real-world impact. A recurring theme across top submissions is the disproportionate value of authen- tic mispronunciation data. DespiteIqra E...

work page 2026
[8]

Conclusion We have presented IQRA 2026, the challenge introduced Iqra Extra IS26, the first corpus of authentic human mispro- nounced MSA speech, alongside the expanded QuranMB.v2 evaluation benchmark, and attracted 19 teams, nearly double previous editions. Submitted systems dramatically surpassed the organizer baseline, with the best system achieving an...

work page 2026
[9]

The effective- ness of computer assisted pronunciation training for foreign lan- guage learning by children,

A. Neri, O. Mich, M. Gerosa, and D. Giuliani, “The effective- ness of computer assisted pronunciation training for foreign lan- guage learning by children,”Computer Assisted Language Learn- ing, vol. 21, no. 5, pp. 393–408, 2008

work page 2008
[10]

Computer-assisted pronunciation train- ing (capt): Current issues and future directions,

P. M. Rogerson-Revell, “Computer-assisted pronunciation train- ing (capt): Current issues and future directions,”Relc Journal, vol. 52, no. 1, pp. 189–205, 2021

work page 2021
[11]

Automatic pronun- ciation assessment–a review,

Y . E. Kheir, A. Ali, and S. A. Chowdhury, “Automatic pronun- ciation assessment–a review,”arXiv preprint arXiv:2310.13974, 2023

work page arXiv 2023
[12]

Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks,

K. Li, X. Qian, and H. Meng, “Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 25, no. 1, pp. 193–207, 2016

work page 2016
[13]

Cnn-rnn-ctc based end-to- end mispronunciation detection and diagnosis,

W.-K. Leung, X. Liu, and H. Meng, “Cnn-rnn-ctc based end-to- end mispronunciation detection and diagnosis,” inICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 8132–8136

work page 2019
[14]

Capturing L2 segmental mis- pronunciations with joint-sequence models in CAPT,

X. Qian, H. Meng, and F. Soong, “Capturing L2 segmental mis- pronunciations with joint-sequence models in CAPT,” inProc. In- terspeech, 2010

work page 2010
[15]

Evaluating Wav2Vec2-BERT for CAPT in low- resource languages,

S. Fortet al., “Evaluating Wav2Vec2-BERT for CAPT in low- resource languages,” inProc. Interspeech, 2025

work page 2025
[16]

Qvoice: Arabic speech pronunciation learning appli- cation,

Y . E. Kheir, F. Khnaisser, S. A. Chowdhury, H. Mubarak, S. Afzal, and A. Ali, “Qvoice: Arabic speech pronunciation learning appli- cation,”arXiv preprint arXiv:2305.07445, 2023

work page arXiv 2023
[17]

Beyond orthography: Automatic recovery of short vowels and dialectal sounds in arabic,

Y . E. Kheir, H. Mubarak, A. Ali, and S. A. Chowdhury, “Beyond orthography: Automatic recovery of short vowels and dialectal sounds in arabic,”arXiv preprint arXiv:2408.02430, 2024

work page arXiv 2024
[18]

Improving mis- pronunciation detection and diagnosis for non-native learners of the arabic language,

N. Alrashoudi, H. Al-Khalifa, and Y . Alotaibi, “Improving mis- pronunciation detection and diagnosis for non-native learners of the arabic language,”Discover Computing, vol. 28, no. 1, p. 1, 2025

work page 2025
[19]

Speechocean762: An open- source non-native English speech corpus for pronunciation assess- ment,

J. Zhang, C. Ni, S. Zhanget al., “Speechocean762: An open- source non-native English speech corpus for pronunciation assess- ment,” inProc. Interspeech, 2021

work page 2021
[20]

Phonetic RNN- Transducer for mispronunciation detection,

D. Y . Zhang, S. Saha, and S. Campbell, “Phonetic RNN- Transducer for mispronunciation detection,” inProc. ICASSP, 2023

work page 2023
[21]

A novel frame- work for mispronunciation detection of arabic phonemes using audio-oriented transformer models,

S ¸. S. C ¸ alık, A. K¨uc ¸¨ukmanisa, and Z. H. Kilimci, “A novel frame- work for mispronunciation detection of arabic phonemes using audio-oriented transformer models,”Applied Acoustics, vol. 215, p. 109711, 2024

work page 2024
[22]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

work page 2020
[23]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W. Hsu, B. Bolte, Y . Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, p. 3451–3460, oct

work page
[24]

Available: https://doi.org/10.1109/TASLP.2021

[Online]. Available: https://doi.org/10.1109/TASLP.2021. 3122291

work page doi:10.1109/taslp.2021 2021
[25]

Automatic pronunciation assessment using self- supervised speech representation learning,

S. Kimet al., “Automatic pronunciation assessment using self- supervised speech representation learning,” inProc. Interspeech, 2022

work page 2022
[26]

Assessment of non- native speech intelligibility using Wav2vec2-based MDD and multi-level GOP transformer,

R. C. Shekar, M. Yang, K. Hirschiet al., “Assessment of non- native speech intelligibility using Wav2vec2-based MDD and multi-level GOP transformer,” inProc. Interspeech, 2023

work page 2023
[27]

The mgb-2 challenge: Arabic multi-dialect broadcast media recognition,

A. Ali, P. Bell, J. Glass, Y . Messaoui, H. Mubarak, S. Renals, and Y . Zhang, “The mgb-2 challenge: Arabic multi-dialect broadcast media recognition,” in2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 279–284

work page 2016
[28]

Non-native children’s automatic speech assess- ment challenge (NOCASA),

P. Gerberet al., “Non-native children’s automatic speech assess- ment challenge (NOCASA),” inProc. IEEE MLSP, 2025

work page 2025
[29]

Iqra’eval: A shared task on qur’anic pronunciation assessment,

Y . El Kheir, A. Meghanani, H. O. Toyin, N. Almarwani, O. Ibrahim, Y . A. Elshahawy, M. Shahin, and A. Ali, “Iqra’eval: A shared task on qur’anic pronunciation assessment,” inProceed- ings of The Third Arabic Natural Language Processing Confer- ence: Shared Tasks, 2025, pp. 443–452

work page 2025
[30]

Towards a Uni- fied Benchmark for Arabic Pronunciation Assessment: Qur’anic Recitation as Case Study,

Y . El Kheir, O. Ibrahim, A. Meghanani, N. Almarwani, H. Toyin, S. Alharbi, M. Alfadly, L. Alkanhal, I. Selim, S. Elbatal, S. Md- haffar, T. Hain, Y . Hifny, M. Shahin, and A. Ali, “Towards a Uni- fied Benchmark for Arabic Pronunciation Assessment: Qur’anic Recitation as Case Study,” inInterspeech 2025, 2025, pp. 2410– 2414

work page 2025
[31]

Phonetic inventory for an arabic speech corpus,

N. Halabi and M. Wald, “Phonetic inventory for an arabic speech corpus,” inProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 734– 738

work page 2016
[32]

Common V oice: A massively- multilingual speech corpus

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,”arXiv preprint arXiv:1912.06670, 2019

work page arXiv 1912
[33]

mhubert-147: A compact multilingual hubert model,

M. Zanon Boito, V . Iyer, N. Lagos, L. Besacier, and I. Calapode- scu, “mhubert-147: A compact multilingual hubert model,” inIn- terspeech 2024, 2024, pp. 3939–3943

work page 2024
[34]

Superb: Speech processing universal performance benchmark,

S. wen Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, and K. L. et al., “Superb: Speech processing universal performance benchmark,” inInterspeech 2021, 2021, pp. 1194–1198

work page 2021
[35]

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inICML, ser. ICML ’06. NY , USA: Association for Computing Machinery, 2006, p. 369–376. [Online]. Available: https: //doi.org/10.1145/1143844.1143891

work page doi:10.1145/1143844.1143891 2006
[36]

Tem- poral convolutional networks for action segmentation and detec- tion,

C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Tem- poral convolutional networks for action segmentation and detec- tion,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 156–165

work page 2017
[37]

Modified kneser-ney smoothing of n-gram models,

F. James, “Modified kneser-ney smoothing of n-gram models,” Research Institute for Advanced Computer Science, Tech. Rep. 00.07, 2000

work page 2000
[38]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460

work page 2020
[39]

U2++: Unified two-pass bidirectional end-to-end model for speech recognition,

D. Wu, B. Zhang, C. Yang, Z. Peng, W. Xia, X. Chen, and X. Lei, “U2++: Unified two-pass bidirectional end-to-end model for speech recognition,”arXiv preprint arXiv:2106.05642, 2021

work page arXiv 2021
[40]

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” 2021. [Online]. Available: https://arxiv.org/abs/2106.06103

work page arXiv 2021
[41]

Specaugment: A simple data augmen- tation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmen- tation method for automatic speech recognition,”arXiv preprint arXiv:1904.08779, 2019

work page arXiv 1904
[42]

Fleurs: Few-shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805

work page 2023
[43]

Com- mon voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon voice: A massively-multilingual speech corpus,” inProceed- ings of the twelfth language resources and evaluation conference, 2020, pp. 4218–4222

work page 2020