arxiv: 2605.08214 · v1 · submitted 2026-05-06 · 💻 cs.SD · cs.AI· eess.AS

Recognition: no theorem link

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

Mohammed Aman Bhuiyan , Md Sazzad Hossain Adib , Samiul Basir Bhuiyan , Amit Chakraborty , Aritra Islam Saswato , Ahmed Faizul Haque Dhrubo , Mohammad Ashrafuzzaman Khan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:34 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS

keywords Bangla ASRspeaker diarizationWhisper fine-tuningPyAnnotelong-form speechword error ratediarization error rate

0 comments

The pith

Fine-tuning Whisper on 15,000 Bangla segments yields 0.2441 WER for long-form ASR and 0.2392 DER for diarization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops better tools for automatic speech recognition and speaker diarization in Bangla, a language that faces difficulties with long recordings and speaker variation. It retrains the Whisper medium model using full parameter updates on a custom collection of roughly 15,000 aligned Bangla audio chunks, adding noise, reverb, echo, clipping, and pitch changes during training. For diarization it retrains the PyAnnote segmentation model on annotated data and places the updated backbone into an existing pipeline while keeping the original embedding and clustering parts. The updated systems reach lower word error rate and diarization error rate than the starting models on a held-out test set. These gains matter because accurate handling of extended Bangla speech can support transcription services, meeting records, and other everyday applications.

Core claim

The paper shows that full-weight fine-tuning of the tugstugi bengaliai regional asr whisper medium model on the custom-curated dataset of approximately 15,000 chunked and aligned Bangla segments, combined with extensive audio augmentation, produces a word error rate of 0.2441 on the test set. It further shows that fine-tuning the pyannote/segmentation-3.0 model with PyTorch Lightning and inserting the updated segmentation backbone into the pyannote/speaker-diarization-community-1 pipeline yields a diarization error rate of 0.2392 on the same test set. Both results improve on the corresponding pretrained baselines, and the work supplies the complete pipeline details for data preparation, text

What carries the argument

Full-weight fine-tuning of the Whisper medium model with noise injection, reverb, echo, clipping, and pitch/time perturbations for ASR, plus replacement of the segmentation backbone in the PyAnnote diarization pipeline after targeted retraining.

Load-bearing premise

The custom-curated set of 15,000 Bangla segments is representative of real-world long-form conditions and the measured error reductions arise from the fine-tuning steps rather than dataset choice or evaluation details.

What would settle it

Testing the fine-tuned Whisper and PyAnnote models on an independent collection of long-form Bangla recordings that differ in speakers, acoustics, or length and finding error rates equal to or higher than the original pretrained models.

Figures

Figures reproduced from arXiv: 2605.08214 by Ahmed Faizul Haque Dhrubo, Amit Chakraborty, Aritra Islam Saswato, Md Sazzad Hossain Adib, Mohammad Ashrafuzzaman Khan, Mohammed Aman Bhuiyan, Samiul Basir Bhuiyan.

**Figure 2.** Figure 2: End-to-end pipeline for Bangla Long-Form Speech Recognition, using fine-tuned Whisper-medium(BengaliAI) ASR model. The framework demonstrates [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Complete end-to-end pipeline for fine-tuned Bengali speaker di [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Automatic Speech Recognition (ASR) and speaker diarization in Bangla remain challenging due to long form recordings, diverse acoustic conditions, and significant speaker variability. This work addresses these two core tasks in Bangla spoken language understanding by developing robust systems for long form ASR and speaker diarization. For ASR (Problem 1), we fine tune the tugstugi bengaliai regional asr whisper medium model on a custom-curated dataset of approximately 15,000 chunked and aligned Bangla audio segments, employing full weight training with extensive data augmentation including noise injection, reverb simulation, echo, clipping distortion, and pitch/time perturbation. For speaker diarization (Problem 2), we fine-tune the pyannote/segmentation-3.0 model using PyTorch Lightning on the competition annotated diarization dataset, swapping the fine-tuned segmentation backbone into the pyannote/speaker-diarization-community-1 pipeline while retaining the pretrained speaker embedding and clustering components. Our ASR system achieves a Word Error Rate (WER) of 0.2441, while our diarization system achieves a Diarization Error Rate (DER) of 0.2392, both evaluated on the test set, demonstrating notable improvements over the respective pretrained baselines. We describe our complete pipeline, including data preprocessing, text normalization, audio augmentation, training strategies, inference optimization, and post-processing for both tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Routine fine-tuning of Whisper and PyAnnote on a Bangla dataset yields concrete WER and DER numbers, but chunked evaluation leaves the long-form claims under-supported.

read the letter

The paper is a straightforward transfer-learning report: they take the tugstugi Whisper medium model, fine-tune it end-to-end on roughly 15k chunked Bangla segments with heavy augmentation (noise, reverb, pitch shifts, etc.), and report 0.2441 WER on their test split. For diarization they swap a fine-tuned segmentation model into the pyannote pipeline and get 0.2392 DER. That is the core of it—no new architecture or loss function, just careful application to Bangla.

Referee Report

3 major / 1 minor

Summary. The manuscript presents fine-tuned systems for Bangla long-form ASR and speaker diarization. The ASR component fine-tunes the tugstugi bengaliai regional asr whisper medium model on a custom dataset of ~15,000 chunked and aligned segments using full-parameter training and extensive augmentation (noise, reverb, echo, clipping, pitch/time shifts). The diarization component fine-tunes pyannote/segmentation-3.0 within the pyannote/speaker-diarization-community-1 pipeline. On the test set the ASR system reports WER 0.2441 and the diarization system reports DER 0.2392, both described as notable improvements over pretrained baselines. The paper also details the full pipelines including preprocessing, normalization, augmentation, training, inference, and post-processing.

Significance. If the reported WER and DER values prove robust, reproducible, and generalizable beyond the chunked evaluation regime, the work would supply usable open resources for an under-resourced language and demonstrate a practical recipe for adapting Whisper and PyAnnote to Bangla. The engineering choices (full fine-tuning plus targeted augmentation, modular pipeline reuse) are concrete and could be adopted by others working on similar low-resource settings.

major comments (3)

[Abstract] Abstract: the statement that the systems demonstrate 'notable improvements over the respective pretrained baselines' is unsupported because no baseline WER or DER numbers, confidence intervals, or statistical tests are supplied. Without these quantities the size and reliability of any gain cannot be evaluated.
[Abstract] Abstract and evaluation description: all reported metrics come from chunked, aligned segments of the custom 15 k dataset. The paper frames the contribution as addressing long-form recordings, yet supplies no separate long-form test protocol, no ablation on chunk length, and no analysis of cross-chunk speaker or acoustic continuity. Chunk-level scores can be optimistic relative to continuous long-form conditions.
[Abstract] Abstract: the test-set construction, size, and train/test split details are not described. It is therefore impossible to determine whether the reported numbers reflect genuine generalization or dataset-specific choices.

minor comments (1)

[Abstract] The model identifier 'tugstugi bengaliai regional asr whisper medium' should be given as an exact Hugging Face repository name or accompanied by a citation for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that the systems demonstrate 'notable improvements over the respective pretrained baselines' is unsupported because no baseline WER or DER numbers, confidence intervals, or statistical tests are supplied. Without these quantities the size and reliability of any gain cannot be evaluated.

Authors: We agree that the abstract's claim requires explicit support. In the revised version we will report the pretrained baseline WER and DER values evaluated on the identical test set, enabling direct comparison of the observed gains. We will also note any available measures of variability from our runs. revision: yes
Referee: [Abstract] Abstract and evaluation description: all reported metrics come from chunked, aligned segments of the custom 15 k dataset. The paper frames the contribution as addressing long-form recordings, yet supplies no separate long-form test protocol, no ablation on chunk length, and no analysis of cross-chunk speaker or acoustic continuity. Chunk-level scores can be optimistic relative to continuous long-form conditions.

Authors: The evaluation is performed on segments obtained by chunking longer recordings, which is required by Whisper's 30-second input constraint. We will revise the manuscript to describe the chunking procedure, its motivation, and the resulting limitations regarding speaker and acoustic continuity across boundaries. A dedicated continuous long-form test protocol and chunk-length ablations are not present in the current experiments; we will add an explicit limitations paragraph acknowledging that chunk-level scores may overestimate performance under fully continuous conditions. revision: partial
Referee: [Abstract] Abstract: the test-set construction, size, and train/test split details are not described. It is therefore impossible to determine whether the reported numbers reflect genuine generalization or dataset-specific choices.

Authors: We acknowledge the omission. The revised manuscript will expand the data section to specify the total number of segments, the exact train/test split (including counts or ratios), the alignment and chunking criteria, and the selection process used for the test set to support claims of generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning metrics are self-contained

full rationale

The paper reports standard ML fine-tuning of pretrained Whisper and PyAnnote models on a held-out test split of ~15k segments, with direct WER/DER measurements against the original baselines. No mathematical derivations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes are present. Claims reduce only to observed error rates on the evaluation set, which is independent of the training procedure by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of a custom 15k-segment dataset and standard assumptions that fine-tuning generalizes; no new entities are postulated and hyperparameters are implicit in the training process.

free parameters (2)

training dataset size
Approximately 15,000 chunked segments chosen for fine-tuning; exact selection criteria and split details not specified.
augmentation parameters
Levels of noise, reverb, echo, clipping, pitch and time perturbation selected to improve robustness; specific values not reported.

axioms (1)

domain assumption Fine-tuned models will generalize to unseen long-form Bangla recordings under varied acoustic conditions.
Invoked when claiming test-set improvements over baselines.

pith-pipeline@v0.9.0 · 5616 in / 1374 out tokens · 53956 ms · 2026-05-12T01:34:33.447654+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, ``Robust speech recognition via large-scale weak supervision,'' in Proc. ICML, 2023

work page 2023
[2]

Available: https://huggingface.co/bengaliAI

BengaliAI, ``Regional Bengali ASR Whisper models,'' Hugging Face Hub, 2024. Available: https://huggingface.co/bengaliAI

work page 2024
[3]

Bredin, ``pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,'' in Proc

H. Bredin, ``pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,'' in Proc. Interspeech, 2023

work page 2023
[4]

Plaquet and H

A. Plaquet and H. Bredin, ``Powerset multi-class cross entropy loss for neural speaker diarization,'' in Proc. Interspeech, 2023

work page 2023
[5]

D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, ``SpecAugment: A simple data augmentation method for automatic speech recognition,'' in Proc. Interspeech, 2019

work page 2019
[6]

T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, ``A study on data augmentation of reverberant speech for robust speech recognition,'' in Proc. ICASSP, 2017

work page 2017
[7]

``RapidFuzz: A fast string matching library,'' Available: https://rapidfuzz.github.io/RapidFuzz/

work page
[8]

Falcon and The PyTorch Lightning team, ``PyTorch Lightning,'' 2019

W. Falcon and The PyTorch Lightning team, ``PyTorch Lightning,'' 2019. Available: https://github.com/Lightning-AI/lightning

work page 2019
[9]

``bengaliAI: tugstugi-bengaliai-regional-asr-whisper-medium,'' Available: https://huggingface.co/bengaliAI/tugstugi_bengaliai-regional-asr_whisper-medium

work page
[10]

Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization,

Tabib, et al., “Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization,” arXiv.org, 2026. https://arxiv.org/abs/2602.14291 (accessed May 06, 2026)

work page arXiv 2026