Recognition: unknown
AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR
Pith reviewed 2026-05-07 09:07 UTC · model grok-4.3
The pith
A new benchmark of spontaneous call-center dialogues in fourteen English accents shows ASR accuracy varies sharply by accent and segmentation, so strong American-English results do not guarantee broad performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The AppTek Call-Center Dialogues corpus consists of commissioned spontaneous role-played English conversations across fourteen accents and sixteen scenarios; when used to evaluate open-source ASR systems under multiple segmentation methods, the results exhibit substantial variation by accent and segmentation approach, showing that high performance on standard American-English benchmarks does not necessarily generalize to other accents.
What carries the argument
The AppTek Call-Center Dialogues corpus, a collection of spontaneous role-played agent-customer conversations with explicit accent annotations covering fourteen English varieties, serves as the evaluation resource for testing ASR robustness under varied segmentation.
If this is right
- ASR systems that excel on short read American English may show large degradations on spontaneous multi-accent long-form audio.
- Segmentation strategy becomes a first-order variable for recognition accuracy in conversational settings.
- Existing public benchmarks risk overestimating readiness for diverse real-world users in service-oriented domains.
- New evaluation resources that include explicit accent coverage and unsegmented dialogue are required to measure practical robustness.
Where Pith is reading between the lines
- Teams building global voice products should add accent-diverse long-form tests before deployment to reduce failure rates in regions with non-dominant English varieties.
- The observed segmentation sensitivity suggests end-to-end models that natively process longer contexts may close part of the gap without explicit re-segmentation.
- This corpus supplies a ready test set for measuring whether fine-tuning or adaptation techniques reduce accent-specific error rates in dialogue.
- Extending similar commissioned collections to other languages would allow direct comparison of accent robustness across linguistic families.
Load-bearing premise
The role-played dialogues sufficiently resemble real call-center speech and the accent annotations are accurate and consistent.
What would settle it
If ASR systems tested on this corpus achieve uniformly low error rates across all fourteen accents comparable to their American-English scores, or if a new collection of genuine non-role-played call-center recordings yields markedly different accent-wise variation patterns.
read the original abstract
Evaluating English ASR systems for conversational AI applications remains difficult, as many publicly available corpora are either pre-segmented into short segments, consist of read or prepared speech, or lack explicit dialect annotations to evaluate robustness for a diverse user base. This work presents the AppTek Call-Center Dialogues corpus, a collection of spontaneous, role-played agent-customer conversations spanning fourteen English accents covering sixteen service-oriented scenarios. The dataset was commissioned specifically for evaluation and none of the audio or text was publicly available prior to release, reducing the risk of overlap with existing large-scale pretraining corpora. We benchmark a set of open-source ASR systems under different segmentation approaches. Results show substantial variation across accents and segmentation methods, indicating that good performance on general American English benchmarks does not necessarily generalize to other accents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the AppTek Call-Center Dialogues corpus: a collection of commissioned, spontaneous role-played agent-customer conversations spanning 14 English accents and 16 service scenarios. It benchmarks open-source ASR systems under multiple segmentation regimes and reports substantial performance variation across accents, concluding that strong results on general American English benchmarks do not necessarily generalize.
Significance. If the accent-driven variation is shown to be robust and not an artifact of the role-play format, the work supplies a useful new long-form, multi-accent conversational benchmark that is explicitly uncontaminated with pretraining data. This directly addresses a known weakness in current ASR evaluation for real-world conversational AI.
major comments (2)
- [Abstract and Data Collection section] The central generalization claim rests on the observed accent and segmentation variation reflecting genuine accent differences rather than data-collection artifacts. However, the manuscript provides no quantitative validation (acoustic feature distributions, expert naturalness ratings, or comparison to authentic call-center recordings) that the commissioned role-played dialogues match real-world call-center conditions such as stress, overlap, or domain jargon. This is load-bearing for the claim in the abstract and §4 results.
- [§4 (Benchmarking Experiments)] Benchmark results are presented without model version numbers, exact segmentation hyperparameters, or statistical significance tests for the reported accent-wise differences. Without these, it is impossible to assess whether the 'substantial variation' is reproducible or merely descriptive.
minor comments (2)
- [Tables in §4] Table captions and axis labels should explicitly state the metric (e.g., WER) and the number of utterances per accent to allow readers to judge the reliability of per-accent scores.
- [§2] The description of the 16 scenarios would benefit from a short table listing scenario names and approximate durations to clarify coverage.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. These have prompted us to strengthen the manuscript's discussion of data limitations and to improve the reproducibility of the experimental section. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract and Data Collection section] The central generalization claim rests on the observed accent and segmentation variation reflecting genuine accent differences rather than data-collection artifacts. However, the manuscript provides no quantitative validation (acoustic feature distributions, expert naturalness ratings, or comparison to authentic call-center recordings) that the commissioned role-played dialogues match real-world call-center conditions such as stress, overlap, or domain jargon. This is load-bearing for the claim in the abstract and §4 results.
Authors: We acknowledge that the role-play format, while spontaneous and scenario-driven, cannot fully replicate all real-world call-center conditions such as high stress or heavy overlap. Because the corpus was commissioned under controlled conditions, direct paired comparisons to proprietary authentic recordings are not feasible. In the revised manuscript we have added a dedicated Limitations subsection that reports basic acoustic statistics (segment duration, speaking rate, and pause distributions) against Switchboard and Fisher corpora, and we have revised the abstract and §4 to frame the results as demonstrating accent-related performance gaps within a multi-accent conversational benchmark rather than claiming direct equivalence to production call-center data. revision: partial
-
Referee: [§4 (Benchmarking Experiments)] Benchmark results are presented without model version numbers, exact segmentation hyperparameters, or statistical significance tests for the reported accent-wise differences. Without these, it is impossible to assess whether the 'substantial variation' is reproducible or merely descriptive.
Authors: We agree that these details are required for reproducibility. The revised §4 now lists the precise model checkpoints (e.g., openai/whisper-large-v3, facebook/wav2vec2-large-960h), the exact segmentation parameters (maximum chunk length, overlap, and VAD threshold for each regime), and the results of statistical tests (one-way ANOVA followed by Tukey HSD post-hoc tests) with reported p-values confirming that the accent-wise WER differences are statistically significant at p < 0.01 for the majority of pairwise comparisons. revision: yes
Circularity Check
No circularity: empirical dataset release and benchmark with no derivations
full rationale
This paper releases a new multi-accent call-center dialogue corpus and reports direct ASR benchmark results across accents and segmentation methods. No equations, fitted parameters, predictions, or derivation chains exist in the work. The central claim (variation across accents indicates limited generalization) rests on empirical measurements from the newly collected data rather than any self-referential construction, self-citation load-bearing step, or renaming of prior results. The dataset novelty argument (no prior public availability) is a factual statement about release timing and does not reduce to any internal loop.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Recent advances in automatic speech recognition (ASR) have led to strong performance on standard English benchmarks. However, systematic evaluation under realistic long-form con- versational conditions remains limited, particularly across di- verse accents, as most public benchmarks emphasize pre- segmented recordings of read or prepared spee...
2026
-
[2]
Related Work Progress in ASR has been strongly shaped by evaluation on widely used benchmarks, many of which contain short, read speech, such as LibriSpeech [1], Mozilla Common V oice [2], and FLEURS [3]. These datasets reduce transcription effort because the text intended to be spoken is known, but they are less representative of deployed conversational ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Overview The dataset is an English ASR test set of spontaneous role- played call-center conversations spanning fourteen English ac- cents and multiple service-oriented domains
Dataset description 3.1. Overview The dataset is an English ASR test set of spontaneous role- played call-center conversations spanning fourteen English ac- cents and multiple service-oriented domains. It is designed ex- clusively for evaluation and analysis rather than model train- ing. In total, the corpus contains 128.6 hours of speech across 156 speak...
2026
-
[4]
Evaluation Setup A diverse set of publicly available open-weight ASR systems were evaluated on the test set
Benchmarking 4.1. Evaluation Setup A diverse set of publicly available open-weight ASR systems were evaluated on the test set. All models were executed locally using their default inference settings. The evaluated models are NVIDIA Canary-1B v2, Parakeet 0.6B TDT (v2, v3) [8] and NeMo Canary-Qwen-2.5B [9], IBM Granite Speech 3.3 (2B and 8B) [10], Kyutai S...
-
[5]
Thus participants might not be familiar with all technical terms or expressions used in a given domain
Limitations The dataset is restricted to role-played call-center interactions. Thus participants might not be familiar with all technical terms or expressions used in a given domain. While demographic diversity was encouraged during re- cruitment, gender distribution is not balanced across all accent groups (see Table 1). In total, 102 female and 54 male ...
-
[6]
The dataset was collected from scratch and does not rely on publicly avail- able sources, minimizing potential overlap with web-scraped training data
Conclusion This work introduced the AppTek Call-Center Dialogues, a long-form English ASR test set of spontaneous, role-played agent-customer conversations spanning fourteen English ac- cents and sixteen service-oriented scenarios. The dataset was collected from scratch and does not rely on publicly avail- able sources, minimizing potential overlap with w...
-
[7]
The gpt-oss-120B [18] model was used locally to help generate the mapping files for scoring normalization and to ver- ify proper US English spelling
Generative AI Use Disclosure OpenAI’s ChatGPT (GPT5.2 [17]) was used to proofread the paper. The gpt-oss-120B [18] model was used locally to help generate the mapping files for scoring normalization and to ver- ify proper US English spelling. Any generative AI output was vetted by at least one of the authors before including it in this work
-
[8]
Lib- rispeech: An ASR corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210
2015
-
[9]
Common V oice: A Massively-Multilingual Speech Corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common V oice: A Massively-Multilingual Speech Corpus,” in Proceedings of The 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020, pp. 4218–4222. [Online]. Available: https:...
2020
-
[10]
FLEURS: FEW-Shot Learn- ing Evaluation of Universal Representations of Speech,
A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: FEW-Shot Learn- ing Evaluation of Universal Representations of Speech,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 798–805
2023
-
[11]
2000 HUB5 English Evaluation Transcripts,
L. D. Consortium, “2000 HUB5 English Evaluation Transcripts,” Philadelphia, 2002, lDC2002T43, Web Download
2000
-
[12]
The AMI Meeting Corpus: A Pre- announcement,
J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V . Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, and P. Wellner, “The AMI Meeting Corpus: A Pre- announcement,” inMachine Learning for Multimodal Interaction, S. Renals and S. Bengio, Eds. Berlin, Heidelberg: Springe...
2006
-
[13]
The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,
J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,” inProc. Interspeech 2018, 2018, pp. 1561– 1565
2018
-
[14]
Earnings-22: A Practical Benchmark for Accents in the Wild,
M. D. Rio, P. Ha, Q. McNamara, C. Miller, and S. Chandra, “Earnings-22: A Practical Benchmark for Accents in the Wild,”
-
[15]
Earnings-22: A practical benchmark for accents in the wild,
[Online]. Available: https://arxiv.org/abs/2203.15591
-
[16]
Available: https://arxiv.org/abs/2509.14128
M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bartley, N. Karpov, J. Balam, and B. Ginsburg, “Canary-1B- v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST,” 2025. [Online]. Available: https://arxiv.org/abs/2509.14128
-
[17]
Canary-Qwen-2.5B: A Speech- Augmented Language Model,
NVIDIA NeMo Team, “Canary-Qwen-2.5B: A Speech- Augmented Language Model,” https://huggingface.co/nvidia/ canary-qwen-2.5b, 2025, Accessed: 2025-10-19
2025
-
[18]
Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities,
G. Saon, A. Dekel, A. Brooks, T. Nagano, A. Daniels, A. Satt, A. Mittal, B. Kingsbury, D. Haws, E. Morais, G. Kurata, H. Aronowitz, I. Ibrahim, J. Kuo, K. Soule, L. Lastras, M. Suzuki, R. Hoory, S. Thomaset al., “Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities,”
-
[19]
Available: https://arxiv.org/abs/2505.08699
[Online]. Available: https://arxiv.org/abs/2505.08699
-
[20]
N. Zeghidour, E. Kharitonov, M. Orsini, V . V olhejn, G. de Marmiesse, E. Grave, P. P ´erez, L. Mazar ´e, and A. D ´efossez, “Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling,” 2025. [Online]. Available: https://arxiv.org/abs/2509.08753
-
[21]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chen, D. Chen, D. Chen, J. Chen, W. Chen, Y .-C. Chen, Y . ling Chen, Q. Dai, X. Dai, R. Fan, M. Gaoet al., “Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs,” 2025. [Online]. Available: https://arxiv....
work page internal anchor Pith review arXiv 2025
-
[22]
X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang, Z. Guo, H. Hao, Y . Xi, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-ASR Technical Report,” 2026. [Online]. Available: https://arxiv.org/abs/2601.21337
work page internal anchor Pith review arXiv 2026
-
[23]
Robust speech recognition via large-scale weak su- pervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023
2023
-
[24]
Silero V AD: pre-trained enterprise-grade V oice Ac- tivity Detector (V AD), Number Detector and Language Classi- fier,
Silero Team, “Silero V AD: pre-trained enterprise-grade V oice Ac- tivity Detector (V AD), Number Detector and Language Classi- fier,” https://github.com/snakers4/silero-vad/tree/v5.1.2, 2024
2024
-
[25]
V . Srivastav, S. Zheng, E. Bezzam, E. L. Bihan, A. Moumen, and S. Gandhi, “Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation,” 2025. [Online]. Available: https://arxiv.org/abs/2510.06961
-
[26]
A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El- Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, A. Nathan, A. Luo, A. Helyar, A. Madry, A. Efremov, A. Spyra, A. Baker-Whitcomb, A. Beutel, A. Karpenko, A. Makelov et al., “OpenAI GPT-5 System Card,” 2025. [Online]. Available: https://arxiv.org/abs/2601.03267
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
gpt-oss-120b & gpt-oss-20b Model Card
S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chenet al., “gpt-oss- 120b & gpt-oss-20b Model Card,” 2025. [Online]. Available: https://arxiv.org/abs/2508.10925
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.