Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs
Pith reviewed 2026-06-30 07:23 UTC · model grok-4.3
The pith
ASR model rankings shift when evaluated on following different user preferences for output formatting and style.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PreferenceASR evaluates ASR systems on their ability to follow natural-language preference instructions across normalization, entities, disfluencies, and case; the preference-aware normalizer selectively applies only the matching normalization steps, and benchmarking four models demonstrates that rankings change across preference types in ways traditional fixed-normalizer evaluations miss.
What carries the argument
The preference-aware normalizer that selectively skips normalization steps to match the active instruction.
If this is right
- Models that rank highest under one preference instruction can rank lower under another.
- Fixed normalizers in existing benchmarks mask differences in how systems handle explicit user instructions.
- The released dataset supplies a direct way to measure preference adherence in ASR outputs.
Where Pith is reading between the lines
- Speech LLMs may require explicit preference conditioning at inference time to maintain high scores across categories.
- Evaluation pipelines for future ASR systems could default to testing multiple preference instructions rather than one fixed style.
- The shift in rankings suggests that preference following is an independent capability that standard word-error metrics do not capture.
Load-bearing premise
The two-stage pipeline with human verification produces examples that match real user preferences and the normalizer applies instructions without adding its own systematic bias.
What would settle it
Re-evaluate the same four models on a larger held-out set of preference instructions and find that model rankings remain stable across all four categories.
Figures
read the original abstract
Popular ASR test sets adopt inconsistent conventions for numbers, disfluencies, entities, and casing, while standard normalizers erase the format distinctions users care about. Current benchmarks therefore cannot measure whether a model follows user preferences for output style. We introduce PreferenceASR, a test set evaluating ASR systems on their ability to follow natural-language preference instructions across four categories: normalization, entities, disfluencies, and case. Built from seven open-source corpora via a two-stage LLM-assisted pipeline with human verification, it is evaluated with a preference-aware normalizer that selectively skips steps matching the active instruction. Benchmarking four models shows rankings shift across preference types, exposing quality differences traditional evaluation obscures. We publicly release the dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PreferenceASR, a new ASR test set constructed from seven open-source corpora via a two-stage LLM-assisted pipeline with human verification. It evaluates systems on their ability to follow natural-language preference instructions in four categories (normalization, entities, disfluencies, case) using a preference-aware normalizer that selectively skips matching steps. Benchmarking four models shows that rankings shift across preference types, revealing quality differences that standard evaluation obscures. The dataset is publicly released.
Significance. If the dataset construction and normalizer are validated, the work could meaningfully advance ASR benchmarking for speech LLMs by incorporating user preferences that current metrics ignore. The public release of the dataset is a clear strength that enables follow-up work.
major comments (2)
- Abstract: the central claim that 'benchmarking four models shows rankings shift across preference types' is presented without any quantitative results, specific WER values, error analysis, or tables, preventing assessment of whether the shifts are robust or statistically meaningful.
- Abstract (preference-aware normalizer description): the selective skipping logic is load-bearing for all reported comparisons, yet the abstract provides no validation (e.g., human agreement rates on skipped vs. applied steps or ablation on neutral instructions); without this, observed ranking changes could arise from normalizer artifacts rather than model differences.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below. The full manuscript contains the requested quantitative details and validations, but we agree the abstract can be strengthened for clarity.
read point-by-point responses
-
Referee: Abstract: the central claim that 'benchmarking four models shows rankings shift across preference types' is presented without any quantitative results, specific WER values, error analysis, or tables, preventing assessment of whether the shifts are robust or statistically meaningful.
Authors: The experiments section of the manuscript provides detailed WER tables, ranking comparisons, and error analysis across the four models and preference categories. The abstract is a concise summary and cannot include tables. We will revise the abstract to include specific quantitative highlights (e.g., example WER deltas and ranking reversals) to make the claim more assessable while remaining within length limits. revision: yes
-
Referee: Abstract (preference-aware normalizer description): the selective skipping logic is load-bearing for all reported comparisons, yet the abstract provides no validation (e.g., human agreement rates on skipped vs. applied steps or ablation on neutral instructions); without this, observed ranking changes could arise from normalizer artifacts rather than model differences.
Authors: The manuscript details the normalizer's construction, human verification process, agreement rates, and ablations in the methods and experiments sections. We agree a brief reference to this validation in the abstract would address the concern and will revise the abstract accordingly to note that the normalizer was validated via human agreement and ablations. revision: yes
Circularity Check
No circularity: empirical dataset construction with no derivations or fitted predictions
full rationale
The paper is a dataset construction effort that builds PreferenceASR from existing corpora via an LLM-assisted pipeline plus human verification, then applies a custom normalizer for evaluation. No equations, parameters, or predictions appear in the provided text or abstract. The central claim (ranking shifts across preference categories) rests on direct empirical comparison of four models rather than any self-referential reduction, self-citation chain, or ansatz. The normalizer description is procedural rather than a fitted or derived quantity that could collapse to its inputs by construction. This matches the default expectation of a non-circular empirical paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human verification of LLM-generated preference examples produces a test set that reflects genuine user preferences.
Reference graph
Works this paper leans on
-
[1]
Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs
Introduction Automatic speech recognition (ASR) has evolved rapidly, from hidden Markov models through end-to-end CTC and RNN-Transducer architectures to the recent class of Speech- augmented Large Language Models (SpeechLLMs) such as SALMONN [1], Qwen-Audio [2], and Qwen2.5-Omni [3]. Un- like earlier systems, SpeechLLMs can follow natural-language instru...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
A preference-annotated English test set covering four pref- erence categories: normalization, entities, disfluencies, and case
-
[3]
A two-stage LLM-assisted pipeline for generating the test set, with human verification and correction
-
[4]
A preference-aware normalizer that selectively skips normal- ization steps matching the active preference instruction, en- abling fair WER computation across diverse formatting re- quirements
-
[5]
We publicly release the test set and evaluation code to support reproducible benchmarking of SpeechLLMs on preference- following
-
[6]
22”) or spoken words (TN, e.g., “twenty-two
Preference-ASR Dataset 2.1. Preference Categories We organize preferences into four categories that capture the most common points of friction in real-world transcription workflows. Each category is further divided into sub-categories that can be combined in a single instruction. Normalization.Normalization deals with non-standard words whose spoken and w...
-
[7]
22nd” following an ITN instruction, a standard normalizer converts it to “twenty second,
Experiment Setup We evaluate the Preference-ASR dataset along two dimensions: (1) how providing a preference instruction affects raw transcrip- tion accuracy under standard normalization, and (2) whether models actually comply with the requested preference when as- sessed with a preference-aware metric. 3.1. Preference-Aware Normalizer Standard WER pipeli...
-
[8]
Standard
Evaluation Results Each model is evaluated under two settings:default(D), a stan- dard ASR prompt without any preference instruction, andin- structed(I), with a preference-specific instruction appended. 3https://huggingface.co/spaces/hf-audio/open_ asr_leaderboard Table 2:WER (%) on Preference-ASR. For each model,Std= standard WER with full normalization;...
-
[9]
Conclusion We introduced Preference-ASR, a test set of 3,210 samples drawn from seven open-source corpora that evaluates whether ASR systems can follow explicit human preference instructions across four categories: normalization, entities, disfluencies, and case. Together with a preference-aware normalizer that selec- tively skips steps matching the activ...
-
[10]
All content was reviewed and validated by the authors
Generative AI Use Disclosure Generative AI tools (LLMs) were used in two capacities: (1) as part of the dataset construction pipeline for preference classifi- cation and instruction generation, and (2) for editing and pol- ishing the manuscript. All content was reviewed and validated by the authors
-
[11]
SALMONN: Towards generic hearing abilities for large language models,
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,”Proc. International Conference on Learning Representations (ICLR), 2024
2024
-
[12]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-Audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Lib- rispeech: An ASR corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210
2015
-
[15]
SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition,
P. K. O’Neill, V . Lavrukhin, S. Majumdar, V . Noroozi, Y . Zhang, O. Kuchaiev, J. Balam, and B. Ginsburg, “SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition,” inProc. INTERSPEECH, 2021, pp. 1434– 1438
2021
-
[16]
Earnings-22: A practical benchmark for accents in the wild,
M. Del Rio, P. Ha, Q. McNamara, C. Miller, and S. Churi, “Earnings-22: A practical benchmark for accents in the wild,” in Proc. INTERSPEECH, 2022, pp. 4833–4837
2022
-
[17]
GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed au- dio,
G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed au- dio,” inProc. INTERSPEECH, 2021, pp. 3670–3674
2021
-
[18]
The ami meeting corpus,
W. Kraaij, T. Hain, M. Lincoln, and W. Post, “The ami meeting corpus,” inProc. International Conference on Methods and Tech- niques in Behavioral Research, 2005, pp. 1–4
2005
-
[19]
V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpreta- tion,
C. Wang, M. Riviere, A. Lee, A. Wu, C. Talber, J. Joshi, and J. Pino, “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpreta- tion,” inProc. Annual Meeting of the Association for Computa- tional Linguistics (ACL), 2021, pp. 993–1003
2021
-
[20]
Common V oice: A massively-multilingual speech corpus,
R. Ardila, M. Branwen, L. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common V oice: A massively-multilingual speech corpus,” in Proc. Language Resources and Evaluation Conference (LREC), 2020, pp. 4218–4222
2020
-
[21]
End-to-end rich transcription-style automatic speech recognition with semi- supervised learning,
Z. Meng, S. Parthasarathy, E. Sun, Y . Gaur, N. Kanda, L. Chen, Y . Zhao, J. Huang, Y . Gong, X. Zenget al., “End-to-end rich transcription-style automatic speech recognition with semi- supervised learning,”arXiv preprint arXiv:2107.05382, 2021
-
[22]
MMAU: A holistic bench- mark of agent capabilities across diverse domains,
G. Sakshi, Z. Sun, A. Alomraniet al., “MMAU: A holistic bench- mark of agent capabilities across diverse domains,”arXiv preprint arXiv:2407.18961, 2024
-
[23]
C.-Y . K. Ke-Han Lu and H. yi Lee, “Speech-ifeval: Evaluating instruction-following and quantifying catastrophic forgetting in speech-aware language models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.19037
-
[24]
Instruction-following speech recognition,
C.-I. J. Lai, Z. Lu, L. Cao, and R. Pang, “Instruction-following speech recognition,”arXiv preprint arXiv:2309.09843, 2023
-
[25]
Open ASR leaderboard,
Hugging Face, “Open ASR leaderboard,” https://huggingface.co/ spaces/hf-audio/open asr leaderboard, 2024
2024
-
[26]
RNN Approaches to Text Normalization: A Challenge
R. Sproat and N. Jaitly, “RNN approaches to text normalization: A challenge,”arXiv preprint arXiv:1611.00068, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Deep context: End-to-end contex- tual speech recognition,
G. Pundak and T. N. Sainath, “Deep context: End-to-end contex- tual speech recognition,” inProc. IEEE Spoken Language Tech- nology Workshop (SLT). IEEE, 2018, pp. 418–425
2018
-
[28]
Spontaneous speech: How people really talk and why engineers should care,
E. Shriberg, “Spontaneous speech: How people really talk and why engineers should care,” inProc. INTERSPEECH, 2005, pp. 1781–1784
2005
-
[29]
Bidirectional recurrent neural network with attention mechanism for punctuation restoration,
O. Tilk and T. Alum ¨ae, “Bidirectional recurrent neural network with attention mechanism for punctuation restoration,” inProc. INTERSPEECH, 2016, pp. 3047–3051
2016
-
[30]
Longer is (not necessarily) stronger: Punc- tuated long-sequence training for enhanced speech recognition and translation,
N. R. Koluguri, T. Bartley, H. Xu, O. Hrinchuk, J. Balam, B. Gins- burg, and G. Kucsko, “Longer is (not necessarily) stronger: Punc- tuated long-sequence training for enhanced speech recognition and translation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 255–262
2024
-
[31]
Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
open asr leaderboard/normalizer,
Hugging Face, “open asr leaderboard/normalizer,” https://github. com/huggingface/open asr leaderboard/tree/main/normalizer, 2026, commit or access date: March 3, 2026
2026
-
[33]
Robust speech recognition via large-scale weak su- pervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,”Proc. International Conference on Machine Learning (ICML), pp. 28 492–28 518, 2023
2023
-
[34]
What is lost in normalization? Exploring pitfalls in multilingual ASR model evaluations,
K. Guptaet al., “What is lost in normalization? Exploring pitfalls in multilingual ASR model evaluations,” inProc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
2024
-
[35]
M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bart- ley, N. Karpov, J. Balam, and B. Ginsburg, “Canary-1b-v2 & parakeet-tdt-0.6 b-v3: Efficient and high-performance models for multilingual asr and ast,”arXiv preprint arXiv:2509.14128, 2025
-
[36]
Fast conformer with linearly scalable attention for efficient speech recognition,
D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V . Noroozi, H. Huang, O. Hrinchuk, K. Puvvada, A. Kumar, J. Balamet al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8
2023
-
[37]
Efficient sequence transduction by jointly predicting tokens and durations,
H. Xu, F. Jia, S. Majumdar, H. Huang, S. Watanabe, and B. Gins- burg, “Efficient sequence transduction by jointly predicting tokens and durations,” inInternational Conference on Machine Learn- ing. PMLR, 2023, pp. 38 462–38 484
2023
-
[38]
Qwen2. 5 technical report,
A. Y . Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint, 2024
2024
-
[39]
Lora: Low-rank adaptation of large language models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022
2022
-
[40]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach et al., “Phi-4-Mini technical report: Compact yet powerful mul- timodal language models via mixture-of-LoRAs,”arXiv preprint arXiv:2503.01743, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
J. Xu, Z. Guo, J. He, H. Hu, Y . Chu, J. Linet al., “Qwen3-Omni technical report,”arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
NVIDIA NeMo Canary-Qwen-2.5B,
NVIDIA, “NVIDIA NeMo Canary-Qwen-2.5B,” https: //huggingface.co/nvidia/canary-qwen-2.5b, 2025. Supplementary Material: Preference-ASR
2025
-
[43]
Um, and profit aim is fifty million Euros, which is uh
Additional Preference Examples Figure 3 in the main paper shows one example per preference category. Figure 4 lists a sample from each category along with the full instruction and expected preference text as used during evaluation. Normalization AMI, ITN, numbers & symbols GT:“Um, and profit aim is fifty million Euros, which is uh” Instr:Perform speech re...
-
[44]
Per-Dataset WER Breakdown Table 2 in the main paper reports aggregate WER across all seven source datasets. Tables 3–6 provide the full per-dataset breakdown for each model, reporting both standard WER (Std, full normalization) and preference-aware WER (Pref, selective normalization that skips steps matching the active preference). D denotes a default pro...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.