Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

Boris Ginsburg; Desh Raj; Jagadeesh Balam; Nikolay Karpov; Nithin Rao Koluguri; Piotr Zelasko; Sasha Meister

arxiv: 2606.29534 · v1 · pith:7R6K5ZP4new · submitted 2026-06-28 · 💻 cs.CL · eess.AS

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

Nithin Rao Koluguri , Sasha Meister , Nikolay Karpov , Piotr Zelasko , Desh Raj , Jagadeesh Balam , Boris Ginsburg This is my paper

Pith reviewed 2026-06-30 07:23 UTC · model grok-4.3

classification 💻 cs.CL eess.AS

keywords ASR evaluationpreference-aware test setspeech recognitionoutput style preferencesnormalizationdisfluency handlingentity rendering

0 comments

The pith

ASR model rankings shift when evaluated on following different user preferences for output formatting and style.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard ASR test sets fix conventions for numbers, entities, disfluencies, and casing, then apply normalizers that erase the distinctions users actually request. The paper builds PreferenceASR from seven open-source corpora using an LLM-assisted pipeline followed by human checks, then scores systems with a normalizer that skips steps to match an active preference instruction. Four categories are covered: normalization choices, entity rendering, disfluency handling, and case usage. When four models are tested, their relative order changes depending on which preference is active, showing that conventional metrics hide differences in how well systems adapt to instructions.

Core claim

PreferenceASR evaluates ASR systems on their ability to follow natural-language preference instructions across normalization, entities, disfluencies, and case; the preference-aware normalizer selectively applies only the matching normalization steps, and benchmarking four models demonstrates that rankings change across preference types in ways traditional fixed-normalizer evaluations miss.

What carries the argument

The preference-aware normalizer that selectively skips normalization steps to match the active instruction.

If this is right

Models that rank highest under one preference instruction can rank lower under another.
Fixed normalizers in existing benchmarks mask differences in how systems handle explicit user instructions.
The released dataset supplies a direct way to measure preference adherence in ASR outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Speech LLMs may require explicit preference conditioning at inference time to maintain high scores across categories.
Evaluation pipelines for future ASR systems could default to testing multiple preference instructions rather than one fixed style.
The shift in rankings suggests that preference following is an independent capability that standard word-error metrics do not capture.

Load-bearing premise

The two-stage pipeline with human verification produces examples that match real user preferences and the normalizer applies instructions without adding its own systematic bias.

What would settle it

Re-evaluate the same four models on a larger held-out set of preference instructions and find that model rankings remain stable across all four categories.

Figures

Figures reproduced from arXiv: 2606.29534 by Boris Ginsburg, Desh Raj, Jagadeesh Balam, Nikolay Karpov, Nithin Rao Koluguri, Piotr Zelasko, Sasha Meister.

**Figure 1.** Figure 1: Existing ASR references exhibit inconsistent conventions across datasets: GigaSpeech writes numbers in spoken form while Earnings-22 uses written form; AMI preserves filler words while SPGISpeech strips them. Preference-ASR resolves this by conditioning each reference on an explicit user instruction. while SPGISpeech strips them. The net effect is that a model’s ranking on any single benchmark often refl… view at source ↗

**Figure 2.** Figure 2: Two-stage construction pipeline for the Preference-ASR dataset. Audio samples and ground truth from seven corpora are first manually verified. Stage 1 classifies each sample into preference categories (normalization, entities, disfluencies, case) using LLM. Stage 2 generates task-specific instructions and reference texts. A final round of manual verification produces the released test set with preference-a… view at source ↗

**Figure 3.** Figure 3: Examples from Preference-ASR. Each box shows ground truth (GT), preference instruction, and expected output (Pref). Bold marks transformations; underlined entities are false positives absent from audio [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Additional examples from Preference-ASR (complementary to [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Popular ASR test sets adopt inconsistent conventions for numbers, disfluencies, entities, and casing, while standard normalizers erase the format distinctions users care about. Current benchmarks therefore cannot measure whether a model follows user preferences for output style. We introduce PreferenceASR, a test set evaluating ASR systems on their ability to follow natural-language preference instructions across four categories: normalization, entities, disfluencies, and case. Built from seven open-source corpora via a two-stage LLM-assisted pipeline with human verification, it is evaluated with a preference-aware normalizer that selectively skips steps matching the active instruction. Benchmarking four models shows rankings shift across preference types, exposing quality differences traditional evaluation obscures. We publicly release the dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PreferenceASR releases a new test set for ASR style preferences with a selective normalizer, but the abstract gives no numbers or validation to back the ranking-shift claim.

read the letter

This paper's key offering is a publicly released test set that evaluates ASR models on following user preferences for output style in four areas, paired with a custom normalizer.

The new part is the combination of preference instructions with a selective normalizer that adjusts based on the active instruction. Prior benchmarks did not do this. The construction from multiple corpora with LLM help and human checks is a reasonable way to build it.

It does a service by pointing out how fixed normalizers in standard tests erase distinctions that users might care about, and by showing that model rankings can change depending on the preference type.

The soft spots are in the evidence presented. The abstract states that rankings shift but supplies no numbers, no breakdown of errors, and no tests of the normalizer against human judgments. That makes it hard to judge if the shifts are robust or if the normalizer introduces artifacts, as the stress-test note suggests could happen with the skipping logic.

The two-stage pipeline sounds practical, but without more on how well it captures natural preferences or inter-annotator agreement, the dataset's quality is not fully clear from the summary.

This paper is for people in the ASR and speech LLM community who want benchmarks that better match real deployment scenarios with user instructions. A reader looking for a new evaluation resource on style preferences would get value from the release.

It deserves a serious referee because the idea addresses a real gap and the dataset is released, which allows others to inspect and use it. I would recommend sending it to peer review, with the expectation that reviewers will ask for more validation on the normalizer and the construction pipeline.

Referee Report

2 major / 0 minor

Summary. The paper introduces PreferenceASR, a new ASR test set constructed from seven open-source corpora via a two-stage LLM-assisted pipeline with human verification. It evaluates systems on their ability to follow natural-language preference instructions in four categories (normalization, entities, disfluencies, case) using a preference-aware normalizer that selectively skips matching steps. Benchmarking four models shows that rankings shift across preference types, revealing quality differences that standard evaluation obscures. The dataset is publicly released.

Significance. If the dataset construction and normalizer are validated, the work could meaningfully advance ASR benchmarking for speech LLMs by incorporating user preferences that current metrics ignore. The public release of the dataset is a clear strength that enables follow-up work.

major comments (2)

Abstract: the central claim that 'benchmarking four models shows rankings shift across preference types' is presented without any quantitative results, specific WER values, error analysis, or tables, preventing assessment of whether the shifts are robust or statistically meaningful.
Abstract (preference-aware normalizer description): the selective skipping logic is load-bearing for all reported comparisons, yet the abstract provides no validation (e.g., human agreement rates on skipped vs. applied steps or ablation on neutral instructions); without this, observed ranking changes could arise from normalizer artifacts rather than model differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below. The full manuscript contains the requested quantitative details and validations, but we agree the abstract can be strengthened for clarity.

read point-by-point responses

Referee: Abstract: the central claim that 'benchmarking four models shows rankings shift across preference types' is presented without any quantitative results, specific WER values, error analysis, or tables, preventing assessment of whether the shifts are robust or statistically meaningful.

Authors: The experiments section of the manuscript provides detailed WER tables, ranking comparisons, and error analysis across the four models and preference categories. The abstract is a concise summary and cannot include tables. We will revise the abstract to include specific quantitative highlights (e.g., example WER deltas and ranking reversals) to make the claim more assessable while remaining within length limits. revision: yes
Referee: Abstract (preference-aware normalizer description): the selective skipping logic is load-bearing for all reported comparisons, yet the abstract provides no validation (e.g., human agreement rates on skipped vs. applied steps or ablation on neutral instructions); without this, observed ranking changes could arise from normalizer artifacts rather than model differences.

Authors: The manuscript details the normalizer's construction, human verification process, agreement rates, and ablations in the methods and experiments sections. We agree a brief reference to this validation in the abstract would address the concern and will revise the abstract accordingly to note that the normalizer was validated via human agreement and ablations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with no derivations or fitted predictions

full rationale

The paper is a dataset construction effort that builds PreferenceASR from existing corpora via an LLM-assisted pipeline plus human verification, then applies a custom normalizer for evaluation. No equations, parameters, or predictions appear in the provided text or abstract. The central claim (ranking shifts across preference categories) rests on direct empirical comparison of four models rather than any self-referential reduction, self-citation chain, or ansatz. The normalizer description is procedural rather than a fitted or derived quantity that could collapse to its inputs by construction. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the reliability of the LLM-assisted construction and human verification process; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human verification of LLM-generated preference examples produces a test set that reflects genuine user preferences.
The two-stage pipeline relies on this step to ensure quality; stated in the abstract description of dataset creation.

pith-pipeline@v0.9.1-grok · 5675 in / 1132 out tokens · 31721 ms · 2026-06-30T07:23:46.427878+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 12 canonical work pages · 7 internal anchors

[1]

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

Introduction Automatic speech recognition (ASR) has evolved rapidly, from hidden Markov models through end-to-end CTC and RNN-Transducer architectures to the recent class of Speech- augmented Large Language Models (SpeechLLMs) such as SALMONN [1], Qwen-Audio [2], and Qwen2.5-Omni [3]. Un- like earlier systems, SpeechLLMs can follow natural-language instru...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

A preference-annotated English test set covering four pref- erence categories: normalization, entities, disfluencies, and case
[3]

A two-stage LLM-assisted pipeline for generating the test set, with human verification and correction
[4]

A preference-aware normalizer that selectively skips normal- ization steps matching the active preference instruction, en- abling fair WER computation across diverse formatting re- quirements
[5]

We publicly release the test set and evaluation code to support reproducible benchmarking of SpeechLLMs on preference- following
[6]

22”) or spoken words (TN, e.g., “twenty-two

Preference-ASR Dataset 2.1. Preference Categories We organize preferences into four categories that capture the most common points of friction in real-world transcription workflows. Each category is further divided into sub-categories that can be combined in a single instruction. Normalization.Normalization deals with non-standard words whose spoken and w...
[7]

22nd” following an ITN instruction, a standard normalizer converts it to “twenty second,

Experiment Setup We evaluate the Preference-ASR dataset along two dimensions: (1) how providing a preference instruction affects raw transcrip- tion accuracy under standard normalization, and (2) whether models actually comply with the requested preference when as- sessed with a preference-aware metric. 3.1. Preference-Aware Normalizer Standard WER pipeli...
[8]

Standard

Evaluation Results Each model is evaluated under two settings:default(D), a stan- dard ASR prompt without any preference instruction, andin- structed(I), with a preference-specific instruction appended. 3https://huggingface.co/spaces/hf-audio/open_ asr_leaderboard Table 2:WER (%) on Preference-ASR. For each model,Std= standard WER with full normalization;...
[9]

Conclusion We introduced Preference-ASR, a test set of 3,210 samples drawn from seven open-source corpora that evaluates whether ASR systems can follow explicit human preference instructions across four categories: normalization, entities, disfluencies, and case. Together with a preference-aware normalizer that selec- tively skips steps matching the activ...
[10]

All content was reviewed and validated by the authors

Generative AI Use Disclosure Generative AI tools (LLMs) were used in two capacities: (1) as part of the dataset construction pipeline for preference classifi- cation and instruction generation, and (2) for editing and pol- ishing the manuscript. All content was reviewed and validated by the authors
[11]

SALMONN: Towards generic hearing abilities for large language models,

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,”Proc. International Conference on Learning Representations (ICLR), 2024

2024
[12]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-Audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Lib- rispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210

2015
[15]

SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition,

P. K. O’Neill, V . Lavrukhin, S. Majumdar, V . Noroozi, Y . Zhang, O. Kuchaiev, J. Balam, and B. Ginsburg, “SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition,” inProc. INTERSPEECH, 2021, pp. 1434– 1438

2021
[16]

Earnings-22: A practical benchmark for accents in the wild,

M. Del Rio, P. Ha, Q. McNamara, C. Miller, and S. Churi, “Earnings-22: A practical benchmark for accents in the wild,” in Proc. INTERSPEECH, 2022, pp. 4833–4837

2022
[17]

GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed au- dio,

G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed au- dio,” inProc. INTERSPEECH, 2021, pp. 3670–3674

2021
[18]

The ami meeting corpus,

W. Kraaij, T. Hain, M. Lincoln, and W. Post, “The ami meeting corpus,” inProc. International Conference on Methods and Tech- niques in Behavioral Research, 2005, pp. 1–4

2005
[19]

V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpreta- tion,

C. Wang, M. Riviere, A. Lee, A. Wu, C. Talber, J. Joshi, and J. Pino, “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpreta- tion,” inProc. Annual Meeting of the Association for Computa- tional Linguistics (ACL), 2021, pp. 993–1003

2021
[20]

Common V oice: A massively-multilingual speech corpus,

R. Ardila, M. Branwen, L. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common V oice: A massively-multilingual speech corpus,” in Proc. Language Resources and Evaluation Conference (LREC), 2020, pp. 4218–4222

2020
[21]

End-to-end rich transcription-style automatic speech recognition with semi- supervised learning,

Z. Meng, S. Parthasarathy, E. Sun, Y . Gaur, N. Kanda, L. Chen, Y . Zhao, J. Huang, Y . Gong, X. Zenget al., “End-to-end rich transcription-style automatic speech recognition with semi- supervised learning,”arXiv preprint arXiv:2107.05382, 2021

work page arXiv 2021
[22]

MMAU: A holistic bench- mark of agent capabilities across diverse domains,

G. Sakshi, Z. Sun, A. Alomraniet al., “MMAU: A holistic bench- mark of agent capabilities across diverse domains,”arXiv preprint arXiv:2407.18961, 2024

work page arXiv 2024
[23]

Speech-ifeval: Evaluating instruction-following and quantifying catastrophic forgetting in speech-aware language models,

C.-Y . K. Ke-Han Lu and H. yi Lee, “Speech-ifeval: Evaluating instruction-following and quantifying catastrophic forgetting in speech-aware language models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.19037

work page arXiv 2025
[24]

Instruction-following speech recognition,

C.-I. J. Lai, Z. Lu, L. Cao, and R. Pang, “Instruction-following speech recognition,”arXiv preprint arXiv:2309.09843, 2023

work page arXiv 2023
[25]

Open ASR leaderboard,

Hugging Face, “Open ASR leaderboard,” https://huggingface.co/ spaces/hf-audio/open asr leaderboard, 2024

2024
[26]

RNN Approaches to Text Normalization: A Challenge

R. Sproat and N. Jaitly, “RNN approaches to text normalization: A challenge,”arXiv preprint arXiv:1611.00068, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Deep context: End-to-end contex- tual speech recognition,

G. Pundak and T. N. Sainath, “Deep context: End-to-end contex- tual speech recognition,” inProc. IEEE Spoken Language Tech- nology Workshop (SLT). IEEE, 2018, pp. 418–425

2018
[28]

Spontaneous speech: How people really talk and why engineers should care,

E. Shriberg, “Spontaneous speech: How people really talk and why engineers should care,” inProc. INTERSPEECH, 2005, pp. 1781–1784

2005
[29]

Bidirectional recurrent neural network with attention mechanism for punctuation restoration,

O. Tilk and T. Alum ¨ae, “Bidirectional recurrent neural network with attention mechanism for punctuation restoration,” inProc. INTERSPEECH, 2016, pp. 3047–3051

2016
[30]

Longer is (not necessarily) stronger: Punc- tuated long-sequence training for enhanced speech recognition and translation,

N. R. Koluguri, T. Bartley, H. Xu, O. Hrinchuk, J. Balam, B. Gins- burg, and G. Kucsko, “Longer is (not necessarily) stronger: Punc- tuated long-sequence training for enhanced speech recognition and translation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 255–262

2024
[31]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

open asr leaderboard/normalizer,

Hugging Face, “open asr leaderboard/normalizer,” https://github. com/huggingface/open asr leaderboard/tree/main/normalizer, 2026, commit or access date: March 3, 2026

2026
[33]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,”Proc. International Conference on Machine Learning (ICML), pp. 28 492–28 518, 2023

2023
[34]

What is lost in normalization? Exploring pitfalls in multilingual ASR model evaluations,

K. Guptaet al., “What is lost in normalization? Exploring pitfalls in multilingual ASR model evaluations,” inProc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024
[35]

Canary-1b-v2 & parakeet-tdt-0.6 b-v3: Efficient and high-performance models for multilingual asr and ast,

M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bart- ley, N. Karpov, J. Balam, and B. Ginsburg, “Canary-1b-v2 & parakeet-tdt-0.6 b-v3: Efficient and high-performance models for multilingual asr and ast,”arXiv preprint arXiv:2509.14128, 2025

work page arXiv 2025
[36]

Fast conformer with linearly scalable attention for efficient speech recognition,

D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V . Noroozi, H. Huang, O. Hrinchuk, K. Puvvada, A. Kumar, J. Balamet al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

2023
[37]

Efficient sequence transduction by jointly predicting tokens and durations,

H. Xu, F. Jia, S. Majumdar, H. Huang, S. Watanabe, and B. Gins- burg, “Efficient sequence transduction by jointly predicting tokens and durations,” inInternational Conference on Machine Learn- ing. PMLR, 2023, pp. 38 462–38 484

2023
[38]

Qwen2. 5 technical report,

A. Y . Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint, 2024

2024
[39]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022

2022
[40]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach et al., “Phi-4-Mini technical report: Compact yet powerful mul- timodal language models via mixture-of-LoRAs,”arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, Y . Chu, J. Linet al., “Qwen3-Omni technical report,”arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

NVIDIA NeMo Canary-Qwen-2.5B,

NVIDIA, “NVIDIA NeMo Canary-Qwen-2.5B,” https: //huggingface.co/nvidia/canary-qwen-2.5b, 2025. Supplementary Material: Preference-ASR

2025
[43]

Um, and profit aim is fifty million Euros, which is uh

Additional Preference Examples Figure 3 in the main paper shows one example per preference category. Figure 4 lists a sample from each category along with the full instruction and expected preference text as used during evaluation. Normalization AMI, ITN, numbers & symbols GT:“Um, and profit aim is fifty million Euros, which is uh” Instr:Perform speech re...
[44]

Per-Dataset WER Breakdown Table 2 in the main paper reports aggregate WER across all seven source datasets. Tables 3–6 provide the full per-dataset breakdown for each model, reporting both standard WER (Std, full normalization) and preference-aware WER (Pref, selective normalization that skips steps matching the active preference). D denotes a default pro...

[1] [1]

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

Introduction Automatic speech recognition (ASR) has evolved rapidly, from hidden Markov models through end-to-end CTC and RNN-Transducer architectures to the recent class of Speech- augmented Large Language Models (SpeechLLMs) such as SALMONN [1], Qwen-Audio [2], and Qwen2.5-Omni [3]. Un- like earlier systems, SpeechLLMs can follow natural-language instru...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

A preference-annotated English test set covering four pref- erence categories: normalization, entities, disfluencies, and case

[3] [3]

A two-stage LLM-assisted pipeline for generating the test set, with human verification and correction

[4] [4]

A preference-aware normalizer that selectively skips normal- ization steps matching the active preference instruction, en- abling fair WER computation across diverse formatting re- quirements

[5] [5]

We publicly release the test set and evaluation code to support reproducible benchmarking of SpeechLLMs on preference- following

[6] [6]

22”) or spoken words (TN, e.g., “twenty-two

Preference-ASR Dataset 2.1. Preference Categories We organize preferences into four categories that capture the most common points of friction in real-world transcription workflows. Each category is further divided into sub-categories that can be combined in a single instruction. Normalization.Normalization deals with non-standard words whose spoken and w...

[7] [7]

22nd” following an ITN instruction, a standard normalizer converts it to “twenty second,

Experiment Setup We evaluate the Preference-ASR dataset along two dimensions: (1) how providing a preference instruction affects raw transcrip- tion accuracy under standard normalization, and (2) whether models actually comply with the requested preference when as- sessed with a preference-aware metric. 3.1. Preference-Aware Normalizer Standard WER pipeli...

[8] [8]

Standard

Evaluation Results Each model is evaluated under two settings:default(D), a stan- dard ASR prompt without any preference instruction, andin- structed(I), with a preference-specific instruction appended. 3https://huggingface.co/spaces/hf-audio/open_ asr_leaderboard Table 2:WER (%) on Preference-ASR. For each model,Std= standard WER with full normalization;...

[9] [9]

Conclusion We introduced Preference-ASR, a test set of 3,210 samples drawn from seven open-source corpora that evaluates whether ASR systems can follow explicit human preference instructions across four categories: normalization, entities, disfluencies, and case. Together with a preference-aware normalizer that selec- tively skips steps matching the activ...

[10] [10]

All content was reviewed and validated by the authors

Generative AI Use Disclosure Generative AI tools (LLMs) were used in two capacities: (1) as part of the dataset construction pipeline for preference classifi- cation and instruction generation, and (2) for editing and pol- ishing the manuscript. All content was reviewed and validated by the authors

[11] [11]

SALMONN: Towards generic hearing abilities for large language models,

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,”Proc. International Conference on Learning Representations (ICLR), 2024

2024

[12] [12]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-Audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Lib- rispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210

2015

[15] [15]

SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition,

P. K. O’Neill, V . Lavrukhin, S. Majumdar, V . Noroozi, Y . Zhang, O. Kuchaiev, J. Balam, and B. Ginsburg, “SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition,” inProc. INTERSPEECH, 2021, pp. 1434– 1438

2021

[16] [16]

Earnings-22: A practical benchmark for accents in the wild,

M. Del Rio, P. Ha, Q. McNamara, C. Miller, and S. Churi, “Earnings-22: A practical benchmark for accents in the wild,” in Proc. INTERSPEECH, 2022, pp. 4833–4837

2022

[17] [17]

GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed au- dio,

G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed au- dio,” inProc. INTERSPEECH, 2021, pp. 3670–3674

2021

[18] [18]

The ami meeting corpus,

W. Kraaij, T. Hain, M. Lincoln, and W. Post, “The ami meeting corpus,” inProc. International Conference on Methods and Tech- niques in Behavioral Research, 2005, pp. 1–4

2005

[19] [19]

V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpreta- tion,

C. Wang, M. Riviere, A. Lee, A. Wu, C. Talber, J. Joshi, and J. Pino, “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpreta- tion,” inProc. Annual Meeting of the Association for Computa- tional Linguistics (ACL), 2021, pp. 993–1003

2021

[20] [20]

Common V oice: A massively-multilingual speech corpus,

R. Ardila, M. Branwen, L. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common V oice: A massively-multilingual speech corpus,” in Proc. Language Resources and Evaluation Conference (LREC), 2020, pp. 4218–4222

2020

[21] [21]

End-to-end rich transcription-style automatic speech recognition with semi- supervised learning,

Z. Meng, S. Parthasarathy, E. Sun, Y . Gaur, N. Kanda, L. Chen, Y . Zhao, J. Huang, Y . Gong, X. Zenget al., “End-to-end rich transcription-style automatic speech recognition with semi- supervised learning,”arXiv preprint arXiv:2107.05382, 2021

work page arXiv 2021

[22] [22]

MMAU: A holistic bench- mark of agent capabilities across diverse domains,

G. Sakshi, Z. Sun, A. Alomraniet al., “MMAU: A holistic bench- mark of agent capabilities across diverse domains,”arXiv preprint arXiv:2407.18961, 2024

work page arXiv 2024

[23] [23]

Speech-ifeval: Evaluating instruction-following and quantifying catastrophic forgetting in speech-aware language models,

C.-Y . K. Ke-Han Lu and H. yi Lee, “Speech-ifeval: Evaluating instruction-following and quantifying catastrophic forgetting in speech-aware language models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.19037

work page arXiv 2025

[24] [24]

Instruction-following speech recognition,

C.-I. J. Lai, Z. Lu, L. Cao, and R. Pang, “Instruction-following speech recognition,”arXiv preprint arXiv:2309.09843, 2023

work page arXiv 2023

[25] [25]

Open ASR leaderboard,

Hugging Face, “Open ASR leaderboard,” https://huggingface.co/ spaces/hf-audio/open asr leaderboard, 2024

2024

[26] [26]

RNN Approaches to Text Normalization: A Challenge

R. Sproat and N. Jaitly, “RNN approaches to text normalization: A challenge,”arXiv preprint arXiv:1611.00068, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [27]

Deep context: End-to-end contex- tual speech recognition,

G. Pundak and T. N. Sainath, “Deep context: End-to-end contex- tual speech recognition,” inProc. IEEE Spoken Language Tech- nology Workshop (SLT). IEEE, 2018, pp. 418–425

2018

[28] [28]

Spontaneous speech: How people really talk and why engineers should care,

E. Shriberg, “Spontaneous speech: How people really talk and why engineers should care,” inProc. INTERSPEECH, 2005, pp. 1781–1784

2005

[29] [29]

Bidirectional recurrent neural network with attention mechanism for punctuation restoration,

O. Tilk and T. Alum ¨ae, “Bidirectional recurrent neural network with attention mechanism for punctuation restoration,” inProc. INTERSPEECH, 2016, pp. 3047–3051

2016

[30] [30]

Longer is (not necessarily) stronger: Punc- tuated long-sequence training for enhanced speech recognition and translation,

N. R. Koluguri, T. Bartley, H. Xu, O. Hrinchuk, J. Balam, B. Gins- burg, and G. Kucsko, “Longer is (not necessarily) stronger: Punc- tuated long-sequence training for enhanced speech recognition and translation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 255–262

2024

[31] [31]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

open asr leaderboard/normalizer,

Hugging Face, “open asr leaderboard/normalizer,” https://github. com/huggingface/open asr leaderboard/tree/main/normalizer, 2026, commit or access date: March 3, 2026

2026

[33] [33]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,”Proc. International Conference on Machine Learning (ICML), pp. 28 492–28 518, 2023

2023

[34] [34]

What is lost in normalization? Exploring pitfalls in multilingual ASR model evaluations,

K. Guptaet al., “What is lost in normalization? Exploring pitfalls in multilingual ASR model evaluations,” inProc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024

[35] [35]

Canary-1b-v2 & parakeet-tdt-0.6 b-v3: Efficient and high-performance models for multilingual asr and ast,

M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bart- ley, N. Karpov, J. Balam, and B. Ginsburg, “Canary-1b-v2 & parakeet-tdt-0.6 b-v3: Efficient and high-performance models for multilingual asr and ast,”arXiv preprint arXiv:2509.14128, 2025

work page arXiv 2025

[36] [36]

Fast conformer with linearly scalable attention for efficient speech recognition,

D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V . Noroozi, H. Huang, O. Hrinchuk, K. Puvvada, A. Kumar, J. Balamet al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

2023

[37] [37]

Efficient sequence transduction by jointly predicting tokens and durations,

H. Xu, F. Jia, S. Majumdar, H. Huang, S. Watanabe, and B. Gins- burg, “Efficient sequence transduction by jointly predicting tokens and durations,” inInternational Conference on Machine Learn- ing. PMLR, 2023, pp. 38 462–38 484

2023

[38] [38]

Qwen2. 5 technical report,

A. Y . Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint, 2024

2024

[39] [39]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022

2022

[40] [40]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach et al., “Phi-4-Mini technical report: Compact yet powerful mul- timodal language models via mixture-of-LoRAs,”arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, Y . Chu, J. Linet al., “Qwen3-Omni technical report,”arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

NVIDIA NeMo Canary-Qwen-2.5B,

NVIDIA, “NVIDIA NeMo Canary-Qwen-2.5B,” https: //huggingface.co/nvidia/canary-qwen-2.5b, 2025. Supplementary Material: Preference-ASR

2025

[43] [43]

Um, and profit aim is fifty million Euros, which is uh

Additional Preference Examples Figure 3 in the main paper shows one example per preference category. Figure 4 lists a sample from each category along with the full instruction and expected preference text as used during evaluation. Normalization AMI, ITN, numbers & symbols GT:“Um, and profit aim is fifty million Euros, which is uh” Instr:Perform speech re...

[44] [44]

Per-Dataset WER Breakdown Table 2 in the main paper reports aggregate WER across all seven source datasets. Tables 3–6 provide the full per-dataset breakdown for each model, reporting both standard WER (Std, full normalization) and preference-aware WER (Pref, selective normalization that skips steps matching the active preference). D denotes a default pro...