pith. sign in

arxiv: 1907.01372 · v1 · pith:CPLGYHLQnew · submitted 2019-07-01 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

Improving Performance of End-to-End ASR on Numeric Sequences

Pith reviewed 2026-05-25 11:27 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDstat.ML
keywords end-to-end ASRnumeric sequencesTTS data augmentationneural denormalizationon-device speech recognitionword error ratespoken-to-written conversion
0
0 comments X

The pith

End-to-end ASR models reduce word error rates on long numeric sequences by up to a factor of eight by augmenting training with TTS data and replacing large FST denormalizers with a small neural network.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end ASR models face an out-of-vocabulary problem when users speak numbers in written form, such as prices or phone numbers, because conventional training leaves those sequences unseen. Traditional systems solve this by training on spoken forms and then applying a large finite-state transducer to convert back to written form, but that transducer uses too much memory for on-device deployment. The paper demonstrates that generating extra numeric examples with text-to-speech and training a compact neural network to perform the spoken-to-written conversion improves accuracy on multiple numeric categories. The gains are largest on the longest sequences, where error rates fall by as much as eight times. Readers care because the method keeps the entire system inside the low-memory envelope required for phones and watches.

Core claim

Recognizing written-domain numeric utterances remains difficult for end-to-end models when numeric sequences are absent from training. Conventional pipelines address the issue by training on spoken-domain data and applying an FST verbalizer for denormalization, yet the verbalizer's memory footprint precludes its use in the on-device setting. Generating additional numeric training data with a text-to-speech system and substituting a small-footprint neural network for the FST verbalizer produces measurable gains across several numeric classes. The largest improvement occurs on the longest numeric sequences, where word error rate falls by up to a factor of eight.

What carries the argument

A small-footprint neural network trained to map spoken-domain numeric output to written-domain form, used together with TTS-generated numeric utterances to augment the training set.

If this is right

  • Recognition accuracy improves on several distinct numeric classes such as prices, phone numbers, and dates.
  • Word error rate on the longest numeric sequences drops by as much as a factor of eight.
  • The entire pipeline remains compatible with the strict memory limits of on-device ASR because the neural denormalizer replaces the large FST component.
  • End-to-end models can now be trained to handle out-of-vocabulary numeric material without relying on external spoken-domain training pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same TTS-plus-neural-denormalizer pattern could be applied to other categories of rare tokens, such as proper names or technical terms, that also suffer from domain mismatch.
  • Placing the neural denormalizer inside the end-to-end model itself rather than as a post-processing step might further reduce latency on resource-constrained devices.
  • Measuring how performance changes when the TTS voices are drawn from a wider range of accents would test whether the current gains hold under more varied real-world conditions.

Load-bearing premise

The distribution of numeric utterances produced by the text-to-speech system is close enough to real user speech that models trained on the synthetic data will generalize to actual spoken input.

What would settle it

Evaluating the augmented model on a large set of real-user numeric utterances recorded in the target acoustic conditions and finding no reduction, or an increase, in word error rate relative to the unaugmented baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 1907.01372 by Cal Peyser, Hao Zhang, Tara N. Sainath, Zelin Wu.

Figure 1
Figure 1. Figure 1: Neural Denormer Architecture. T stands for trivial. N stands for non-trivial. S stands for start. C stands for continua￾tion. 2.2.2. Tagging Layer We define the “tagger” RNN as si = RNNtag(si−1, ti−1, hi) where s = si, . . . , sI are hidden tagger states, with correspond￾ing observations, i.e., tag sequence t = ti, . . . , tI . Each tag ti is a joined tag in the cross-product set of {trivial, non-trivial} … view at source ↗
read the original abstract

Recognizing written domain numeric utterances (e.g. I need $1.25.) can be challenging for ASR systems, particularly when numeric sequences are not seen during training. This out-of-vocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances (e.g. I need one dollar and twenty five cents.), for which numeric sequences are composed of in-vocabulary numbers, and then using an FST verbalizer to denormalize the result. Unfortunately, conventional ASR models are not suitable for the low memory setting of on-device speech recognition. E2E models such as RNN-T are attractive for on-device ASR, as they fold the AM, PM and LM of a conventional model into one neural network. However, in the on-device setting the large memory footprint of an FST denormer makes spoken domain training more difficult. In this paper, we investigate techniques to improve E2E model performance on numeric data. We find that using a text-to-speech system to generate additional numeric training data, as well as using a small-footprint neural network to perform spoken-to-written domain denorming, yields improvement in several numeric classes. In the case of the longest numeric sequences, we see reduction of WER by up to a factor of 8.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that for on-device E2E ASR (e.g., RNN-T), augmenting training with TTS-generated numeric utterances and replacing a large FST verbalizer with a small-footprint neural denormalizer improves recognition of written-domain numeric sequences, yielding WER reductions of up to a factor of 8 on the longest numeric classes.

Significance. If the reported gains prove robust, the approach would be significant for memory-constrained on-device ASR by addressing numeric OOV without relying on large FST components; the combination of data augmentation and compact denorming directly targets a practical deployment constraint.

major comments (2)
  1. [Abstract] Abstract: the claim of up to 8x WER reduction on longest sequences is presented without any baseline WER values, dataset sizes, model sizes, error bars, or ablation results, so it is impossible to determine whether the data support the stated improvement.
  2. [Abstract] Abstract: no details are supplied on numeric-sequence sampling for TTS, acoustic conditions modeled by the TTS system, speaker variability, or any side-by-side comparison of TTS-generated versus real numeric test utterances; this leaves the central assumption—that TTS data sufficiently approximates the target user-speech distribution—unverified and load-bearing for the reported gains.
minor comments (1)
  1. [Abstract] The abstract refers to 'several numeric classes' without defining the classes or providing a table that would allow readers to assess per-class results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments on the abstract. The full manuscript contains the supporting details and experiments; we address each point below and indicate where revisions to the abstract are feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of up to 8x WER reduction on longest sequences is presented without any baseline WER values, dataset sizes, model sizes, error bars, or ablation results, so it is impossible to determine whether the data support the stated improvement.

    Authors: The abstract is a concise summary and therefore omits these specifics, but the full paper reports baseline WER values and the factor-of-8 improvement on the longest numeric class in Table 2, training-set sizes and TTS augmentation volumes in Section 3, model sizes in Section 2.1, and ablation results comparing TTS data and the neural denormalizer in Section 4. Error bars are not included because all runs used fixed random seeds; we can add a brief statement of the key baseline and improved WER numbers to the abstract if space permits. revision: partial

  2. Referee: [Abstract] Abstract: no details are supplied on numeric-sequence sampling for TTS, acoustic conditions modeled by the TTS system, speaker variability, or any side-by-side comparison of TTS-generated versus real numeric test utterances; this leaves the central assumption—that TTS data sufficiently approximates the target user-speech distribution—unverified and load-bearing for the reported gains.

    Authors: Section 3.1 describes the numeric-sequence sampling procedure used to generate the TTS utterances. Section 3.2 specifies the acoustic conditions and speaker variability (multiple TTS voices) modeled by the TTS system. All reported WER numbers are measured on real user utterances; the performance gains on those real test sets (Section 4) therefore serve as the empirical verification that the TTS-augmented training distribution is sufficiently close to the target domain. revision: no

Circularity Check

0 steps flagged

No circularity: empirical gains from external TTS data and independent NN denormer

full rationale

The paper reports WER improvements from augmenting training with TTS-generated numeric utterances and replacing an FST verbalizer with a small-footprint neural denormalizer. No equations, fitted parameters, or predictions are defined in terms of the reported results. The approach relies on external TTS synthesis and a separate neural component whose accuracy is evaluated independently on real data. No self-citation chains, ansatzes, or renamings are load-bearing. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical derivations, free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5778 in / 1135 out tokens · 35823 ms · 2026-05-25T11:27:00.715365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 9 internal anchors

  1. [1]

    Improving Performance of End-to-End ASR on Numeric Sequences

    Introduction An ongoing challenge of ASR systems is to model transcrip- tions that do not exactly reflect the words spoken in an utter- ance. For example, the spoken utterance “set an alarm for four fifteen” is typically decoded in the written form as “set an alarm for 4:15”. Numeric utterances, such as addresses, phone num- bers, and postal codes are parti...

  2. [2]

    tagger” RNN to run on the input sequence before the sequence-to-sequence model. The tagger tags each word in the input sequence as either “trivial

    Methods In this section, we present different ideas explored to address the long-tail numeric issue of our RNN-T system. We give each approach a label, which we will reference in Table 2 below. 2.1. TTS Training Data ( W1) To address the numeric data-sparsity issue, we generate addi- tional training data that represent challenging and realistic nu- meric ...

  3. [3]

    22110” might be verbalized as “double two double one oh

    Experiments 3.1. Data Sets Our experiments are conducted on a ∼30,000 hour training set consisting of 43 million English utterances. The training utter- ances are anonymized and hand-transcribed, and are representa- tive of Googles voice search traffic in the United States. Multi- style training (MTR) data are created by artificially corrupting the clean ut...

  4. [4]

    $180.50 into inr

    Results Table 2 gives WER results for each of our experiments on the SAMPLED and TAIL test sets, as well as the real-audio VS and NUMERICS test sets. We use the labels given in Section 2 (W1, W2, S1, S2) for convenience. We useW0 to refer to the baseline RNN-T model. The results for the written domain models are characterized by a steep decline in quality...

  5. [5]

    Conclusions In this paper, we experimented with four approaches for im- proving end-to-end ASR performance on numeric utterances. We found that all approaches yield improvements, with the largest improvements occurring when TTS training data, spo- ken domain training, and neural denorming are all used to- gether. The fact that we see the largest improveme...

  6. [6]

    Acknowledgements We thank Gabriel Mechali, Mark Epstein, Michael Riley, and Richard Sproat for help and comments on this work

  7. [7]

    Formatting time-aligned ASR transcripts for readability,

    M. Shugrina, “Formatting time-aligned ASR transcripts for readability,” inHuman Language Technologies: The 2010 Annual Conference of the North American Chapter of he Association for Computational Linguistics , ser. HLT ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 198–206. [Online]. Available: http://dl.acm.org/citation.c...

  8. [8]

    Language model verbalization for automatic speech recognition,

    H. Sak, C. Allauzen, K. Nakajima, and F. Beaufay, “Language model verbalization for automatic speech recognition,” in Proc. ICASSP, 2013

  9. [9]

    Query Language Modeling for V oice Search,

    C. Chelba, “Query Language Modeling for V oice Search,” in Proc. IEEE Workshop on Spoken Language Technology, 2010

  10. [10]

    Sequence-based class tag- ging for robust transcription in asr,

    K. H. Lucy Vasserman, Vlad Schogol, “Sequence-based class tag- ging for robust transcription in asr,” in INTERSPEECH, 2015

  11. [11]

    Neural models of text normalization for speech applications,

    H. Zhang, R. Sproat, A. Ng, F. Stahlberg, X. Peng, K. Gorman, and B. Roark, “Neural models of text normalization for speech applications,” Computational Linguistics, vol. 45, no. 2, 2019

  12. [12]

    Streaming End-to-end Speech Recognition For Mobile Devices,

    Y . He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y . Wu, R. Pang, Q. Liang, D. Bhatia, Y . Shangguan, B. Li, G. Pundak, K. Sim, T. Bagby, S. Chang, K. Rao, and A. Gruenstein, “Streaming End-to-end Speech Recognition For Mobile Devices,” 2019

  13. [13]

    State-of-the-art speech recognition with sequence- to-sequence models,

    C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, N. Jaitly, B. Li, and J. Chorowski, “State-of-the-art speech recognition with sequence- to-sequence models,” in Proc. ICASSP, 2018

  14. [14]

    Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

    M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition,” Conference on Neural Information Processing Systems, vol. abs/1406.2227, 2014. [Online]. Available: http: //arxiv.org/abs/1406.2227

  15. [15]

    Synthetic Data for Text Localisation in Natural Images

    A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” Computer Vision and Pattern Recognition, vol. abs/1604.06646, 2016. [Online]. Available: http://arxiv.org/abs/1604.06646

  16. [16]

    Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization

    J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V . Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep networks with synthetic data: Bridging the reality gap by domain randomization,” Conference on Computer Vision and Pattern Recognition , vol. abs/1804.06516, 2018. [Online]. Available: http://arxiv.org/abs/1804.06516

  17. [17]

    Streaming End-to-end Speech Recognition For Mobile Devices

    Y . He, T. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y . Wu, R. Pang, Q. Liang, D. Bhatia, Y . Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S. yiin Chang, K. Rao, and A. Gruenstein, “Streaming end- to-end speech recognition for mobile devices,” 2019. [Online]. Available: https://arxiv.org/abs/1811.06621

  18. [18]

    A neural probabilistic language model,

    Y . Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” in Proceedings of the 13th International Conference on Neural Information Processing Systems , ser. Con- ference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2000, pp. 893–899. [Online]. Available: http://dl.acm.org/citation.cfm?id=3008751.3008881

  19. [19]

    Lstm neural networks for language modeling,

    M. Sundermeyer, R. Schlter, and H. Ney, “Lstm neural networks for language modeling,” 09 2012

  20. [20]

    Recurrent neural network based language modeling in meeting recognition,

    S. Kombrink, T. Mikolov, M. Karafi´at, and L. Burget, “Recurrent neural network based language modeling in meeting recognition,” in INTERSPEECH, 2011

  21. [21]

    Multi-domain recurrent neural network language model for medical speech recognition,

    O. Tilk and T. Alume, “Multi-domain recurrent neural network language model for medical speech recognition,” 09 2014

  22. [22]

    A Spelling Correction Model for End-to-End Speech Recognition,

    J. Guo, T. N. Sainath, and R. J. Weiss, “A Spelling Correction Model for End-to-End Speech Recognition,” 2019

  23. [23]

    Neural error corrective language models for automatic speech recognition,

    T. Tanaka, R. Masumura, H. Masataki, and Y . Aono, “Neural error corrective language models for automatic speech recognition,” in Proc. Interspeech 2018 , 2018, pp. 401–405. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1430

  24. [24]

    RNN Approaches to Text Normalization: A Challenge

    R. Sproat and N. Jaitly, “RNN approaches to text normalization: A challenge,” arXiv preprint , vol. abs/1611.00068, 2016. [Online]. Available: http://arxiv.org/abs/1611.00068

  25. [25]

    Generated of large-scale simulated utterances in virtual rooms to train deep-neural networks for far- field speech recognition in Google Home,

    C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. N. Sainath, and M. Bacchiani, “Generated of large-scale simulated utterances in virtual rooms to train deep-neural networks for far- field speech recognition in Google Home,” in Proc. Interspeech, 2017

  26. [26]

    Hierarchical generative modeling for controllable speech synthesis,

    W.-N. Hsu, Y . Zhang, R. Weiss, H. Zen, Y . Wu, Y . Wang, Y . Cao, Y . Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative modeling for controllable speech synthesis,” in Proc. ICLR, 2019, to appear , 2019

  27. [27]

    Tacotron: Towards End-to-End Speech Synthesis

    Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Bengio, Q. V . Le, Y . Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: A fully end-to-end text-to-speech synthesis model,” Proc. Interspeech , 2017. [Online]. Available: http: //arxiv.org/abs/1703.10135

  28. [28]

    Efficient Neural Audio Synthesis

    N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” ICML, vol. abs/1802.08435, 2018. [Online]. Available: http://arxiv.org/abs/1802.08435

  29. [30]

    Parallel WaveNet: Fast High-Fidelity Speech Synthesis

    [Online]. Available: http://arxiv.org/abs/1711.10433

  30. [31]

    Recent advances in google real-time hmm-driven unit selection synthesizer,

    X. Gonzalvo, S. Tazari, C. an Chan, M. Becker, A. Gutkin, and H. Silen, “Recent advances in google real-time hmm-driven unit selection synthesizer,” in Proc. Interspeech, 2016

  31. [32]

    Tensorflow: Large-scale machine learn- ing on heterogeneous distributed systems,

    M. Abadi et al., “Tensorflow: Large-scale machine learn- ing on heterogeneous distributed systems,” Available online: http://download.tensorflow.org/paper/whitepaper2015.pdf, 2015