LSTM Language Models for LVCSR in First-Pass Decoding and Lattice-Rescoring

Eugen Beck; Hermann Ney; Ralf Schl\"uter; Wei Zhou

arxiv: 1907.01030 · v1 · pith:KXSDCUSVnew · submitted 2019-07-01 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

LSTM Language Models for LVCSR in First-Pass Decoding and Lattice-Rescoring

Eugen Beck , Wei Zhou , Ralf Schl\"uter , Hermann Ney This is my paper

Pith reviewed 2026-05-25 11:06 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDstat.ML

keywords LSTM language modelLVCSRfirst-pass decodinglattice rescoringhypothesis recombinationspeech recognitionGPGPU runtime

0 comments

The pith

LSTM language models can be used in first-pass LVCSR decoding by recombining hypotheses that share the last two words before lattice rescoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a practical way to incorporate LSTM language models into the initial decoding stage of large-vocabulary continuous speech recognition systems. Hypotheses are merged when they end with the same two words, limiting the state that must be tracked from the recurrent model. The resulting lattice is then rescored with the full LSTM. This produces competitive word error rates on the Hub5'00 and Librispeech test sets while running faster than real time on GPU hardware. The work also briefly examines summing probabilities over every LSTM state sequence that corresponds to one word hypothesis.

Core claim

Performing first-pass decoding with an LSTM language model, recombining any hypotheses that share the last two words, and then rescoring the resulting lattice yields competitive recognition accuracy on Hub5'00 and Librispeech with better-than-real-time runtime on GPGPU machines.

What carries the argument

Recombination of hypotheses sharing the last two words, which approximates LSTM state during beam search so that first-pass decoding remains tractable before full lattice rescoring.

If this is right

Competitive word error rates on the Hub5'00 and Librispeech evaluation sets.
Runtime faster than real time when executed on GPGPU hardware.
The same recombination approach can be applied when exploring a full sum over all state sequences belonging to a given word hypothesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-word recombination rule may generalize to other long-context recurrent language models used in first-pass search.
Systems that already rely on lattice rescoring could adopt this first-pass LSTM stage to reduce overall latency without changing the final rescoring step.
The method could be tested on other LVCSR benchmarks to measure how much the recombination window size affects the gap between first-pass and rescored error rates.

Load-bearing premise

Recombining hypotheses that share only the last two words keeps enough LSTM state information that the first-pass search does not discard high-scoring paths whose final accuracy would suffer after rescoring.

What would settle it

Running the identical first-pass decoder once with two-word recombination and once with full LSTM state tracking on the same Hub5'00 or Librispeech audio, then comparing the word error rates of the two resulting lattices after identical rescoring.

Figures

Figures reproduced from arXiv: 1907.01030 by Eugen Beck, Hermann Ney, Ralf Schl\"uter, Wei Zhou.

**Figure 2.** Figure 2: Time to process one batch for one step with the Switchboard LM divided by the number of histories in the batch 4.4. Parallelism As GPGPUs are massively parallel architectures it is important to provide them with enough opportunities for parallelization when doing computations. In [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 5.** Figure 5: Comparison of first-pass recognition, lattice rescoring of backoff-models and lattice rescoring of LSTM-LM based lattices for the Librispeech dev-clean corpus 5. Conclusions In this paper we have shown how to use LSTM-LMs in decoding using a GPGPU. We have shown that first using the LSTM LM with a small recombination limit and doing lattice rescoring afterwards yields the most efficient decoding process… view at source ↗

read the original abstract

LSTM based language models are an important part of modern LVCSR systems as they significantly improve performance over traditional backoff language models. Incorporating them efficiently into decoding has been notoriously difficult. In this paper we present an approach based on a combination of one-pass decoding and lattice rescoring. We perform decoding with the LSTM-LM in the first pass but recombine hypothesis that share the last two words, afterwards we rescore the resulting lattice. We run our systems on GPGPU equipped machines and are able to produce competitive results on the Hub5'00 and Librispeech evaluation corpora with a runtime better than real-time. In addition we shortly investigate the possibility to carry out the full sum over all state-sequences belonging to a given word-hypothesis during decoding without recombination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that LSTM LMs can be incorporated into LVCSR via first-pass decoding that recombines hypotheses sharing the last two words, followed by lattice rescoring; this yields competitive results on Hub5'00 and Librispeech with better-than-real-time runtime on GPGPU hardware, while a brief investigation of full summation over state sequences without recombination is also mentioned.

Significance. If the performance claims hold, the work supplies a practical engineering route for deploying strong LSTM LMs inside the first pass rather than only in rescoring, which remains a relevant systems contribution for real-time LVCSR. The use of public corpora and the explicit comparison of recombined versus full-summation decoding are strengths.

major comments (2)

[Abstract / decoding description] The recombination rule (hypotheses sharing only the last two words) is load-bearing for the efficiency claim, yet the manuscript supplies no analysis showing that distinct LSTM hidden states arising from different longer histories are sufficiently similar that high-scoring paths are not lost before rescoring; this directly affects whether the produced lattices remain adequate for the subsequent rescoring step.
[Abstract / evaluation claims] The central empirical claim of 'competitive results' and 'runtime better than real-time' is stated without any numeric WER values, baseline comparisons, or error-bar information in the provided text; without these data the performance assertions cannot be assessed.

minor comments (1)

[Abstract] The phrase 'we shortly investigate' the full-sum case appears in the abstract but no quantitative outcome or section reference is supplied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract / decoding description] The recombination rule (hypotheses sharing only the last two words) is load-bearing for the efficiency claim, yet the manuscript supplies no analysis showing that distinct LSTM hidden states arising from different longer histories are sufficiently similar that high-scoring paths are not lost before rescoring; this directly affects whether the produced lattices remain adequate for the subsequent rescoring step.

Authors: The recombination on the last two words is a standard approximation used to control decoder state space growth when incorporating LSTM LMs. The manuscript does not contain an explicit quantitative analysis of hidden-state similarity or path loss under this rule. The subsequent full LSTM lattice rescoring is intended to recover quality, but we agree an added discussion of the approximation would strengthen the paper. We will include such a discussion in revision. revision: partial
Referee: [Abstract / evaluation claims] The central empirical claim of 'competitive results' and 'runtime better than real-time' is stated without any numeric WER values, baseline comparisons, or error-bar information in the provided text; without these data the performance assertions cannot be assessed.

Authors: The abstract is a high-level summary; the full manuscript (Section 4 and tables) reports the concrete WER numbers on Hub5'00 and Librispeech, baseline comparisons, and GPGPU runtime figures. To make the claims self-contained we will incorporate the key numeric results into the abstract during revision. revision: yes

Circularity Check

0 steps flagged

Empirical systems result with no derivation chain

full rationale

The paper presents an engineering method for first-pass LSTM-LM decoding with bigram recombination followed by lattice rescoring, then reports runtime and WER on Hub5'00 and Librispeech. No equations, fitted parameters, or theorems are claimed to derive a result; the central claims are measured outcomes on public data. No self-citation load-bearing step, uniqueness theorem, or ansatz is invoked to justify the method. The work is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard domain assumptions in automatic speech recognition about the sufficiency of limited-history recombination for neural language model state and on the availability of GPU hardware for real-time inference. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Merging hypotheses that share the last two words is sufficient to keep LSTM state information without major search errors
This premise is required for the first-pass recombination step to be valid.

pith-pipeline@v0.9.0 · 5672 in / 1255 out tokens · 27746 ms · 2026-05-25T11:06:46.882431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

[1]

LSTMs thus supersede traditional backoff - models which are based on word counts

Introduction In recent years, language models (LMs) based on long short- term memory (LSTM) neural networks have become an inte- gral part of many state-of-the-art automatic speech recogi tion systems [1, 2, 3]. LSTMs thus supersede traditional backoff - models which are based on word counts. For count based models relative frequencies of word n-grams are...

work page
[2]

In this section we give a short overview of how other researchers have dealt with this problem

Related Work Using Neural Network based Language Models (NN-LMs) in Decoding is computationally more expensive than using back - off Language Models. In this section we give a short overview of how other researchers have dealt with this problem. Early approaches of introducing NN-LMs into decoding in- clude some form of conversion to a more traditional ba...

work page
[3]

The authors of [7] tra ined feed-forward LMs for different orders and extracted the pro ba- bilities for the backoff LM directly from the neural network

the continues states of an RNN-LM are discretized to cre- ate a weighted ﬁnite state transducer. The authors of [7] tra ined feed-forward LMs for different orders and extracted the pro ba- bilities for the backoff LM directly from the neural network. [8] compares different techniques for conversion and [9] uses t hese techniques to investigate conversion ...

work page
[4]

The LSTM Units are replaced with GRUs, NCE replaces the hierarchical softmax and GRU states are quantized to reduce the number of necessary computation s

are extended in [26]. The LSTM Units are replaced with GRUs, NCE replaces the hierarchical softmax and GRU states are quantized to reduce the number of necessary computation s

work page
[5]

Implementation For this work we extended the decoder of the RWTH ASR toolkit, described extensively in [27]. The decoder uses tr ee- conditioned search, which differs from the more common HCLG-based decoder in that we do not do static composition of the grammar WFST with the rest of the search network. In- stead hypotheses from the HCL part of the decoder...

work page
[6]

Experiments 4.1. Hardware and Measurement Methodology Each node used for our experiments has two sockets with Intel Xeon E5-2620 v4 CPUs with a base-clock speed of 2.1Ghz and 4 Nvidia Geforce 1080Ti GPUs. Unless stated otherwise, our decoder ran in a single thread. The tensorﬂow runtime spawns more threads as it sees ﬁt. As we are primarily using the GPU ...

work page
[7]

This includes loading features from disk, forwarding them through the acoustic model and decoding / rescoring

To compute the real time factor (RTF) we measure the total wallclock time required by the recognizer/rescorer to proc ess all segments within the corpus and divide it by the total dura - tion. This includes loading features from disk, forwarding them through the acoustic model and decoding / rescoring. Startu p time is not included. Features are not extra...

work page
[8]

We have shown that ﬁrst using the LSTM LM with a small recombination limit and doing lattice rescor - ing afterwards yields the most efﬁcient decoding process

Conclusions In this paper we have shown how to use LSTM-LMs in decod- ing using a GPGPU. We have shown that ﬁrst using the LSTM LM with a small recombination limit and doing lattice rescor - ing afterwards yields the most efﬁcient decoding process. T his approach yields a WER of 11.7% on the Hub5’00 task at an RTF of 1. Further work is required for system...

work page
[9]

Acknowledgments This project has received funding from the European Researc h Council (ERC) under the European Unions Horizon 2020 re- search and innovation program (grant agreement No 694537, project ”SEQCLAS”) and from the European Unions Hori- zon 2020 research and innovation program under the Marie Skodowska-Curie grant agreement No 644283. The work r...

work page 2020
[10]

The microsoft 2017 conversational speech recognition sys tem,

W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stol cke, “The microsoft 2017 conversational speech recognition sys tem,” in 2018 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP), April 2018, pp. 5934–5938

work page 2017
[11]

English conversational telephone speech recognition by humans and machines,

G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. D im- itriadis, X. Cui, B. Ramabhadran, M. Picheny, L. Lim, B. Room i, and P . Hall, “English conversational telephone speech recognition by humans and machines,” in Interspeech 2017, 18th Annual Con- ference of the International Speech Communication Associa tion, Stockholm, Sweden, August 20-24, 2...

work page 2017
[12]

The CAPIO 2017 Conversational Speech Recognition System

K. J. Han, A. Chandrashekaran, J. Kim, and I. R. Lane, “The CA- PIO 2017 conversational speech recognition system,” CoRR, vol. abs/1801.00059, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Improved backing-off for m-gram la n- guage modeling,

R. Kneser and H. Ney, “Improved backing-off for m-gram la n- guage modeling,” in 1995 International Conference on Acoustics, Speech, and Signal Processing , vol. 1, May 1995, pp. 181–184 vol.1

work page 1995
[14]

V ariational approximation of long-span languagemodels for lvcsr,

A. Deoras, T. Mikolov, S. Kombrink, M. Karaﬁt, and S. Khu- danpur, “V ariational approximation of long-span languagemodels for lvcsr,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2011, pp. 5532– 5535

work page 2011
[15]

Conversion of recurrent neural net- work language models to weighted ﬁnite state transducers fo r au- tomatic speech recognition,

G. Lecorv´ e and P . Motl´ ıcek, “Conversion of recurrent neural net- work language models to weighted ﬁnite state transducers fo r au- tomatic speech recognition,” in INTERSPEECH 2012, 13th An- nual Conference of the International Speech Communication As- sociation, Portland, Oregon, USA, September 9-13, 2012. ISCA, 2012, pp. 1668–1671

work page 2012
[16]

Con- verting neural network language models into back-off langu age models for efﬁcient decoding in automatic speech recogniti on,

E. Arsoy, S. F. Chen, B. Ramabhadran, and A. Sethy, “Con- verting neural network language models into back-off langu age models for efﬁcient decoding in automatic speech recogniti on,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 22, no. 1, pp. 184–192, Jan 2014

work page 2014
[17]

Comparing approaches to convert recurrent neural network s into backoff language models for efﬁcient decoding,

H. Adel, K. Kirchhoff, N. T. Vu, D. Telaar, and T. Schultz, “Comparing approaches to convert recurrent neural network s into backoff language models for efﬁcient decoding,” in INTER- SPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 1 4- 18, 2014 , H. Li, H. M. Meng, B. Ma, E. Chng, and L. Xie,...

work page 2014
[18]

Approximated and dom ain- adapted lstm language models for ﬁrst-pass decoding in spee ch recognition,

M. Singh, Y . Oualil, and D. Klakow, “Approximated and dom ain- adapted lstm language models for ﬁrst-pass decoding in spee ch recognition,” in Proc. Interspeech 2017. ISCA, 2017, pp. 2720– 2724

work page 2017
[19]

Noise-contrastive esti mation: A new estimation principle for unnormalized statistical mod els,

M. Gutmann and A. Hyv¨ arinen, “Noise-contrastive esti mation: A new estimation principle for unnormalized statistical mod els,” in Proceedings of the Thirteenth International Conference on Artiﬁ- cial Intelligence and Statistics, AISTATS 2010, Chia Lagun a Re- sort, Sardinia, Italy, May 13-15, 2010 , ser. JMLR Proceedings, Y . W. Teh and D. M. Titteringt...

work page 2010
[20]

Noise-contrastive estim ation of unnormalized statistical models, with applications to nat ural im- age statistics,

M. Gutmann and A. Hyvarinen, “Noise-contrastive estim ation of unnormalized statistical models, with applications to nat ural im- age statistics,” Journal of Machine Learning Research , vol. 13, pp. 307–361, 2012

work page 2012
[21]

Recur rent neural network language model training with noise contrast ive es- timation for speech recognition,

X. Chen, X. Liu, M. J. F. Gales, and P . C. Woodland, “Recur rent neural network language model training with noise contrast ive es- timation for speech recognition,” in 2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP ), April 2015, pp. 5411–5415

work page 2015
[22]

Unnor mal- ized exponential and neural network language models,

A. Sethy, S. Chen, E. Arisoy, and B. Ramabhadran, “Unnor mal- ized exponential and neural network language models,” in 2015 IEEE International Conference on Acoustics, Speech and Sig nal Processing (ICASSP), April 2015, pp. 5416–5420

work page 2015
[23]

Fast neural net work language model lookups at n-gram speeds,

Y . Huang, A. Sethy, and B. Ramabhadran, “Fast neural net work language model lookups at n-gram speeds,” in Proc. Interspeech 2017, 2017, pp. 274–278

work page 2017
[24]

A fast re-scoring strat- egy to capture long-distance dependencies,

A. Deoras, T. Mikolov, and K. Church, “A fast re-scoring strat- egy to capture long-distance dependencies,” in Proceedings of the Conference on Empirical Methods in Natural Language Proces s- ing, ser. EMNLP ’11. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, pp. 1116–1127

work page 2011
[25]

Efﬁcient lattice rescoring using recurrent neural n etwork language models,

X. Liu, Y . Wang, X. Chen, M. J. F. Gales, and P . C. Wood- land, “Efﬁcient lattice rescoring using recurrent neural n etwork language models,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2014, pp. 4908–4912

work page 2014
[26]

La ttice decoding and rescoring with long-span neural network langu age models,

M. Sundermeyer, Z. T ¨ uske, R. Schl¨ uter, and H. Ney, “La ttice decoding and rescoring with long-span neural network langu age models,” in INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Sing apore, September 14-18, 2014 , H. Li, H. M. Meng, B. Ma, E. Chng, and L. Xie, Eds. ISCA, 2014, pp. 661–665

work page 2014
[27]

Two efﬁcient lattice rescoring methods using recurrent ne ural network language models,

X. Liu, X. Chen, Y . Wang, M. J. F. Gales, and P . C. Woodland , “Two efﬁcient lattice rescoring methods using recurrent ne ural network language models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1438–1449, Aug 2016

work page 2016
[28]

Lattice rescoring strategies for long short term memory language models in speech recognition,

S. Kumar, M. Nirschl, D. N. Holtmann-Rice, H. Liao, A. T. Suresh, and F. X. Y u, “Lattice rescoring strategies for long short term memory language models in speech recognition,” in 2017 IEEE Automatic Speech Recognition and Understanding W ork- shop, ASRU 2017, Okinawa, Japan, December 16-20, 2017 . IEEE, 2017, pp. 165–172

work page 2017
[29]

A pruned rnnlm lattice-rescori ng algorithm for automatic speech recognition,

H. Xu, T. Chen, D. Gao, Y . Wang, K. Li, N. Goel, Y . Carmiel, D. Povey, and S. Khudanpur, “A pruned rnnlm lattice-rescori ng algorithm for automatic speech recognition,” in 2018 IEEE Inter- national Conference on Acoustics, Speech and Signal Proces sing (ICASSP), April 2018, pp. 5929–5933

work page 2018
[30]

Cache based recurr ent neural network language model inference for ﬁrst pass speec h recognition,

Z. Huang, G. Zweig, and B. Dumoulin, “Cache based recurr ent neural network language model inference for ﬁrst pass speec h recognition,” in 2014 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , May 2014, pp. 6354–6358

work page 2014
[31]

Real-time one-pass d e- coding with recurrent neural network language model for spe ech recognition,

T. Hori, Y . Kubo, and A. Nakamura, “Real-time one-pass d e- coding with recurrent neural network language model for spe ech recognition,” in 2014 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , May 2014, pp. 6364–6368

work page 2014
[32]

Applying GPGP U to recurrent neural network language model based fast netwo rk search in the real-time L VCSR,

K. Lee, C. Park, I. Kim, N. Kim, and J. Lee, “Applying GPGP U to recurrent neural network language model based fast netwo rk search in the real-time L VCSR,” in Proc. Interspeech 2015 . ISCA, 2015, pp. 2102–2106

work page 2015
[33]

Hierarchical probabilistic ne ural net- work language model,

F. Morin and Y . Bengio, “Hierarchical probabilistic ne ural net- work language model,” in Proceedings of the Tenth International W orkshop on Artiﬁcial Intelligence and Statistics, AISTATS 2005, Bridgetown, Barbados, January 6-8, 2005 , R. G. Cowell and Z. Ghahramani, Eds. Society for Artiﬁcial Intelligence and Statistics, 2005

work page 2005
[34]

Strategies for training large scale neural network langua ge mod- els,

T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernoc k´ y, “Strategies for training large scale neural network langua ge mod- els,” in 2011 IEEE W orkshop on Automatic Speech Recognition & Understanding, ASRU 2011, W aikoloa, HI, USA, December 11- 15, 2011 , D. Nahamoo and M. Picheny, Eds. IEEE, 2011, pp. 196–201

work page 2011
[35]

Accelerating recurr ent neural network language model based online speech recognit ion system,

K. Lee, C. Park, N. Kim, and J. Lee, “Accelerating recurr ent neural network language model based online speech recognit ion system,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018 . IEEE, 2018, pp. 5904–5908

work page 2018
[36]

Progress in decoding for large vocabulary c ontinuous speech recognition,

D. Nolden, “Progress in decoding for large vocabulary c ontinuous speech recognition,” Ph.D. dissertation, RWTH Aachen Univ er- sity, Computer Science Department, RWTH Aachen University , Aachen, Germany, Apr. 2017

work page 2017
[37]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210

work page 2015
[38]

Gamma - tone features and feature combination for large vocabulary speech recognition,

R. Schl¨ uter, I. Bezrukov, H. Wagner, and H. Ney, “Gamma - tone features and feature combination for large vocabulary speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , vol. 4, April 2007, pp. IV –649–IV –652

work page 2007
[39]

Hypothesis spaces for minimum ba yes risk training in large vocabulary speech recognition,

M. Gibson and T. Hain, “Hypothesis spaces for minimum ba yes risk training in large vocabulary speech recognition,” in INTER- SPEECH 2006 - ICSLP , Ninth International Conference on Spo- ken Language Processing, Pittsburgh, PA, USA, September 17-21,

work page 2006
[40]

Rwth asr systems for librispeech: Hy- brid vs attention,

C. L ¨ uscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zey er, R. Schl¨ uter, and H. Ney, “Rwth asr systems for librispeech: Hy- brid vs attention,” in submitted to Interspeech, 2019

work page 2019

[1] [1]

LSTMs thus supersede traditional backoff - models which are based on word counts

Introduction In recent years, language models (LMs) based on long short- term memory (LSTM) neural networks have become an inte- gral part of many state-of-the-art automatic speech recogi tion systems [1, 2, 3]. LSTMs thus supersede traditional backoff - models which are based on word counts. For count based models relative frequencies of word n-grams are...

work page

[2] [2]

In this section we give a short overview of how other researchers have dealt with this problem

Related Work Using Neural Network based Language Models (NN-LMs) in Decoding is computationally more expensive than using back - off Language Models. In this section we give a short overview of how other researchers have dealt with this problem. Early approaches of introducing NN-LMs into decoding in- clude some form of conversion to a more traditional ba...

work page

[3] [3]

The authors of [7] tra ined feed-forward LMs for different orders and extracted the pro ba- bilities for the backoff LM directly from the neural network

the continues states of an RNN-LM are discretized to cre- ate a weighted ﬁnite state transducer. The authors of [7] tra ined feed-forward LMs for different orders and extracted the pro ba- bilities for the backoff LM directly from the neural network. [8] compares different techniques for conversion and [9] uses t hese techniques to investigate conversion ...

work page

[4] [4]

The LSTM Units are replaced with GRUs, NCE replaces the hierarchical softmax and GRU states are quantized to reduce the number of necessary computation s

are extended in [26]. The LSTM Units are replaced with GRUs, NCE replaces the hierarchical softmax and GRU states are quantized to reduce the number of necessary computation s

work page

[5] [5]

Implementation For this work we extended the decoder of the RWTH ASR toolkit, described extensively in [27]. The decoder uses tr ee- conditioned search, which differs from the more common HCLG-based decoder in that we do not do static composition of the grammar WFST with the rest of the search network. In- stead hypotheses from the HCL part of the decoder...

work page

[6] [6]

Experiments 4.1. Hardware and Measurement Methodology Each node used for our experiments has two sockets with Intel Xeon E5-2620 v4 CPUs with a base-clock speed of 2.1Ghz and 4 Nvidia Geforce 1080Ti GPUs. Unless stated otherwise, our decoder ran in a single thread. The tensorﬂow runtime spawns more threads as it sees ﬁt. As we are primarily using the GPU ...

work page

[7] [7]

This includes loading features from disk, forwarding them through the acoustic model and decoding / rescoring

To compute the real time factor (RTF) we measure the total wallclock time required by the recognizer/rescorer to proc ess all segments within the corpus and divide it by the total dura - tion. This includes loading features from disk, forwarding them through the acoustic model and decoding / rescoring. Startu p time is not included. Features are not extra...

work page

[8] [8]

We have shown that ﬁrst using the LSTM LM with a small recombination limit and doing lattice rescor - ing afterwards yields the most efﬁcient decoding process

Conclusions In this paper we have shown how to use LSTM-LMs in decod- ing using a GPGPU. We have shown that ﬁrst using the LSTM LM with a small recombination limit and doing lattice rescor - ing afterwards yields the most efﬁcient decoding process. T his approach yields a WER of 11.7% on the Hub5’00 task at an RTF of 1. Further work is required for system...

work page

[9] [9]

Acknowledgments This project has received funding from the European Researc h Council (ERC) under the European Unions Horizon 2020 re- search and innovation program (grant agreement No 694537, project ”SEQCLAS”) and from the European Unions Hori- zon 2020 research and innovation program under the Marie Skodowska-Curie grant agreement No 644283. The work r...

work page 2020

[10] [10]

The microsoft 2017 conversational speech recognition sys tem,

W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stol cke, “The microsoft 2017 conversational speech recognition sys tem,” in 2018 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP), April 2018, pp. 5934–5938

work page 2017

[11] [11]

English conversational telephone speech recognition by humans and machines,

G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. D im- itriadis, X. Cui, B. Ramabhadran, M. Picheny, L. Lim, B. Room i, and P . Hall, “English conversational telephone speech recognition by humans and machines,” in Interspeech 2017, 18th Annual Con- ference of the International Speech Communication Associa tion, Stockholm, Sweden, August 20-24, 2...

work page 2017

[12] [12]

The CAPIO 2017 Conversational Speech Recognition System

K. J. Han, A. Chandrashekaran, J. Kim, and I. R. Lane, “The CA- PIO 2017 conversational speech recognition system,” CoRR, vol. abs/1801.00059, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Improved backing-off for m-gram la n- guage modeling,

R. Kneser and H. Ney, “Improved backing-off for m-gram la n- guage modeling,” in 1995 International Conference on Acoustics, Speech, and Signal Processing , vol. 1, May 1995, pp. 181–184 vol.1

work page 1995

[14] [14]

V ariational approximation of long-span languagemodels for lvcsr,

A. Deoras, T. Mikolov, S. Kombrink, M. Karaﬁt, and S. Khu- danpur, “V ariational approximation of long-span languagemodels for lvcsr,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2011, pp. 5532– 5535

work page 2011

[15] [15]

Conversion of recurrent neural net- work language models to weighted ﬁnite state transducers fo r au- tomatic speech recognition,

G. Lecorv´ e and P . Motl´ ıcek, “Conversion of recurrent neural net- work language models to weighted ﬁnite state transducers fo r au- tomatic speech recognition,” in INTERSPEECH 2012, 13th An- nual Conference of the International Speech Communication As- sociation, Portland, Oregon, USA, September 9-13, 2012. ISCA, 2012, pp. 1668–1671

work page 2012

[16] [16]

Con- verting neural network language models into back-off langu age models for efﬁcient decoding in automatic speech recogniti on,

E. Arsoy, S. F. Chen, B. Ramabhadran, and A. Sethy, “Con- verting neural network language models into back-off langu age models for efﬁcient decoding in automatic speech recogniti on,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 22, no. 1, pp. 184–192, Jan 2014

work page 2014

[17] [17]

Comparing approaches to convert recurrent neural network s into backoff language models for efﬁcient decoding,

H. Adel, K. Kirchhoff, N. T. Vu, D. Telaar, and T. Schultz, “Comparing approaches to convert recurrent neural network s into backoff language models for efﬁcient decoding,” in INTER- SPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 1 4- 18, 2014 , H. Li, H. M. Meng, B. Ma, E. Chng, and L. Xie,...

work page 2014

[18] [18]

Approximated and dom ain- adapted lstm language models for ﬁrst-pass decoding in spee ch recognition,

M. Singh, Y . Oualil, and D. Klakow, “Approximated and dom ain- adapted lstm language models for ﬁrst-pass decoding in spee ch recognition,” in Proc. Interspeech 2017. ISCA, 2017, pp. 2720– 2724

work page 2017

[19] [19]

Noise-contrastive esti mation: A new estimation principle for unnormalized statistical mod els,

M. Gutmann and A. Hyv¨ arinen, “Noise-contrastive esti mation: A new estimation principle for unnormalized statistical mod els,” in Proceedings of the Thirteenth International Conference on Artiﬁ- cial Intelligence and Statistics, AISTATS 2010, Chia Lagun a Re- sort, Sardinia, Italy, May 13-15, 2010 , ser. JMLR Proceedings, Y . W. Teh and D. M. Titteringt...

work page 2010

[20] [20]

Noise-contrastive estim ation of unnormalized statistical models, with applications to nat ural im- age statistics,

M. Gutmann and A. Hyvarinen, “Noise-contrastive estim ation of unnormalized statistical models, with applications to nat ural im- age statistics,” Journal of Machine Learning Research , vol. 13, pp. 307–361, 2012

work page 2012

[21] [21]

Recur rent neural network language model training with noise contrast ive es- timation for speech recognition,

X. Chen, X. Liu, M. J. F. Gales, and P . C. Woodland, “Recur rent neural network language model training with noise contrast ive es- timation for speech recognition,” in 2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP ), April 2015, pp. 5411–5415

work page 2015

[22] [22]

Unnor mal- ized exponential and neural network language models,

A. Sethy, S. Chen, E. Arisoy, and B. Ramabhadran, “Unnor mal- ized exponential and neural network language models,” in 2015 IEEE International Conference on Acoustics, Speech and Sig nal Processing (ICASSP), April 2015, pp. 5416–5420

work page 2015

[23] [23]

Fast neural net work language model lookups at n-gram speeds,

Y . Huang, A. Sethy, and B. Ramabhadran, “Fast neural net work language model lookups at n-gram speeds,” in Proc. Interspeech 2017, 2017, pp. 274–278

work page 2017

[24] [24]

A fast re-scoring strat- egy to capture long-distance dependencies,

A. Deoras, T. Mikolov, and K. Church, “A fast re-scoring strat- egy to capture long-distance dependencies,” in Proceedings of the Conference on Empirical Methods in Natural Language Proces s- ing, ser. EMNLP ’11. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, pp. 1116–1127

work page 2011

[25] [25]

Efﬁcient lattice rescoring using recurrent neural n etwork language models,

X. Liu, Y . Wang, X. Chen, M. J. F. Gales, and P . C. Wood- land, “Efﬁcient lattice rescoring using recurrent neural n etwork language models,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2014, pp. 4908–4912

work page 2014

[26] [26]

La ttice decoding and rescoring with long-span neural network langu age models,

M. Sundermeyer, Z. T ¨ uske, R. Schl¨ uter, and H. Ney, “La ttice decoding and rescoring with long-span neural network langu age models,” in INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Sing apore, September 14-18, 2014 , H. Li, H. M. Meng, B. Ma, E. Chng, and L. Xie, Eds. ISCA, 2014, pp. 661–665

work page 2014

[27] [27]

Two efﬁcient lattice rescoring methods using recurrent ne ural network language models,

X. Liu, X. Chen, Y . Wang, M. J. F. Gales, and P . C. Woodland , “Two efﬁcient lattice rescoring methods using recurrent ne ural network language models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1438–1449, Aug 2016

work page 2016

[28] [28]

Lattice rescoring strategies for long short term memory language models in speech recognition,

S. Kumar, M. Nirschl, D. N. Holtmann-Rice, H. Liao, A. T. Suresh, and F. X. Y u, “Lattice rescoring strategies for long short term memory language models in speech recognition,” in 2017 IEEE Automatic Speech Recognition and Understanding W ork- shop, ASRU 2017, Okinawa, Japan, December 16-20, 2017 . IEEE, 2017, pp. 165–172

work page 2017

[29] [29]

A pruned rnnlm lattice-rescori ng algorithm for automatic speech recognition,

H. Xu, T. Chen, D. Gao, Y . Wang, K. Li, N. Goel, Y . Carmiel, D. Povey, and S. Khudanpur, “A pruned rnnlm lattice-rescori ng algorithm for automatic speech recognition,” in 2018 IEEE Inter- national Conference on Acoustics, Speech and Signal Proces sing (ICASSP), April 2018, pp. 5929–5933

work page 2018

[30] [30]

Cache based recurr ent neural network language model inference for ﬁrst pass speec h recognition,

Z. Huang, G. Zweig, and B. Dumoulin, “Cache based recurr ent neural network language model inference for ﬁrst pass speec h recognition,” in 2014 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , May 2014, pp. 6354–6358

work page 2014

[31] [31]

Real-time one-pass d e- coding with recurrent neural network language model for spe ech recognition,

T. Hori, Y . Kubo, and A. Nakamura, “Real-time one-pass d e- coding with recurrent neural network language model for spe ech recognition,” in 2014 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , May 2014, pp. 6364–6368

work page 2014

[32] [32]

Applying GPGP U to recurrent neural network language model based fast netwo rk search in the real-time L VCSR,

K. Lee, C. Park, I. Kim, N. Kim, and J. Lee, “Applying GPGP U to recurrent neural network language model based fast netwo rk search in the real-time L VCSR,” in Proc. Interspeech 2015 . ISCA, 2015, pp. 2102–2106

work page 2015

[33] [33]

Hierarchical probabilistic ne ural net- work language model,

F. Morin and Y . Bengio, “Hierarchical probabilistic ne ural net- work language model,” in Proceedings of the Tenth International W orkshop on Artiﬁcial Intelligence and Statistics, AISTATS 2005, Bridgetown, Barbados, January 6-8, 2005 , R. G. Cowell and Z. Ghahramani, Eds. Society for Artiﬁcial Intelligence and Statistics, 2005

work page 2005

[34] [34]

Strategies for training large scale neural network langua ge mod- els,

T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernoc k´ y, “Strategies for training large scale neural network langua ge mod- els,” in 2011 IEEE W orkshop on Automatic Speech Recognition & Understanding, ASRU 2011, W aikoloa, HI, USA, December 11- 15, 2011 , D. Nahamoo and M. Picheny, Eds. IEEE, 2011, pp. 196–201

work page 2011

[35] [35]

Accelerating recurr ent neural network language model based online speech recognit ion system,

K. Lee, C. Park, N. Kim, and J. Lee, “Accelerating recurr ent neural network language model based online speech recognit ion system,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018 . IEEE, 2018, pp. 5904–5908

work page 2018

[36] [36]

Progress in decoding for large vocabulary c ontinuous speech recognition,

D. Nolden, “Progress in decoding for large vocabulary c ontinuous speech recognition,” Ph.D. dissertation, RWTH Aachen Univ er- sity, Computer Science Department, RWTH Aachen University , Aachen, Germany, Apr. 2017

work page 2017

[37] [37]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210

work page 2015

[38] [38]

Gamma - tone features and feature combination for large vocabulary speech recognition,

R. Schl¨ uter, I. Bezrukov, H. Wagner, and H. Ney, “Gamma - tone features and feature combination for large vocabulary speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , vol. 4, April 2007, pp. IV –649–IV –652

work page 2007

[39] [39]

Hypothesis spaces for minimum ba yes risk training in large vocabulary speech recognition,

M. Gibson and T. Hain, “Hypothesis spaces for minimum ba yes risk training in large vocabulary speech recognition,” in INTER- SPEECH 2006 - ICSLP , Ninth International Conference on Spo- ken Language Processing, Pittsburgh, PA, USA, September 17-21,

work page 2006

[40] [40]

Rwth asr systems for librispeech: Hy- brid vs attention,

C. L ¨ uscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zey er, R. Schl¨ uter, and H. Ney, “Rwth asr systems for librispeech: Hy- brid vs attention,” in submitted to Interspeech, 2019

work page 2019