LSTM Language Models for LVCSR in First-Pass Decoding and Lattice-Rescoring
Pith reviewed 2026-05-25 11:06 UTC · model grok-4.3
The pith
LSTM language models can be used in first-pass LVCSR decoding by recombining hypotheses that share the last two words before lattice rescoring.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Performing first-pass decoding with an LSTM language model, recombining any hypotheses that share the last two words, and then rescoring the resulting lattice yields competitive recognition accuracy on Hub5'00 and Librispeech with better-than-real-time runtime on GPGPU machines.
What carries the argument
Recombination of hypotheses sharing the last two words, which approximates LSTM state during beam search so that first-pass decoding remains tractable before full lattice rescoring.
If this is right
- Competitive word error rates on the Hub5'00 and Librispeech evaluation sets.
- Runtime faster than real time when executed on GPGPU hardware.
- The same recombination approach can be applied when exploring a full sum over all state sequences belonging to a given word hypothesis.
Where Pith is reading between the lines
- The two-word recombination rule may generalize to other long-context recurrent language models used in first-pass search.
- Systems that already rely on lattice rescoring could adopt this first-pass LSTM stage to reduce overall latency without changing the final rescoring step.
- The method could be tested on other LVCSR benchmarks to measure how much the recombination window size affects the gap between first-pass and rescored error rates.
Load-bearing premise
Recombining hypotheses that share only the last two words keeps enough LSTM state information that the first-pass search does not discard high-scoring paths whose final accuracy would suffer after rescoring.
What would settle it
Running the identical first-pass decoder once with two-word recombination and once with full LSTM state tracking on the same Hub5'00 or Librispeech audio, then comparing the word error rates of the two resulting lattices after identical rescoring.
Figures
read the original abstract
LSTM based language models are an important part of modern LVCSR systems as they significantly improve performance over traditional backoff language models. Incorporating them efficiently into decoding has been notoriously difficult. In this paper we present an approach based on a combination of one-pass decoding and lattice rescoring. We perform decoding with the LSTM-LM in the first pass but recombine hypothesis that share the last two words, afterwards we rescore the resulting lattice. We run our systems on GPGPU equipped machines and are able to produce competitive results on the Hub5'00 and Librispeech evaluation corpora with a runtime better than real-time. In addition we shortly investigate the possibility to carry out the full sum over all state-sequences belonging to a given word-hypothesis during decoding without recombination.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LSTM LMs can be incorporated into LVCSR via first-pass decoding that recombines hypotheses sharing the last two words, followed by lattice rescoring; this yields competitive results on Hub5'00 and Librispeech with better-than-real-time runtime on GPGPU hardware, while a brief investigation of full summation over state sequences without recombination is also mentioned.
Significance. If the performance claims hold, the work supplies a practical engineering route for deploying strong LSTM LMs inside the first pass rather than only in rescoring, which remains a relevant systems contribution for real-time LVCSR. The use of public corpora and the explicit comparison of recombined versus full-summation decoding are strengths.
major comments (2)
- [Abstract / decoding description] The recombination rule (hypotheses sharing only the last two words) is load-bearing for the efficiency claim, yet the manuscript supplies no analysis showing that distinct LSTM hidden states arising from different longer histories are sufficiently similar that high-scoring paths are not lost before rescoring; this directly affects whether the produced lattices remain adequate for the subsequent rescoring step.
- [Abstract / evaluation claims] The central empirical claim of 'competitive results' and 'runtime better than real-time' is stated without any numeric WER values, baseline comparisons, or error-bar information in the provided text; without these data the performance assertions cannot be assessed.
minor comments (1)
- [Abstract] The phrase 'we shortly investigate' the full-sum case appears in the abstract but no quantitative outcome or section reference is supplied.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract / decoding description] The recombination rule (hypotheses sharing only the last two words) is load-bearing for the efficiency claim, yet the manuscript supplies no analysis showing that distinct LSTM hidden states arising from different longer histories are sufficiently similar that high-scoring paths are not lost before rescoring; this directly affects whether the produced lattices remain adequate for the subsequent rescoring step.
Authors: The recombination on the last two words is a standard approximation used to control decoder state space growth when incorporating LSTM LMs. The manuscript does not contain an explicit quantitative analysis of hidden-state similarity or path loss under this rule. The subsequent full LSTM lattice rescoring is intended to recover quality, but we agree an added discussion of the approximation would strengthen the paper. We will include such a discussion in revision. revision: partial
-
Referee: [Abstract / evaluation claims] The central empirical claim of 'competitive results' and 'runtime better than real-time' is stated without any numeric WER values, baseline comparisons, or error-bar information in the provided text; without these data the performance assertions cannot be assessed.
Authors: The abstract is a high-level summary; the full manuscript (Section 4 and tables) reports the concrete WER numbers on Hub5'00 and Librispeech, baseline comparisons, and GPGPU runtime figures. To make the claims self-contained we will incorporate the key numeric results into the abstract during revision. revision: yes
Circularity Check
Empirical systems result with no derivation chain
full rationale
The paper presents an engineering method for first-pass LSTM-LM decoding with bigram recombination followed by lattice rescoring, then reports runtime and WER on Hub5'00 and Librispeech. No equations, fitted parameters, or theorems are claimed to derive a result; the central claims are measured outcomes on public data. No self-citation load-bearing step, uniqueness theorem, or ansatz is invoked to justify the method. The work is self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Merging hypotheses that share the last two words is sufficient to keep LSTM state information without major search errors
Reference graph
Works this paper leans on
-
[1]
LSTMs thus supersede traditional backoff - models which are based on word counts
Introduction In recent years, language models (LMs) based on long short- term memory (LSTM) neural networks have become an inte- gral part of many state-of-the-art automatic speech recogi tion systems [1, 2, 3]. LSTMs thus supersede traditional backoff - models which are based on word counts. For count based models relative frequencies of word n-grams are...
-
[2]
In this section we give a short overview of how other researchers have dealt with this problem
Related Work Using Neural Network based Language Models (NN-LMs) in Decoding is computationally more expensive than using back - off Language Models. In this section we give a short overview of how other researchers have dealt with this problem. Early approaches of introducing NN-LMs into decoding in- clude some form of conversion to a more traditional ba...
-
[3]
the continues states of an RNN-LM are discretized to cre- ate a weighted finite state transducer. The authors of [7] tra ined feed-forward LMs for different orders and extracted the pro ba- bilities for the backoff LM directly from the neural network. [8] compares different techniques for conversion and [9] uses t hese techniques to investigate conversion ...
-
[4]
are extended in [26]. The LSTM Units are replaced with GRUs, NCE replaces the hierarchical softmax and GRU states are quantized to reduce the number of necessary computation s
-
[5]
Implementation For this work we extended the decoder of the RWTH ASR toolkit, described extensively in [27]. The decoder uses tr ee- conditioned search, which differs from the more common HCLG-based decoder in that we do not do static composition of the grammar WFST with the rest of the search network. In- stead hypotheses from the HCL part of the decoder...
-
[6]
Experiments 4.1. Hardware and Measurement Methodology Each node used for our experiments has two sockets with Intel Xeon E5-2620 v4 CPUs with a base-clock speed of 2.1Ghz and 4 Nvidia Geforce 1080Ti GPUs. Unless stated otherwise, our decoder ran in a single thread. The tensorflow runtime spawns more threads as it sees fit. As we are primarily using the GPU ...
-
[7]
To compute the real time factor (RTF) we measure the total wallclock time required by the recognizer/rescorer to proc ess all segments within the corpus and divide it by the total dura - tion. This includes loading features from disk, forwarding them through the acoustic model and decoding / rescoring. Startu p time is not included. Features are not extra...
-
[8]
Conclusions In this paper we have shown how to use LSTM-LMs in decod- ing using a GPGPU. We have shown that first using the LSTM LM with a small recombination limit and doing lattice rescor - ing afterwards yields the most efficient decoding process. T his approach yields a WER of 11.7% on the Hub5’00 task at an RTF of 1. Further work is required for system...
-
[9]
Acknowledgments This project has received funding from the European Researc h Council (ERC) under the European Unions Horizon 2020 re- search and innovation program (grant agreement No 694537, project ”SEQCLAS”) and from the European Unions Hori- zon 2020 research and innovation program under the Marie Skodowska-Curie grant agreement No 644283. The work r...
work page 2020
-
[10]
The microsoft 2017 conversational speech recognition sys tem,
W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stol cke, “The microsoft 2017 conversational speech recognition sys tem,” in 2018 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP), April 2018, pp. 5934–5938
work page 2017
-
[11]
English conversational telephone speech recognition by humans and machines,
G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. D im- itriadis, X. Cui, B. Ramabhadran, M. Picheny, L. Lim, B. Room i, and P . Hall, “English conversational telephone speech recognition by humans and machines,” in Interspeech 2017, 18th Annual Con- ference of the International Speech Communication Associa tion, Stockholm, Sweden, August 20-24, 2...
work page 2017
-
[12]
The CAPIO 2017 Conversational Speech Recognition System
K. J. Han, A. Chandrashekaran, J. Kim, and I. R. Lane, “The CA- PIO 2017 conversational speech recognition system,” CoRR, vol. abs/1801.00059, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
Improved backing-off for m-gram la n- guage modeling,
R. Kneser and H. Ney, “Improved backing-off for m-gram la n- guage modeling,” in 1995 International Conference on Acoustics, Speech, and Signal Processing , vol. 1, May 1995, pp. 181–184 vol.1
work page 1995
-
[14]
V ariational approximation of long-span languagemodels for lvcsr,
A. Deoras, T. Mikolov, S. Kombrink, M. Karafit, and S. Khu- danpur, “V ariational approximation of long-span languagemodels for lvcsr,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2011, pp. 5532– 5535
work page 2011
-
[15]
G. Lecorv´ e and P . Motl´ ıcek, “Conversion of recurrent neural net- work language models to weighted finite state transducers fo r au- tomatic speech recognition,” in INTERSPEECH 2012, 13th An- nual Conference of the International Speech Communication As- sociation, Portland, Oregon, USA, September 9-13, 2012. ISCA, 2012, pp. 1668–1671
work page 2012
-
[16]
E. Arsoy, S. F. Chen, B. Ramabhadran, and A. Sethy, “Con- verting neural network language models into back-off langu age models for efficient decoding in automatic speech recogniti on,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 22, no. 1, pp. 184–192, Jan 2014
work page 2014
-
[17]
H. Adel, K. Kirchhoff, N. T. Vu, D. Telaar, and T. Schultz, “Comparing approaches to convert recurrent neural network s into backoff language models for efficient decoding,” in INTER- SPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 1 4- 18, 2014 , H. Li, H. M. Meng, B. Ma, E. Chng, and L. Xie,...
work page 2014
-
[18]
M. Singh, Y . Oualil, and D. Klakow, “Approximated and dom ain- adapted lstm language models for first-pass decoding in spee ch recognition,” in Proc. Interspeech 2017. ISCA, 2017, pp. 2720– 2724
work page 2017
-
[19]
Noise-contrastive esti mation: A new estimation principle for unnormalized statistical mod els,
M. Gutmann and A. Hyv¨ arinen, “Noise-contrastive esti mation: A new estimation principle for unnormalized statistical mod els,” in Proceedings of the Thirteenth International Conference on Artifi- cial Intelligence and Statistics, AISTATS 2010, Chia Lagun a Re- sort, Sardinia, Italy, May 13-15, 2010 , ser. JMLR Proceedings, Y . W. Teh and D. M. Titteringt...
work page 2010
-
[20]
M. Gutmann and A. Hyvarinen, “Noise-contrastive estim ation of unnormalized statistical models, with applications to nat ural im- age statistics,” Journal of Machine Learning Research , vol. 13, pp. 307–361, 2012
work page 2012
-
[21]
X. Chen, X. Liu, M. J. F. Gales, and P . C. Woodland, “Recur rent neural network language model training with noise contrast ive es- timation for speech recognition,” in 2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP ), April 2015, pp. 5411–5415
work page 2015
-
[22]
Unnor mal- ized exponential and neural network language models,
A. Sethy, S. Chen, E. Arisoy, and B. Ramabhadran, “Unnor mal- ized exponential and neural network language models,” in 2015 IEEE International Conference on Acoustics, Speech and Sig nal Processing (ICASSP), April 2015, pp. 5416–5420
work page 2015
-
[23]
Fast neural net work language model lookups at n-gram speeds,
Y . Huang, A. Sethy, and B. Ramabhadran, “Fast neural net work language model lookups at n-gram speeds,” in Proc. Interspeech 2017, 2017, pp. 274–278
work page 2017
-
[24]
A fast re-scoring strat- egy to capture long-distance dependencies,
A. Deoras, T. Mikolov, and K. Church, “A fast re-scoring strat- egy to capture long-distance dependencies,” in Proceedings of the Conference on Empirical Methods in Natural Language Proces s- ing, ser. EMNLP ’11. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, pp. 1116–1127
work page 2011
-
[25]
Efficient lattice rescoring using recurrent neural n etwork language models,
X. Liu, Y . Wang, X. Chen, M. J. F. Gales, and P . C. Wood- land, “Efficient lattice rescoring using recurrent neural n etwork language models,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2014, pp. 4908–4912
work page 2014
-
[26]
La ttice decoding and rescoring with long-span neural network langu age models,
M. Sundermeyer, Z. T ¨ uske, R. Schl¨ uter, and H. Ney, “La ttice decoding and rescoring with long-span neural network langu age models,” in INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Sing apore, September 14-18, 2014 , H. Li, H. M. Meng, B. Ma, E. Chng, and L. Xie, Eds. ISCA, 2014, pp. 661–665
work page 2014
-
[27]
Two efficient lattice rescoring methods using recurrent ne ural network language models,
X. Liu, X. Chen, Y . Wang, M. J. F. Gales, and P . C. Woodland , “Two efficient lattice rescoring methods using recurrent ne ural network language models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1438–1449, Aug 2016
work page 2016
-
[28]
Lattice rescoring strategies for long short term memory language models in speech recognition,
S. Kumar, M. Nirschl, D. N. Holtmann-Rice, H. Liao, A. T. Suresh, and F. X. Y u, “Lattice rescoring strategies for long short term memory language models in speech recognition,” in 2017 IEEE Automatic Speech Recognition and Understanding W ork- shop, ASRU 2017, Okinawa, Japan, December 16-20, 2017 . IEEE, 2017, pp. 165–172
work page 2017
-
[29]
A pruned rnnlm lattice-rescori ng algorithm for automatic speech recognition,
H. Xu, T. Chen, D. Gao, Y . Wang, K. Li, N. Goel, Y . Carmiel, D. Povey, and S. Khudanpur, “A pruned rnnlm lattice-rescori ng algorithm for automatic speech recognition,” in 2018 IEEE Inter- national Conference on Acoustics, Speech and Signal Proces sing (ICASSP), April 2018, pp. 5929–5933
work page 2018
-
[30]
Cache based recurr ent neural network language model inference for first pass speec h recognition,
Z. Huang, G. Zweig, and B. Dumoulin, “Cache based recurr ent neural network language model inference for first pass speec h recognition,” in 2014 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , May 2014, pp. 6354–6358
work page 2014
-
[31]
Real-time one-pass d e- coding with recurrent neural network language model for spe ech recognition,
T. Hori, Y . Kubo, and A. Nakamura, “Real-time one-pass d e- coding with recurrent neural network language model for spe ech recognition,” in 2014 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , May 2014, pp. 6364–6368
work page 2014
-
[32]
K. Lee, C. Park, I. Kim, N. Kim, and J. Lee, “Applying GPGP U to recurrent neural network language model based fast netwo rk search in the real-time L VCSR,” in Proc. Interspeech 2015 . ISCA, 2015, pp. 2102–2106
work page 2015
-
[33]
Hierarchical probabilistic ne ural net- work language model,
F. Morin and Y . Bengio, “Hierarchical probabilistic ne ural net- work language model,” in Proceedings of the Tenth International W orkshop on Artificial Intelligence and Statistics, AISTATS 2005, Bridgetown, Barbados, January 6-8, 2005 , R. G. Cowell and Z. Ghahramani, Eds. Society for Artificial Intelligence and Statistics, 2005
work page 2005
-
[34]
Strategies for training large scale neural network langua ge mod- els,
T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernoc k´ y, “Strategies for training large scale neural network langua ge mod- els,” in 2011 IEEE W orkshop on Automatic Speech Recognition & Understanding, ASRU 2011, W aikoloa, HI, USA, December 11- 15, 2011 , D. Nahamoo and M. Picheny, Eds. IEEE, 2011, pp. 196–201
work page 2011
-
[35]
Accelerating recurr ent neural network language model based online speech recognit ion system,
K. Lee, C. Park, N. Kim, and J. Lee, “Accelerating recurr ent neural network language model based online speech recognit ion system,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018 . IEEE, 2018, pp. 5904–5908
work page 2018
-
[36]
Progress in decoding for large vocabulary c ontinuous speech recognition,
D. Nolden, “Progress in decoding for large vocabulary c ontinuous speech recognition,” Ph.D. dissertation, RWTH Aachen Univ er- sity, Computer Science Department, RWTH Aachen University , Aachen, Germany, Apr. 2017
work page 2017
-
[37]
Lib- rispeech: an asr corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210
work page 2015
-
[38]
Gamma - tone features and feature combination for large vocabulary speech recognition,
R. Schl¨ uter, I. Bezrukov, H. Wagner, and H. Ney, “Gamma - tone features and feature combination for large vocabulary speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , vol. 4, April 2007, pp. IV –649–IV –652
work page 2007
-
[39]
Hypothesis spaces for minimum ba yes risk training in large vocabulary speech recognition,
M. Gibson and T. Hain, “Hypothesis spaces for minimum ba yes risk training in large vocabulary speech recognition,” in INTER- SPEECH 2006 - ICSLP , Ninth International Conference on Spo- ken Language Processing, Pittsburgh, PA, USA, September 17-21,
work page 2006
-
[40]
Rwth asr systems for librispeech: Hy- brid vs attention,
C. L ¨ uscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zey er, R. Schl¨ uter, and H. Ney, “Rwth asr systems for librispeech: Hy- brid vs attention,” in submitted to Interspeech, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.