Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

Florian Metze; Siddharth Dalmia; Suyoun Kim

arxiv: 1906.11604 · v1 · pith:GVAIPUBDnew · submitted 2019-06-27 · 💻 cs.CL · cs.SD· eess.AS

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

Suyoun Kim , Siddharth Dalmia , Florian Metze This is my paper

Pith reviewed 2026-05-25 15:02 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords end-to-end speech recognitionconversational contextgated embeddingsBERTfastTextSwitchboard corpusword error ratecontext fusion

0 comments

The pith

A gated neural network integrates external text embeddings into end-to-end speech recognition to capture conversational context across sentences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a gated neural network that fuses external word and sentence embeddings such as fastText and BERT with speech inputs inside an end-to-end recognizer. This design lets the model learn longer context that crosses sentence boundaries, which standard end-to-end models do not capture as effectively. The approach is tested on the Switchboard conversational corpus and produces lower word error rates than baseline end-to-end systems. The core idea is that gating allows the text embeddings to supply useful conversational information without replacing the acoustic signal.

Core claim

By inserting a gated fusion layer that combines conversational-context, word, and speech embeddings drawn from external text models, the end-to-end recognizer learns longer-range conversational information spanning multiple sentences and thereby reduces word error rate on long conversations in the Switchboard corpus.

What carries the argument

Gated neural network that fuses external text embeddings (fastText, BERT) with acoustic features to supply conversational context.

If this is right

The model learns conversational context that spans sentences rather than remaining limited to single utterances.
Word error rate improves significantly when external embeddings are gated into the end-to-end framework.
The same architecture outperforms standard end-to-end speech recognition models on the Switchboard corpus.
Better conversational-context representation is achieved inside the recognizer itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gating mechanism could be tested on other dialogue corpora to check whether the context benefit generalizes beyond Switchboard.
If the external embeddings carry domain mismatch, performance on short utterances might degrade while long-conversation gains remain.
Replacing fastText or BERT with newer sentence encoders would be a direct way to measure whether stronger text representations amplify the reported gains.

Load-bearing premise

External text embeddings can be added through gating without creating harmful mismatch between text and speech domains or discarding necessary acoustic detail.

What would settle it

Training the gated model on Switchboard and finding that its word error rate on long conversations is equal to or higher than the ungated baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.11604 by Florian Metze, Siddharth Dalmia, Suyoun Kim.

**Figure 2.** Figure 2: Our contextual gating mechanism in decoder [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: shows the relative improvements in the accuracy on the Dev set (5.2) over the baseline “non-conversational” model. We show the improvements on the two different methods of merging the contextual embeddings, namely mean and concatenation. Typically increasing the receptive field of the conversational-context helps improve the model. However, as the number of utterence history increased, the number of trai… view at source ↗

**Figure 4.** Figure 4: The relative improvement in Development accuracy over 100% sampling rate which was used in (Kim and Metze, 2018) obtained by using conversational-context embeddings with different sampling rate. We also experiment with an utterance level sampling strategy with various sampling ratio, [0.0, 0.2, 0.5, 1.0]. Sampling techniques have been extensively used in sequence prediction tasks to reduce overfitting (B… view at source ↗

**Figure 5.** Figure 5: Comparison of the conversational distance [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings. Unlike conventional speech recognition models, our model learns longer conversational-context information that spans across sentences and is consequently better at recognizing long conversations. Specifically, we propose to use the text-based external word and/or sentence embeddings (i.e., fastText, BERT) within an end-to-end framework, yielding a significant improvement in word error rate with better conversational-context representation. We evaluated the models on the Switchboard conversational speech corpus and show that our model outperforms standard end-to-end speech recognition models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a gated fusion of external text embeddings into an E2E ASR model to capture cross-sentence context and reports WER gains on Switchboard, but the lack of ablations leaves the source of the improvement unclear.

read the letter

The core idea is straightforward: take an end-to-end speech recognizer, feed it word and sentence embeddings from fastText or BERT through a gate, and hope the model picks up longer conversational context that standard E2E setups miss. That specific combination for multi-turn fusion appears new relative to the cited prior work. The practical motivation is solid—conversational ASR often suffers when the model cannot look beyond the current utterance—and the Switchboard evaluation is a reasonable test bed for the claim. If the full experiments show consistent gains that hold up under different embedding sources, this could be a useful incremental technique for people building production conversational systems. The main weakness is that the abstract supplies no numbers, no baseline definitions, no ablation on the gate itself, and no check on whether the external embeddings actually contribute or whether the gate mostly ignores them. The domain-mismatch concern is real: general-text embeddings trained on web data may not align cleanly with Switchboard acoustics, and without gate-value analysis or a random-vector control it is hard to attribute any WER drop to better context rather than extra capacity. The paper reads as honest empirical work rather than over-claiming, but the current write-up is too thin to judge soundness. This is the kind of paper that belongs in a speech or NLP workshop or a journal like TASLP if the methods section adds the missing controls; a serious editor could send it out for review once those details are in place, though I would not cite it yet without seeing the full results.

Referee Report

2 major / 2 minor

Summary. The paper proposes a novel end-to-end speech recognition architecture that uses a gated neural network to fuse speech embeddings with external text-based word and sentence embeddings (fastText, BERT) in order to incorporate longer conversational context spanning multiple sentences. It claims this yields better recognition of long conversations and a significant WER improvement over standard E2E models when evaluated on the Switchboard corpus.

Significance. If the WER gains are shown to arise specifically from effective cross-sentence context fusion rather than other factors, the work would provide a practical demonstration of integrating pre-trained text embeddings into E2E ASR via gating, which could be useful for conversational tasks. The approach is framed as empirical rather than theoretical, with no parameter-free derivations or machine-checked proofs noted.

major comments (2)

[§4] §4 (Experiments on Switchboard): The reported WER improvements lack any ablation studies that isolate the gated external embeddings (e.g., comparison to the base E2E model without context embeddings, to random vectors of matching dimension, or to in-domain embeddings), making it impossible to confirm that gains are attributable to longer conversational-context information rather than other architectural changes or training effects.
[§3] §3 (Gated fusion architecture): The description of the gating mechanism between speech, word, and sentence embeddings provides no supporting analysis such as gate-value histograms, activation statistics, or tests for domain mismatch between general-text pretraining (fastText/BERT) and Switchboard acoustics; without this, the central claim that the model 'learns longer conversational-context information' cannot be verified.

minor comments (2)

[Abstract] The abstract asserts 'significant improvement in word error rate' but supplies no numerical values, baseline definitions, or statistical test details; these should be added for clarity even if full results appear in §4.
[§3] Notation for the gated fusion (speech/word/sentence embeddings) is introduced without an explicit equation or diagram reference, which could be clarified in §3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [§4] §4 (Experiments on Switchboard): The reported WER improvements lack any ablation studies that isolate the gated external embeddings (e.g., comparison to the base E2E model without context embeddings, to random vectors of matching dimension, or to in-domain embeddings), making it impossible to confirm that gains are attributable to longer conversational-context information rather than other architectural changes or training effects.

Authors: The current manuscript already reports comparisons against the standard E2E model without external embeddings, which provides an initial isolation of the gated fusion contribution. We agree, however, that further ablations using random vectors of matching dimension and in-domain embeddings would more rigorously attribute gains to conversational context. These additional experiments will be included in the revised manuscript. revision: yes
Referee: [§3] §3 (Gated fusion architecture): The description of the gating mechanism between speech, word, and sentence embeddings provides no supporting analysis such as gate-value histograms, activation statistics, or tests for domain mismatch between general-text pretraining (fastText/BERT) and Switchboard acoustics; without this, the central claim that the model 'learns longer conversational-context information' cannot be verified.

Authors: While the WER gains on Switchboard provide empirical support for the claim, we acknowledge that direct analysis of the gating behavior would strengthen verification. We will add gate-value histograms, activation statistics, and a brief discussion of potential domain mismatch in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with external results

full rationale

The paper describes a gated fusion architecture for incorporating fastText/BERT embeddings into an E2E ASR model and reports WER gains on Switchboard. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on experimental comparison rather than any self-definitional or load-bearing reduction. External embeddings are treated as independent inputs; their integration is an architectural choice evaluated empirically, not derived from the model itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all such elements remain unknown.

pith-pipeline@v0.9.0 · 5640 in / 990 out tokens · 23685 ms · 2026-05-25T15:02:10.303303+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 6 internal anchors

[1]

Uri Alon, Golan Pundak, and Tara N Sainath. 2019. Contextual speech recognition with difficult negative training examples. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6440--6444. IEEE

work page 2019
[2]

John Arevalo, Thamar Solorio, Manuel Montes-y G \'o mez, and Fabio A Gonz \'a lez. 2017. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, and Michael Picheny. 2018. Building competitive direct acoustics-to-word models for english conversational speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4759--4763. IEEE

work page 2018
[4]

Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, and David Nahamoo. 2017. Direct acoustics-to-word models for english conversational speech recognition. CoRR, abs/1703.07754

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR

work page 2015
[6]

Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. 2016. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4945--4949. IEEE

work page 2016
[7]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171--1179

work page 2015
[8]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135--146

work page 2017
[9]

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960--4964. IEEE

work page 2016
[10]

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, pages 577--585

work page 2015
[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL

work page 2019
[12]

John Godfrey and Edward Holliman. 1993. Switchboard-1 release 2 ldc97s62. Linguistic Data Consortium, Philadelphia, LDC97S62

work page 1993
[13]

John J Godfrey, Edward C Holliman, and Jane McDaniel. 1992. Switchboard: Telephone speech corpus for research and development. In Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, volume 1, pages 517--520. IEEE

work page 1992
[14]

Alex Graves, Santiago Fern \'a ndez, Faustino Gomez, and J \"u rgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369--376. ACM

work page 2006
[15]

Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1764--1772

work page 2014
[16]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567

work page internal anchor Pith review Pith/arXiv arXiv 2014
[17]

Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan. 2017. Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm. Interspeech

work page 2017
[18]

Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. 2016. Document context language models. ICLR (Workshop track)

work page 2016
[19]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, H \'e rve J \'e gou, and Tomas Mikolov. 2016. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427--431. Association for Computational Linguistics

work page 2017
[21]

Suyoun Kim, Siddharth Dalmia, and Florian Metze. 2018. Situation informed end-to-end asr for chime-5 challenge. CHiME5 workshop

work page 2018
[22]

Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 4835--4839. IEEE

work page 2017
[23]

Suyoun Kim and Florian Metze. 2018. Dialog-context aware end-to-end speech recognition. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 434--440. IEEE

work page 2018
[24]

Suyoun Kim and Florian Metze. 2019. Acoustic-to-word models with conversational context information. NAACL

work page 2019
[25]

Suyoun Kim and Michael L Seltzer. 2018. Towards language-universal end-to-end speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4914--4918. IEEE

work page 2018
[26]

Jamie Kiros, William Chan, and Geoffrey Hinton. 2018. Illustrative language understanding: Large-scale visual grounding with image search. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 922--933

work page 2018
[27]

Jinyu Li, Guoli Ye, Amit Das, Rui Zhao, and Yifan Gong. 2018. Advancing acoustic-to-word ctc model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5794--5798. IEEE

work page 2018
[28]

Bing Liu and Ian Lane. 2017. Dialog context language modeling with recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 5715--5719. IEEE

work page 2017
[29]

Yajie Miao, Mohammad Gowayyed, and Florian Metze. 2015. EESEN : End-to-end speech recognition using deep RNN models and WFST -based decoding. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 167--174. IEEE

work page 2015
[30]

Yajie Miao, Mohammad Gowayyed, Xingyu Na, Tom Ko, Florian Metze, and Alexander Waibel. 2016. An empirical exploration of ctc acoustic models. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2623--2627. IEEE

work page 2016
[31]

Tom \'a s Mikolov, Martin Karafi \'a t, Luk \'a s Burget, Jan C ernock \`y , and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association

work page 2010
[32]

Tomas Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. SLT, 12:234--239

work page 2012
[33]

Shruti Palaskar and Florian Metze. 2018. Acoustic-to-word recognition with sequence-to-sequence models. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 397--404. IEEE

work page 2018
[34]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310--1318

work page 2013
[35]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W

work page 2017
[36]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532--1543

work page 2014
[37]

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. NAACL

work page 2018
[38]

Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur. 2016. Purely sequence-trained neural networks for asr based on lattice-free mmi. In Interspeech, pages 2751--2755

work page 2016
[39]

Golan Pundak, Tara N Sainath, Rohit Prabhavalkar, Anjuli Kannan, and Ding Zhao. 2018. Deep context: end-to-end contextual speech recognition. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 418--425. IEEE

work page 2018
[40]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8)

work page 2019
[41]

Ramon Sanabria and Florian Metze. 2018. Hierarchical multitask learning with ctc. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 485--490. IEEE

work page 2018
[42]

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Hagen Soltau, Hank Liao, and Hasim Sak. 2017. Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recognition. Interspeech

work page 2017
[44]

Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112

work page 2014
[45]

Tian Wang and Kyunghyun Cho. 2016. Larger-context language modelling. ACL

work page 2016
[46]

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin , Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. 2018. https://doi.org/10.21437/Interspeech.2018-1456 Espnet: End-to-end speech processing toolkit . In Interspeech, pages 2207--2211

work page doi:10.21437/interspeech.2018-1456 2018
[47]

Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi. 2017. Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8):1240--1253

work page 2017
[48]

Wayne Xiong, Lingfeng Wu, Jun Zhang, and Andreas Stolcke. 2018. Session-level language modeling for conversational speech. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2764--2768

work page 2018
[49]

Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701

work page internal anchor Pith review Pith/arXiv arXiv 2012
[50]

Albert Zeyer, Kazuki Irie, Ralf Schl \"u ter, and Hermann Ney. 2018. Improved training of end-to-end attention models for speech recognition. Interspeech

work page 2018
[51]

Yu Zhang, William Chan, and Navdeep Jaitly. 2017. Very deep convolutional networks for end-to-end speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 4845--4849. IEEE

work page 2017
[52]

Geoffrey Zweig, Chengzhu Yu, Jasha Droppo, and Andreas Stolcke. 2017. Advances in all-neural speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 4805--4809. IEEE

work page 2017
[53]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[54]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Uri Alon, Golan Pundak, and Tara N Sainath. 2019. Contextual speech recognition with difficult negative training examples. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6440--6444. IEEE

work page 2019

[2] [2]

John Arevalo, Thamar Solorio, Manuel Montes-y G \'o mez, and Fabio A Gonz \'a lez. 2017. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, and Michael Picheny. 2018. Building competitive direct acoustics-to-word models for english conversational speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4759--4763. IEEE

work page 2018

[4] [4]

Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, and David Nahamoo. 2017. Direct acoustics-to-word models for english conversational speech recognition. CoRR, abs/1703.07754

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR

work page 2015

[6] [6]

Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. 2016. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4945--4949. IEEE

work page 2016

[7] [7]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171--1179

work page 2015

[8] [8]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135--146

work page 2017

[9] [9]

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960--4964. IEEE

work page 2016

[10] [10]

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, pages 577--585

work page 2015

[11] [11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL

work page 2019

[12] [12]

John Godfrey and Edward Holliman. 1993. Switchboard-1 release 2 ldc97s62. Linguistic Data Consortium, Philadelphia, LDC97S62

work page 1993

[13] [13]

John J Godfrey, Edward C Holliman, and Jane McDaniel. 1992. Switchboard: Telephone speech corpus for research and development. In Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, volume 1, pages 517--520. IEEE

work page 1992

[14] [14]

Alex Graves, Santiago Fern \'a ndez, Faustino Gomez, and J \"u rgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369--376. ACM

work page 2006

[15] [15]

Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1764--1772

work page 2014

[16] [16]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567

work page internal anchor Pith review Pith/arXiv arXiv 2014

[17] [17]

Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan. 2017. Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm. Interspeech

work page 2017

[18] [18]

Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. 2016. Document context language models. ICLR (Workshop track)

work page 2016

[19] [19]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, H \'e rve J \'e gou, and Tomas Mikolov. 2016. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427--431. Association for Computational Linguistics

work page 2017

[21] [21]

Suyoun Kim, Siddharth Dalmia, and Florian Metze. 2018. Situation informed end-to-end asr for chime-5 challenge. CHiME5 workshop

work page 2018

[22] [22]

Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 4835--4839. IEEE

work page 2017

[23] [23]

Suyoun Kim and Florian Metze. 2018. Dialog-context aware end-to-end speech recognition. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 434--440. IEEE

work page 2018

[24] [24]

Suyoun Kim and Florian Metze. 2019. Acoustic-to-word models with conversational context information. NAACL

work page 2019

[25] [25]

Suyoun Kim and Michael L Seltzer. 2018. Towards language-universal end-to-end speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4914--4918. IEEE

work page 2018

[26] [26]

Jamie Kiros, William Chan, and Geoffrey Hinton. 2018. Illustrative language understanding: Large-scale visual grounding with image search. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 922--933

work page 2018

[27] [27]

Jinyu Li, Guoli Ye, Amit Das, Rui Zhao, and Yifan Gong. 2018. Advancing acoustic-to-word ctc model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5794--5798. IEEE

work page 2018

[28] [28]

Bing Liu and Ian Lane. 2017. Dialog context language modeling with recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 5715--5719. IEEE

work page 2017

[29] [29]

Yajie Miao, Mohammad Gowayyed, and Florian Metze. 2015. EESEN : End-to-end speech recognition using deep RNN models and WFST -based decoding. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 167--174. IEEE

work page 2015

[30] [30]

Yajie Miao, Mohammad Gowayyed, Xingyu Na, Tom Ko, Florian Metze, and Alexander Waibel. 2016. An empirical exploration of ctc acoustic models. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2623--2627. IEEE

work page 2016

[31] [31]

Tom \'a s Mikolov, Martin Karafi \'a t, Luk \'a s Burget, Jan C ernock \`y , and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association

work page 2010

[32] [32]

Tomas Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. SLT, 12:234--239

work page 2012

[33] [33]

Shruti Palaskar and Florian Metze. 2018. Acoustic-to-word recognition with sequence-to-sequence models. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 397--404. IEEE

work page 2018

[34] [34]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310--1318

work page 2013

[35] [35]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W

work page 2017

[36] [36]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532--1543

work page 2014

[37] [37]

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. NAACL

work page 2018

[38] [38]

Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur. 2016. Purely sequence-trained neural networks for asr based on lattice-free mmi. In Interspeech, pages 2751--2755

work page 2016

[39] [39]

Golan Pundak, Tara N Sainath, Rohit Prabhavalkar, Anjuli Kannan, and Ding Zhao. 2018. Deep context: end-to-end contextual speech recognition. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 418--425. IEEE

work page 2018

[40] [40]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8)

work page 2019

[41] [41]

Ramon Sanabria and Florian Metze. 2018. Hierarchical multitask learning with ctc. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 485--490. IEEE

work page 2018

[42] [42]

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603

work page internal anchor Pith review Pith/arXiv arXiv 2017

[43] [43]

Hagen Soltau, Hank Liao, and Hasim Sak. 2017. Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recognition. Interspeech

work page 2017

[44] [44]

Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112

work page 2014

[45] [45]

Tian Wang and Kyunghyun Cho. 2016. Larger-context language modelling. ACL

work page 2016

[46] [46]

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin , Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. 2018. https://doi.org/10.21437/Interspeech.2018-1456 Espnet: End-to-end speech processing toolkit . In Interspeech, pages 2207--2211

work page doi:10.21437/interspeech.2018-1456 2018

[47] [47]

Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi. 2017. Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8):1240--1253

work page 2017

[48] [48]

Wayne Xiong, Lingfeng Wu, Jun Zhang, and Andreas Stolcke. 2018. Session-level language modeling for conversational speech. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2764--2768

work page 2018

[49] [49]

Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701

work page internal anchor Pith review Pith/arXiv arXiv 2012

[50] [50]

Albert Zeyer, Kazuki Irie, Ralf Schl \"u ter, and Hermann Ney. 2018. Improved training of end-to-end attention models for speech recognition. Interspeech

work page 2018

[51] [51]

Yu Zhang, William Chan, and Navdeep Jaitly. 2017. Very deep convolutional networks for end-to-end speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 4845--4849. IEEE

work page 2017

[52] [52]

Geoffrey Zweig, Chengzhu Yu, Jasha Droppo, and Andreas Stolcke. 2017. Advances in all-neural speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 4805--4809. IEEE

work page 2017

[53] [53]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[54] [54]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page