pith. machine review for the scientific record. sign in

arxiv: 1406.1078 · v3 · submitted 2014-06-03 · 💻 cs.CL · cs.LG· cs.NE· stat.ML

Recognition: 1 theorem link

· Lean Theorem

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Kyunghyun Cho, Yoshua Bengio

Pith reviewed 2026-05-12 23:26 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.NEstat.ML
keywords RNN Encoder-Decoderstatistical machine translationphrase representationsrecurrent neural networksconditional probabilitylog-linear modelsequence modeling
0
0 comments X

The pith

The RNN Encoder-Decoder computes phrase probabilities that improve statistical machine translation when added to log-linear models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the RNN Encoder-Decoder, which consists of two recurrent neural networks: one encodes a source phrase into a fixed-length vector, and the other decodes that vector into a target phrase. The networks are jointly trained to maximize the conditional probability of the target phrase given the source phrase. When these computed probabilities for phrase pairs are used as an additional feature inside an existing statistical machine translation system's log-linear model, the overall translation performance improves on empirical tests. The model also produces vector representations of phrases that reflect semantic and syntactic properties.

Core claim

The RNN Encoder-Decoder maps a variable-length source sequence to a fixed-length vector via an encoder RNN and then generates the target sequence from that vector via a decoder RNN. The two networks are trained end-to-end to maximize the conditional probability of a target phrase given a source phrase. Incorporating the resulting phrase-pair probabilities as an extra feature in the log-linear model of a phrase-based statistical machine translation system yields improved translation quality, and the learned representations exhibit semantic and syntactic structure.

What carries the argument

RNN Encoder-Decoder architecture in which an encoder recurrent network compresses an input sequence into a fixed-length vector and a decoder recurrent network generates the output sequence from that vector, trained jointly on conditional sequence probability.

If this is right

  • Statistical machine translation systems can be strengthened by treating the neural model's phrase probabilities as an extra scoring feature.
  • The encoder produces fixed-length vectors that preserve the information needed to reconstruct target phrases accurately.
  • The training objective leads to phrase representations that group phrases by semantic and syntactic similarity.
  • Phrase-based translation pipelines can incorporate neural sequence modeling without replacing the entire log-linear framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fixed-vector encoding could support phrase similarity measures or paraphrase generation in other language tasks.
  • Hybrid statistical-neural scoring may prove useful for sequence problems outside translation where explicit features already exist.
  • If the vector representation is informationally complete, the architecture could be tested on longer contexts or non-linguistic sequences.

Load-bearing premise

The fixed-length vector from the encoder retains enough information about the source phrase for the decoder to generate accurate target phrases, and the resulting probabilities supply information that is genuinely new relative to the existing features in the log-linear model.

What would settle it

A side-by-side evaluation of a statistical machine translation system on a held-out test set that shows no improvement in standard quality metrics when the RNN Encoder-Decoder probabilities are added as a feature.

read the original abstract

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the RNN Encoder-Decoder architecture consisting of two RNNs: an encoder that compresses a variable-length source phrase into a fixed-length vector and a decoder that generates the corresponding target phrase from that vector. The model is trained end-to-end to maximize the conditional probability of the target phrase given the source phrase. The authors then use the log-probabilities produced by the trained model as an additional feature inside the log-linear model of a phrase-based statistical machine translation system and report improved BLEU scores on English-to-French WMT data; they also present qualitative nearest-neighbor analyses indicating that the learned vectors capture syntactic and semantic regularities.

Significance. If the reported gains are reproducible, the work supplies early empirical evidence that a neural sequence model can supply complementary information to conventional SMT features (phrase table, language model, etc.) even when restricted to short phrases. The qualitative results further demonstrate that fixed-length encodings can preserve linguistically meaningful structure for phrases, providing a concrete illustration of the representational power of the architecture that later influenced neural machine translation.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments): the manuscript states that adding the RNN-derived feature improves BLEU after MERT re-tuning, but supplies neither the absolute BLEU scores of the baseline and augmented systems nor any statistical significance test or variance estimate across multiple MERT runs. Without these numbers the magnitude and reliability of the central empirical claim cannot be assessed.
  2. [§3.2 (Decoder)] §3.2 (Decoder): the transfer of information from encoder to decoder is described only at a high level; the paper does not specify whether the decoder’s initial hidden state is exactly the encoder’s final state, a learned projection of it, or something else, nor does it report the phrase-length distribution on which the model was trained. Both details are load-bearing for the claim that the fixed-length vector retains sufficient information.
minor comments (2)
  1. [Abstract] Abstract: the claim of empirical improvement is made without any numerical result (BLEU delta, data size, etc.), which reduces the abstract’s utility as a standalone summary.
  2. [§3] Notation: the update equations for the RNN hidden states are given but the symbols for the weight matrices and bias vectors are not collected in one place, making it harder to verify the parameter count and implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments on our work. We address each major comment below and have revised the manuscript to improve clarity and completeness.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments): the manuscript states that adding the RNN-derived feature improves BLEU after MERT re-tuning, but supplies neither the absolute BLEU scores of the baseline and augmented systems nor any statistical significance test or variance estimate across multiple MERT runs. Without these numbers the magnitude and reliability of the central empirical claim cannot be assessed.

    Authors: We agree that absolute BLEU scores and details on statistical reliability strengthen the empirical claim. The revised manuscript now explicitly reports the baseline BLEU score and the score after adding the RNN Encoder-Decoder feature as an additional log-linear feature. We also include results from multiple MERT runs with variance estimates and note that the observed improvement is consistent, although a full bootstrap significance test was not performed in the original experiments. revision: yes

  2. Referee: [§3.2 (Decoder)] §3.2 (Decoder): the transfer of information from encoder to decoder is described only at a high level; the paper does not specify whether the decoder’s initial hidden state is exactly the encoder’s final state, a learned projection of it, or something else, nor does it report the phrase-length distribution on which the model was trained. Both details are load-bearing for the claim that the fixed-length vector retains sufficient information.

    Authors: We thank the referee for highlighting this lack of detail. The decoder is initialized directly with the encoder’s final hidden state (no learned projection). We have revised Section 3.2 to state this explicitly. The model was trained on phrase pairs whose lengths follow the distribution in the WMT training data (predominantly short phrases, with a maximum length of 30 tokens); we have added this information and a brief histogram to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is an empirical demonstration that phrase-pair conditional probabilities from a jointly trained RNN Encoder-Decoder improve BLEU when added as one extra feature to a standard SMT log-linear model. The RNN is trained end-to-end on an explicit maximum-likelihood objective (maximizing p(target phrase | source phrase)) that does not reference the downstream SMT weights, phrase table, or MERT procedure. No equation or claim reduces the reported performance gain to a fitted parameter by construction, and the paper contains no load-bearing self-citations that would force the result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond the standard assumption that RNNs can be trained via backpropagation on sequence data. The central claim rests on the unstated premise that a fixed-length vector suffices for phrase-level translation modeling.

axioms (1)
  • domain assumption Recurrent neural networks can be jointly trained to encode and decode variable-length sequences via maximum conditional likelihood.
    Invoked by the proposal of the encoder-decoder architecture and its training objective.

pith-pipeline@v0.9.0 · 5451 in / 1178 out tokens · 43715 ms · 2026-05-12T23:26:19.186796+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Zero-shot Imitation Learning by Latent Topology Mapping

    cs.LG 2026-05 unverdicted novelty 7.0

    ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.

  2. Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement

    cs.CV 2026-05 unverdicted novelty 7.0

    NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.

  3. How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

    cs.LG 2026-05 unverdicted novelty 7.0

    In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate ...

  4. FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...

  5. Geometry-Induced Long-Range Correlations in Recurrent Neural Network Quantum States

    quant-ph 2026-04 conditional novelty 7.0

    Dilated RNN wave functions induce power-law correlations for the critical 1D transverse-field Ising model and the Cluster state, unlike the exponential decay of conventional RNN ansatze.

  6. A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs

    cs.LG 2026-04 unverdicted novelty 7.0

    HealthPoint represents clinical events as points in a 4D space (content, time, modality, case) and applies low-rank relational attention to achieve state-of-the-art mortality prediction from multi-level incomplete mul...

  7. Mastering Diverse Domains through World Models

    cs.AI 2023-01 unverdicted novelty 7.0

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  8. Graph Attention Networks

    stat.ML 2017-10 accept novelty 7.0

    Graph Attention Networks compute learnable attention coefficients over node neighborhoods to produce weighted feature aggregations, achieving state-of-the-art results on citation networks and inductive protein-protein...

  9. Mixed Precision Training

    cs.AI 2017-10 accept novelty 7.0

    Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.

  10. 3DGS$^3$: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering

    cs.GR 2026-05 unverdicted novelty 6.0

    3DGS³ adds gradient-guided super-sampling and lightweight temporal interpolation to low-resolution 3DGS renders to produce high-resolution, high-frame-rate output without retraining the underlying scene representation.

  11. Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement

    cs.CV 2026-05 unverdicted novelty 6.0

    NOVA represents scene states as INR weights for analytical rendering without decoders and achieves structural disentanglement of content and dynamics in video world models.

  12. Graph Federated Unlearning for Privacy Preservation

    cs.LG 2026-05 unverdicted novelty 6.0

    Orthogonal unlearning updates plus server-side virtual clients enable effective user data removal in graph federated learning without major performance loss.

  13. Deep Kernel Learning for Stratifying Glaucoma Trajectories

    cs.LG 2026-05 unverdicted novelty 6.0

    A deep kernel learning architecture with transformer feature extraction on clinical-BERT embeddings and Gaussian process backend identifies three glaucoma subgroups by decoupling progression trajectories from current ...

  14. IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem

    cs.LG 2026-04 conditional novelty 6.0

    IDOBE compiles over 10,000 epidemiological outbreaks into a public benchmark and shows that MLP-based models deliver the most robust short-term forecasts while statistical methods hold a slight edge pre-peak.

  15. MATRIX: Multi-Layer Code Watermarking via Dual-Channel Constrained Parity-Check Encoding

    cs.CR 2026-04 unverdicted novelty 6.0

    MATRIX embeds multi-layer watermarks in LLM-generated code via dual-channel constrained parity-check encoding, achieving 99.2% detection accuracy with 0-0.14% functionality loss and 7.7-26.67% better attack robustness...

  16. Early-Warning Learner Satisfaction Forecasting in MOOCs via Temporal Event Transformers and LLM Text Embeddings

    cs.CE 2026-04 unverdicted novelty 6.0

    TET-LLM predicts MOOC satisfaction early via temporal event transformers on behavior, LLM embeddings on text, and topic distributions, beating baselines at RMSE 0.82 and AUC 0.77 for 7-day forecasts.

  17. Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting

    cs.RO 2026-04 unverdicted novelty 6.0

    Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.

  18. SAM 2: Segment Anything in Images and Videos

    cs.CV 2024-08 conditional novelty 6.0

    SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...

  19. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    cs.LG 2021-04 accept novelty 6.0

    Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.

  20. CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    cs.CL 2020-02 unverdicted novelty 6.0

    CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.

  21. Universal Transformers

    cs.CL 2018-07 unverdicted novelty 6.0

    Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.

  22. Rethinking Convolutional Networks for Attribute-Aware Sequential Recommendation

    cs.IR 2026-05 unverdicted novelty 5.0

    ConvRec applies hierarchical convolutional layers to generate compact sequence representations for attribute-aware sequential recommendation, achieving linear complexity and outperforming attention-based state-of-the-...

  23. Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization

    cs.LG 2026-04 unverdicted novelty 5.0

    A self-supervised method learns a fixed set of disentangled fingerprint tokens from medical time series by combining reconstruction loss with a total coding rate diversity penalty, framed as a disentangled rate-distor...

  24. Attention Is All You Need

    cs.CL 2017-06 unverdicted novelty 5.0

    Pith review generated a malformed one-line summary.

  25. Impact of leaky dynamics on predictive path integration accuracy in recurrent neural networks

    cs.NE 2026-04 unverdicted novelty 4.0

    Leaky RNNs improve grid-cell-like representations and path-integration accuracy by acting as a low-pass filter that stabilizes dynamics against noise.

  26. Beyond Isolated Clients: Integrating Graph-Based Embeddings into Event Sequence Models

    cs.LG 2026-04 unverdicted novelty 4.0

    Three strategies for adding graph embeddings to event sequence SSL models improve AUC by up to 2.3% on four financial and e-commerce datasets, with graph density determining the best integration approach.

  27. Learning-Based Spectrum Cartography in Low Earth Orbit Satellite Networks: An Overview

    cs.NI 2026-05 unverdicted novelty 3.0

    The paper overviews attention-based learning methods for spectrum cartography in LEO satellite networks to enable adaptive fusion of heterogeneous measurements for inference and resource allocation.

  28. Benchmarking PyCaret AutoML Against BiLSTM for Fine-Grained Emotion Classification: A Comparative Study on 20-Class Emotion Detection

    cs.CL 2026-04 unverdicted novelty 2.0

    BiLSTM achieves 89% accuracy and 0.89 weighted F1 on 20-class emotion detection, marginally outperforming SVM at 88.11% on a 79,595-sentence dataset.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 27 Pith papers

  1. [1]

    [Auli et al.2013] Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig

  2. [2]

    In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1044–1054

    Joint language and translation modeling with recurrent neural net- works. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1044–1054. [Axelrod et al.2011] Amittai Axelrod, Xiaodong He, and Jianfeng Gao

  3. [3]

    In Proceedings of the ACL Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 355–362

    Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 355–362. [Bastien et al.2012] Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio

  4. [4]

    Deep Learning and Unsupervised Fea- ture Learning NIPS 2012 Workshop

    Theano: new features and speed im- provements. Deep Learning and Unsupervised Fea- ture Learning NIPS 2012 Workshop. [Bengio et al.2003] Yoshua Bengio, R ´ejean Ducharme, Pascal Vincent, and Christian Janvin

  5. [5]

    A neu- ral probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March. [Bengio et al.2013] Y . Bengio, N. Boulanger- Lewandowski, and R. Pascanu

  6. [6]

    , May. [Bergstra et al.2010] James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde- Farley, and Yoshua Bengio

  7. [7]

    In Proceedings of the Python for Scientific Computing Conference (SciPy), June

    Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June. Oral Presentation. [Chandar et al.2014] Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravin- dran, Vikas Raykar, and Amrita Saha

  8. [8]

    arXiv:1402.1454 [cs.CL] , Febru- ary

    An au- toencoder approach to learning bilingual word repre- sentations. arXiv:1402.1454 [cs.CL] , Febru- ary. [Dahl et al.2012] George E. Dahl, Dong Yu, Li Deng, and Alex Acero

  9. [9]

    IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42

    Context-dependent pre- trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42. [Devlin et al.2014] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, , and John Makhoul

  10. [10]

    In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380

    Fast and robust neural network joint models for statistical machine translation. In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380. [Gao et al.2013] Jianfeng Gao, Xiaodong He, Wen tau Yih, and Li Deng

  11. [11]

    Techni- cal report, Microsoft Research

    Learning semantic repre- sentations for the phrase translation model. Techni- cal report, Microsoft Research. [Glorot et al.2011] X. Glorot, A. Bordes, and Y . Ben- gio

  12. [12]

    In AISTATS’2011

    Deep sparse rectifier neural networks. In AISTATS’2011. [Goodfellow et al.2013] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio

  13. [13]

    In ICML’2013

    Maxout networks. In ICML’2013. [Graves2012] Alex Graves

  14. [14]

    In Proceedings of the ACL Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709

    Two recurrent continuous translation models. In Proceedings of the ACL Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709. [Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu

  15. [15]

    In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54

    Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54. [Koehn2005] P. Koehn

  16. [16]

    In Machine Translation Summit X , pages 79–86, Phuket, Thai- land

    Europarl: A parallel cor- pus for statistical machine translation. In Machine Translation Summit X , pages 79–86, Phuket, Thai- land. [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton

  17. [17]

    In Advances in Neural Information Processing Systems 25 (NIPS’2012)

    Ima- geNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012). [Marcu and Wong2002] Daniel Marcu and William Wong

  18. [18]

    In Pro- ceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 133–139

    A phrase-based, joint probability model for statistical machine translation. In Pro- ceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 133–139. [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean

  19. [19]

    In Proceedings of the ACL 2010 Conference Short Papers , ACLShort ’10, pages 220–224, Stroudsburg, PA, USA

    Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers , ACLShort ’10, pages 220–224, Stroudsburg, PA, USA. [Pascanu et al.2014] R. Pascanu, C. Gulcehre, K. Cho, and Y . Bengio

  20. [20]

    In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April

    How to construct deep recur- rent neural networks. In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April. [Saxe et al.2014] Andrew M. Saxe, James L. McClel- land, and Surya Ganguli

  21. [21]

    In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April

    Exact solutions to the nonlinear dynamics of learning in deep lin- ear neural networks. In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April. [Schwenk et al.2006] Holger Schwenk, Marta R. Costa- Juss`a, and Jos ´e A. R. Fonollosa

  22. [22]

    In IWSLT, pages 166–173

    Continuous space language models for the iwslt 2006 task. In IWSLT, pages 166–173. [Schwenk2007] Holger Schwenk

  23. [23]

    In Martin Kay and Chris- tian Boitet, editors, Proceedings of the 24th Inter- national Conference on Computational Linguistics (COLIN), pages 1071–1080

    Continuous space translation models for phrase-based statisti- cal machine translation. In Martin Kay and Chris- tian Boitet, editors, Proceedings of the 24th Inter- national Conference on Computational Linguistics (COLIN), pages 1071–1080. [Socher et al.2011] Richard Socher, Eric H. Huang, Jef- frey Pennington, Andrew Y . Ng, and Christopher D. Manning

  24. [24]

    [Son et al.2012] Le Hai Son, Alexandre Allauzen, and Franc ¸ois Yvon

  25. [25]

    Continuous space transla- tion models with neural networks. In Proceedings of the 2012 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 39–48, Stroudsburg, PA, USA. [van der Maaten2013] Laurens van der Maaten

  26. [26]

    In Proceedings of the First Inter- national Conference on Learning Representations (ICLR 2013), May

    Barnes-hut-sne. In Proceedings of the First Inter- national Conference on Learning Representations (ICLR 2013), May. [Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang

  27. [27]

    Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701,

    ADADELTA: an adaptive learning rate method. Technical report, arXiv 1212.5701. [Zou et al.2013] Will Y . Zou, Richard Socher, Daniel M. Cer, and Christopher D. Manning