arxiv: 1406.1078 · v3 · submitted 2014-06-03 · 💻 cs.CL · cs.LG· cs.NE· stat.ML

Recognition: 1 theorem link

· Lean Theorem

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Kyunghyun Cho, Yoshua Bengio

Pith reviewed 2026-05-12 23:26 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.NEstat.ML

keywords RNN Encoder-Decoderstatistical machine translationphrase representationsrecurrent neural networksconditional probabilitylog-linear modelsequence modeling

0 comments

The pith

The RNN Encoder-Decoder computes phrase probabilities that improve statistical machine translation when added to log-linear models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the RNN Encoder-Decoder, which consists of two recurrent neural networks: one encodes a source phrase into a fixed-length vector, and the other decodes that vector into a target phrase. The networks are jointly trained to maximize the conditional probability of the target phrase given the source phrase. When these computed probabilities for phrase pairs are used as an additional feature inside an existing statistical machine translation system's log-linear model, the overall translation performance improves on empirical tests. The model also produces vector representations of phrases that reflect semantic and syntactic properties.

Core claim

The RNN Encoder-Decoder maps a variable-length source sequence to a fixed-length vector via an encoder RNN and then generates the target sequence from that vector via a decoder RNN. The two networks are trained end-to-end to maximize the conditional probability of a target phrase given a source phrase. Incorporating the resulting phrase-pair probabilities as an extra feature in the log-linear model of a phrase-based statistical machine translation system yields improved translation quality, and the learned representations exhibit semantic and syntactic structure.

What carries the argument

RNN Encoder-Decoder architecture in which an encoder recurrent network compresses an input sequence into a fixed-length vector and a decoder recurrent network generates the output sequence from that vector, trained jointly on conditional sequence probability.

If this is right

Statistical machine translation systems can be strengthened by treating the neural model's phrase probabilities as an extra scoring feature.
The encoder produces fixed-length vectors that preserve the information needed to reconstruct target phrases accurately.
The training objective leads to phrase representations that group phrases by semantic and syntactic similarity.
Phrase-based translation pipelines can incorporate neural sequence modeling without replacing the entire log-linear framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fixed-vector encoding could support phrase similarity measures or paraphrase generation in other language tasks.
Hybrid statistical-neural scoring may prove useful for sequence problems outside translation where explicit features already exist.
If the vector representation is informationally complete, the architecture could be tested on longer contexts or non-linguistic sequences.

Load-bearing premise

The fixed-length vector from the encoder retains enough information about the source phrase for the decoder to generate accurate target phrases, and the resulting probabilities supply information that is genuinely new relative to the existing features in the log-linear model.

What would settle it

A side-by-side evaluation of a statistical machine translation system on a held-out test set that shows no improvement in standard quality metrics when the RNN Encoder-Decoder probabilities are added as a feature.

read the original abstract

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the RNN Encoder-Decoder architecture consisting of two RNNs: an encoder that compresses a variable-length source phrase into a fixed-length vector and a decoder that generates the corresponding target phrase from that vector. The model is trained end-to-end to maximize the conditional probability of the target phrase given the source phrase. The authors then use the log-probabilities produced by the trained model as an additional feature inside the log-linear model of a phrase-based statistical machine translation system and report improved BLEU scores on English-to-French WMT data; they also present qualitative nearest-neighbor analyses indicating that the learned vectors capture syntactic and semantic regularities.

Significance. If the reported gains are reproducible, the work supplies early empirical evidence that a neural sequence model can supply complementary information to conventional SMT features (phrase table, language model, etc.) even when restricted to short phrases. The qualitative results further demonstrate that fixed-length encodings can preserve linguistically meaningful structure for phrases, providing a concrete illustration of the representational power of the architecture that later influenced neural machine translation.

major comments (2)

[§4 (Experiments)] §4 (Experiments): the manuscript states that adding the RNN-derived feature improves BLEU after MERT re-tuning, but supplies neither the absolute BLEU scores of the baseline and augmented systems nor any statistical significance test or variance estimate across multiple MERT runs. Without these numbers the magnitude and reliability of the central empirical claim cannot be assessed.
[§3.2 (Decoder)] §3.2 (Decoder): the transfer of information from encoder to decoder is described only at a high level; the paper does not specify whether the decoder’s initial hidden state is exactly the encoder’s final state, a learned projection of it, or something else, nor does it report the phrase-length distribution on which the model was trained. Both details are load-bearing for the claim that the fixed-length vector retains sufficient information.

minor comments (2)

[Abstract] Abstract: the claim of empirical improvement is made without any numerical result (BLEU delta, data size, etc.), which reduces the abstract’s utility as a standalone summary.
[§3] Notation: the update equations for the RNN hidden states are given but the symbols for the weight matrices and bias vectors are not collected in one place, making it harder to verify the parameter count and implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments on our work. We address each major comment below and have revised the manuscript to improve clarity and completeness.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): the manuscript states that adding the RNN-derived feature improves BLEU after MERT re-tuning, but supplies neither the absolute BLEU scores of the baseline and augmented systems nor any statistical significance test or variance estimate across multiple MERT runs. Without these numbers the magnitude and reliability of the central empirical claim cannot be assessed.

Authors: We agree that absolute BLEU scores and details on statistical reliability strengthen the empirical claim. The revised manuscript now explicitly reports the baseline BLEU score and the score after adding the RNN Encoder-Decoder feature as an additional log-linear feature. We also include results from multiple MERT runs with variance estimates and note that the observed improvement is consistent, although a full bootstrap significance test was not performed in the original experiments. revision: yes
Referee: [§3.2 (Decoder)] §3.2 (Decoder): the transfer of information from encoder to decoder is described only at a high level; the paper does not specify whether the decoder’s initial hidden state is exactly the encoder’s final state, a learned projection of it, or something else, nor does it report the phrase-length distribution on which the model was trained. Both details are load-bearing for the claim that the fixed-length vector retains sufficient information.

Authors: We thank the referee for highlighting this lack of detail. The decoder is initialized directly with the encoder’s final hidden state (no learned projection). We have revised Section 3.2 to state this explicitly. The model was trained on phrase pairs whose lengths follow the distribution in the WMT training data (predominantly short phrases, with a maximum length of 30 tokens); we have added this information and a brief histogram to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is an empirical demonstration that phrase-pair conditional probabilities from a jointly trained RNN Encoder-Decoder improve BLEU when added as one extra feature to a standard SMT log-linear model. The RNN is trained end-to-end on an explicit maximum-likelihood objective (maximizing p(target phrase | source phrase)) that does not reference the downstream SMT weights, phrase table, or MERT procedure. No equation or claim reduces the reported performance gain to a fitted parameter by construction, and the paper contains no load-bearing self-citations that would force the result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond the standard assumption that RNNs can be trained via backpropagation on sequence data. The central claim rests on the unstated premise that a fixed-length vector suffices for phrase-level translation modeling.

axioms (1)

domain assumption Recurrent neural networks can be jointly trained to encode and decode variable-length sequences via maximum conditional likelihood.
Invoked by the proposal of the encoder-decoder architecture and its training objective.

pith-pipeline@v0.9.0 · 5451 in / 1178 out tokens · 43715 ms · 2026-05-12T23:26:19.186796+00:00 · methodology

discussion (0)

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Zero-shot Imitation Learning by Latent Topology Mapping
cs.LG 2026-05 unverdicted novelty 7.0

ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 7.0

NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences
cs.LG 2026-05 unverdicted novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate ...
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
cs.LG 2026-05 unverdicted novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...
Geometry-Induced Long-Range Correlations in Recurrent Neural Network Quantum States
quant-ph 2026-04 conditional novelty 7.0

Dilated RNN wave functions induce power-law correlations for the critical 1D transverse-field Ising model and the Cluster state, unlike the exponential decay of conventional RNN ansatze.
A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs
cs.LG 2026-04 unverdicted novelty 7.0

HealthPoint represents clinical events as points in a 4D space (content, time, modality, case) and applies low-rank relational attention to achieve state-of-the-art mortality prediction from multi-level incomplete mul...
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Graph Attention Networks
stat.ML 2017-10 accept novelty 7.0

Graph Attention Networks compute learnable attention coefficients over node neighborhoods to produce weighted feature aggregations, achieving state-of-the-art results on citation networks and inductive protein-protein...
Mixed Precision Training
cs.AI 2017-10 accept novelty 7.0

Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
3DGS$^3$: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering
cs.GR 2026-05 unverdicted novelty 6.0

3DGS³ adds gradient-guided super-sampling and lightweight temporal interpolation to low-resolution 3DGS renders to produce high-resolution, high-frame-rate output without retraining the underlying scene representation.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 6.0

NOVA represents scene states as INR weights for analytical rendering without decoders and achieves structural disentanglement of content and dynamics in video world models.
Graph Federated Unlearning for Privacy Preservation
cs.LG 2026-05 unverdicted novelty 6.0

Orthogonal unlearning updates plus server-side virtual clients enable effective user data removal in graph federated learning without major performance loss.
Deep Kernel Learning for Stratifying Glaucoma Trajectories
cs.LG 2026-05 unverdicted novelty 6.0

A deep kernel learning architecture with transformer feature extraction on clinical-BERT embeddings and Gaussian process backend identifies three glaucoma subgroups by decoupling progression trajectories from current ...
IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem
cs.LG 2026-04 conditional novelty 6.0

IDOBE compiles over 10,000 epidemiological outbreaks into a public benchmark and shows that MLP-based models deliver the most robust short-term forecasts while statistical methods hold a slight edge pre-peak.
MATRIX: Multi-Layer Code Watermarking via Dual-Channel Constrained Parity-Check Encoding
cs.CR 2026-04 unverdicted novelty 6.0

MATRIX embeds multi-layer watermarks in LLM-generated code via dual-channel constrained parity-check encoding, achieving 99.2% detection accuracy with 0-0.14% functionality loss and 7.7-26.67% better attack robustness...
Early-Warning Learner Satisfaction Forecasting in MOOCs via Temporal Event Transformers and LLM Text Embeddings
cs.CE 2026-04 unverdicted novelty 6.0

TET-LLM predicts MOOC satisfaction early via temporal event transformers on behavior, LLM embeddings on text, and topic distributions, beating baselines at RMSE 0.82 and AUC 0.77 for 7-day forecasts.
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
cs.RO 2026-04 unverdicted novelty 6.0

Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
SAM 2: Segment Anything in Images and Videos
cs.CV 2024-08 conditional novelty 6.0

SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
cs.LG 2021-04 accept novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
cs.CL 2020-02 unverdicted novelty 6.0

CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
Universal Transformers
cs.CL 2018-07 unverdicted novelty 6.0

Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
Rethinking Convolutional Networks for Attribute-Aware Sequential Recommendation
cs.IR 2026-05 unverdicted novelty 5.0

ConvRec applies hierarchical convolutional layers to generate compact sequence representations for attribute-aware sequential recommendation, achieving linear complexity and outperforming attention-based state-of-the-...
Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization
cs.LG 2026-04 unverdicted novelty 5.0

A self-supervised method learns a fixed set of disentangled fingerprint tokens from medical time series by combining reconstruction loss with a total coding rate diversity penalty, framed as a disentangled rate-distor...
Attention Is All You Need
cs.CL 2017-06 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.
Impact of leaky dynamics on predictive path integration accuracy in recurrent neural networks
cs.NE 2026-04 unverdicted novelty 4.0

Leaky RNNs improve grid-cell-like representations and path-integration accuracy by acting as a low-pass filter that stabilizes dynamics against noise.
Beyond Isolated Clients: Integrating Graph-Based Embeddings into Event Sequence Models
cs.LG 2026-04 unverdicted novelty 4.0

Three strategies for adding graph embeddings to event sequence SSL models improve AUC by up to 2.3% on four financial and e-commerce datasets, with graph density determining the best integration approach.
Learning-Based Spectrum Cartography in Low Earth Orbit Satellite Networks: An Overview
cs.NI 2026-05 unverdicted novelty 3.0

The paper overviews attention-based learning methods for spectrum cartography in LEO satellite networks to enable adaptive fusion of heterogeneous measurements for inference and resource allocation.
Benchmarking PyCaret AutoML Against BiLSTM for Fine-Grained Emotion Classification: A Comparative Study on 20-Class Emotion Detection
cs.CL 2026-04 unverdicted novelty 2.0

BiLSTM achieves 89% accuracy and 0.89 weighted F1 on 20-class emotion detection, marginally outperforming SVM at 88.11% on a 79,595-sentence dataset.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 27 Pith papers

[1]

[Auli et al.2013] Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig

work page 2013
[2]

In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1044–1054

Joint language and translation modeling with recurrent neural net- works. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1044–1054. [Axelrod et al.2011] Amittai Axelrod, Xiaodong He, and Jianfeng Gao

work page 2011
[3]

In Proceedings of the ACL Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 355–362

Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 355–362. [Bastien et al.2012] Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio

work page 2012
[4]

Deep Learning and Unsupervised Fea- ture Learning NIPS 2012 Workshop

Theano: new features and speed im- provements. Deep Learning and Unsupervised Fea- ture Learning NIPS 2012 Workshop. [Bengio et al.2003] Yoshua Bengio, R ´ejean Ducharme, Pascal Vincent, and Christian Janvin

work page 2012
[5]

A neu- ral probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March. [Bengio et al.2013] Y . Bengio, N. Boulanger- Lewandowski, and R. Pascanu

work page 2013
[6]

, May. [Bergstra et al.2010] James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde- Farley, and Yoshua Bengio

work page 2010
[7]

In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy), June

Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy), June. Oral Presentation. [Chandar et al.2014] Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravin- dran, Vikas Raykar, and Amrita Saha

work page 2014
[8]

arXiv:1402.1454 [cs.CL] , Febru- ary

An au- toencoder approach to learning bilingual word repre- sentations. arXiv:1402.1454 [cs.CL] , Febru- ary. [Dahl et al.2012] George E. Dahl, Dong Yu, Li Deng, and Alex Acero

work page arXiv 2012
[9]

IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42

Context-dependent pre- trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42. [Devlin et al.2014] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, , and John Makhoul

work page 2014
[10]

In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380

Fast and robust neural network joint models for statistical machine translation. In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380. [Gao et al.2013] Jianfeng Gao, Xiaodong He, Wen tau Yih, and Li Deng

work page 2014
[11]

Techni- cal report, Microsoft Research

Learning semantic repre- sentations for the phrase translation model. Techni- cal report, Microsoft Research. [Glorot et al.2011] X. Glorot, A. Bordes, and Y . Ben- gio

work page 2011
[12]

In AISTATS’2011

Deep sparse rectiﬁer neural networks. In AISTATS’2011. [Goodfellow et al.2013] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio

work page 2011
[13]

In ICML’2013

Maxout networks. In ICML’2013. [Graves2012] Alex Graves

work page 2013
[14]

In Proceedings of the ACL Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709

Two recurrent continuous translation models. In Proceedings of the ACL Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709. [Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu

work page 2003
[15]

In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54

Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54. [Koehn2005] P. Koehn

work page 2003
[16]

In Machine Translation Summit X , pages 79–86, Phuket, Thai- land

Europarl: A parallel cor- pus for statistical machine translation. In Machine Translation Summit X , pages 79–86, Phuket, Thai- land. [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton

work page 2012
[17]

In Advances in Neural Information Processing Systems 25 (NIPS’2012)

Ima- geNet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012). [Marcu and Wong2002] Daniel Marcu and William Wong

work page 2012
[18]

In Pro- ceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 133–139

A phrase-based, joint probability model for statistical machine translation. In Pro- ceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 133–139. [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean

work page 2013
[19]

In Proceedings of the ACL 2010 Conference Short Papers , ACLShort ’10, pages 220–224, Stroudsburg, PA, USA

Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers , ACLShort ’10, pages 220–224, Stroudsburg, PA, USA. [Pascanu et al.2014] R. Pascanu, C. Gulcehre, K. Cho, and Y . Bengio

work page 2010
[20]

In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April

How to construct deep recur- rent neural networks. In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April. [Saxe et al.2014] Andrew M. Saxe, James L. McClel- land, and Surya Ganguli

work page 2014
[21]

In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April

Exact solutions to the nonlinear dynamics of learning in deep lin- ear neural networks. In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April. [Schwenk et al.2006] Holger Schwenk, Marta R. Costa- Juss`a, and Jos ´e A. R. Fonollosa

work page 2014
[22]

In IWSLT, pages 166–173

Continuous space language models for the iwslt 2006 task. In IWSLT, pages 166–173. [Schwenk2007] Holger Schwenk

work page 2006
[23]

In Martin Kay and Chris- tian Boitet, editors, Proceedings of the 24th Inter- national Conference on Computational Linguistics (COLIN), pages 1071–1080

Continuous space translation models for phrase-based statisti- cal machine translation. In Martin Kay and Chris- tian Boitet, editors, Proceedings of the 24th Inter- national Conference on Computational Linguistics (COLIN), pages 1071–1080. [Socher et al.2011] Richard Socher, Eric H. Huang, Jef- frey Pennington, Andrew Y . Ng, and Christopher D. Manning

work page 2011
[24]

[Son et al.2012] Le Hai Son, Alexandre Allauzen, and Franc ¸ois Yvon

work page 2012
[25]

Continuous space transla- tion models with neural networks. In Proceedings of the 2012 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 39–48, Stroudsburg, PA, USA. [van der Maaten2013] Laurens van der Maaten

work page 2012
[26]

In Proceedings of the First Inter- national Conference on Learning Representations (ICLR 2013), May

Barnes-hut-sne. In Proceedings of the First Inter- national Conference on Learning Representations (ICLR 2013), May. [Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang

work page 2013
[27]

Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701,

ADADELTA: an adaptive learning rate method. Technical report, arXiv 1212.5701. [Zou et al.2013] Will Y . Zou, Richard Socher, Daniel M. Cer, and Christopher D. Manning

work page arXiv 2013