arxiv: 1409.0473 · v7 · submitted 2014-09-01 · 💻 cs.CL · cs.LG· cs.NE· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

Pith reviewed 2026-05-11 09:16 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.NEstat.ML

keywords neural machine translationattention mechanismencoder-decodersoft alignmentsequence-to-sequencemachine translationend-to-end training

0 comments

The pith

A neural translation model learns to focus on relevant source words while generating each target word.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that the standard encoder-decoder architecture for neural machine translation is limited by forcing all source information into one fixed-length vector. It proposes extending the model so the decoder can automatically and softly search for the most relevant parts of the source sentence when predicting each target word. This joint learning of alignment and translation is meant to remove the bottleneck and produce end-to-end trainable systems. A reader would care because it offers a path to simpler, fully neural translation pipelines that might scale without hand-engineered phrase tables. The authors report that the resulting model reaches translation quality comparable to the best phrase-based systems on English-to-French.

Core claim

The authors conjecture that a fixed-length context vector creates a performance bottleneck in basic encoder-decoder networks for machine translation. They introduce an attention mechanism that computes a distinct context vector for each target word as a weighted sum of the source sentence's hidden states, with the weights learned jointly during training. This allows the model to softly align source and target positions without explicit segmentation. On the WMT English-to-French task the approach matches the BLEU score of a strong phrase-based baseline while producing alignments that match human intuition.

What carries the argument

The attention-based alignment model that produces a context vector for each decoding step as a weighted combination of encoder hidden states, with weights derived from a feed-forward network trained jointly with the translation objective.

If this is right

Translation systems can be trained end-to-end as a single network rather than relying on separate alignment and phrase-table components.
Performance on longer sentences should improve because relevance can be selected dynamically instead of being compressed into one vector.
The learned soft alignments provide an interpretable view of which source words influence each target word.
The same joint alignment-and-generation approach can be applied to other sequence tasks where input relevance varies by output position.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The attention weights could serve as a starting point for extracting explicit phrase pairs or for debugging translation errors.
Models that build on this idea might combine soft attention with hard constraints or multiple attention layers to handle very long documents.
The removal of the fixed-vector bottleneck suggests similar gains are possible in any encoder-decoder setting where the input sequence is much longer than the output.

Load-bearing premise

That forcing the entire source sentence into a single fixed-length vector prevents the decoder from accessing the right information when generating different target words.

What would settle it

An experiment in which a basic encoder-decoder model without the soft-search mechanism reaches the same BLEU score as the attention model on the identical English-to-French test set.

read the original abstract

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds soft attention to encoder-decoder NMT so the decoder can dynamically focus on source words, and reports BLEU scores on En-Fr that match the best phrase-based systems.

read the letter

The main takeaway is that this work shows how to replace the single fixed context vector in a basic RNN encoder-decoder with a soft attention mechanism. For each target word the decoder computes weights over the source hidden states and uses a weighted sum, all trained jointly with the translation loss. They get English-to-French BLEU numbers on WMT'14 that sit at the level of the strongest phrase-based systems then available, plus some qualitative plots where the learned alignments look sensible for word pairs and phrases.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes an attention-based extension to the encoder-decoder architecture for neural machine translation. Rather than compressing the source sentence into a single fixed-length vector, the decoder learns to compute soft alignment weights over source positions at each decoding step, allowing it to focus on relevant source words when predicting each target word. The model is trained end-to-end on parallel data. On the WMT 2014 English-to-French task the attention model reaches 28.45 BLEU, described as comparable to a strong phrase-based baseline (Moses at 33.30 BLEU), and qualitative inspection shows that the learned soft alignments are intuitive.

Significance. If the reported BLEU scores and alignment visualizations hold under scrutiny, the work is significant because it supplies the first large-scale empirical demonstration that a neural translation model can learn to perform soft alignment jointly with translation. This directly addresses the fixed-context-vector limitation that motivated the paper and introduces the attention mechanism that later became standard in sequence modeling. The combination of quantitative results on a competitive benchmark and qualitative evidence of sensible alignments gives the central claim a solid empirical footing.

major comments (2)

[§4.1 and Table 1] §4.1 and Table 1: the claim that performance is 'comparable' to the state-of-the-art phrase-based system rests on a single-run BLEU of 28.45 versus 33.30 for Moses. A 4.85-point gap is large enough that the comparability statement would be strengthened by reporting variance across random seeds, an ensemble result, or a direct comparison against the best contemporaneous neural baselines on the same data split.
[§3.2, Eq. (5)–(7)] §3.2, Eq. (5)–(7): the alignment model is a simple feed-forward network whose output is normalized by softmax; the paper does not analyze or mitigate potential gradient vanishing when source sentences exceed the lengths seen in training. Because the motivating conjecture concerns long-sentence performance, this omission is load-bearing for the central architectural claim.

minor comments (3)

[Abstract] The abstract states the main result without quoting the actual BLEU numbers or naming the test set; adding these two facts would make the abstract self-contained.
[Figure 3] Figure 3 (alignment visualizations): the heatmaps lack explicit word labels on both axes and a color-bar scale, making it harder for readers to verify the claimed agreement with intuition.
[§2.2] §2.2: the description of the basic RNN encoder-decoder could cite the exact prior work (Sutskever et al., 2014) more explicitly when stating the fixed-vector bottleneck.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [§4.1 and Table 1] §4.1 and Table 1: the claim that performance is 'comparable' to the state-of-the-art phrase-based system rests on a single-run BLEU of 28.45 versus 33.30 for Moses. A 4.85-point gap is large enough that the comparability statement would be strengthened by reporting variance across random seeds, an ensemble result, or a direct comparison against the best contemporaneous neural baselines on the same data split.

Authors: We acknowledge the 4.85 BLEU gap and agree that 'comparable' may overstate the absolute performance relative to the strong Moses baseline. The intent of the claim was to emphasize that an end-to-end neural model without phrase tables or hand-crafted features could reach a level close enough to be practically relevant on a large-scale task. We did not run multiple random seeds or ensembles due to the high computational cost of training on the full WMT data at the time. We will revise the abstract and §4.1 to describe the result as 'competitive with' or 'approaching' the phrase-based system and will add a brief comparison to other neural encoder-decoder baselines available at submission time. This constitutes a partial revision. revision: partial
Referee: [§3.2, Eq. (5)–(7)] §3.2, Eq. (5)–(7): the alignment model is a simple feed-forward network whose output is normalized by softmax; the paper does not analyze or mitigate potential gradient vanishing when source sentences exceed the lengths seen in training. Because the motivating conjecture concerns long-sentence performance, this omission is load-bearing for the central architectural claim.

Authors: The alignment model indeed uses a simple feed-forward scorer followed by softmax normalization over source positions. While this can in principle dilute gradients for source sentences much longer than those seen during training, our experiments were conducted on the standard WMT splits where sentence lengths are bounded, and the attention model showed clear gains over the fixed-vector baseline. We did not include an explicit gradient analysis because the paper's focus was on the empirical demonstration of jointly learned soft alignments. We will add a short discussion in §3.2 noting the potential limitation for extremely long sequences and pointing to the empirical improvement on longer sentences in the test set. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines an attention-augmented encoder-decoder from first principles (bidirectional RNN encoder, decoder with soft alignment probabilities computed via a feedforward network, trained end-to-end by maximizing log-likelihood on parallel sentence pairs). Performance is measured by standard BLEU on held-out WMT test data; no fitted parameter is defined in terms of BLEU, no self-citation supplies a uniqueness theorem or ansatz, and the fixed-length-vector conjecture is offered only as motivation. All load-bearing steps (model equations, training objective, alignment visualization) are self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The proposal rests on standard neural network training assumptions plus the specific conjecture about fixed vectors; no new physical entities are postulated.

free parameters (1)

attention model parameters
Weights of the alignment model are learned from data during joint training.

axioms (1)

domain assumption A single neural network can be jointly tuned to maximize translation performance
Stated as the goal of neural machine translation in the abstract.

invented entities (1)

soft alignment weights no independent evidence
purpose: To allow the decoder to focus on variable relevant source parts
New component introduced to extend the encoder-decoder

pith-pipeline@v0.9.0 · 5489 in / 1063 out tokens · 101742 ms · 2026-05-11T09:16:53.286704+00:00 · methodology

discussion (0)

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 unverdicted novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
cs.LG 2017-01 accept novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Adaptive Computation Time for Recurrent Neural Networks
cs.NE 2016-03 accept novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Neural Turing Machines
cs.NE 2014-10 unverdicted novelty 8.0

Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.
GravityGraphSAGE: Link Prediction in Directed Attributed Graphs
cs.LG 2026-05 unverdicted novelty 7.0

GravityGraphSAGE adapts GraphSAGE with a gravity-inspired decoder to outperform prior graph deep learning methods on directed link prediction across citation networks and 16 real-world graphs.
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events
cs.LG 2026-05 unverdicted novelty 7.0

ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
math.PR 2026-04 unverdicted novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-atten...
Selective Contrastive Learning For Gloss Free Sign Language Translation
cs.CL 2026-04 unverdicted novelty 7.0

A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.
AlphaEvolve: A coding agent for scientific and algorithmic discovery
cs.AI 2025-06 unverdicted novelty 7.0

AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, ...
In-context Learning and Induction Heads
cs.LG 2022-09 unverdicted novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
cs.CL 2018-08 accept novelty 7.0

SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
cs.CL 2016-11 accept novelty 7.0

MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
Cubit: Token Mixer with Kernel Ridge Regression
cs.LG 2026-05 unverdicted novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALB...
Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus
cs.CL 2026-05 unverdicted novelty 6.0

mBERT with LoRA achieves the best weighted F1 of 0.62 for Tajik POS tagging on context-free dictionary entries, but macro F1 is only 0.11, with all models failing on rare function words.
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
cs.CL 2026-05 unverdicted novelty 6.0

VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
cs.LG 2026-05 unverdicted novelty 6.0

GCL uses a two-stage protocol with Routing, Auditing, Public-Factor, and Aggregation Agents to mitigate modality dominance and spurious coupling in multimodal learning, achieving state-of-the-art results on CMU-MOSI, ...
Jet Quenching Identification via Supervised Learning in Simulated Heavy-Ion Collisions
hep-ph 2026-04 unverdicted novelty 6.0

Sequential machine learning on jet declustering history trees outperforms static models at identifying jet quenching in heavy-ion collision simulations.
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
cs.NE 2026-04 unverdicted novelty 6.0

S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
Graph Transformer-Based Pathway Embedding for Cancer Prognosis
cs.LG 2026-04 unverdicted novelty 6.0

PATH gene embeddings in a graph transformer achieve 0.8766 F1 on pancancer metastasis prediction (8.8% over SOTA) and identify disease-state pathway rewiring.
Neural architectures for resolving references in program code
cs.LG 2026-04 unverdicted novelty 6.0

New seq2seq architectures for permutation indexing outperform baselines on synthetic reference-resolution tasks and reduce real decompilation error rates by 42%.
Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers
astro-ph.IM 2026-04 unverdicted novelty 6.0

Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
cs.CL 2026-04 unverdicted novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
On the Opportunities and Risks of Foundation Models
cs.LG 2021-08 accept novelty 6.0

Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
cs.LG 2021-04 accept novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
cs.CL 2020-06 unverdicted novelty 6.0

GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
Universal Transformers
cs.CL 2018-07 unverdicted novelty 6.0

Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
Attention U-Net: Learning Where to Look for the Pancreas
cs.CV 2018-04 unverdicted novelty 6.0

Attention gates added to U-Net automatically focus on target organs in CT images and improve segmentation performance on abdominal datasets.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair
cs.SE 2026-05 unverdicted novelty 5.0

Multi-stage LLM training plus compiler-guided error repair boosts functional equivalence in Java-to-Cangjie translation by 6.06% over prior methods despite scarce parallel data.
Adaptive Memory Decay for Log-Linear Attention
cs.LG 2026-05 conditional novelty 5.0

Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.
Neural Equalisers for Highly Compressed Faster-than-Nyquist Signalling: Design, Performance, Complexity and Robustness
cs.IT 2026-05 unverdicted novelty 5.0

Deep learning receivers enable reliable FTN signaling with up to 75% spectral compression via sliding-window detection while maintaining low latency and robustness to channel variations.
Beyond the Final Label: Exploiting the Untapped Potential of Classification Histories in Astronomical Light Curve Analysis
astro-ph.IM 2026-04 unverdicted novelty 5.0

An RNN-plus-attention model that ingests classification histories outperforms standard final-label classifiers on ELAsTiCC synthetic data and is accompanied by new Wasserstein-based metrics for temporal stability and ...
Topological Dualities for Modal Algebras
math.CT 2026-04 unverdicted novelty 5.0

A family of dualities links modal frames to relational spaces, with simplifications for semicontinuous relations that match modal axioms to relational properties.
An Explainable Approach to Document-level Translation Evaluation with Topic Modeling
cs.CE 2026-04 unverdicted novelty 5.0

A topic-modeling framework measures document-level thematic consistency in translations by aligning key tokens across languages with a bilingual dictionary and scoring via cosine similarity, providing explainable insi...
MambaSL: Exploring Single-Layer Mamba for Time Series Classification
cs.LG 2026-04 unverdicted novelty 5.0

A single-layer Mamba variant with targeted redesigns sets new state-of-the-art average performance on all 30 UEA time series classification datasets under a unified reproducible protocol.
MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection
cs.LG 2026-04 unverdicted novelty 5.0

Smartphone transillumination imaging paired with a neuroevolution-tuned ensemble model classifies chicken breast myopathies at 82.4% accuracy on 336 fillets, matching costly hyperspectral systems.
Towards Automated Pentesting with Large Language Models
cs.CR 2026-04 unverdicted novelty 5.0

RedShell fine-tunes LLMs on enhanced malicious PowerShell data to produce syntactically valid offensive code for pentesting, reporting over 90% validity, strong semantic match to references, and better edit-distance s...
Attention Is All You Need
cs.CL 2017-06 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
cs.LG 2026-05 unverdicted novelty 4.0

Group Cognition Learning uses governed two-stage agents after separate modality encoding to mitigate dominance and spurious coupling, reporting state-of-the-art results on CMU-MOSI, CMU-MOSEI, and MIntRec for regressi...
Text Style Transfer with Machine Translation for Graphic Designs
cs.CL 2026-04 unverdicted novelty 4.0

Custom tag methods with NMT and LLMs for word alignment in text style transfer perform no better than standard attention-based alignment from NMT models.
JSSFF: A Joint Structural-Semantic Fusion Framework for Remote Sensing Image Captioning
cs.CV 2026-04 unverdicted novelty 4.0

JSSFF improves remote sensing image captioning by fusing structural edge details with semantic features in an encoder-decoder model and using fairness-based beam search, outperforming baselines on quantitative and qua...
Sinkhorn doubly stochastic attention rank decay analysis
cs.LG 2026-04 unverdicted novelty 4.0

Sinkhorn-normalized doubly stochastic attention preserves rank more effectively than Softmax row-stochastic attention, with both showing doubly exponential rank decay to one with network depth.
Video-guided Machine Translation with Global Video Context
cs.CV 2026-04 unverdicted novelty 4.0

A globally video-guided multimodal translation framework retrieves semantically related video segments with a vector database and applies attention mechanisms to improve subtitle translation accuracy in long videos.
LLMs Struggle with Abstract Meaning Comprehension More Than Expected
cs.CL 2026-04 unverdicted novelty 3.0

LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.
Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS)
cs.CV 2026-04 unverdicted novelty 3.0

ADRUwAMS reports Dice scores of 0.9229 (whole tumor), 0.8432 (tumor core), and 0.8004 (enhancing tumor) on BraTS 2020 after training on BraTS 2019/2020 datasets.
Lecture Notes on Statistical Physics and Neural Networks
cond-mat.dis-nn 2026-05 unverdicted novelty 2.0

Lecture notes that treat statistical physics as probability theory and connect Ising models, spin glasses, and renormalization group ideas to Hopfield networks, restricted Boltzmann machines, and large language models.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 49 Pith papers

[1]

Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ , pages 355--362. Association for Computational Linguistics

work page 2011
[2]

J., Bergeron, A., Bouchard, N., and Bengio, Y

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop

work page 2012
[3]

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks\/ , 5 (2), 157--166

work page 1994
[4]

Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model. J. Mach. Learn. Res. , 3 , 1137--1155

work page 2003
[5]

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference ( SciPy )\/ . Oral Presentation

work page 2010
[6]

Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2013). Audio chord recognition with recurrent neural networks. In ISMIR\/

work page 2013
[7]

Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014)\/ . to appear

work page 2014
[8]

Cho, K., van Merri\"enboer, B., Bahdanau, D., and Bengio, Y. (2014b). On the properties of neural machine translation: E ncoder-- D ecoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation\/ . to appear

work page
[9]

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. In Association for Computational Linguistics\/

work page 2014
[10]

Forcada, M. L. and \ Neco, R. P. (1997). Recursive hetero-associative memories for translation. In J. Mira, R. Moreno-D\'iaz, and J. Cabestany, editors, Biological and Artificial Computation: From Neuroscience to Technology\/ , volume 1240 of Lecture Notes in Computer Science\/ , pages 453--462. Springer Berlin Heidelberg

work page 1997
[11]

Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In Proceedings of The 30th International Conference on Machine Learning\/ , pages 1319--1327

work page 2013
[12]

Graves, A. (2012). Sequence transduction with recurrent neural networks. In Proceedings of the 29th International Conference on Machine Learning (ICML 2012)\/

work page 2012
[13]

Graves , A. (2013). Generating sequences with recurrent neural networks. ar X iv: 1308.0850 [cs.NE] \/

work page arXiv 2013
[14]

Graves, A., Jaitly, N., and Mohamed, A.-R. (2013). Hybrid speech recognition with deep bidirectional LSTM . In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on\/ , pages 273--278

work page 2013
[15]

and Blunsom, P

Hermann, K. and Blunsom, P. (2014). Multilingual distributed representations without word alignment. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)\/

work page 2014
[16]

u r Informatik, Lehrstuhl Prof. Brauer, Technische Universit\

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f\" u r Informatik, Lehrstuhl Prof. Brauer, Technische Universit\" a t M\" u nchen

work page 1991
[17]

and Schmidhuber, J

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation\/ , 9 (8), 1735--1780

work page 1997
[18]

and Blunsom, P

Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ , pages 1700--1709. Association for Computational Linguistics

work page 2013
[19]

Koehn, P. (2010). Statistical Machine Translation\/ . Cambridge University Press, New York, NY, USA

work page 2010
[20]

J., and Marcu, D

Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1\/ , NAACL '03, pages 48--54, Stroudsburg, PA, USA. Association for Computational Linguistics

work page 2003
[21]

Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent neural networks. In ICML'2013\/

work page 2013
[22]

Pascanu, R., Mikolov, T., and Bengio, Y. (2013b). On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013)\/

work page 2013
[23]

Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014). How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)\/

work page 2014
[24]

Pouget-Abadie, J., Bahdanau, D., van Merri\"enboer, B., Cho, K., and Bengio, Y. (2014). Overcoming the curse of sentence length for neural machine translation using automatic segmentation. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation\/ . to appear

work page 2014
[25]

and Paliwal, K

Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on\/ , 45 (11), 2673--2681

work page 1997
[26]

Schwenk, H. (2012). Continuous space translation models for phrase-based statistical machine translation. In M. Kay and C. Boitet, editors, Proceedings of the 24th International Conference on Computational Linguistics (COLIN)\/ , pages 1071--1080. Indian Institute of Technology Bombay

work page 2012
[27]

Schwenk, H., Dchelotte, D., and Gauvain, J.-L. (2006). Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL on Main conference poster sessions\/ , pages 723--730. Association for Computational Linguistics

work page 2006
[28]

Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014)\/

work page 2014
[29]

Zeiler, M. D. (2012). ADADELTA : An adaptive learning rate method. ar X iv: 1212.5701 [cs.LG] \/

work page arXiv 2012