Recognition: 2 theorem links
· Lean TheoremNeural Machine Translation by Jointly Learning to Align and Translate
Pith reviewed 2026-05-11 09:16 UTC · model grok-4.3
The pith
A neural translation model learns to focus on relevant source words while generating each target word.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors conjecture that a fixed-length context vector creates a performance bottleneck in basic encoder-decoder networks for machine translation. They introduce an attention mechanism that computes a distinct context vector for each target word as a weighted sum of the source sentence's hidden states, with the weights learned jointly during training. This allows the model to softly align source and target positions without explicit segmentation. On the WMT English-to-French task the approach matches the BLEU score of a strong phrase-based baseline while producing alignments that match human intuition.
What carries the argument
The attention-based alignment model that produces a context vector for each decoding step as a weighted combination of encoder hidden states, with weights derived from a feed-forward network trained jointly with the translation objective.
If this is right
- Translation systems can be trained end-to-end as a single network rather than relying on separate alignment and phrase-table components.
- Performance on longer sentences should improve because relevance can be selected dynamically instead of being compressed into one vector.
- The learned soft alignments provide an interpretable view of which source words influence each target word.
- The same joint alignment-and-generation approach can be applied to other sequence tasks where input relevance varies by output position.
Where Pith is reading between the lines
- The attention weights could serve as a starting point for extracting explicit phrase pairs or for debugging translation errors.
- Models that build on this idea might combine soft attention with hard constraints or multiple attention layers to handle very long documents.
- The removal of the fixed-vector bottleneck suggests similar gains are possible in any encoder-decoder setting where the input sequence is much longer than the output.
Load-bearing premise
That forcing the entire source sentence into a single fixed-length vector prevents the decoder from accessing the right information when generating different target words.
What would settle it
An experiment in which a basic encoder-decoder model without the soft-search mechanism reaches the same BLEU score as the attention model on the identical English-to-French test set.
read the original abstract
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an attention-based extension to the encoder-decoder architecture for neural machine translation. Rather than compressing the source sentence into a single fixed-length vector, the decoder learns to compute soft alignment weights over source positions at each decoding step, allowing it to focus on relevant source words when predicting each target word. The model is trained end-to-end on parallel data. On the WMT 2014 English-to-French task the attention model reaches 28.45 BLEU, described as comparable to a strong phrase-based baseline (Moses at 33.30 BLEU), and qualitative inspection shows that the learned soft alignments are intuitive.
Significance. If the reported BLEU scores and alignment visualizations hold under scrutiny, the work is significant because it supplies the first large-scale empirical demonstration that a neural translation model can learn to perform soft alignment jointly with translation. This directly addresses the fixed-context-vector limitation that motivated the paper and introduces the attention mechanism that later became standard in sequence modeling. The combination of quantitative results on a competitive benchmark and qualitative evidence of sensible alignments gives the central claim a solid empirical footing.
major comments (2)
- [§4.1 and Table 1] §4.1 and Table 1: the claim that performance is 'comparable' to the state-of-the-art phrase-based system rests on a single-run BLEU of 28.45 versus 33.30 for Moses. A 4.85-point gap is large enough that the comparability statement would be strengthened by reporting variance across random seeds, an ensemble result, or a direct comparison against the best contemporaneous neural baselines on the same data split.
- [§3.2, Eq. (5)–(7)] §3.2, Eq. (5)–(7): the alignment model is a simple feed-forward network whose output is normalized by softmax; the paper does not analyze or mitigate potential gradient vanishing when source sentences exceed the lengths seen in training. Because the motivating conjecture concerns long-sentence performance, this omission is load-bearing for the central architectural claim.
minor comments (3)
- [Abstract] The abstract states the main result without quoting the actual BLEU numbers or naming the test set; adding these two facts would make the abstract self-contained.
- [Figure 3] Figure 3 (alignment visualizations): the heatmaps lack explicit word labels on both axes and a color-bar scale, making it harder for readers to verify the claimed agreement with intuition.
- [§2.2] §2.2: the description of the basic RNN encoder-decoder could cite the exact prior work (Sutskever et al., 2014) more explicitly when stating the fixed-vector bottleneck.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [§4.1 and Table 1] §4.1 and Table 1: the claim that performance is 'comparable' to the state-of-the-art phrase-based system rests on a single-run BLEU of 28.45 versus 33.30 for Moses. A 4.85-point gap is large enough that the comparability statement would be strengthened by reporting variance across random seeds, an ensemble result, or a direct comparison against the best contemporaneous neural baselines on the same data split.
Authors: We acknowledge the 4.85 BLEU gap and agree that 'comparable' may overstate the absolute performance relative to the strong Moses baseline. The intent of the claim was to emphasize that an end-to-end neural model without phrase tables or hand-crafted features could reach a level close enough to be practically relevant on a large-scale task. We did not run multiple random seeds or ensembles due to the high computational cost of training on the full WMT data at the time. We will revise the abstract and §4.1 to describe the result as 'competitive with' or 'approaching' the phrase-based system and will add a brief comparison to other neural encoder-decoder baselines available at submission time. This constitutes a partial revision. revision: partial
-
Referee: [§3.2, Eq. (5)–(7)] §3.2, Eq. (5)–(7): the alignment model is a simple feed-forward network whose output is normalized by softmax; the paper does not analyze or mitigate potential gradient vanishing when source sentences exceed the lengths seen in training. Because the motivating conjecture concerns long-sentence performance, this omission is load-bearing for the central architectural claim.
Authors: The alignment model indeed uses a simple feed-forward scorer followed by softmax normalization over source positions. While this can in principle dilute gradients for source sentences much longer than those seen during training, our experiments were conducted on the standard WMT splits where sentence lengths are bounded, and the attention model showed clear gains over the fixed-vector baseline. We did not include an explicit gradient analysis because the paper's focus was on the empirical demonstration of jointly learned soft alignments. We will add a short discussion in §3.2 noting the potential limitation for extremely long sequences and pointing to the empirical improvement on longer sentences in the test set. This is a partial revision. revision: partial
Circularity Check
No significant circularity
full rationale
The paper defines an attention-augmented encoder-decoder from first principles (bidirectional RNN encoder, decoder with soft alignment probabilities computed via a feedforward network, trained end-to-end by maximizing log-likelihood on parallel sentence pairs). Performance is measured by standard BLEU on held-out WMT test data; no fitted parameter is defined in terms of BLEU, no self-citation supplies a uniqueness theorem or ansatz, and the fixed-length-vector conjecture is offered only as motivation. All load-bearing steps (model equations, training objective, alignment visualization) are self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
free parameters (1)
- attention model parameters
axioms (1)
- domain assumption A single neural network can be jointly tuned to maximize translation performance
invented entities (1)
-
soft alignment weights
no independent evidence
Forward citations
Cited by 50 Pith papers
-
Characterizing the Expressivity of Local Attention in Transformers
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
-
Adaptive Computation Time for Recurrent Neural Networks
ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
-
Neural Turing Machines
Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.
-
GravityGraphSAGE: Link Prediction in Directed Attributed Graphs
GravityGraphSAGE adapts GraphSAGE with a gravity-inspired decoder to outperform prior graph deep learning methods on directed link prediction across citation networks and 16 real-world graphs.
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events
ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.
-
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-atten...
-
Selective Contrastive Learning For Gloss Free Sign Language Translation
A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.
-
AlphaEvolve: A coding agent for scientific and algorithmic discovery
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, ...
-
In-context Learning and Induction Heads
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
-
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
-
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
-
Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning
BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALB...
-
Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus
mBERT with LoRA achieves the best weighted F1 of 0.62 for Tajik POS tagging on context-free dictionary entries, but macro F1 is only 0.11, with all models failing on rare function words.
-
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
-
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
GCL uses a two-stage protocol with Routing, Auditing, Public-Factor, and Aggregation Agents to mitigate modality dominance and spurious coupling in multimodal learning, achieving state-of-the-art results on CMU-MOSI, ...
-
Jet Quenching Identification via Supervised Learning in Simulated Heavy-Ion Collisions
Sequential machine learning on jet declustering history trees outperforms static models at identifying jet quenching in heavy-ion collision simulations.
-
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
-
Graph Transformer-Based Pathway Embedding for Cancer Prognosis
PATH gene embeddings in a graph transformer achieve 0.8766 F1 on pancancer metastasis prediction (8.8% over SOTA) and identify disease-state pathway rewiring.
-
Neural architectures for resolving references in program code
New seq2seq architectures for permutation indexing outperform baselines on synthetic reference-resolution tasks and reduce real decompilation error rates by 42%.
-
Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers
Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
-
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
On the Opportunities and Risks of Foundation Models
Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.
-
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
-
Universal Transformers
Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
-
Attention U-Net: Learning Where to Look for the Pancreas
Attention gates added to U-Net automatically focus on target organs in CT images and improve segmentation performance on abdominal datasets.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair
Multi-stage LLM training plus compiler-guided error repair boosts functional equivalence in Java-to-Cangjie translation by 6.06% over prior methods despite scarce parallel data.
-
Adaptive Memory Decay for Log-Linear Attention
Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.
-
Neural Equalisers for Highly Compressed Faster-than-Nyquist Signalling: Design, Performance, Complexity and Robustness
Deep learning receivers enable reliable FTN signaling with up to 75% spectral compression via sliding-window detection while maintaining low latency and robustness to channel variations.
-
Beyond the Final Label: Exploiting the Untapped Potential of Classification Histories in Astronomical Light Curve Analysis
An RNN-plus-attention model that ingests classification histories outperforms standard final-label classifiers on ELAsTiCC synthetic data and is accompanied by new Wasserstein-based metrics for temporal stability and ...
-
Topological Dualities for Modal Algebras
A family of dualities links modal frames to relational spaces, with simplifications for semicontinuous relations that match modal axioms to relational properties.
-
An Explainable Approach to Document-level Translation Evaluation with Topic Modeling
A topic-modeling framework measures document-level thematic consistency in translations by aligning key tokens across languages with a bilingual dictionary and scoring via cosine similarity, providing explainable insi...
-
MambaSL: Exploring Single-Layer Mamba for Time Series Classification
A single-layer Mamba variant with targeted redesigns sets new state-of-the-art average performance on all 30 UEA time series classification datasets under a unified reproducible protocol.
-
MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection
Smartphone transillumination imaging paired with a neuroevolution-tuned ensemble model classifies chicken breast myopathies at 82.4% accuracy on 336 fillets, matching costly hyperspectral systems.
-
Towards Automated Pentesting with Large Language Models
RedShell fine-tunes LLMs on enhanced malicious PowerShell data to produce syntactically valid offensive code for pentesting, reporting over 90% validity, strong semantic match to references, and better edit-distance s...
-
Attention Is All You Need
Pith review generated a malformed one-line summary.
-
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
Group Cognition Learning uses governed two-stage agents after separate modality encoding to mitigate dominance and spurious coupling, reporting state-of-the-art results on CMU-MOSI, CMU-MOSEI, and MIntRec for regressi...
-
Text Style Transfer with Machine Translation for Graphic Designs
Custom tag methods with NMT and LLMs for word alignment in text style transfer perform no better than standard attention-based alignment from NMT models.
-
JSSFF: A Joint Structural-Semantic Fusion Framework for Remote Sensing Image Captioning
JSSFF improves remote sensing image captioning by fusing structural edge details with semantic features in an encoder-decoder model and using fairness-based beam search, outperforming baselines on quantitative and qua...
-
Sinkhorn doubly stochastic attention rank decay analysis
Sinkhorn-normalized doubly stochastic attention preserves rank more effectively than Softmax row-stochastic attention, with both showing doubly exponential rank decay to one with network depth.
-
Video-guided Machine Translation with Global Video Context
A globally video-guided multimodal translation framework retrieves semantically related video segments with a vector database and applies attention mechanisms to improve subtitle translation accuracy in long videos.
-
LLMs Struggle with Abstract Meaning Comprehension More Than Expected
LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.
-
Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS)
ADRUwAMS reports Dice scores of 0.9229 (whole tumor), 0.8432 (tumor core), and 0.8004 (enhancing tumor) on BraTS 2020 after training on BraTS 2019/2020 datasets.
-
Lecture Notes on Statistical Physics and Neural Networks
Lecture notes that treat statistical physics as probability theory and connect Ising models, spin glasses, and renormalization group ideas to Hopfield networks, restricted Boltzmann machines, and large language models.
Reference graph
Works this paper leans on
-
[1]
Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ , pages 355--362. Association for Computational Linguistics
work page 2011
-
[2]
J., Bergeron, A., Bouchard, N., and Bengio, Y
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop
work page 2012
-
[3]
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks\/ , 5 (2), 157--166
work page 1994
-
[4]
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model. J. Mach. Learn. Res. , 3 , 1137--1155
work page 2003
-
[5]
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference ( SciPy )\/ . Oral Presentation
work page 2010
-
[6]
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2013). Audio chord recognition with recurrent neural networks. In ISMIR\/
work page 2013
-
[7]
Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014)\/ . to appear
work page 2014
-
[8]
Cho, K., van Merri\"enboer, B., Bahdanau, D., and Bengio, Y. (2014b). On the properties of neural machine translation: E ncoder-- D ecoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation\/ . to appear
-
[9]
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. In Association for Computational Linguistics\/
work page 2014
-
[10]
Forcada, M. L. and \ Neco, R. P. (1997). Recursive hetero-associative memories for translation. In J. Mira, R. Moreno-D\'iaz, and J. Cabestany, editors, Biological and Artificial Computation: From Neuroscience to Technology\/ , volume 1240 of Lecture Notes in Computer Science\/ , pages 453--462. Springer Berlin Heidelberg
work page 1997
-
[11]
Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In Proceedings of The 30th International Conference on Machine Learning\/ , pages 1319--1327
work page 2013
-
[12]
Graves, A. (2012). Sequence transduction with recurrent neural networks. In Proceedings of the 29th International Conference on Machine Learning (ICML 2012)\/
work page 2012
- [13]
-
[14]
Graves, A., Jaitly, N., and Mohamed, A.-R. (2013). Hybrid speech recognition with deep bidirectional LSTM . In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on\/ , pages 273--278
work page 2013
-
[15]
Hermann, K. and Blunsom, P. (2014). Multilingual distributed representations without word alignment. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)\/
work page 2014
-
[16]
u r Informatik, Lehrstuhl Prof. Brauer, Technische Universit\
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f\" u r Informatik, Lehrstuhl Prof. Brauer, Technische Universit\" a t M\" u nchen
work page 1991
-
[17]
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation\/ , 9 (8), 1735--1780
work page 1997
-
[18]
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ , pages 1700--1709. Association for Computational Linguistics
work page 2013
-
[19]
Koehn, P. (2010). Statistical Machine Translation\/ . Cambridge University Press, New York, NY, USA
work page 2010
-
[20]
Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1\/ , NAACL '03, pages 48--54, Stroudsburg, PA, USA. Association for Computational Linguistics
work page 2003
-
[21]
Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent neural networks. In ICML'2013\/
work page 2013
-
[22]
Pascanu, R., Mikolov, T., and Bengio, Y. (2013b). On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013)\/
work page 2013
-
[23]
Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014). How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)\/
work page 2014
-
[24]
Pouget-Abadie, J., Bahdanau, D., van Merri\"enboer, B., Cho, K., and Bengio, Y. (2014). Overcoming the curse of sentence length for neural machine translation using automatic segmentation. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation\/ . to appear
work page 2014
-
[25]
Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on\/ , 45 (11), 2673--2681
work page 1997
-
[26]
Schwenk, H. (2012). Continuous space translation models for phrase-based statistical machine translation. In M. Kay and C. Boitet, editors, Proceedings of the 24th International Conference on Computational Linguistics (COLIN)\/ , pages 1071--1080. Indian Institute of Technology Bombay
work page 2012
-
[27]
Schwenk, H., Dchelotte, D., and Gauvain, J.-L. (2006). Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL on Main conference poster sessions\/ , pages 723--730. Association for Computational Linguistics
work page 2006
-
[28]
Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014)\/
work page 2014
- [29]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.