arxiv: 1508.07909 · v5 · submitted 2015-08-31 · 💻 cs.CL

Recognition: no theorem link

Neural Machine Translation of Rare Words with Subword Units

Alexandra Birch, Barry Haddow, Rico Sennrich

Pith reviewed 2026-05-12 15:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords neural machine translationsubword unitsbyte pair encodingrare wordsopen vocabularyWMT 2015English-German translationEnglish-Russian translation

0 comments

The pith

Encoding rare words as subword sequences enables open-vocabulary neural machine translation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that neural machine translation models can handle rare and unknown words without a separate dictionary by breaking words into smaller subword units. This is done using techniques like byte pair encoding, which splits words into frequent chunks that the model can learn to translate individually. The intuition is that names can be copied character by character, compounds translated by parts, and similar words handled through transformations at the subword level. Experiments on English to German and English to Russian translation tasks from WMT 2015 demonstrate that this subword approach improves translation quality by 1.1 and 1.3 BLEU points over a baseline that backs off to a dictionary for out-of-vocabulary words. A reader would care because it makes neural translation systems more practical for real-world languages with large vocabularies.

Core claim

The central claim is that by encoding rare and unknown words as sequences of subword units, the NMT model becomes capable of open-vocabulary translation. Various word segmentation techniques are discussed, including character n-gram models and byte pair encoding. Empirically, subword models outperform a back-off dictionary baseline on WMT 15 English-German and English-Russian tasks by 1.1 and 1.3 BLEU points respectively.

What carries the argument

Byte pair encoding (BPE) for word segmentation, which creates a vocabulary of subword units by merging frequent pairs to allow composition of translations for rare words.

If this is right

Names and unknown words can be translated via character copying or transliteration using subword units.
Compound words can be translated compositionally by breaking them into their subword components.
Cognates and loanwords can be handled through subword-level phonological and morphological transformations.
The neural model learns to translate an open vocabulary without needing external dictionary lookups for rare words.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could simplify NMT systems by eliminating the need for separate handling of out-of-vocabulary words.
Subword segmentation might improve generalization in other natural language processing tasks involving rare terms.
Further research could test the method on additional language pairs or with different neural architectures.

Load-bearing premise

That breaking words into subword units via byte pair encoding or character n-grams keeps sufficient linguistic information intact for the model to produce accurate translations of rare words.

What would settle it

Observing no BLEU improvement or a drop when using subword units compared to the dictionary back-off baseline on the WMT English-German or English-Russian tasks would indicate the approach does not work as claimed.

read the original abstract

Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BPE subword units give NMT a practical open-vocab fix that beats dictionary back-off by 1+ BLEU on WMT15 En-De and En-Ru.

read the letter

The main thing to know is that this paper shows byte pair encoding can split words into subword units so an NMT model handles rare words end-to-end, beating the back-off dictionary baseline by 1.1 BLEU on English-German and 1.3 on English-Russian for the WMT 2015 tasks. They build a fixed subword vocabulary by merging frequent character pairs from the data, then train the model to translate sequences of these units. This covers names via copying, compounds via parts, and similar words via learned patterns, all without external resources. They also test character n-grams as a simpler alternative and keep the NMT architecture and training data the same for direct comparison. The results are on standard held-out test sets, so the gains trace to the segmentation choice. The work is clean and the empirical side is straightforward. A minor soft spot is that the BLEU lifts are modest and the paper gives limited breakdown by word type or statistical tests, plus the exact BPE merge count feels a bit under-explored. That does not break the central result. This is aimed at anyone building neural translation systems that hit vocabulary walls. Readers working on practical improvements or multilingual setups will get direct value from the method and the numbers. It has enough grounding and novelty in application to deserve a serious referee.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes representing rare and unknown words in neural machine translation as sequences of subword units (via byte-pair encoding or character n-grams) to enable open-vocabulary translation. It evaluates the approach on the WMT 2015 English-German and English-Russian tasks, reporting BLEU gains of 1.1 and 1.3 points over a dictionary back-off baseline, attributing the improvements to better handling of names, compounds, cognates, and loanwords through subword composition.

Significance. If the results hold, the work provides a simple, resource-light method for addressing the open-vocabulary problem in NMT that avoids external dictionaries. The empirical improvements on standard benchmarks demonstrate that subword segmentation can be learned end-to-end within the neural model, offering a practical alternative to back-off strategies and influencing later tokenization practices in the field.

major comments (2)

[§4.1] §4.1, BPE description: the number of merge operations and resulting vocabulary size are not specified for the reported experiments; without these hyperparameters the 1.1–1.3 BLEU gains cannot be reproduced or compared directly to the baseline.
[Table 2] Table 2 (En-De results): the subword models show gains, but no statistical significance tests or variance estimates across multiple runs are provided, weakening the claim that the improvements are robust rather than due to random variation.

minor comments (2)

[§3.2] §3.2: an explicit example of word segmentation (e.g., how 'unbelievable' or a German compound is split) would clarify the subword representation for readers.
[Abstract] Abstract and §5: the exact test-set conditions and baseline implementation details (e.g., dictionary coverage) are only summarized; a short additional sentence would make the comparison fully transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive review and constructive comments. We address each major point below.

read point-by-point responses

Referee: [§4.1] §4.1, BPE description: the number of merge operations and resulting vocabulary size are not specified for the reported experiments; without these hyperparameters the 1.1–1.3 BLEU gains cannot be reproduced or compared directly to the baseline.

Authors: We agree that these details are necessary for reproducibility. The reported experiments used 59,500 merge operations, yielding a subword vocabulary of 60,000 units (the same setting was applied to both language pairs). We will add this information to §4.1 in the revised manuscript. revision: yes
Referee: [Table 2] Table 2 (En-De results): the subword models show gains, but no statistical significance tests or variance estimates across multiple runs are provided, weakening the claim that the improvements are robust rather than due to random variation.

Authors: We acknowledge that multiple runs with significance testing would provide stronger evidence of robustness. Training multiple independent NMT models was computationally prohibitive at the time. The 1.1–1.3 BLEU gains are consistent across two language pairs, and the paper includes a qualitative analysis showing systematic improvements on names, compounds, cognates, and loanwords. We will add a brief discussion of these limitations and supporting evidence in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an algorithmic method for open-vocabulary NMT via subword segmentation (BPE or character n-grams) and reports empirical BLEU gains on held-out WMT15 test sets against a back-off dictionary baseline. No derivations, predictions, or first-principles results reduce to inputs by construction; the central claims are direct experimental comparisons on fixed architectures and data splits. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the argument chain. The approach is self-contained as an engineering solution evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that subword units can compositionally represent translations of rare words, with no free parameters beyond standard model training and no invented entities.

axioms (1)

domain assumption Various word classes including names, compounds, cognates and loanwords are translatable via smaller units than whole words.
Invoked in the abstract as the core intuition motivating subword encoding.

pith-pipeline@v0.9.0 · 5466 in / 1239 out tokens · 36299 ms · 2026-05-12T15:10:16.309155+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
cs.LG 2026-05 unverdicted novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
cs.CL 2026-04 unverdicted novelty 7.0

Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
cs.CL 2026-04 unverdicted novelty 7.0

MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B pa...
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
Chronos: Learning the Language of Time Series
cs.LG 2024-03 conditional novelty 7.0

Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
Fine-Tuning Language Models from Human Preferences
cs.CL 2019-09 unverdicted novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Deep Learning Scaling is Predictable, Empirically
cs.LG 2017-12 unverdicted novelty 7.0

Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation
cs.CV 2026-05 unverdicted novelty 6.0

BabelDOC uses an intermediate representation to decouple layout from content for improved layout-preserving PDF translation.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection
cs.CR 2026-04 unverdicted novelty 6.0

VulStyle pre-trains on 4.9M functions using code, non-terminal ASTs, and stylometry features, then fine-tunes to achieve SOTA F1 gains of 4-48% on BigVul and VulDeePecker.
Informativity of Data-Knowledge Pairs for Lyapunov Equations
eess.SY 2026-04 unverdicted novelty 6.0

The paper derives an algebraic condition for when data-knowledge pairs are jointly informative enough to uniquely solve Lyapunov equations.
FAST: Efficient Action Tokenization for Vision-Language-Action Models
cs.RO 2025-01 unverdicted novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
cs.CV 2022-06 unverdicted novelty 6.0

Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
LaMDA: Language Models for Dialog Applications
cs.CL 2022-01 unverdicted novelty 6.0

LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan
cs.CL 2026-05 unverdicted novelty 5.0

An interpretable deep learning framework with a new tokenizer is used to quantify how grammatical gender information is distributed between lemmas and sentential context during the Latin-to-Occitan transition.
In Search of Lost DNA Sequence Pretraining
cs.LG 2026-04 unverdicted novelty 5.0

DNA pretraining suffers from inappropriate evaluation datasets, flawed neighbor-masking, and neglected vocabulary design; the authors supply guidelines and a reproducible testbed to fix them.
Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings
cs.SI 2026-04 unverdicted novelty 5.0

LLMs handle skin tone emoji modifiers better than dedicated embedding models but display systemic disparities in sentiment and semantic consistency across tones.
Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate
cs.CL 2026-04 conditional novelty 5.0

Empirical tests on software engineering tasks show Chinese prompts do not reduce token usage or improve success rates over English for LLM coding.
WorldVLA: Towards Autoregressive Action World Model
cs.RO 2025-06 unverdicted novelty 5.0

WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
cs.SE 2024-01 unverdicted novelty 5.0

DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Attention Is All You Need
cs.CL 2017-06 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 27 Pith papers

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate . In Proceedings of the International Conference on Learning Representations (ICLR)

work page 2015
[2]

Issam Bazzi and James R. Glass. 2000. Modeling out-of-vocabulary words for robust speech recognition . In Sixth International Conference on Spoken Language Processing, ICSLP 2000 / INTERSPEECH 2000 , pages 401--404, Beijing, China

work page 2000
[3]

Botha and Phil Blunsom

Jan A. Botha and Phil Blunsom. 2014. Compositional Morphology for Word Representations and Language Modelling . In Proceedings of the 31st International Conference on Machine Learning (ICML) , Beijing, China

work page 2014
[4]

Rohan Chitnis and John DeNero. 2015. Variable-Length Word Encodings for Neural Translation Models . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

work page 2015
[5]

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1724--1734, Doha, Qatar. ...

work page 2014
[6]

Mathias Creutz and Krista Lagus. 2002. Unsupervised Discovery of Morphemes . In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning , pages 21--30. Association for Computational Linguistics

work page 2002
[7]

Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp Koehn. 2014. Integrating an Unsupervised Transliteration Model into Statistical Machine Translation . In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014 , pages 148--153, Gothenburg, Sweden

work page 2014
[8]

Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A Simple, Fast, and Effective Reparameterization of IBM Model 2 . In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 644--648, Atlanta, Georgia. Association for Computational Linguistics

work page 2013
[9]

Philip Gage. 1994. A New Algorithm for Data Compression . C Users J. , 12(2):23--38, February

work page 1994
[10]

Barry Haddow, Matthias Huck, Alexandra Birch, Nikolay Bogoychev, and Philipp Koehn. 2015. The Edinburgh/JHU Phrase-based Machine Translation Systems for WMT 2015 . In Proceedings of the Tenth Workshop on Statistical Machine Translation , pages 126--133, Lisbon, Portugal. Association for Computational Linguistics

work page 2015
[11]

S \'e bastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On Using Very Large Target Vocabulary for Neural Machine Translation . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 1--10, Beij...

work page 2015
[12]

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Continuous Translation Models . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , Seattle. Association for Computational Linguistics

work page 2013
[13]

Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2015. Character-Aware Neural Language Models . CoRR , abs/1508.06615

work page arXiv 2015
[14]

Philipp Koehn and Kevin Knight. 2003. Empirical Methods for Compound Splitting . In EACL '03: Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics , pages 187--193, Budapest, Hungary. Association for Computational Linguistics

work page 2003
[15]

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ond r ej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation . In Proceedings of the ACL-2007 Demo and Poster Sessions , pag...

work page 2007
[16]

Franklin M. Liang. 1983. Word hy-phen-a-tion by com-put-er . Ph.D. thesis, Stanford University, Department of Linguistics, Stanford, CA

work page 1983
[17]

Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis

Wang Ling, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis. 2015a. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1520--1530, Lisbon, Portugal. A...

work page 2015
[18]

Wang Ling , Isabel Trancoso , Chris Dyer , and Alan W. Black . 2015b. Character-based Neural Machine Translation . ArXiv e-prints , November

work page
[19]

Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology . In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, CoNLL 2013, Sofia, Bulgaria, August 8-9, 2013 , pages 104--113

work page 2013
[20]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective Approaches to Attention-based Neural Machine Translation . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 1412--1421, Lisbon, Portugal. Association for Computational Linguistics

work page 2015
[21]

Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015b. Addressing the Rare Word Problem in Neural Machine Translation . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 11--19, Beiji...

work page
[22]

Tomas Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Stefan Kombrink, and Jan Cernock \'y . 2012. Subword Language Modeling with Neural Networks . Unpublished

work page 2012
[23]

Graham Neubig, Taro Watanabe, Shinsuke Mori, and Tatsuya Kawahara. 2012. Machine Translation without Words through Substring Alignment . In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 1: Long Papers , pages 165--174

work page 2012
[24]

Sonja Nie en and Hermann Ney. 2000. Improving SMT quality with morpho-syntactic analysis . In 18th Int. Conf. on Computational Linguistics , pages 1081--1085

work page 2000
[25]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks . In Proceedings of the 30th International Conference on Machine Learning, ICML 2013 , pages 1310--1318, Atlanta, USA

work page 2013
[26]

Maja Popovi \'c . 2015. chrF: character n-gram F-score for automatic MT evaluation . In Proceedings of the Tenth Workshop on Statistical Machine Translation , pages 392--395, Lisbon, Portugal. Association for Computational Linguistics

work page 2015
[27]

Rico Sennrich and Barry Haddow. 2015. A Joint Dependency Model of Morphological and Syntactic Structure for Statistical Machine Translation . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 2081--2087, Lisbon, Portugal. Association for Computational Linguistics

work page 2015
[28]

Benjamin Snyder and Regina Barzilay. 2008. Unsupervised Multilingual Learning for Morphological Segmentation . In Proceedings of ACL-08: HLT , pages 737--745, Columbus, Ohio. Association for Computational Linguistics

work page 2008
[29]

David Stallard, Jacob Devlin, Michael Kayser, Yoong Keok Lee, and Regina Barzilay. 2012. Unsupervised Morphology Rivals Supervised Morphology for Arabic MT . In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 2: Short Papers , pages 322--327

work page 2012
[30]

Milo s Stanojevi \'c , Amir Kamran, Philipp Koehn, and Ond r ej Bojar. 2015. Results of the WMT15 Metrics Shared Task . In Proceedings of the Tenth Workshop on Statistical Machine Translation , pages 256--273, Lisbon, Portugal. Association for Computational Linguistics

work page 2015
[31]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks . In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014 , pages 3104--3112, Montreal, Quebec, Canada

work page 2014
[32]

J \"o rg Tiedemann. 2009. Character-based PSMT for Closely Related Languages . In Proceedings of 13th Annual Conference of the European Association for Machine Translation (EAMT 09) , pages 12--19

work page 2009
[33]

J \"o rg Tiedemann. 2012. Character-Based Pivot Translation for Under-Resourced Languages and Domains . In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages 141--151, Avignon, France. Association for Computational Linguistics

work page 2012
[34]

David Vilar, Jan-Thorsten Peter, and Hermann Ney. 2007. Can We Translate Letters? In Second Workshop on Statistical Machine Translation , pages 33--39, Prague, Czech Republic. Association for Computational Linguistics

work page 2007
[35]

V \"a yrynen, Mathias Creutz, and Markus Sadeniemi

Sami Virpioja, Jaakko J. V \"a yrynen, Mathias Creutz, and Markus Sadeniemi. 2007. Morphology-Aware Statistical Machine Translation Based on Morphs Induced in an Unsupervised Manner . In Proceedings of the Machine Translation Summit XI , pages 491--498, Copenhagen, Denmark

work page 2007
[36]

Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method . CoRR , abs/1212.5701

work page arXiv 2012