What do Neural Machine Translation Models Learn about Morphology?

Fahim Dalvi; Hassan Sajjad; James Glass; Nadir Durrani; Yonatan Belinkov

arxiv: 1704.03471 · v3 · pith:FAGLIR5Lnew · submitted 2017-04-11 · 💻 cs.CL

What do Neural Machine Translation Models Learn about Morphology?

Yonatan Belinkov , Nadir Durrani , Fahim Dalvi , Hassan Sajjad , James Glass This is my paper

classification 💻 cs.CL

keywords modelsneuralrepresentationslearnmachinemorphologytargettranslation

0 comments

read the original abstract

Neural machine translation (MT) models obtain state-of-the-art performance while maintaining a simple, end-to-end architecture. However, little is known about what these models learn about source and target languages during the training process. In this work, we analyze the representations learned by neural MT models at various levels of granularity and empirically evaluate the quality of the representations for learning morphology through extrinsic part-of-speech and morphological tagging tasks. We conduct a thorough investigation along several parameters: word-based vs. character-based representations, depth of the encoding layer, the identity of the target language, and encoder vs. decoder representations. Our data-driven, quantitative evaluation sheds light on important aspects in the neural MT system and its ability to capture word structure.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Toward Calibrated, Fair, and accurate Deepfake Detection
cs.LG 2026-06 unverdicted novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models
cs.CL 2026-06 unverdicted novelty 6.0

LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts
cs.CL 2019-06 conditional novelty 6.0

Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.