Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures

Annette Rios; Gongbo Tang; Mathias M\"uller; Rico Sennrich

arxiv: 1808.08946 · v3 · pith:VTMTFKLKnew · submitted 2018-08-27 · 💻 cs.CL

Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures

Gongbo Tang , Mathias M\"uller , Annette Rios , Rico Sennrich This is my paper

classification 💻 cs.CL

keywords cnnsnetworksrnnsself-attentionalbeenabilityagreementarchitectures

0 comments

read the original abstract

Recently, non-recurrent architectures (convolutional, self-attentional) have outperformed RNNs in neural machine translation. CNNs and self-attentional networks can connect distant words via shorter network paths than RNNs, and it has been speculated that this improves their ability to model long-range dependencies. However, this theoretical argument has not been tested empirically, nor have alternative explanations for their strong performance been explored in-depth. We hypothesize that the strong performance of CNNs and self-attentional networks could also be due to their ability to extract semantic features from the source text, and we evaluate RNNs, CNNs and self-attention networks on two tasks: subject-verb agreement (where capturing long-range dependencies is required) and word sense disambiguation (where semantic feature extraction is required). Our experimental results show that: 1) self-attentional networks and CNNs do not outperform RNNs in modeling subject-verb agreement over long distances; 2) self-attentional networks perform distinctly better than RNNs and CNNs on word sense disambiguation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts
cs.CL 2019-06 conditional novelty 6.0

Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.