Massively Multilingual Word Embeddings

Waleed Ammar , George Mulcaire , Yulia Tsvetkov , Guillaume Lample , Chris Dyer , Noah A. Smith

Authors on Pith no claims yet

classification 💻 cs.CL

keywords methodsdataembeddingsevaluationalongareabettercategorization

read the original abstract

We introduce new methods for estimating and evaluating embeddings of words in more than fifty languages in a single shared embedding space. Our estimation methods, multiCluster and multiCCA, use dictionaries and monolingual data; they do not require parallel data. Our new evaluation method, multiQVEC-CCA, is shown to correlate better than previous ones with two downstream tasks (text categorization and parsing). We also describe a web portal for evaluation that will facilitate further research in this area, along with open-source releases of all our methods.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL
cs.CL 2026-04 unverdicted novelty 7.0

Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.