Enriching Word Vectors with Subword Information

Piotr Bojanowski , Edouard Grave , Armand Joulin , Tomas Mikolov

Authors on Pith no claims yet

classification 💻 cs.CL cs.LG

keywords wordrepresentationswordslargetaskscharactercorporalanguages

read the original abstract

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character $n$-grams. A vector representation is associated to each character $n$-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap
cs.SE 2026-04 unverdicted novelty 7.0

Atropos uses GCN on inference graphs for early failure prediction and hotswaps to larger LLMs, achieving 74% of large-model performance at 24% cost.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan
cs.CL 2026-05 unverdicted novelty 5.0

An interpretable deep learning framework with a new tokenizer is used to quantify how grammatical gender information is distributed between lemmas and sentential context during the Latin-to-Occitan transition.
Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings
cs.SI 2026-04 unverdicted novelty 5.0

LLMs handle skin tone emoji modifiers better than dedicated embedding models but display systemic disparities in sentiment and semantic consistency across tones.
Skeleton-based Coherence Modeling in Narratives
cs.CL 2026-04 unverdicted novelty 4.0

Sentence-level models outperform skeleton-based approaches for narrative coherence despite a new SSN network improving on cosine and Euclidean baselines.