pith. machine review for the scientific record. sign in

arxiv: 1607.04606 · v2 · submitted 2016-07-15 · 💻 cs.CL · cs.LG

Recognition: unknown

Enriching Word Vectors with Subword Information

Authors on Pith no claims yet
classification 💻 cs.CL cs.LG
keywords wordrepresentationswordslargetaskscharactercorporalanguages
0
0 comments X
read the original abstract

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character $n$-grams. A vector representation is associated to each character $n$-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap

    cs.SE 2026-04 unverdicted novelty 7.0

    Atropos uses GCN on inference graphs for early failure prediction and hotswaps to larger LLMs, achieving 74% of large-model performance at 24% cost.

  2. Eliciting Latent Predictions from Transformers with the Tuned Lens

    cs.LG 2023-03 accept novelty 7.0

    Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

  3. Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

    cs.CL 2026-05 unverdicted novelty 5.0

    An interpretable deep learning framework with a new tokenizer is used to quantify how grammatical gender information is distributed between lemmas and sentential context during the Latin-to-Occitan transition.

  4. Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings

    cs.SI 2026-04 unverdicted novelty 5.0

    LLMs handle skin tone emoji modifiers better than dedicated embedding models but display systemic disparities in sentiment and semantic consistency across tones.

  5. Skeleton-based Coherence Modeling in Narratives

    cs.CL 2026-04 unverdicted novelty 4.0

    Sentence-level models outperform skeleton-based approaches for narrative coherence despite a new SSN network improving on cosine and Euclidean baselines.