Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov , Ilya Sutskever , Kai Chen , Greg Corrado , Jeffrey Dean

Authors on Pith no claims yet

classification 💻 cs.CL cs.LGstat.ML

keywords representationsphraseswordcanadadistributedexamplelearningmethod

read the original abstract

The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
SimCSE: Simple Contrastive Learning of Sentence Embeddings
cs.CL 2021-04 conditional novelty 8.0

SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining
cs.CL 2026-04 unverdicted novelty 6.0

MIPIC trains nested Matryoshka representations via self-distilled intra-relational alignment with top-k CKA and progressive information chaining across depths, yielding competitive performance especially at extreme lo...
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
A Multi-head-based architecture for effective morphological tagging in Russian with open dictionary
cs.CL 2026-04 unverdicted novelty 6.0

A multi-head attention model for Russian morphological tagging supports open dictionaries via subtoken splitting and reports 98-99% accuracy on grammatical categories while running efficiently on consumer hardware.
ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering
cs.DL 2026-03 accept novelty 6.0

ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
cs.CL 2026-04 unverdicted novelty 4.0

Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance ...