word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method

Yoav Goldberg , Omer Levy

Authors on Pith no claims yet

classification 💻 cs.CL cs.LGstat.ML

keywords mikolovword2vecbehindmodelssoftwaretomasattemptchen

read the original abstract

The word2vec software of Tomas Mikolov and colleagues (https://code.google.com/p/word2vec/ ) has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research papers. We found the description of the models in these papers to be somewhat cryptic and hard to follow. While the motivations and presentation may be obvious to the neural-networks language-modeling crowd, we had to struggle quite a bit to figure out the rationale behind the equations. This note is an attempt to explain equation (4) (negative sampling) in "Distributed Representations of Words and Phrases and their Compositionality" by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Linear Representation Hypothesis and the Geometry of Large Language Models
cs.CL 2023-11 conditional novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.