Understanding word2vec Explained%3A deriving Mikolov et al.'s negative-sampling word-embedding method

The article by Yoav Goldberg and Omer Levy aims to explain the negative-sampling word-embedding method used in Tomas Mikolov's word2vec software. The authors find the original descriptions in Mikolov et al.'s papers somewhat cryptic and difficult to follow. They focus on equation (4) from the paper "Distributed Representations of Words and Phrases and their Compositionality." The article begins by explaining the skip-gram model, where the goal is to maximize the conditional probabilities of contexts given words in a corpus. The parameters are set to maximize the product of these probabilities, but this approach is computationally expensive due to the softmax function. To address this, hierarchical softmax is proposed as an alternative. The negative-sampling approach is then introduced as a more efficient method. It involves optimizing a different objective by generating random word-context pairs (negative samples) and using them to prevent all vectors from having the same value. The objective function includes both positive and negative samples, leading to a more efficient computation. The authors also discuss the context definitions used in word2vec, including dynamic window sizes and the effect of subsampling and rare-word pruning. These techniques help improve the quality of word embeddings by increasing the effective window size and reducing the impact of less informative frequent words. Despite the improvements, the authors note that the intuition behind why these methods produce good word representations remains somewhat hand-wavy and could benefit from more formal explanations.The article by Yoav Goldberg and Omer Levy aims to explain the negative-sampling word-embedding method used in Tomas Mikolov's word2vec software. The authors find the original descriptions in Mikolov et al.'s papers somewhat cryptic and difficult to follow. They focus on equation (4) from the paper "Distributed Representations of Words and Phrases and their Compositionality." The article begins by explaining the skip-gram model, where the goal is to maximize the conditional probabilities of contexts given words in a corpus. The parameters are set to maximize the product of these probabilities, but this approach is computationally expensive due to the softmax function. To address this, hierarchical softmax is proposed as an alternative. The negative-sampling approach is then introduced as a more efficient method. It involves optimizing a different objective by generating random word-context pairs (negative samples) and using them to prevent all vectors from having the same value. The objective function includes both positive and negative samples, leading to a more efficient computation. The authors also discuss the context definitions used in word2vec, including dynamic window sizes and the effect of subsampling and rare-word pruning. These techniques help improve the quality of word embeddings by increasing the effective window size and reducing the impact of less informative frequent words. Despite the improvements, the authors note that the intuition behind why these methods produce good word representations remains somewhat hand-wavy and could benefit from more formal explanations.

word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method

February 14, 2014 | Yoav Goldberg and Omer Levy