[slides] Improving Distributional Similarity with Lessons Learned from Word Embeddings

The paper "Improving Distributional Similarity with Lessons Learned from Word Embeddings" by Omer Levy, Yoav Goldberg, and Ido Dagan explores the performance differences between neural-network-inspired word embedding models and traditional distributional models. The authors argue that the superior performance of word embeddings is often due to specific system design choices and hyperparameter optimizations rather than the embedding algorithms themselves. They demonstrate that these modifications can be transferred to traditional distributional models, leading to similar gains. The study also shows that there is no consistent advantage to any single approach over others, contradicting claims that embeddings are inherently superior to count-based methods. The paper begins by introducing the background on word representation methods, including explicit PPMI matrix, SVD factorization, Skip-Grams with Negative Sampling (SGNS), and GloVe. It then delves into the hyperparameters that affect the performance of these methods, such as dynamic context windows, subsampling, deleting rare words, shifted PMI, context distribution smoothing, adding context vectors, eigenvalue weighting, and vector normalization. The experimental setup involves evaluating various word representations on eight datasets covering word similarity and analogy tasks. The results show that different hyperparameter configurations have a substantial impact on performance, sometimes exceeding the benefits of switching to a different representation method. The authors also find that careful hyperparameter tuning can outweigh the benefits of adding more data in some cases. The paper re-evaluates prior claims about the superiority of certain methods, such as embeddings over count-based distributional methods and GloVe over SGNS. It concludes that there is no consistent significant advantage to one approach over the others, challenging the notion that prediction-based methods are inherently superior. Finally, the authors provide practical recommendations for tuning hyperparameters and suggest that SGNS is a robust baseline method that performs well across various tasks.The paper "Improving Distributional Similarity with Lessons Learned from Word Embeddings" by Omer Levy, Yoav Goldberg, and Ido Dagan explores the performance differences between neural-network-inspired word embedding models and traditional distributional models. The authors argue that the superior performance of word embeddings is often due to specific system design choices and hyperparameter optimizations rather than the embedding algorithms themselves. They demonstrate that these modifications can be transferred to traditional distributional models, leading to similar gains. The study also shows that there is no consistent advantage to any single approach over others, contradicting claims that embeddings are inherently superior to count-based methods. The paper begins by introducing the background on word representation methods, including explicit PPMI matrix, SVD factorization, Skip-Grams with Negative Sampling (SGNS), and GloVe. It then delves into the hyperparameters that affect the performance of these methods, such as dynamic context windows, subsampling, deleting rare words, shifted PMI, context distribution smoothing, adding context vectors, eigenvalue weighting, and vector normalization. The experimental setup involves evaluating various word representations on eight datasets covering word similarity and analogy tasks. The results show that different hyperparameter configurations have a substantial impact on performance, sometimes exceeding the benefits of switching to a different representation method. The authors also find that careful hyperparameter tuning can outweigh the benefits of adding more data in some cases. The paper re-evaluates prior claims about the superiority of certain methods, such as embeddings over count-based distributional methods and GloVe over SGNS. It concludes that there is no consistent significant advantage to one approach over the others, challenging the notion that prediction-based methods are inherently superior. Finally, the authors provide practical recommendations for tuning hyperparameters and suggest that SGNS is a robust baseline method that performs well across various tasks.

Improving Distributional Similarity with Lessons Learned from Word Embeddings

2015 | Omer Levy, Yoav Goldberg, Ido Dagan