March 11, 2024 | Harald Steck, Chaitanya Ekanadham, Nathan Kallus
The paper investigates whether cosine similarity of embeddings truly reflects semantic similarity. Cosine similarity is the cosine of the angle between two vectors, or equivalently the dot product of their normalizations. It is widely used to measure semantic similarity between high-dimensional objects via learned embeddings. However, it can sometimes yield arbitrary or meaningless results compared to the unnormalized dot product. The authors analyze embeddings from regularized linear models, showing that cosine similarity can produce arbitrary results due to the embeddings' degree of freedom, even when their unnormalized dot products are well-defined and unique. They derive analytical solutions for linear matrix factorization (MF) models, demonstrating that cosine similarity is not inherently meaningful and can be influenced by regularization techniques. For instance, different regularization schemes can lead to different cosine similarities, even when the underlying model is invariant to certain transformations. The first objective allows for arbitrary re-scaling of embeddings, leading to non-unique cosine similarities, while the second objective yields unique results. The paper cautions against blindly using cosine similarity and suggests alternatives, such as training models with respect to cosine similarity or avoiding embedding spaces. Experiments on simulated data show that cosine similarities can vary significantly depending on the model and regularization used, highlighting the need for careful consideration when applying cosine similarity. The findings suggest that while cosine similarity is commonly used, it may not always capture semantic similarity accurately, especially in deep learning models where multiple regularization techniques are applied.The paper investigates whether cosine similarity of embeddings truly reflects semantic similarity. Cosine similarity is the cosine of the angle between two vectors, or equivalently the dot product of their normalizations. It is widely used to measure semantic similarity between high-dimensional objects via learned embeddings. However, it can sometimes yield arbitrary or meaningless results compared to the unnormalized dot product. The authors analyze embeddings from regularized linear models, showing that cosine similarity can produce arbitrary results due to the embeddings' degree of freedom, even when their unnormalized dot products are well-defined and unique. They derive analytical solutions for linear matrix factorization (MF) models, demonstrating that cosine similarity is not inherently meaningful and can be influenced by regularization techniques. For instance, different regularization schemes can lead to different cosine similarities, even when the underlying model is invariant to certain transformations. The first objective allows for arbitrary re-scaling of embeddings, leading to non-unique cosine similarities, while the second objective yields unique results. The paper cautions against blindly using cosine similarity and suggests alternatives, such as training models with respect to cosine similarity or avoiding embedding spaces. Experiments on simulated data show that cosine similarities can vary significantly depending on the model and regularization used, highlighting the need for careful consideration when applying cosine similarity. The findings suggest that while cosine similarity is commonly used, it may not always capture semantic similarity accurately, especially in deep learning models where multiple regularization techniques are applied.