15 Aug 2014 | Felix Hill, Roi Reichart, Anna Korhonen
SimLex-999 is a gold standard resource for evaluating distributional semantic models, designed to better quantify similarity than existing resources like WordSim-353 and MEN. Unlike these, SimLex-999 explicitly measures similarity rather than association or relatedness, ensuring that pairs of entities that are associated but not similar (e.g., Freud and psychology) receive low ratings. This focus on similarity incentivizes the development of models with broader applications than those reflecting conceptual association. SimLex-999 includes a diverse range of concrete and abstract adjective, noun, and verb pairs, along with independent ratings of concreteness and association strength for each pair. This diversity enables fine-grained analyses of model performance on different concept types and provides insights into how architectures can be improved. Unlike existing gold standards, state-of-the-art models perform well below the inter-annotator agreement ceiling on SimLex-999, indicating significant room for improvement.
SimLex-999 was created by 500 paid native English speakers on Amazon Mechanical Turk, who rated the similarity of concept pairs via a simple visual interface. The pairs were selected based on empirical evidence that humans represent concepts of different part-of-speech (POS) and conceptual concreteness differently. SimLex-999 includes a principled selection of adjective, verb, and noun concept pairs covering the full concreteness spectrum, enabling more nuanced analyses of how computational models handle different concept types.
Quantitative and qualitative analyses of SimLex-999 ratings show that participants found it unproblematic to consistently quantify the similarity of the full range of concepts and distinguish it from association. Unlike existing datasets, SimLex-999 contains many pairs, such as [movie, theater], which are strongly associated but receive low similarity scores.
The second main contribution of this paper is the evaluation of state-of-the-art distributional semantic models using SimLex-999. These include well-known models like NLMs, VSMs, and LSA. The analysis shows how SimLex-999 can be applied to uncover substantial differences in the ability of models to represent concepts of different types. Despite these differences, the models considered each share the characteristic of being better at capturing association than similarity. The difficulty of estimating similarity is driven primarily by strongly-associated pairs that have high association ratings but low similarity scores in SimLex-999. As a result, current models achieve notably lower scores on SimLex-999 than on existing gold standards.
The paper also explores ways in which distributional models might improve on this performance in similarity modelling. It evaluates models on subsets of SimLex-999, including abstract and concrete subsets and subsets of more and less strongly associated pairs. The analysis confirms the hypothesis that models learning from input informed by dependency parsing yield improved similarity estimation. It also finds no evidence for a related hypothesis that smaller context windows improve similarity estimation. The experimentsSimLex-999 is a gold standard resource for evaluating distributional semantic models, designed to better quantify similarity than existing resources like WordSim-353 and MEN. Unlike these, SimLex-999 explicitly measures similarity rather than association or relatedness, ensuring that pairs of entities that are associated but not similar (e.g., Freud and psychology) receive low ratings. This focus on similarity incentivizes the development of models with broader applications than those reflecting conceptual association. SimLex-999 includes a diverse range of concrete and abstract adjective, noun, and verb pairs, along with independent ratings of concreteness and association strength for each pair. This diversity enables fine-grained analyses of model performance on different concept types and provides insights into how architectures can be improved. Unlike existing gold standards, state-of-the-art models perform well below the inter-annotator agreement ceiling on SimLex-999, indicating significant room for improvement.
SimLex-999 was created by 500 paid native English speakers on Amazon Mechanical Turk, who rated the similarity of concept pairs via a simple visual interface. The pairs were selected based on empirical evidence that humans represent concepts of different part-of-speech (POS) and conceptual concreteness differently. SimLex-999 includes a principled selection of adjective, verb, and noun concept pairs covering the full concreteness spectrum, enabling more nuanced analyses of how computational models handle different concept types.
Quantitative and qualitative analyses of SimLex-999 ratings show that participants found it unproblematic to consistently quantify the similarity of the full range of concepts and distinguish it from association. Unlike existing datasets, SimLex-999 contains many pairs, such as [movie, theater], which are strongly associated but receive low similarity scores.
The second main contribution of this paper is the evaluation of state-of-the-art distributional semantic models using SimLex-999. These include well-known models like NLMs, VSMs, and LSA. The analysis shows how SimLex-999 can be applied to uncover substantial differences in the ability of models to represent concepts of different types. Despite these differences, the models considered each share the characteristic of being better at capturing association than similarity. The difficulty of estimating similarity is driven primarily by strongly-associated pairs that have high association ratings but low similarity scores in SimLex-999. As a result, current models achieve notably lower scores on SimLex-999 than on existing gold standards.
The paper also explores ways in which distributional models might improve on this performance in similarity modelling. It evaluates models on subsets of SimLex-999, including abstract and concrete subsets and subsets of more and less strongly associated pairs. The analysis confirms the hypothesis that models learning from input informed by dependency parsing yield improved similarity estimation. It also finds no evidence for a related hypothesis that smaller context windows improve similarity estimation. The experiments