This paper presents a new method for detecting and disambiguating named entities in open-domain text using the rich knowledge from an online encyclopedia, specifically Wikipedia. The method involves training a Support Vector Machine (SVM) kernel to exploit the extensive coverage and structured information in Wikipedia. The approach aims to resolve the ambiguity of named entities, which can refer to multiple entities with the same name, such as different people or different types of entities like a snake, a programming language, or a movie.
The paper introduces a dictionary of named entities, where each entry maps to a set of entities that can be denoted by a proper name. The dictionary is constructed by identifying named entities from Wikipedia, considering their titles, redirect names, and disambiguation names. The method also leverages Wikipedia's categories and hyperlinks to create a dataset of disambiguated queries, which are used to train the SVM kernel.
The key contribution of the paper is the development of a taxonomy kernel that combines cosine similarity between query contexts and article texts with correlations between context words and Wikipedia categories. This kernel is designed to improve the accuracy of named entity disambiguation by considering both the textual context and the hierarchical structure of Wikipedia categories. The method is evaluated on different scenarios, showing that the taxonomy kernel significantly outperforms the cosine similarity baseline in most cases.
The paper also addresses the challenge of detecting out-of-Wikipedia entities, which are entities not covered by Wikipedia. A special entity is introduced to represent these entities, and the ranking function is adjusted to handle such cases. The results show that the taxonomy kernel provides better performance in disambiguating named entities, especially when the query context is highly correlated with the categories of the entities.
The experiments demonstrate that the proposed method improves the accuracy of named entity disambiguation, particularly when leveraging the structured information from Wikipedia. The method has the potential to enhance the performance of search engines and other natural language processing tasks by providing more accurate and relevant results. The paper concludes that the approach effectively utilizes the knowledge from an online encyclopedia to improve the disambiguation of named entities in open-domain text.This paper presents a new method for detecting and disambiguating named entities in open-domain text using the rich knowledge from an online encyclopedia, specifically Wikipedia. The method involves training a Support Vector Machine (SVM) kernel to exploit the extensive coverage and structured information in Wikipedia. The approach aims to resolve the ambiguity of named entities, which can refer to multiple entities with the same name, such as different people or different types of entities like a snake, a programming language, or a movie.
The paper introduces a dictionary of named entities, where each entry maps to a set of entities that can be denoted by a proper name. The dictionary is constructed by identifying named entities from Wikipedia, considering their titles, redirect names, and disambiguation names. The method also leverages Wikipedia's categories and hyperlinks to create a dataset of disambiguated queries, which are used to train the SVM kernel.
The key contribution of the paper is the development of a taxonomy kernel that combines cosine similarity between query contexts and article texts with correlations between context words and Wikipedia categories. This kernel is designed to improve the accuracy of named entity disambiguation by considering both the textual context and the hierarchical structure of Wikipedia categories. The method is evaluated on different scenarios, showing that the taxonomy kernel significantly outperforms the cosine similarity baseline in most cases.
The paper also addresses the challenge of detecting out-of-Wikipedia entities, which are entities not covered by Wikipedia. A special entity is introduced to represent these entities, and the ranking function is adjusted to handle such cases. The results show that the taxonomy kernel provides better performance in disambiguating named entities, especially when the query context is highly correlated with the categories of the entities.
The experiments demonstrate that the proposed method improves the accuracy of named entity disambiguation, particularly when leveraging the structured information from Wikipedia. The method has the potential to enhance the performance of search engines and other natural language processing tasks by providing more accurate and relevant results. The paper concludes that the approach effectively utilizes the knowledge from an online encyclopedia to improve the disambiguation of named entities in open-domain text.