This paper presents a large-scale system for named entity recognition and semantic disambiguation using information from Wikipedia and web search results. The system aims to identify and label named entities in text, such as people, locations, and organizations, while resolving ambiguities in their surface forms. The approach involves maximizing agreement between contextual information from Wikipedia and the document's context, as well as among category tags of candidate entities. The system achieves high disambiguation accuracy on both news stories and Wikipedia articles.
The paper discusses related work in named entity recognition, including tasks like ENAMEX, TIMEX, and NUMEX, and systems like MUC and ACE. It highlights the importance of semantic disambiguation when scaling entity tracking to large document collections or the web, as many surface forms are ambiguous. For example, "Texas" can refer to multiple entities, including a U.S. state, a band, and a TV series.
The system uses Wikipedia as a comprehensive source of entity information, including entity pages, redirect pages, disambiguation pages, and list pages. It extracts surface forms, category tags, and contextual information from these sources. The system then uses these to disambiguate entities in text by maximizing agreement between contextual and category information.
The disambiguation process involves a vector space model, where the document is compared with vectors representing Wikipedia entities. The system builds an extended document vector and maximizes the agreement between the document vector and entity vectors. This approach is evaluated on both Wikipedia articles and news stories, showing high accuracy.
The system is implemented as a web browser application that can analyze any web page or client text document. It has the potential to move from a word-based space to a concept-based space, enabling new research directions in entity-based indexing, searching, and personalized web views. The system uses minimal language-dependent resources beyond Wikipedia, making it adaptable to other languages. The results show that the system achieves high accuracy in disambiguation, with 91.4% accuracy on news stories and 88.3% on Wikipedia articles.This paper presents a large-scale system for named entity recognition and semantic disambiguation using information from Wikipedia and web search results. The system aims to identify and label named entities in text, such as people, locations, and organizations, while resolving ambiguities in their surface forms. The approach involves maximizing agreement between contextual information from Wikipedia and the document's context, as well as among category tags of candidate entities. The system achieves high disambiguation accuracy on both news stories and Wikipedia articles.
The paper discusses related work in named entity recognition, including tasks like ENAMEX, TIMEX, and NUMEX, and systems like MUC and ACE. It highlights the importance of semantic disambiguation when scaling entity tracking to large document collections or the web, as many surface forms are ambiguous. For example, "Texas" can refer to multiple entities, including a U.S. state, a band, and a TV series.
The system uses Wikipedia as a comprehensive source of entity information, including entity pages, redirect pages, disambiguation pages, and list pages. It extracts surface forms, category tags, and contextual information from these sources. The system then uses these to disambiguate entities in text by maximizing agreement between contextual and category information.
The disambiguation process involves a vector space model, where the document is compared with vectors representing Wikipedia entities. The system builds an extended document vector and maximizes the agreement between the document vector and entity vectors. This approach is evaluated on both Wikipedia articles and news stories, showing high accuracy.
The system is implemented as a web browser application that can analyze any web page or client text document. It has the potential to move from a word-based space to a concept-based space, enabling new research directions in entity-based indexing, searching, and personalized web views. The system uses minimal language-dependent resources beyond Wikipedia, making it adaptable to other languages. The results show that the system achieves high accuracy in disambiguation, with 91.4% accuracy on news stories and 88.3% on Wikipedia articles.