November 1, 2004 | Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates
This paper presents an experimental study of the KNOWITALL system, which automatically extracts large collections of facts from the Web in an unsupervised, domain-independent, and scalable manner. The system uses a generate-and-test architecture to extract information in two stages. First, it generates candidate facts using domain-independent extraction patterns. Second, it tests the plausibility of these facts using pointwise mutual information (PMI) statistics derived from Web search engine results. The system also incorporates three methods to improve recall and extraction rate: Pattern Learning, Subclass Extraction, and List Extraction. These methods do not require hand-labeled training examples and instead bootstrap from the system's domain-independent methods. The paper reports experiments that measure the relative efficacy of these methods and demonstrate their synergy. The results show that the methods significantly increase recall while maintaining high precision, with the system discovering over 10,000 cities missing from the Tipster Gazetteer. The paper also discusses the system's design, including its extraction rules, discriminators, and the use of PMI statistics for validation. The system's performance is evaluated using precision and recall metrics, and the results show that the system achieves high precision and recall for various classes, including City, USState, Country, Actor, and Film. The study highlights the effectiveness of the system in extracting high-quality information from the Web without supervision.This paper presents an experimental study of the KNOWITALL system, which automatically extracts large collections of facts from the Web in an unsupervised, domain-independent, and scalable manner. The system uses a generate-and-test architecture to extract information in two stages. First, it generates candidate facts using domain-independent extraction patterns. Second, it tests the plausibility of these facts using pointwise mutual information (PMI) statistics derived from Web search engine results. The system also incorporates three methods to improve recall and extraction rate: Pattern Learning, Subclass Extraction, and List Extraction. These methods do not require hand-labeled training examples and instead bootstrap from the system's domain-independent methods. The paper reports experiments that measure the relative efficacy of these methods and demonstrate their synergy. The results show that the methods significantly increase recall while maintaining high precision, with the system discovering over 10,000 cities missing from the Tipster Gazetteer. The paper also discusses the system's design, including its extraction rules, discriminators, and the use of PMI statistics for validation. The system's performance is evaluated using precision and recall metrics, and the results show that the system achieves high precision and recall for various classes, including City, USState, Country, Actor, and Film. The study highlights the effectiveness of the system in extracting high-quality information from the Web without supervision.