November 1, 2004 | Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates
The KNOWITALL system aims to automate the process of extracting large collections of facts from the web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOWITALL's architecture and design principles, emphasizing its ability to extract information without hand-labeled training examples. In its first major run, KNOWITALL extracted over 50,000 facts but faced challenges in improving recall and extraction rate without sacrificing precision. To address this, the paper introduces three methods: Pattern Learning, Subclass Extraction, and List Extraction, which are evaluated for their performance and demonstrated to be synergistic. These methods bootstrapped from KNOWITALL's domain-independent methods and significantly increased recall while maintaining high precision. The paper also discusses the use of Web-based mutual information statistics for validating the output of the information extraction system and provides a comprehensive overview of KNOWITALL, its design decisions, and experimental justification.The KNOWITALL system aims to automate the process of extracting large collections of facts from the web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOWITALL's architecture and design principles, emphasizing its ability to extract information without hand-labeled training examples. In its first major run, KNOWITALL extracted over 50,000 facts but faced challenges in improving recall and extraction rate without sacrificing precision. To address this, the paper introduces three methods: Pattern Learning, Subclass Extraction, and List Extraction, which are evaluated for their performance and demonstrated to be synergistic. These methods bootstrapped from KNOWITALL's domain-independent methods and significantly increased recall while maintaining high precision. The paper also discusses the use of Web-based mutual information statistics for validating the output of the information extraction system and provides a comprehensive overview of KNOWITALL, its design decisions, and experimental justification.