2004 | Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates
This paper introduces KNOWITALL, a system designed to automate the process of extracting large collections of facts from the web in an autonomous, domain-independent, and scalable manner. KNOWITALL leverages existing web search engines to compute statistics efficiently and associates a probability with each extracted fact, enabling it to trade off precision and recall. The system uses a domain- and language-independent architecture, with modules for extraction, search engine interface, assessment, and database storage. KNOWITALL's main modules include the Extractor, which instantiates extraction rules for each class and relation; the Search Engine Interface, which formulates queries based on these rules; the Assessor, which uses pointwise mutual information (PMI) to assess the likelihood of extracted facts; and the Database, which stores the extracted information. The paper presents preliminary experiments where KNOWITALL extracted over 50,000 facts in four days, demonstrating its effectiveness in extracting information from the web. The authors also discuss the design choices and lessons learned, including the use of recursive query expansion to access more search engine results and the impact of different feature selection methods on precision and recall.This paper introduces KNOWITALL, a system designed to automate the process of extracting large collections of facts from the web in an autonomous, domain-independent, and scalable manner. KNOWITALL leverages existing web search engines to compute statistics efficiently and associates a probability with each extracted fact, enabling it to trade off precision and recall. The system uses a domain- and language-independent architecture, with modules for extraction, search engine interface, assessment, and database storage. KNOWITALL's main modules include the Extractor, which instantiates extraction rules for each class and relation; the Search Engine Interface, which formulates queries based on these rules; the Assessor, which uses pointwise mutual information (PMI) to assess the likelihood of extracted facts; and the Database, which stores the extracted information. The paper presents preliminary experiments where KNOWITALL extracted over 50,000 facts in four days, demonstrating its effectiveness in extracting information from the web. The authors also discuss the design choices and lessons learned, including the use of recursive query expansion to access more search engine results and the impact of different feature selection methods on precision and recall.