May 17-22, 2004 | Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates
KNOWITALL is a system that automatically extracts large collections of facts from the web in an autonomous, domain-independent, and scalable manner. Preliminary experiments showed that an instance of KNOWITALL, running for four days on a single machine, extracted 54,753 facts. The system associates a probability with each fact, enabling it to trade off precision and recall. The paper analyzes the architecture of KNOWITALL and reports on lessons learned for the design of large-scale information extraction systems.
KNOWITALL uses statistics computed by treating the web as a large corpus of text to evaluate the information it extracts. It leverages existing web search engines to compute these statistics efficiently. Based on its evaluation, KNOWITALL associates a probability with every fact it extracts, enabling it to automatically trade recall for precision. In our experiments, KNOWITALL ran for four days and extracted over 50,000 facts regarding cities, states, countries, actors, and films. We analyze the extraction rate and the precision/recall achieved in this run in Section 3.
KNOWITALL is an autonomous system that extracts facts, concepts, and relationships from the web. It is seeded with an extensible ontology and a small number of generic rule templates from which it creates text extraction rules for each class and relation in its ontology. The system relies on a domain- and language-independent architecture to populate the ontology with specific facts and relations. KNOWITALL is designed to support scalability and high throughput. Each KNOWITALL module runs as a thread and communication between modules is accomplished by asynchronous message passing.
KNOWITALL’s main modules are described below:
- Extractor: KNOWITALL instantiates a set of extraction rules for each class and relation from a set of generic, domain-independent templates. For example, the generic template "NP1 such as NPList2" indicates that the head of each simple noun phrase (NP) in NPList2 is an instance of the class named in NP1. This template can be instantiated to find city names from such sentences as "We provide tours to cities such as Paris, Nice, and Monte Carlo." KNOWITALL would extract three instances of the class City from this sentence.
- Search Engine Interface: KNOWITALL automatically formulates queries based on its extraction rules. Each rule has an associated search query composed of the keywords in the rule. For example, the above rule would lead KNOWITALL to issue the query "cities such as" to a search engine, download each of the pages named in the engine's results in parallel, and apply the Extractor to the appropriate sentences on each downloaded page. KNOWITALL makes use of up to 12 search engines including Google, Alta Vista, Fast, and others.
- Assessor: KNOWITALL uses statistics computed by querying search engines to assess the likelihood that the Extractor's conjectures are correct. Specifically, the Assessor uses a form of pointKNOWITALL is a system that automatically extracts large collections of facts from the web in an autonomous, domain-independent, and scalable manner. Preliminary experiments showed that an instance of KNOWITALL, running for four days on a single machine, extracted 54,753 facts. The system associates a probability with each fact, enabling it to trade off precision and recall. The paper analyzes the architecture of KNOWITALL and reports on lessons learned for the design of large-scale information extraction systems.
KNOWITALL uses statistics computed by treating the web as a large corpus of text to evaluate the information it extracts. It leverages existing web search engines to compute these statistics efficiently. Based on its evaluation, KNOWITALL associates a probability with every fact it extracts, enabling it to automatically trade recall for precision. In our experiments, KNOWITALL ran for four days and extracted over 50,000 facts regarding cities, states, countries, actors, and films. We analyze the extraction rate and the precision/recall achieved in this run in Section 3.
KNOWITALL is an autonomous system that extracts facts, concepts, and relationships from the web. It is seeded with an extensible ontology and a small number of generic rule templates from which it creates text extraction rules for each class and relation in its ontology. The system relies on a domain- and language-independent architecture to populate the ontology with specific facts and relations. KNOWITALL is designed to support scalability and high throughput. Each KNOWITALL module runs as a thread and communication between modules is accomplished by asynchronous message passing.
KNOWITALL’s main modules are described below:
- Extractor: KNOWITALL instantiates a set of extraction rules for each class and relation from a set of generic, domain-independent templates. For example, the generic template "NP1 such as NPList2" indicates that the head of each simple noun phrase (NP) in NPList2 is an instance of the class named in NP1. This template can be instantiated to find city names from such sentences as "We provide tours to cities such as Paris, Nice, and Monte Carlo." KNOWITALL would extract three instances of the class City from this sentence.
- Search Engine Interface: KNOWITALL automatically formulates queries based on its extraction rules. Each rule has an associated search query composed of the keywords in the rule. For example, the above rule would lead KNOWITALL to issue the query "cities such as" to a search engine, download each of the pages named in the engine's results in parallel, and apply the Extractor to the appropriate sentences on each downloaded page. KNOWITALL makes use of up to 12 search engines including Google, Alta Vista, Fast, and others.
- Assessor: KNOWITALL uses statistics computed by querying search engines to assess the likelihood that the Extractor's conjectures are correct. Specifically, the Assessor uses a form of point