Open Information Extraction from the Web

Open Information Extraction from the Web

| Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni
This paper introduces Open Information Extraction (OIE), a new extraction paradigm that enables domain-independent discovery of relations from text and scales to the diversity and size of the Web corpus. Unlike traditional Information Extraction (IE) systems that require users to specify relations in advance, OIE systems automatically discover possible relations while making a single pass over the corpus. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system that assigns probabilities to extracted tuples and indexes them for efficient extraction and exploration via user queries. TEXTRUNNER outperforms the state-of-the-art Web IE system, KNOWITALL, in both efficiency and scalability. It achieves a 33% reduction in error rate for a comparable number of extractions and extracts a far broader set of facts, reflecting orders of magnitude more relations, in the time it takes KNOWITALL to extract for a handful of pre-specified relations. TEXTRUNNER's 11 million highest probability tuples contain over 1 million concrete facts and over 6.5 million abstract assertions. TEXTRUNNER consists of three key modules: a self-supervised learner that labels candidate extractions as trustworthy or not, a single-pass extractor that identifies tuples for all possible relations, and a redundancy-based assessor that assigns probabilities to tuples based on a probabilistic model of redundancy. The system processes millions of tuples efficiently, with a processing speed of 0.036 CPU seconds per sentence, making it significantly faster than traditional IE systems. The paper reports on TEXTRUNNER's performance on a 9 million Web page corpus, showing that it achieves a 33% lower error rate than KNOWITALL while finding an almost identical number of correct extractions. It also provides statistics on the 11 million highest probability tuples, showing that 7.8 million of them have well-formed relations and arguments with a probability of at least 0.8. Of these, 80.4% were deemed correct by human reviewers. TEXTRUNNER's ability to extract information for all relations at once, without requiring them to be named explicitly in its input, results in a significant scalability advantage over previous IE systems. The system is capable of responding to queries over millions of tuples at interactive speeds due to an inverted index distributed over a pool of machines. It also enables complex relational queries that are not currently possible using a standard inverted index used by today's search engines. The paper concludes that TEXTRUNNER is a scalable, efficient, and effective system for Open Information Extraction from the Web, with the potential to be integrated with methods for detecting synonyms and resolving multiple mentions of entities, as well as learning the types of entities commonly taken by relations.This paper introduces Open Information Extraction (OIE), a new extraction paradigm that enables domain-independent discovery of relations from text and scales to the diversity and size of the Web corpus. Unlike traditional Information Extraction (IE) systems that require users to specify relations in advance, OIE systems automatically discover possible relations while making a single pass over the corpus. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system that assigns probabilities to extracted tuples and indexes them for efficient extraction and exploration via user queries. TEXTRUNNER outperforms the state-of-the-art Web IE system, KNOWITALL, in both efficiency and scalability. It achieves a 33% reduction in error rate for a comparable number of extractions and extracts a far broader set of facts, reflecting orders of magnitude more relations, in the time it takes KNOWITALL to extract for a handful of pre-specified relations. TEXTRUNNER's 11 million highest probability tuples contain over 1 million concrete facts and over 6.5 million abstract assertions. TEXTRUNNER consists of three key modules: a self-supervised learner that labels candidate extractions as trustworthy or not, a single-pass extractor that identifies tuples for all possible relations, and a redundancy-based assessor that assigns probabilities to tuples based on a probabilistic model of redundancy. The system processes millions of tuples efficiently, with a processing speed of 0.036 CPU seconds per sentence, making it significantly faster than traditional IE systems. The paper reports on TEXTRUNNER's performance on a 9 million Web page corpus, showing that it achieves a 33% lower error rate than KNOWITALL while finding an almost identical number of correct extractions. It also provides statistics on the 11 million highest probability tuples, showing that 7.8 million of them have well-formed relations and arguments with a probability of at least 0.8. Of these, 80.4% were deemed correct by human reviewers. TEXTRUNNER's ability to extract information for all relations at once, without requiring them to be named explicitly in its input, results in a significant scalability advantage over previous IE systems. The system is capable of responding to queries over millions of tuples at interactive speeds due to an inverted index distributed over a pool of machines. It also enables complex relational queries that are not currently possible using a standard inverted index used by today's search engines. The paper concludes that TEXTRUNNER is a scalable, efficient, and effective system for Open Information Extraction from the Web, with the potential to be integrated with methods for detecting synonyms and resolving multiple mentions of entities, as well as learning the types of entities commonly taken by relations.
Reach us at info@study.space