| Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni
This paper introduces Open Information Extraction (OIE), a novel paradigm that enables domain-independent discovery of relations from text, scaling to the diversity and size of web corpora. OIE systems make a single pass over their corpus to extract a large set of relational tuples without requiring any human input. The paper also presents TEXTRUNNER, a fully implemented OIE system that assigns probabilities to tuples and supports efficient extraction and exploration via user queries.
TEXTRUNNER's architecture consists of three key modules: a self-supervised learner, a single-pass extractor, and a redundancy-based assessor. The self-supervised learner uses a small corpus sample to train a classifier that labels candidate extractions as trustworthy or not. The single-pass extractor makes a single pass over the entire corpus to extract tuples for all possible relations, using lightweight NLP techniques. The redundancy-based assessor assigns probabilities to retained tuples based on a probabilistic model of redundancy in text.
Experiments on a 9 million Web page corpus show that TEXTRUNNER achieves a 33% reduction in error rate compared to KNOWITALL, a state-of-the-art Web IE system, while extracting a broader set of facts. The paper also reports statistics on TEXTRUNNER's 11 million highest probability tuples, demonstrating its scalability and quality. The system can respond to queries over millions of tuples at interactive speeds due to efficient indexing. The key to TEXTRUNNER's scalability is its linear processing time in the number of documents, making it significantly faster and more efficient than traditional IE systems.This paper introduces Open Information Extraction (OIE), a novel paradigm that enables domain-independent discovery of relations from text, scaling to the diversity and size of web corpora. OIE systems make a single pass over their corpus to extract a large set of relational tuples without requiring any human input. The paper also presents TEXTRUNNER, a fully implemented OIE system that assigns probabilities to tuples and supports efficient extraction and exploration via user queries.
TEXTRUNNER's architecture consists of three key modules: a self-supervised learner, a single-pass extractor, and a redundancy-based assessor. The self-supervised learner uses a small corpus sample to train a classifier that labels candidate extractions as trustworthy or not. The single-pass extractor makes a single pass over the entire corpus to extract tuples for all possible relations, using lightweight NLP techniques. The redundancy-based assessor assigns probabilities to retained tuples based on a probabilistic model of redundancy in text.
Experiments on a 9 million Web page corpus show that TEXTRUNNER achieves a 33% reduction in error rate compared to KNOWITALL, a state-of-the-art Web IE system, while extracting a broader set of facts. The paper also reports statistics on TEXTRUNNER's 11 million highest probability tuples, demonstrating its scalability and quality. The system can respond to queries over millions of tuples at interactive speeds due to efficient indexing. The key to TEXTRUNNER's scalability is its linear processing time in the number of documents, making it significantly faster and more efficient than traditional IE systems.