2-7 August 2009 | Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky
This paper presents a method for relation extraction without labeled data, using distant supervision from a large semantic database, Freebase. The approach combines the advantages of supervised and unsupervised information extraction. For each pair of entities in a Freebase relation, the method finds all sentences containing those entities in a large unlabeled corpus and extracts textual features to train a relation classifier. The algorithm uses logistic regression to combine features and achieves a precision of 67.6% for 10,000 instances of 102 relations. The method also analyzes the performance of different features, showing that syntactic parse features are particularly helpful for ambiguous or lexically distant relations.
The paper discusses three learning paradigms for relation extraction: supervised, unsupervised, and bootstrapping. Supervised methods require labeled data, which is expensive and domain-dependent. Unsupervised methods can use large amounts of data but may not map well to knowledge bases. Bootstrapping methods use a small number of seed instances to iteratively extract more relations, but often suffer from low precision and semantic drift.
The proposed method, distant supervision, uses Freebase to provide supervision for relation extraction. It leverages the fact that any sentence containing a pair of entities that participate in a known Freebase relation is likely to express that relation in some way. The method uses a logistic regression classifier to combine features and extract relations from large amounts of unlabeled data. It is not limited to a specific domain and can use any size of corpus. The method also allows for the use of syntactic features, which are particularly helpful for ambiguous or lexically distant relations.
The paper evaluates the method on a variety of tasks, including held-out evaluation and human evaluation. The results show that the method achieves high precision for a reasonably large number of relations. The held-out evaluation suggests that the combination of syntactic and lexical features provides better performance than either feature set on its own. The human evaluation shows that syntactic features are particularly helpful for certain relations, such as director-film and writer-film. The method is able to extract a large number of relations from unlabeled data and is not limited to a specific domain. The paper concludes that syntactic features are useful in distantly supervised information extraction, particularly in cases where the individual patterns are particularly ambiguous.This paper presents a method for relation extraction without labeled data, using distant supervision from a large semantic database, Freebase. The approach combines the advantages of supervised and unsupervised information extraction. For each pair of entities in a Freebase relation, the method finds all sentences containing those entities in a large unlabeled corpus and extracts textual features to train a relation classifier. The algorithm uses logistic regression to combine features and achieves a precision of 67.6% for 10,000 instances of 102 relations. The method also analyzes the performance of different features, showing that syntactic parse features are particularly helpful for ambiguous or lexically distant relations.
The paper discusses three learning paradigms for relation extraction: supervised, unsupervised, and bootstrapping. Supervised methods require labeled data, which is expensive and domain-dependent. Unsupervised methods can use large amounts of data but may not map well to knowledge bases. Bootstrapping methods use a small number of seed instances to iteratively extract more relations, but often suffer from low precision and semantic drift.
The proposed method, distant supervision, uses Freebase to provide supervision for relation extraction. It leverages the fact that any sentence containing a pair of entities that participate in a known Freebase relation is likely to express that relation in some way. The method uses a logistic regression classifier to combine features and extract relations from large amounts of unlabeled data. It is not limited to a specific domain and can use any size of corpus. The method also allows for the use of syntactic features, which are particularly helpful for ambiguous or lexically distant relations.
The paper evaluates the method on a variety of tasks, including held-out evaluation and human evaluation. The results show that the method achieves high precision for a reasonably large number of relations. The held-out evaluation suggests that the combination of syntactic and lexical features provides better performance than either feature set on its own. The human evaluation shows that syntactic features are particularly helpful for certain relations, such as director-film and writer-film. The method is able to extract a large number of relations from unlabeled data and is not limited to a specific domain. The paper concludes that syntactic features are useful in distantly supervised information extraction, particularly in cases where the individual patterns are particularly ambiguous.