Learning Classifiers from Only Positive and Unlabeled Data

Learning Classifiers from Only Positive and Unlabeled Data

August 24-27, 2008 | Charles Elkan, Keith Noto
This paper presents a method for learning a binary classifier from positive and unlabeled examples. The key assumption is that labeled positive examples are selected randomly from all positive examples. Under this assumption, a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive. The authors show how to use this result in two different ways to learn a classifier from a nontraditional training set. They apply these methods to a real-world problem of identifying protein records for inclusion in a specialized molecular biology database. Their experiments show that models trained using these new methods perform better than the current state-of-the-art biased SVM method for learning from positive and unlabeled examples. The paper also discusses related work and compares their methods to existing approaches. The main contribution is Lemma 1, which shows that if positive training examples are labeled at random, then the conditional probabilities produced by a model trained on the labeled and unlabeled examples differ by only a constant factor from the conditional probabilities produced by a model trained on fully labeled positive and negative examples. The authors demonstrate that their methods are both faster and more accurate than existing approaches.This paper presents a method for learning a binary classifier from positive and unlabeled examples. The key assumption is that labeled positive examples are selected randomly from all positive examples. Under this assumption, a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive. The authors show how to use this result in two different ways to learn a classifier from a nontraditional training set. They apply these methods to a real-world problem of identifying protein records for inclusion in a specialized molecular biology database. Their experiments show that models trained using these new methods perform better than the current state-of-the-art biased SVM method for learning from positive and unlabeled examples. The paper also discusses related work and compares their methods to existing approaches. The main contribution is Lemma 1, which shows that if positive training examples are labeled at random, then the conditional probabilities produced by a model trained on the labeled and unlabeled examples differ by only a constant factor from the conditional probabilities produced by a model trained on fully labeled positive and negative examples. The authors demonstrate that their methods are both faster and more accurate than existing approaches.
Reach us at info@study.space