[slides and audio] Learning classifiers from only positive and unlabeled data

The paper addresses the problem of learning a binary classifier from only positive and unlabeled data, a scenario where the training set consists of positive examples and a mix of labeled and unlabeled examples. The authors assume that the labeled positive examples are selected randomly from all positive examples. They derive a lemma that shows how to transform the probabilities predicted by a classifier trained on the labeled and unlabeled data into probabilities that are close to the true conditional probabilities of being positive. This lemma is used to propose two methods for learning a classifier from the nontraditional training set: one involves using the probabilities to rank examples, and the other involves weighting the unlabeled examples. The methods are applied to a real-world problem of identifying protein records for an incomplete specialized molecular biology database. Experiments show that the proposed methods outperform the current state-of-the-art biased SVM method, both in terms of accuracy and speed. The paper also discusses related work and concludes by summarizing the contributions and findings.The paper addresses the problem of learning a binary classifier from only positive and unlabeled data, a scenario where the training set consists of positive examples and a mix of labeled and unlabeled examples. The authors assume that the labeled positive examples are selected randomly from all positive examples. They derive a lemma that shows how to transform the probabilities predicted by a classifier trained on the labeled and unlabeled data into probabilities that are close to the true conditional probabilities of being positive. This lemma is used to propose two methods for learning a classifier from the nontraditional training set: one involves using the probabilities to rank examples, and the other involves weighting the unlabeled examples. The methods are applied to a real-world problem of identifying protein records for an incomplete specialized molecular biology database. Experiments show that the proposed methods outperform the current state-of-the-art biased SVM method, both in terms of accuracy and speed. The paper also discusses related work and concludes by summarizing the contributions and findings.

Learning Classifiers from Only Positive and Unlabeled Data

August 24-27, 2008 | Charles Elkan, Keith Noto