Correcting Sample Selection Bias by Unlabeled Data

Correcting Sample Selection Bias by Unlabeled Data

| Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf
This paper addresses the issue of sample selection bias, where training and test data are drawn from different distributions. Traditional methods often estimate the sampling distributions and then correct for the bias, but this approach can be inefficient and prone to errors. The authors propose a nonparametric method that directly produces resampling weights without estimating the distributions. This method, called Kernel Mean Matching (KMM), matches the distributions between the training and testing sets in feature space. KMM reweights the training points to make the means of the training and test points in a reproducing kernel Hilbert space (RKHS) close. The optimization problem is a simple quadratic program, and the reweighted sample can be incorporated into various regression and classification algorithms. Experimental results on both synthetic and real-world datasets demonstrate that KMM significantly improves learning performance compared to unweighted data, even in cases where the key assumption of the method is not strictly satisfied. The method is particularly effective in scenarios with biased sampling, such as gene expression studies and tumor diagnosis using microarrays.This paper addresses the issue of sample selection bias, where training and test data are drawn from different distributions. Traditional methods often estimate the sampling distributions and then correct for the bias, but this approach can be inefficient and prone to errors. The authors propose a nonparametric method that directly produces resampling weights without estimating the distributions. This method, called Kernel Mean Matching (KMM), matches the distributions between the training and testing sets in feature space. KMM reweights the training points to make the means of the training and test points in a reproducing kernel Hilbert space (RKHS) close. The optimization problem is a simple quadratic program, and the reweighted sample can be incorporated into various regression and classification algorithms. Experimental results on both synthetic and real-world datasets demonstrate that KMM significantly improves learning performance compared to unweighted data, even in cases where the key assumption of the method is not strictly satisfied. The method is particularly effective in scenarios with biased sampling, such as gene expression studies and tumor diagnosis using microarrays.
Reach us at info@study.space