| Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf
This paper presents a nonparametric method for correcting sample selection bias by using unlabeled data. The method, called kernel mean matching (KMM), directly computes resampling weights without estimating the underlying distributions. It works by matching the distributions of training and test data in feature space. The key idea is to reweight the training samples so that the means of the training and test samples in a reproducing kernel Hilbert space (RKHS) are close. This approach avoids the need to estimate biased densities or selection probabilities, and does not assume knowledge of class probabilities. Instead, it accounts for the difference between the training and test distributions by reweighting the training points. The method is applied to various regression and classification benchmarks, as well as to microarray data from prostate and breast cancer patients. Experimental results show that KMM significantly improves learning performance compared to training on unweighted data, and in some cases outperforms reweighting using the true sample bias distribution. The method is shown to be effective even when the key assumption about the relationship between the training and test distributions is not valid. The paper also discusses the convergence of reweighted means in feature space and provides theoretical guarantees for the method. The results demonstrate that KMM is a promising approach for correcting sample selection bias in machine learning.This paper presents a nonparametric method for correcting sample selection bias by using unlabeled data. The method, called kernel mean matching (KMM), directly computes resampling weights without estimating the underlying distributions. It works by matching the distributions of training and test data in feature space. The key idea is to reweight the training samples so that the means of the training and test samples in a reproducing kernel Hilbert space (RKHS) are close. This approach avoids the need to estimate biased densities or selection probabilities, and does not assume knowledge of class probabilities. Instead, it accounts for the difference between the training and test distributions by reweighting the training points. The method is applied to various regression and classification benchmarks, as well as to microarray data from prostate and breast cancer patients. Experimental results show that KMM significantly improves learning performance compared to training on unweighted data, and in some cases outperforms reweighting using the true sample bias distribution. The method is shown to be effective even when the key assumption about the relationship between the training and test distributions is not valid. The paper also discusses the convergence of reweighted means in feature space and provides theoretical guarantees for the method. The results demonstrate that KMM is a promising approach for correcting sample selection bias in machine learning.