A Kernel Method for the Two-Sample Problem

A Kernel Method for the Two-Sample Problem

04/08 | Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander Smola
This paper introduces a kernel method for the two-sample problem, which allows for statistical tests to determine if two samples are drawn from different distributions. The test statistic is the maximum mean discrepancy (MMD), defined as the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). Two tests are based on large deviation bounds for the MMD, while a third is based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. The MMD recovers several classical metrics on distributions when the function space is allowed to be more general. The method is applied to various problems, including attribute matching for databases and comparing distributions over graphs, where it performs strongly. The paper also discusses the theoretical properties of the MMD, including its consistency and asymptotic behavior, and compares it with other approaches to the two-sample problem. The results show that the MMD-based tests are effective in distinguishing distributions, particularly in high-dimensional data with low sample sizes and on graph data.This paper introduces a kernel method for the two-sample problem, which allows for statistical tests to determine if two samples are drawn from different distributions. The test statistic is the maximum mean discrepancy (MMD), defined as the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). Two tests are based on large deviation bounds for the MMD, while a third is based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. The MMD recovers several classical metrics on distributions when the function space is allowed to be more general. The method is applied to various problems, including attribute matching for databases and comparing distributions over graphs, where it performs strongly. The paper also discusses the theoretical properties of the MMD, including its consistency and asymptotic behavior, and compares it with other approaches to the two-sample problem. The results show that the MMD-based tests are effective in distinguishing distributions, particularly in high-dimensional data with low sample sizes and on graph data.
Reach us at info@study.space