2001 | Robert Tibshirani, Guenther Walther and Trevor Hastie
The gap statistic is a method for estimating the number of clusters in a dataset. It compares the within-cluster dispersion of data with that expected under a reference null distribution. The method is applicable to any clustering algorithm, such as K-means or hierarchical clustering. The gap statistic calculates the difference between the logarithm of the within-cluster dispersion and its expected value under a reference distribution. The optimal number of clusters is determined by the value of k where the gap statistic is maximized. The method is validated through simulations and shown to outperform other clustering validation methods. The gap statistic is particularly effective for well-separated clusters but may not perform well when clusters are overlapping or when data lie near a subspace. The method uses a reference distribution, often uniform, or one aligned with principal components, to estimate the number of clusters. The gap statistic is applied to various datasets, including DNA microarray data, where it successfully identifies the number of clusters. The method is robust and provides a statistical basis for determining the optimal number of clusters in unsupervised learning.The gap statistic is a method for estimating the number of clusters in a dataset. It compares the within-cluster dispersion of data with that expected under a reference null distribution. The method is applicable to any clustering algorithm, such as K-means or hierarchical clustering. The gap statistic calculates the difference between the logarithm of the within-cluster dispersion and its expected value under a reference distribution. The optimal number of clusters is determined by the value of k where the gap statistic is maximized. The method is validated through simulations and shown to outperform other clustering validation methods. The gap statistic is particularly effective for well-separated clusters but may not perform well when clusters are overlapping or when data lie near a subspace. The method uses a reference distribution, often uniform, or one aligned with principal components, to estimate the number of clusters. The gap statistic is applied to various datasets, including DNA microarray data, where it successfully identifies the number of clusters. The method is robust and provides a statistical basis for determining the optimal number of clusters in unsupervised learning.