Understanding Estimating the number of clusters in a dataset via the gap statistic

The paper introduces the "gap statistic" as a method for estimating the number of clusters in a dataset. The technique compares the within-cluster dispersion with an expected value under a null distribution, using any clustering algorithm (e.g., K-means or hierarchical). The authors develop some theory and demonstrate through simulations that the gap statistic generally outperforms other methods. The method is designed to be applicable to various clustering methods and distance measures. The paper also discusses the choice of a reference distribution and provides a computational implementation of the gap statistic, including two methods for generating reference data. The authors compare the gap statistic with other clustering methods and show its effectiveness in identifying well-separated clusters. The paper concludes with a discussion of further research directions, such as handling overlapping clusters and using adaptive versions of clustering algorithms.The paper introduces the "gap statistic" as a method for estimating the number of clusters in a dataset. The technique compares the within-cluster dispersion with an expected value under a null distribution, using any clustering algorithm (e.g., K-means or hierarchical). The authors develop some theory and demonstrate through simulations that the gap statistic generally outperforms other methods. The method is designed to be applicable to various clustering methods and distance measures. The paper also discusses the choice of a reference distribution and provides a computational implementation of the gap statistic, including two methods for generating reference data. The authors compare the gap statistic with other clustering methods and show its effectiveness in identifying well-separated clusters. The paper concludes with a discussion of further research directions, such as handling overlapping clusters and using adaptive versions of clustering algorithms.

Estimating the number of clusters in a data set via the gap statistic

[Received February 2000. Final revision November 2000] | Robert Tibshirani, Guenther Walther and Trevor Hastie