Supervised and Unsupervised Discretization of Continuous Features

Supervised and Unsupervised Discretization of Continuous Features

1995 | James Dougherty, Ron Kohavi, Mehran Sahami
This paper reviews and evaluates methods for discretizing continuous features in supervised and unsupervised learning. The authors compare binning (an unsupervised method) with entropy-based and purity-based methods (supervised). They find that discretizing features using an entropy-based method significantly improves the performance of the Naive-Bayes algorithm, outperforming C4.5 on average across 16 datasets. They also show that discretizing features before applying C4.5 does not degrade performance and can even improve it in some cases. The paper discusses three main discretization methods: equal width interval binning, Holte's 1R discretizer, and recursive minimal entropy partitioning. Equal width binning is the simplest method, dividing the range of values into equal-sized bins. Holte's 1R discretizer aims to create bins with a pure class distribution. Recursive minimal entropy partitioning uses entropy to select bin boundaries, aiming to minimize entropy in the partitions. The authors evaluate these methods on 16 datasets from the UCI repository. They find that entropy-based discretization significantly improves the accuracy of both Naive-Bayes and C4.5. The entropy-based method is a global method that does not suffer from data fragmentation. The authors conclude that entropy-based discretization is the best method among those tested, as it provides the best performance for both Naive-Bayes and C4.5. They also note that supervised methods generally outperform unsupervised methods, although even simple binning can significantly improve the performance of the Naive-Bayes classifier. The paper highlights the importance of discretization in improving the accuracy of induction algorithms and suggests further research into dynamic discretization methods.This paper reviews and evaluates methods for discretizing continuous features in supervised and unsupervised learning. The authors compare binning (an unsupervised method) with entropy-based and purity-based methods (supervised). They find that discretizing features using an entropy-based method significantly improves the performance of the Naive-Bayes algorithm, outperforming C4.5 on average across 16 datasets. They also show that discretizing features before applying C4.5 does not degrade performance and can even improve it in some cases. The paper discusses three main discretization methods: equal width interval binning, Holte's 1R discretizer, and recursive minimal entropy partitioning. Equal width binning is the simplest method, dividing the range of values into equal-sized bins. Holte's 1R discretizer aims to create bins with a pure class distribution. Recursive minimal entropy partitioning uses entropy to select bin boundaries, aiming to minimize entropy in the partitions. The authors evaluate these methods on 16 datasets from the UCI repository. They find that entropy-based discretization significantly improves the accuracy of both Naive-Bayes and C4.5. The entropy-based method is a global method that does not suffer from data fragmentation. The authors conclude that entropy-based discretization is the best method among those tested, as it provides the best performance for both Naive-Bayes and C4.5. They also note that supervised methods generally outperform unsupervised methods, although even simple binning can significantly improve the performance of the Naive-Bayes classifier. The paper highlights the importance of discretization in improving the accuracy of induction algorithms and suggests further research into dynamic discretization methods.
Reach us at info@study.space
Understanding Supervised and Unsupervised Discretization of Continuous Features