| Andrew Estabrooks, Taeho Jo and Nathalie Japkowicz
This paper presents an experimental study on the effectiveness of combining different resampling methods to address the class imbalance problem in data mining. The study focuses on the decision tree induction system C4.5 and evaluates the performance of various resampling strategies, including oversampling and undersampling, as well as their combinations. The research concludes that combining different resampling approaches is an effective solution to the tuning problem in class-imbalanced datasets.
The paper first discusses the challenges posed by class imbalance in various domains, including artificial and real-world data sets. It then presents an experimental study comparing oversampling and undersampling methods on different data sets, including the Reuters-21578 text collection. The results show that oversampling and undersampling can have varying effectiveness depending on the class imbalance ratio and the specific domain.
The study further explores the impact of different resampling rates on classification performance. It finds that resampling to full balance is not always optimal and that the best resampling rate varies across domains. The paper then proposes a combination scheme that integrates different resampling methods to improve classification accuracy. This combination scheme is tested on various domains, including artificial and real-world data sets, and is shown to outperform both oversampling and undersampling methods as well as other combination methods like Adaboost.
The combination scheme is based on a hierarchical structure that combines the results of multiple resampling methods. It includes an elimination process to ensure that only reliable classifiers are used in the decision-making process. The scheme is tested on the Reuters-21578 text classification domain and is shown to perform well in terms of both error rates and ROC curves.
The paper also discusses related work and concludes that the proposed combination method is effective in improving classification performance on imbalanced datasets. Future work includes further research into the components of the combination method and its application to other domains with large class imbalances.This paper presents an experimental study on the effectiveness of combining different resampling methods to address the class imbalance problem in data mining. The study focuses on the decision tree induction system C4.5 and evaluates the performance of various resampling strategies, including oversampling and undersampling, as well as their combinations. The research concludes that combining different resampling approaches is an effective solution to the tuning problem in class-imbalanced datasets.
The paper first discusses the challenges posed by class imbalance in various domains, including artificial and real-world data sets. It then presents an experimental study comparing oversampling and undersampling methods on different data sets, including the Reuters-21578 text collection. The results show that oversampling and undersampling can have varying effectiveness depending on the class imbalance ratio and the specific domain.
The study further explores the impact of different resampling rates on classification performance. It finds that resampling to full balance is not always optimal and that the best resampling rate varies across domains. The paper then proposes a combination scheme that integrates different resampling methods to improve classification accuracy. This combination scheme is tested on various domains, including artificial and real-world data sets, and is shown to outperform both oversampling and undersampling methods as well as other combination methods like Adaboost.
The combination scheme is based on a hierarchical structure that combines the results of multiple resampling methods. It includes an elimination process to ensure that only reliable classifiers are used in the decision-making process. The scheme is tested on the Reuters-21578 text classification domain and is shown to perform well in terms of both error rates and ROC curves.
The paper also discusses related work and concludes that the proposed combination method is effective in improving classification performance on imbalanced datasets. Future work includes further research into the components of the combination method and its application to other domains with large class imbalances.