Class overlap handling methods in imbalanced domain: A comprehensive survey

Class overlap handling methods in imbalanced domain: A comprehensive survey

11 January 2024 | Anil Kumar · Dinesh Singh · Rama Shankar Yadav
Class overlap in imbalanced datasets is a significant challenge in deep learning, machine learning, and big data applications. Class overlap and imbalance negatively affect classification performance. Solutions include data-level, algorithm-level, ensemble, and hybrid methods. Data-level methods alter class distribution, leading to information loss and overfitting. Algorithm-level methods adjust model structure to prioritize minority class instances but are less user-friendly. This survey presents a comprehensive review of state-of-the-art methods for handling class overlap in imbalanced datasets, discussing their advantages, disadvantages, limitations, and performance metrics. It analyzes recent research, highlighting gaps and future directions for ML, DL, and BD applications. Class overlap occurs when multiple classes share similar features in data space, complicating classification. It degrades performance in both imbalanced and overlapped data, with a more severe impact when combined. Existing algorithms struggle in overlapped regions due to poor visibility of minority instances, leading to biased decision boundaries and high misclassification rates. This is particularly problematic in fields like medical science, anomaly detection, and financial analysis. Literature solutions are categorized based on class distribution or overlap. Instance distribution-based methods use resampling (under, over, or hybrid sampling), but suffer from information loss and overfitting. Class overlap-based methods involve identifying and handling overlapped regions in two phases. Approaches include discarding, merging, or separating overlapped regions. Discarding ignores overlapped regions, merging treats them as a new class, and separating uses two models for learning and testing. Imbalance and overlap are critical issues in ML, DL, and BD. DL and BD are rapidly growing fields, with CNNs becoming popular for time series analysis. This survey summarizes recent methods in DL and ML across small, medium, and BD domains, highlighting research gaps and future directions. The survey structure is shown in Fig. 1.Class overlap in imbalanced datasets is a significant challenge in deep learning, machine learning, and big data applications. Class overlap and imbalance negatively affect classification performance. Solutions include data-level, algorithm-level, ensemble, and hybrid methods. Data-level methods alter class distribution, leading to information loss and overfitting. Algorithm-level methods adjust model structure to prioritize minority class instances but are less user-friendly. This survey presents a comprehensive review of state-of-the-art methods for handling class overlap in imbalanced datasets, discussing their advantages, disadvantages, limitations, and performance metrics. It analyzes recent research, highlighting gaps and future directions for ML, DL, and BD applications. Class overlap occurs when multiple classes share similar features in data space, complicating classification. It degrades performance in both imbalanced and overlapped data, with a more severe impact when combined. Existing algorithms struggle in overlapped regions due to poor visibility of minority instances, leading to biased decision boundaries and high misclassification rates. This is particularly problematic in fields like medical science, anomaly detection, and financial analysis. Literature solutions are categorized based on class distribution or overlap. Instance distribution-based methods use resampling (under, over, or hybrid sampling), but suffer from information loss and overfitting. Class overlap-based methods involve identifying and handling overlapped regions in two phases. Approaches include discarding, merging, or separating overlapped regions. Discarding ignores overlapped regions, merging treats them as a new class, and separating uses two models for learning and testing. Imbalance and overlap are critical issues in ML, DL, and BD. DL and BD are rapidly growing fields, with CNNs becoming popular for time series analysis. This survey summarizes recent methods in DL and ML across small, medium, and BD domains, highlighting research gaps and future directions. The survey structure is shown in Fig. 1.
Reach us at info@study.space