Penerapan Metode SMOTE Untuk Mengatasi Imbalanced Data Pada Klasifikasi Ujaran Kebencian

Penerapan Metode SMOTE Untuk Mengatasi Imbalanced Data Pada Klasifikasi Ujaran Kebencian

Januari 2024 | Ridwan, Eni Heni Hermaliani, Muji Ernawati
This study investigates the application of the Synthetic Minority Oversampling Technique (SMOTE) to address imbalanced data in hate speech classification on Twitter. The research aims to compare the performance of different SMOTE variants—SMOTE, SVM-SMOTE, Kmeans-SMOTE, and Borderline-SMOTE—in handling imbalanced data, and to evaluate the effectiveness of various machine learning algorithms (Random Forest, Support Vector Machine, Logistic Regression, and Naive Bayes) for hate speech detection. The dataset used consists of 10,535 tweets, with 4,449 hate speech and 6,086 non-hate speech samples. Data preprocessing steps include cleaning, case folding, text normalization, stemming, and stopword removal. Feature extraction is performed using Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). The results show that the Borderline-SMOTE method outperforms other SMOTE techniques in handling imbalanced data, achieving the highest accuracy (84.09%), recall (85.25%), precision (84.55%), and F1-score (81.16%). Among the machine learning algorithms, Random Forest demonstrates the best performance, with accuracy of 84.58%, recall of 77.96%, precision of 84.33%, and F1-score of 81.02. The study concludes that Borderline-SMOTE is the most effective oversampling method for addressing imbalanced data in hate speech classification, while Random Forest is the best classifier for this task. The findings suggest that combining Borderline-SMOTE with Random Forest can significantly improve the performance of hate speech detection systems. Future research could explore the integration of Glove embeddings for feature extraction and data augmentation techniques to further enhance model accuracy.This study investigates the application of the Synthetic Minority Oversampling Technique (SMOTE) to address imbalanced data in hate speech classification on Twitter. The research aims to compare the performance of different SMOTE variants—SMOTE, SVM-SMOTE, Kmeans-SMOTE, and Borderline-SMOTE—in handling imbalanced data, and to evaluate the effectiveness of various machine learning algorithms (Random Forest, Support Vector Machine, Logistic Regression, and Naive Bayes) for hate speech detection. The dataset used consists of 10,535 tweets, with 4,449 hate speech and 6,086 non-hate speech samples. Data preprocessing steps include cleaning, case folding, text normalization, stemming, and stopword removal. Feature extraction is performed using Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). The results show that the Borderline-SMOTE method outperforms other SMOTE techniques in handling imbalanced data, achieving the highest accuracy (84.09%), recall (85.25%), precision (84.55%), and F1-score (81.16%). Among the machine learning algorithms, Random Forest demonstrates the best performance, with accuracy of 84.58%, recall of 77.96%, precision of 84.33%, and F1-score of 81.02. The study concludes that Borderline-SMOTE is the most effective oversampling method for addressing imbalanced data in hate speech classification, while Random Forest is the best classifier for this task. The findings suggest that combining Borderline-SMOTE with Random Forest can significantly improve the performance of hate speech detection systems. Future research could explore the integration of Glove embeddings for feature extraction and data augmentation techniques to further enhance model accuracy.
Reach us at info@study.space
[slides] Penerapan | StudySpace