Penerapan Metode SMOTE Untuk Mengatasi Imbalanced Data Pada Klasifikasi Ujaran Kebencian

Penerapan Metode SMOTE Untuk Mengatasi Imbalanced Data Pada Klasifikasi Ujaran Kebencian

Vol. 4 No. 1 Januari 2024 | Ridwan1, Eni Heni Herbaliani2*, Muji Ernawati3
This paper addresses the issue of hate speech classification using Synthetic Minority Oversampling Technique (SMOTE) to handle imbalanced data. Hate speech, which can lead to discrimination, violence, and social conflict, is prevalent on social media platforms like Twitter. The study uses a dataset of Twitter tweets collected from various sources, annotated by 30 annotators with diverse backgrounds. The dataset is preprocessed to clean, normalize, and tokenize the text, and feature extraction is performed using Bag of Words (BoW) and Term Frequency - Revised Document Frequency (TF-IDF). SMOTE methods, including Borderline-SMOTE, SVM-SMOTE, Kmeans-SMOTE, and Borderline-SMOTE, are applied to oversample the minority class. Classification algorithms such as Random Forest, Support Vector Machine (SVM), Logistic Regression, and Naive Bayes are used to classify the tweets. The performance of these models is evaluated using accuracy, precision, recall, and F1-Score. The results show that Borderline-SMOTE outperforms other SMOTE methods in terms of accuracy (84.09%), recall (85.25%), precision (84.55%), and F1-Score (81.16%). Additionally, Random Forest algorithm consistently performs better than other algorithms. The study concludes that Borderline-SMOTE is the most effective method for handling imbalanced data in hate speech classification, and Random Forest is the best performing machine learning algorithm. Future work could involve incorporating Glove Embedding and data augmentation to further improve model accuracy.This paper addresses the issue of hate speech classification using Synthetic Minority Oversampling Technique (SMOTE) to handle imbalanced data. Hate speech, which can lead to discrimination, violence, and social conflict, is prevalent on social media platforms like Twitter. The study uses a dataset of Twitter tweets collected from various sources, annotated by 30 annotators with diverse backgrounds. The dataset is preprocessed to clean, normalize, and tokenize the text, and feature extraction is performed using Bag of Words (BoW) and Term Frequency - Revised Document Frequency (TF-IDF). SMOTE methods, including Borderline-SMOTE, SVM-SMOTE, Kmeans-SMOTE, and Borderline-SMOTE, are applied to oversample the minority class. Classification algorithms such as Random Forest, Support Vector Machine (SVM), Logistic Regression, and Naive Bayes are used to classify the tweets. The performance of these models is evaluated using accuracy, precision, recall, and F1-Score. The results show that Borderline-SMOTE outperforms other SMOTE methods in terms of accuracy (84.09%), recall (85.25%), precision (84.55%), and F1-Score (81.16%). Additionally, Random Forest algorithm consistently performs better than other algorithms. The study concludes that Borderline-SMOTE is the most effective method for handling imbalanced data in hate speech classification, and Random Forest is the best performing machine learning algorithm. Future work could involve incorporating Glove Embedding and data augmentation to further improve model accuracy.
Reach us at info@study.space