February 14, 2024 | Noura A. Semary, Wesam Ahmed, Khalid Amin, Paweł Plawiak, Mohamed Hammad
This study investigates the effectiveness of various feature extraction techniques in enhancing the performance of machine learning-based sentiment analysis. The research focuses on selecting the most suitable feature extraction method to improve the accuracy and efficiency of sentiment classification tasks. The study evaluates six feature extraction methods: Bag-of-words (BOW), Term Frequency-Inverse Document Frequency (TF-IDF), n-grams, Hashing Vectorizer (HV), Global Vector for Word Representation (GloVe), and Word2Vec. These methods are applied to two datasets: Twitter US Airlines and Amazon Musical Instrument Reviews. The results show that the TF-IDF technique achieves the highest accuracy, with 99% on the Amazon reviews dataset and 96% on the Twitter US airlines dataset. The study also addresses the issue of class imbalance in the datasets and applies the SMOTE technique to balance the data. The random forest classifier is used for classification, and the performance is evaluated using metrics such as accuracy, precision, recall, and F1-measure. The findings indicate that TF-IDF provides a good balance between performance and computational efficiency, making it a suitable choice for sentiment analysis tasks. The study highlights the importance of feature extraction in sentiment analysis and provides practical insights for improving model performance and guiding future research.This study investigates the effectiveness of various feature extraction techniques in enhancing the performance of machine learning-based sentiment analysis. The research focuses on selecting the most suitable feature extraction method to improve the accuracy and efficiency of sentiment classification tasks. The study evaluates six feature extraction methods: Bag-of-words (BOW), Term Frequency-Inverse Document Frequency (TF-IDF), n-grams, Hashing Vectorizer (HV), Global Vector for Word Representation (GloVe), and Word2Vec. These methods are applied to two datasets: Twitter US Airlines and Amazon Musical Instrument Reviews. The results show that the TF-IDF technique achieves the highest accuracy, with 99% on the Amazon reviews dataset and 96% on the Twitter US airlines dataset. The study also addresses the issue of class imbalance in the datasets and applies the SMOTE technique to balance the data. The random forest classifier is used for classification, and the performance is evaluated using metrics such as accuracy, precision, recall, and F1-measure. The findings indicate that TF-IDF provides a good balance between performance and computational efficiency, making it a suitable choice for sentiment analysis tasks. The study highlights the importance of feature extraction in sentiment analysis and provides practical insights for improving model performance and guiding future research.