22 March 2024 | Muhamet Kastrati¹ · Zenun Kastrati² · Ali Shariq Imran³ · Marenglen Biba¹
This study presents a large-scale dataset of 17.5 million tweets labeled with Ekman's six basic emotions using distant supervision based on emojis. The dataset was created to address the challenge of limited labeled data in sentiment and emotion classification tasks on short texts, such as Twitter posts. The researchers used a combination of conventional machine learning models and deep learning approaches, including transformer-based models, to evaluate the performance of different classifiers on the dataset. The results showed that the BiLSTM model with FastText and an attention mechanism achieved the highest performance, with an F1-score of 70.92% for sentiment classification and 54.85% for emotion detection.
The study also investigated the impact of various factors on the performance of classifiers, including the size of the training data, class imbalance, and the use of pre-trained word embeddings. The results indicated that increasing the size of the training data and using pre-trained word embeddings such as GloVe, Glove Twitter, and FastText significantly improved the performance of the classifiers. Additionally, the study found that class imbalance negatively affected the performance of the classifiers, and the dataset was intentionally balanced to mitigate this issue.
The researchers also explored the effects of different numbers of emotion classes on the performance of the classifiers. The results showed that reducing the number of emotion classes from six to two improved the accuracy of the classifiers. The study concluded that the BiLSTM model with FastText and an attention mechanism is the most effective for both sentiment polarity and emotion classification tasks on the dataset. The findings of this study contribute to the field of sentiment and emotion analysis by providing a large-scale, well-labeled dataset and demonstrating the effectiveness of deep learning models in handling the challenges of short text classification.This study presents a large-scale dataset of 17.5 million tweets labeled with Ekman's six basic emotions using distant supervision based on emojis. The dataset was created to address the challenge of limited labeled data in sentiment and emotion classification tasks on short texts, such as Twitter posts. The researchers used a combination of conventional machine learning models and deep learning approaches, including transformer-based models, to evaluate the performance of different classifiers on the dataset. The results showed that the BiLSTM model with FastText and an attention mechanism achieved the highest performance, with an F1-score of 70.92% for sentiment classification and 54.85% for emotion detection.
The study also investigated the impact of various factors on the performance of classifiers, including the size of the training data, class imbalance, and the use of pre-trained word embeddings. The results indicated that increasing the size of the training data and using pre-trained word embeddings such as GloVe, Glove Twitter, and FastText significantly improved the performance of the classifiers. Additionally, the study found that class imbalance negatively affected the performance of the classifiers, and the dataset was intentionally balanced to mitigate this issue.
The researchers also explored the effects of different numbers of emotion classes on the performance of the classifiers. The results showed that reducing the number of emotion classes from six to two improved the accuracy of the classifiers. The study concluded that the BiLSTM model with FastText and an attention mechanism is the most effective for both sentiment polarity and emotion classification tasks on the dataset. The findings of this study contribute to the field of sentiment and emotion analysis by providing a large-scale, well-labeled dataset and demonstrating the effectiveness of deep learning models in handling the challenges of short text classification.