This study evaluates the performance of four large language models (LLMs)—ChatGPT 3.5, ChatGPT 4, Gemini Pro, and LLaMA2 7b—in multilingual sentiment analysis. The research focuses on assessing how well these models handle ambiguous and ironic text across 10 languages. The study constructs 20 scenarios, translates them into 10 languages, and evaluates the models' ability to predict sentiment. The results are validated against human responses to assess model accuracy and identify biases.
The study finds that ChatGPT and Gemini generally perform well in sentiment analysis, but they often fail to recognize nuances like irony or sarcasm. LLaMA2 consistently rates all scenarios positively, showing an optimistic bias. Gemini exhibits significant language-related rating differences and unexplained censorship behavior. The study also highlights that LLMs may have biases based on language families and that safety filters can affect model outputs.
The research emphasizes the importance of evaluating LLMs in diverse linguistic contexts to improve their interpretability, accuracy, and applicability. It calls for further investigation into the underlying factors affecting model performance, including training data, biases, and decision-making processes. The study provides a standardized methodology for evaluating LLMs in sentiment analysis and encourages improvements in algorithms and data to enhance their performance and reliability. The findings suggest that while LLMs show promise in sentiment analysis, there are still significant challenges in handling ambiguous and culturally nuanced scenarios.This study evaluates the performance of four large language models (LLMs)—ChatGPT 3.5, ChatGPT 4, Gemini Pro, and LLaMA2 7b—in multilingual sentiment analysis. The research focuses on assessing how well these models handle ambiguous and ironic text across 10 languages. The study constructs 20 scenarios, translates them into 10 languages, and evaluates the models' ability to predict sentiment. The results are validated against human responses to assess model accuracy and identify biases.
The study finds that ChatGPT and Gemini generally perform well in sentiment analysis, but they often fail to recognize nuances like irony or sarcasm. LLaMA2 consistently rates all scenarios positively, showing an optimistic bias. Gemini exhibits significant language-related rating differences and unexplained censorship behavior. The study also highlights that LLMs may have biases based on language families and that safety filters can affect model outputs.
The research emphasizes the importance of evaluating LLMs in diverse linguistic contexts to improve their interpretability, accuracy, and applicability. It calls for further investigation into the underlying factors affecting model performance, including training data, biases, and decision-making processes. The study provides a standardized methodology for evaluating LLMs in sentiment analysis and encourages improvements in algorithms and data to enhance their performance and reliability. The findings suggest that while LLMs show promise in sentiment analysis, there are still significant challenges in handling ambiguous and culturally nuanced scenarios.