25 Jan 2024 | Alessio Buscemi*, Daniele Proverbio†
This study evaluates the performance of four leading Large Language Models (LLMs)—ChatGPT 3.5, ChatGPT 4, Gemini Pro, and LLaMA2 7b—in automated sentiment analysis across 20 nuanced and ambiguous scenarios translated into 10 different languages. The primary goal is to assess how these models handle complex and ambiguous text, including irony and sarcasm, and to compare their performance with human responses. The study finds that while ChatGPT and Gemini generally perform well, they exhibit significant biases and inconsistent performance across models and languages. LLaMA2 consistently rates all scenarios positively, showing an "optimistic bias." Gemini, in particular, shows notable differences in ratings across languages and exhibits unexplained censorship behavior. The study provides a standardized methodology for evaluating LLMs and highlights the need for further improvements in algorithm development, data quality, and interpretability to enhance the reliability and applicability of automated sentiment analysis in various contexts.This study evaluates the performance of four leading Large Language Models (LLMs)—ChatGPT 3.5, ChatGPT 4, Gemini Pro, and LLaMA2 7b—in automated sentiment analysis across 20 nuanced and ambiguous scenarios translated into 10 different languages. The primary goal is to assess how these models handle complex and ambiguous text, including irony and sarcasm, and to compare their performance with human responses. The study finds that while ChatGPT and Gemini generally perform well, they exhibit significant biases and inconsistent performance across models and languages. LLaMA2 consistently rates all scenarios positively, showing an "optimistic bias." Gemini, in particular, shows notable differences in ratings across languages and exhibits unexplained censorship behavior. The study provides a standardized methodology for evaluating LLMs and highlights the need for further improvements in algorithm development, data quality, and interpretability to enhance the reliability and applicability of automated sentiment analysis in various contexts.