How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO

How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO

22 Apr 2024 | Man Tik Ng, Hui Tung Tse, Jen-tse Huang, Jingjing Li, Wenxuan Wang, Michael R. Lyu
This paper introduces ECHO, an evaluative framework inspired by the Turing test to assess the role-playing abilities of Large Language Models (LLMs). The framework focuses on simulating average individuals rather than historical or fictional figures, enabling a more realistic evaluation of LLMs' ability to mimic human behavior. The study evaluates three role-playing LLMs: GPT-3.5, GPT-4, and GPTs from OpenAI. Results show that GPT-4 more effectively deceives human evaluators, achieving a success rate of 48.3% in mimicking real individuals. Additionally, GPT-4 can identify differences between human and machine-generated texts but cannot determine which texts were human-produced. The study also explores the potential of LLMs as evaluators, assessing their ability to distinguish between human and machine-generated responses. Results indicate that GPT-4 and GPT-4-Turbo demonstrate strong performance in this task, while Gemini-1.0-Pro performs similarly to random guessing. The study highlights the importance of specific instructions in improving LLMs' role-playing abilities and the challenges in capturing the nuances of human behavior. The research addresses limitations in current LLMs, including their inability to fully replicate human communication styles and the potential for biases in their evaluations. The study emphasizes the need for further research to improve LLMs' ability to mimic human behavior accurately and to ensure fair and unbiased evaluations. Overall, the study provides valuable insights into the capabilities and limitations of LLMs in role-playing scenarios and their potential as evaluators in identifying human and machine-generated texts.This paper introduces ECHO, an evaluative framework inspired by the Turing test to assess the role-playing abilities of Large Language Models (LLMs). The framework focuses on simulating average individuals rather than historical or fictional figures, enabling a more realistic evaluation of LLMs' ability to mimic human behavior. The study evaluates three role-playing LLMs: GPT-3.5, GPT-4, and GPTs from OpenAI. Results show that GPT-4 more effectively deceives human evaluators, achieving a success rate of 48.3% in mimicking real individuals. Additionally, GPT-4 can identify differences between human and machine-generated texts but cannot determine which texts were human-produced. The study also explores the potential of LLMs as evaluators, assessing their ability to distinguish between human and machine-generated responses. Results indicate that GPT-4 and GPT-4-Turbo demonstrate strong performance in this task, while Gemini-1.0-Pro performs similarly to random guessing. The study highlights the importance of specific instructions in improving LLMs' role-playing abilities and the challenges in capturing the nuances of human behavior. The research addresses limitations in current LLMs, including their inability to fully replicate human communication styles and the potential for biases in their evaluations. The study emphasizes the need for further research to improve LLMs' ability to mimic human behavior accurately and to ensure fair and unbiased evaluations. Overall, the study provides valuable insights into the capabilities and limitations of LLMs in role-playing scenarios and their potential as evaluators in identifying human and machine-generated texts.
Reach us at info@study.space
[slides] How Well Can LLMs Echo Us%3F Evaluating AI Chatbots' Role-Play Ability with ECHO | StudySpace