[slides and audio] How Well Can LLMs Echo Us%3F Evaluating AI Chatbots' Role-Play Ability with ECHO

The paper introduces ECHO, a framework inspired by the Turing Test to evaluate the role-playing abilities of Large Language Models (LLMs). Unlike previous studies that focus on imitating well-known public figures or fictional characters, ECHO aims to assess LLMs' capability to simulate ordinary individuals. The framework engages the acquaintances of the target individuals to distinguish between human and machine-generated responses. The study evaluates three role-playing LLMs using ECHO: GPT-3.5, GPT-4, and OpenAI's online application GPTs. The results show that GPT-4 is more effective in deceiving human evaluators, achieving a leading success rate of 48.3%. Additionally, the study investigates whether LLMs can discern between human and machine-generated texts, finding that while GPT-4 can identify differences, it cannot determine which texts are human-produced. The paper concludes by highlighting the contributions of ECHO and the limitations of the study, emphasizing the need for further research to address the challenges in capturing the complexities of human interaction.The paper introduces ECHO, a framework inspired by the Turing Test to evaluate the role-playing abilities of Large Language Models (LLMs). Unlike previous studies that focus on imitating well-known public figures or fictional characters, ECHO aims to assess LLMs' capability to simulate ordinary individuals. The framework engages the acquaintances of the target individuals to distinguish between human and machine-generated responses. The study evaluates three role-playing LLMs using ECHO: GPT-3.5, GPT-4, and OpenAI's online application GPTs. The results show that GPT-4 is more effective in deceiving human evaluators, achieving a leading success rate of 48.3%. Additionally, the study investigates whether LLMs can discern between human and machine-generated texts, finding that while GPT-4 can identify differences, it cannot determine which texts are human-produced. The paper concludes by highlighting the contributions of ECHO and the limitations of the study, emphasizing the need for further research to address the challenges in capturing the complexities of human interaction.

How Well Can LLMs Echo Us? Evaluating AI Chatbots’ Role-Play Ability with ECHO

22 Apr 2024 | Man Tik Ng, Hui Tung Tse, Jen-tse Huang, Jingjing Li, Wenxuan Wang, Michael R. Lyu