25 Mar 2024 | Se-eun Yoon, Zhankui He, Jessica Maria Echterhoff, Julian McAuley
This paper introduces a new evaluation protocol for assessing large language models (LLMs) as generative user simulators in conversational recommendation systems (CRSSs). The protocol consists of five tasks designed to evaluate key properties of synthetic users, including item selection, binary and open-ended preference expression, request generation, and feedback provision. The tasks are evaluated using real-world datasets from platforms like ReDial, Reddit, MovieLens, and IMDB. The study reveals that LLM-based simulators often deviate from human behavior, such as favoring popular items, lacking personalization in requests, and providing incoherent feedback. However, these deviations can be reduced through prompting strategies and model selection. The paper provides insights into improving the realism of LLM-based user simulators and highlights the importance of interactive evaluation in CRSSs.This paper introduces a new evaluation protocol for assessing large language models (LLMs) as generative user simulators in conversational recommendation systems (CRSSs). The protocol consists of five tasks designed to evaluate key properties of synthetic users, including item selection, binary and open-ended preference expression, request generation, and feedback provision. The tasks are evaluated using real-world datasets from platforms like ReDial, Reddit, MovieLens, and IMDB. The study reveals that LLM-based simulators often deviate from human behavior, such as favoring popular items, lacking personalization in requests, and providing incoherent feedback. However, these deviations can be reduced through prompting strategies and model selection. The paper provides insights into improving the realism of LLM-based user simulators and highlights the importance of interactive evaluation in CRSSs.