Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation

Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation

25 Mar 2024 | Se-eun Yoon, Zhankui He, Jessica Maria Echterhoff, Julian McAuley
This paper introduces a new evaluation protocol for large language models (LLMs) as generative user simulators in conversational recommendation. The protocol consists of five tasks designed to assess key properties of synthetic users: choosing items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendations, and giving feedback. The tasks are used to evaluate how closely LLM-based simulators represent human behavior in conversational recommendation scenarios. The study compares the performance of LLM-based simulators with real users across various tasks. The results show that simulators often fail to capture the diversity of items mentioned, correlate poorly with human preferences, exhibit lack of personalization in requests, and provide incoherent feedback. However, the study also identifies methods to reduce these discrepancies, such as using prompting strategies and selecting appropriate models. The evaluation protocol is designed to be automatic and reproducible, using real user data and five specific tasks. The tasks are executed on different datasets, including ReDial, Reddit, MovieLens, and IMDB. The study finds that simulators can be improved by incorporating pickiness personality traits, which enhance preference alignment with humans. The results indicate that while LLM-based simulators can generate diverse and personalized requests, they often struggle to capture the nuances of user preferences and provide coherent feedback. The study also highlights the importance of using diverse datasets and considering the unique characteristics of different domains when evaluating simulators. Overall, the study provides a new framework for evaluating LLMs as user simulators in conversational recommendation, offering insights into how to improve their realism and effectiveness in simulating human behavior. The findings suggest that further research is needed to develop more accurate and realistic simulators that can better represent the complexities of human preferences and behaviors.This paper introduces a new evaluation protocol for large language models (LLMs) as generative user simulators in conversational recommendation. The protocol consists of five tasks designed to assess key properties of synthetic users: choosing items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendations, and giving feedback. The tasks are used to evaluate how closely LLM-based simulators represent human behavior in conversational recommendation scenarios. The study compares the performance of LLM-based simulators with real users across various tasks. The results show that simulators often fail to capture the diversity of items mentioned, correlate poorly with human preferences, exhibit lack of personalization in requests, and provide incoherent feedback. However, the study also identifies methods to reduce these discrepancies, such as using prompting strategies and selecting appropriate models. The evaluation protocol is designed to be automatic and reproducible, using real user data and five specific tasks. The tasks are executed on different datasets, including ReDial, Reddit, MovieLens, and IMDB. The study finds that simulators can be improved by incorporating pickiness personality traits, which enhance preference alignment with humans. The results indicate that while LLM-based simulators can generate diverse and personalized requests, they often struggle to capture the nuances of user preferences and provide coherent feedback. The study also highlights the importance of using diverse datasets and considering the unique characteristics of different domains when evaluating simulators. Overall, the study provides a new framework for evaluating LLMs as user simulators in conversational recommendation, offering insights into how to improve their realism and effectiveness in simulating human behavior. The findings suggest that further research is needed to develop more accurate and realistic simulators that can better represent the complexities of human preferences and behaviors.
Reach us at info@study.space
[slides and audio] Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation