17 Jun 2024 | Yijiang River Dong* and Tiancheng Hu* and Nigel Collier
Can LLM be a Personalized Judge?
This paper investigates the reliability of using large language models (LLMs) as personalized judges, where the LLM is asked to infer user preferences based on personas. The study finds that directly applying LLM-as-a-Personalized-Judge is less reliable than previously assumed, showing low and inconsistent agreement with human ground truth. The personas typically used are often overly simplistic, resulting in low predictive power. To address these issues, the paper introduces verbal uncertainty estimation into the LLM-as-a-Personalized-Judge pipeline, allowing the model to express low confidence on uncertain judgments. This adjustment leads to much higher agreement (above 80%) on high-certainty samples for binary tasks. Through human evaluation, the LLM-as-a-Personalized-Judge achieves comparable performance to third-party human evaluation and even surpasses human performance on high-certainty samples. The study indicates that certainty-enhanced LLM-as-a-Personalized-Judge offers a promising direction for developing more reliable and scalable methods for evaluating LLM personalization. The code is available at https://github.com/dong-river/Personalized-Judge.
The paper discusses the challenges of using LLMs for personalization tasks, highlighting the issue of persona sparsity, where the available persona information is insufficient for accurate predictions. The study proposes a method to address this by incorporating verbal uncertainty estimation into the LLM-as-a-Personalized-Judge process. The results show that this approach significantly improves performance on high-certainty samples. The study also compares the performance of LLM-as-a-Personalized-Judge with third-person human evaluation, finding that it achieves comparable performance and even surpasses human performance on high-certainty samples. The paper concludes that certainty-aware LLM-as-a-Personalized-Judge is a promising alternative for evaluating personalization tasks, especially when first-person data is not available. The study also highlights the limitations of current datasets and the need for more diverse and comprehensive data to evaluate LLM personalization effectively. The paper emphasizes the importance of ethical considerations, including user autonomy, privacy, and the potential for social biases in LLMs. The study advocates for the collection of more first-person personalization data and the development of more reliable and scalable methods for evaluating LLM personalization.Can LLM be a Personalized Judge?
This paper investigates the reliability of using large language models (LLMs) as personalized judges, where the LLM is asked to infer user preferences based on personas. The study finds that directly applying LLM-as-a-Personalized-Judge is less reliable than previously assumed, showing low and inconsistent agreement with human ground truth. The personas typically used are often overly simplistic, resulting in low predictive power. To address these issues, the paper introduces verbal uncertainty estimation into the LLM-as-a-Personalized-Judge pipeline, allowing the model to express low confidence on uncertain judgments. This adjustment leads to much higher agreement (above 80%) on high-certainty samples for binary tasks. Through human evaluation, the LLM-as-a-Personalized-Judge achieves comparable performance to third-party human evaluation and even surpasses human performance on high-certainty samples. The study indicates that certainty-enhanced LLM-as-a-Personalized-Judge offers a promising direction for developing more reliable and scalable methods for evaluating LLM personalization. The code is available at https://github.com/dong-river/Personalized-Judge.
The paper discusses the challenges of using LLMs for personalization tasks, highlighting the issue of persona sparsity, where the available persona information is insufficient for accurate predictions. The study proposes a method to address this by incorporating verbal uncertainty estimation into the LLM-as-a-Personalized-Judge process. The results show that this approach significantly improves performance on high-certainty samples. The study also compares the performance of LLM-as-a-Personalized-Judge with third-person human evaluation, finding that it achieves comparable performance and even surpasses human performance on high-certainty samples. The paper concludes that certainty-aware LLM-as-a-Personalized-Judge is a promising alternative for evaluating personalization tasks, especially when first-person data is not available. The study also highlights the limitations of current datasets and the need for more diverse and comprehensive data to evaluate LLM personalization effectively. The paper emphasizes the importance of ethical considerations, including user autonomy, privacy, and the potential for social biases in LLMs. The study advocates for the collection of more first-person personalization data and the development of more reliable and scalable methods for evaluating LLM personalization.