Can LLM be a Personalized Judge?

Can LLM be a Personalized Judge?

17 Jun 2024 | Yijiang River Dong* and Tiancheng Hu* and Nigel Collier
The paper investigates the reliability of using large language models (LLMs) as personalized judges, a common approach in evaluating LLM personalization. The authors find that this method is less reliable than assumed, showing low and inconsistent agreement with human ground truth. They attribute this to the simplicity of the personas typically used, which often lack the necessary predictive power. To address this issue, the authors introduce verbal uncertainty estimation, allowing LLMs to express low confidence on uncertain judgments. This adjustment significantly improves performance, achieving over 80% agreement on high-certainty samples for binary tasks. Through human evaluation, the LLM-as-a-Personalized-Judge is found to achieve comparable performance to third-party human evaluation and even surpasses human performance on high-certainty samples. The study suggests that certainty-enhanced LLM-as-a-Personalized-Judge offers a promising direction for developing more reliable and scalable methods for evaluating LLM personalization. The code for the experiments is available at <https://github.com/dong-river/Personalized-Judge>.The paper investigates the reliability of using large language models (LLMs) as personalized judges, a common approach in evaluating LLM personalization. The authors find that this method is less reliable than assumed, showing low and inconsistent agreement with human ground truth. They attribute this to the simplicity of the personas typically used, which often lack the necessary predictive power. To address this issue, the authors introduce verbal uncertainty estimation, allowing LLMs to express low confidence on uncertain judgments. This adjustment significantly improves performance, achieving over 80% agreement on high-certainty samples for binary tasks. Through human evaluation, the LLM-as-a-Personalized-Judge is found to achieve comparable performance to third-party human evaluation and even surpasses human performance on high-certainty samples. The study suggests that certainty-enhanced LLM-as-a-Personalized-Judge offers a promising direction for developing more reliable and scalable methods for evaluating LLM personalization. The code for the experiments is available at <https://github.com/dong-river/Personalized-Judge>.
Reach us at info@study.space