3 Jul 2024 | Qian Pan, Zahra Ashktorab, Michael Desmond, Martin Santillan Cooper, James Johnson, Rahul Nair, Elizabeth Daly, Werner Geyer
This paper explores the challenges and opportunities of using Large Language Models (LLMs) as evaluators (LLM-as-a-judge) in assessing natural language generation (NLG) outputs. Traditional metrics like BLEU and ROUGE are inadequate for evaluating highly creative or superior-quality text, and human evaluation is costly and difficult to scale. The authors present EvaluLLM, a tool that enables users to leverage LLMs as customizable judges, integrating human input to ensure alignment with human preferences and maintain robust and consistent evaluations. Through interviews with eight domain experts, the study identifies the need for assistance in developing effective evaluation criteria. The paper offers design recommendations to optimize human-assisted LLM-as-judge systems, including efficient criteria iteration, structured and customizable templates, interactive criteria iteration, ensuring consistency, supporting different setups, adaptable reference-based evaluation, enhancing system transparency, and proactively mitigating potential biases. The findings highlight the potential of LLMs as customizable judges and emphasize the importance of interactive, transparent, and user-centered evaluation processes.This paper explores the challenges and opportunities of using Large Language Models (LLMs) as evaluators (LLM-as-a-judge) in assessing natural language generation (NLG) outputs. Traditional metrics like BLEU and ROUGE are inadequate for evaluating highly creative or superior-quality text, and human evaluation is costly and difficult to scale. The authors present EvaluLLM, a tool that enables users to leverage LLMs as customizable judges, integrating human input to ensure alignment with human preferences and maintain robust and consistent evaluations. Through interviews with eight domain experts, the study identifies the need for assistance in developing effective evaluation criteria. The paper offers design recommendations to optimize human-assisted LLM-as-judge systems, including efficient criteria iteration, structured and customizable templates, interactive criteria iteration, ensuring consistency, supporting different setups, adaptable reference-based evaluation, enhancing system transparency, and proactively mitigating potential biases. The findings highlight the potential of LLMs as customizable judges and emphasize the importance of interactive, transparent, and user-centered evaluation processes.