3 Jul 2024 | Qian Pan, Zahra Ashktorab, Michael Desmond, Martin Santillan Cooper, James Johnson, Rahul Nair, Elizabeth Daly, Werner Geyer
This paper presents EvaluLLM, a user study exploring the design of LLM-as-a-Judge systems that integrate human input to balance trust and cost-saving potential with caution. The study addresses the challenges of evaluating large language models (LLMs) in creative or high-quality text generation tasks where traditional metrics like BLEU and ROUGE are less effective. The research highlights the need for customizable evaluation criteria that align with human preferences and the importance of human involvement in ensuring evaluation robustness and consistency. Through interviews with eight domain experts, the study identifies key challenges in model evaluation, including the need for rapid performance comparison, structured and customizable evaluation templates, and strategies for integrating LLM-as-a-Judge into workflows. The findings suggest that LLMs can be effective as evaluators when paired with human oversight, enabling more accurate and reliable assessments. The paper proposes design recommendations for LLM-as-a-Judge systems, emphasizing the importance of interactive, transparent, and user-centered evaluation processes. These include structured and customizable templates for evaluation criteria, interactive criteria iteration, ensuring consistency, supporting different setups, adaptable reference-based evaluation, and enhancing system transparency. The study also discusses the need for bias mitigation strategies and further automation to improve the efficiency and effectiveness of LLM-as-a-Judge systems. Overall, the research underscores the importance of human-in-the-loop approaches in evaluating LLMs, ensuring that evaluations are aligned with human preferences and that the systems are both reliable and adaptable to diverse use cases.This paper presents EvaluLLM, a user study exploring the design of LLM-as-a-Judge systems that integrate human input to balance trust and cost-saving potential with caution. The study addresses the challenges of evaluating large language models (LLMs) in creative or high-quality text generation tasks where traditional metrics like BLEU and ROUGE are less effective. The research highlights the need for customizable evaluation criteria that align with human preferences and the importance of human involvement in ensuring evaluation robustness and consistency. Through interviews with eight domain experts, the study identifies key challenges in model evaluation, including the need for rapid performance comparison, structured and customizable evaluation templates, and strategies for integrating LLM-as-a-Judge into workflows. The findings suggest that LLMs can be effective as evaluators when paired with human oversight, enabling more accurate and reliable assessments. The paper proposes design recommendations for LLM-as-a-Judge systems, emphasizing the importance of interactive, transparent, and user-centered evaluation processes. These include structured and customizable templates for evaluation criteria, interactive criteria iteration, ensuring consistency, supporting different setups, adaptable reference-based evaluation, and enhancing system transparency. The study also discusses the need for bias mitigation strategies and further automation to improve the efficiency and effectiveness of LLM-as-a-Judge systems. Overall, the research underscores the importance of human-in-the-loop approaches in evaluating LLMs, ensuring that evaluations are aligned with human preferences and that the systems are both reliable and adaptable to diverse use cases.