[slides] Who Validates the Validators%3F Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

The paper "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences" addresses the challenge of validating the outputs of Large Language Models (LLMs) by proposing a mixed-initiative approach called EvalGEN. The authors, Shreya Shankar, J.D. Zamfirescu-Pereira, Björn Hartmann, Aditya G. Parameswaran, and Ian Arawjo, highlight the limitations of current evaluation methods, which often rely on code-based or LLM-generated evaluators that themselves suffer from biases and inaccuracies. EvalGEN aims to align LLM-generated evaluation functions with human preferences through a user-friendly interface. The system provides automated assistance in generating evaluation criteria and implementing assertions, while also allowing users to grade a subset of LLM outputs to refine the criteria. The interface, EvalGEN, is integrated into the existing ChainForge system for prompt engineering and auditing. The paper presents a qualitative study with nine industry practitioners who used EvalGEN to evaluate LLM outputs. The study found overall support for EvalGEN but also highlighted the subjectivity and iterative nature of the alignment process. A phenomenon called *criteria drift* was observed, where users needed to define criteria to grade outputs, but the process of grading helped them refine these criteria. This suggests that evaluation criteria are not independent but dependent on the specific LLM outputs observed. The authors discuss the implications of these findings for the design of future LLM evaluation assistants, emphasizing the need for mixed-initiative approaches that embrace the messiness and iteration of the alignment process. They also raise broader questions about what constitutes "alignment with user preferences" in the context of LLM evaluation.The paper "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences" addresses the challenge of validating the outputs of Large Language Models (LLMs) by proposing a mixed-initiative approach called EvalGEN. The authors, Shreya Shankar, J.D. Zamfirescu-Pereira, Björn Hartmann, Aditya G. Parameswaran, and Ian Arawjo, highlight the limitations of current evaluation methods, which often rely on code-based or LLM-generated evaluators that themselves suffer from biases and inaccuracies. EvalGEN aims to align LLM-generated evaluation functions with human preferences through a user-friendly interface. The system provides automated assistance in generating evaluation criteria and implementing assertions, while also allowing users to grade a subset of LLM outputs to refine the criteria. The interface, EvalGEN, is integrated into the existing ChainForge system for prompt engineering and auditing. The paper presents a qualitative study with nine industry practitioners who used EvalGEN to evaluate LLM outputs. The study found overall support for EvalGEN but also highlighted the subjectivity and iterative nature of the alignment process. A phenomenon called *criteria drift* was observed, where users needed to define criteria to grade outputs, but the process of grading helped them refine these criteria. This suggests that evaluation criteria are not independent but dependent on the specific LLM outputs observed. The authors discuss the implications of these findings for the design of future LLM evaluation assistants, emphasizing the need for mixed-initiative approaches that embrace the messiness and iteration of the alignment process. They also raise broader questions about what constitutes "alignment with user preferences" in the context of LLM evaluation.

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

18 Apr 2024 | Shreya Shankar, J.D. Zamfirescu-Pereira, Björn Hartmann, Aditya G. Parameswaran, Ian Arawjo