Humans or LLMs as the Judge? A Study on Judgement Bias
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang
The Chinese University of Hong Kong, Shenzhen
Shenzhen Research Institute of Big Data
Abstract: Using humans and large language models (LLMs) as judges for evaluating LLM performance has gained attention, but this approach introduces potential biases from both humans and LLMs, questioning the reliability of evaluation results. This paper proposes a novel framework to investigate biases in LLM and human judges without relying on groundtruth annotations. We curate a dataset based on the revised Bloom's Taxonomy and conduct thousands of evaluations. Results show that both human and LLM judges are vulnerable to various perturbations, and even the most advanced judges have considerable biases. We further exploit these biases to conduct attacks on LLM judges. Our work aims to raise awareness of the biases and vulnerabilities of human- and LLM-as-a-judge, as well as the urgency of developing robust evaluation systems.
Introduction: Proprietary models like GPT-4, Claude, and GeminiPro showcase strong abilities in NLP tasks and are widely used. Open-source communities are trying to replicate these models and democratize LLMs. To track LLM advancements, the community focuses on evaluating model performance through benchmarks, which can be categorized into open-ended and close-ended. While close-ended benchmarks are convenient, they often suffer from data contamination. Open-ended benchmarks, such as MTBench and Alpaca-Eval, test models via free-form generation, which is more consistent with real-world use cases. However, open-ended benchmarks have less data contamination since there are no standard answers.
Open-ended benchmarks often rely on human evaluation. With the emergence of human-aligned LLMs, LLM-as-a-judge serves as an alternative to human judges. However, both types of judges are found to possess certain biases, questioning the validity of human- and LLM-as-a-judge. Therefore, an important question arises: How biased are humans and LLMs in judging open-ended generation?
Current bias evaluation frameworks require a golden standard, either in the form of groundtruth or human-provided reference answers. However, what if we want to explore the effect of perturbations without a golden standard?
In this paper, we first identify four biases: Misinformation Oversight Bias, Gender Bias, Authority Bias, and Beauty Bias, which are crucial in NLG evaluation. Inspired by Intervention Study, we investigate these biases by adding four perturbations to raw answers.
To fill the gap in current research, we propose a novel reference-free framework for bias evaluation on human and LLM judges. We form a control group and an experimental group, where each sample in the former contains a pair of answers to the same question, and each answer pair in the latter consists of an answer from the former and the perturbed version of the otherHumans or LLMs as the Judge? A Study on Judgement Bias
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang
The Chinese University of Hong Kong, Shenzhen
Shenzhen Research Institute of Big Data
Abstract: Using humans and large language models (LLMs) as judges for evaluating LLM performance has gained attention, but this approach introduces potential biases from both humans and LLMs, questioning the reliability of evaluation results. This paper proposes a novel framework to investigate biases in LLM and human judges without relying on groundtruth annotations. We curate a dataset based on the revised Bloom's Taxonomy and conduct thousands of evaluations. Results show that both human and LLM judges are vulnerable to various perturbations, and even the most advanced judges have considerable biases. We further exploit these biases to conduct attacks on LLM judges. Our work aims to raise awareness of the biases and vulnerabilities of human- and LLM-as-a-judge, as well as the urgency of developing robust evaluation systems.
Introduction: Proprietary models like GPT-4, Claude, and GeminiPro showcase strong abilities in NLP tasks and are widely used. Open-source communities are trying to replicate these models and democratize LLMs. To track LLM advancements, the community focuses on evaluating model performance through benchmarks, which can be categorized into open-ended and close-ended. While close-ended benchmarks are convenient, they often suffer from data contamination. Open-ended benchmarks, such as MTBench and Alpaca-Eval, test models via free-form generation, which is more consistent with real-world use cases. However, open-ended benchmarks have less data contamination since there are no standard answers.
Open-ended benchmarks often rely on human evaluation. With the emergence of human-aligned LLMs, LLM-as-a-judge serves as an alternative to human judges. However, both types of judges are found to possess certain biases, questioning the validity of human- and LLM-as-a-judge. Therefore, an important question arises: How biased are humans and LLMs in judging open-ended generation?
Current bias evaluation frameworks require a golden standard, either in the form of groundtruth or human-provided reference answers. However, what if we want to explore the effect of perturbations without a golden standard?
In this paper, we first identify four biases: Misinformation Oversight Bias, Gender Bias, Authority Bias, and Beauty Bias, which are crucial in NLG evaluation. Inspired by Intervention Study, we investigate these biases by adding four perturbations to raw answers.
To fill the gap in current research, we propose a novel reference-free framework for bias evaluation on human and LLM judges. We form a control group and an experimental group, where each sample in the former contains a pair of answers to the same question, and each answer pair in the latter consists of an answer from the former and the perturbed version of the other