16 Jun 2024 | Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang
The paper "Humans or LLMs as the Judge? A Study on Judgement Bias" by Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang explores the biases of human judges and large language models (LLMs) in evaluating LLM performance. The authors propose a novel reference-free framework to investigate four types of biases: Misinformation Oversight Bias, Gender Bias, Authority Bias, and Beauty Bias. They curate a dataset based on the revised Bloom’s Taxonomy and conduct extensive evaluations. The results show that both human and LLM judges are susceptible to various biases, with LLMs exhibiting significant biases in all four categories. The study also demonstrates how these biases can be exploited to conduct attacks on LLM judges, achieving an Attack Successful Rate (ASR) of 50% on GPT-4. The authors aim to raise awareness about the biases and vulnerabilities of human and LLM judges and emphasize the need for developing more robust evaluation systems. The paper includes a detailed methodology, experimental results, and discussions on the limitations and ethical considerations.The paper "Humans or LLMs as the Judge? A Study on Judgement Bias" by Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang explores the biases of human judges and large language models (LLMs) in evaluating LLM performance. The authors propose a novel reference-free framework to investigate four types of biases: Misinformation Oversight Bias, Gender Bias, Authority Bias, and Beauty Bias. They curate a dataset based on the revised Bloom’s Taxonomy and conduct extensive evaluations. The results show that both human and LLM judges are susceptible to various biases, with LLMs exhibiting significant biases in all four categories. The study also demonstrates how these biases can be exploited to conduct attacks on LLM judges, achieving an Attack Successful Rate (ASR) of 50% on GPT-4. The authors aim to raise awareness about the biases and vulnerabilities of human and LLM judges and emphasize the need for developing more robust evaluation systems. The paper includes a detailed methodology, experimental results, and discussions on the limitations and ethical considerations.