[slides and audio] LLM Evaluators Recognize and Favor Their Own Generations

This paper investigates the phenomenon of self-preference in large language models (LLMs), where LLMs rate their own outputs higher than those of other LLMs or humans. The authors explore whether this bias is due to LLMs recognizing their own outputs. They find that LLMs like GPT-4 and Llama 2 have non-trivial accuracy in distinguishing their own outputs from others, with GPT-4 achieving 73.5% accuracy. Fine-tuning these models further enhances their self-recognition capability, leading to a linear correlation between self-recognition and self-preference. The study also controls for potential confounders and invalidates the reverse causal relationship, providing evidence that self-recognition causes self-preference. The findings highlight the importance of addressing self-preference in LLMs to ensure unbiased evaluations and AI safety.This paper investigates the phenomenon of self-preference in large language models (LLMs), where LLMs rate their own outputs higher than those of other LLMs or humans. The authors explore whether this bias is due to LLMs recognizing their own outputs. They find that LLMs like GPT-4 and Llama 2 have non-trivial accuracy in distinguishing their own outputs from others, with GPT-4 achieving 73.5% accuracy. Fine-tuning these models further enhances their self-recognition capability, leading to a linear correlation between self-recognition and self-preference. The study also controls for potential confounders and invalidates the reverse causal relationship, providing evidence that self-recognition causes self-preference. The findings highlight the importance of addressing self-preference in LLMs to ensure unbiased evaluations and AI safety.

LLM Evaluators Recognize and Favor Their Own Generations

15 Apr 2024 | Arjun Panickssery, Samuel R. Bowman, Shi Feng