Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

6 Apr 2024 | Yann Dubois, Balázs Galambosi, Percy Liang and Tatsunori B. Hashimoto
The paper "Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators" addresses the issue of spurious correlations in automated evaluation metrics, particularly focusing on the length bias in AlpacaEval, a popular benchmark for chat LLMs. The authors propose a simple regression-based approach to control for biases in auto-evaluations. They fit a generalized linear model (GLM) to predict auto-annotator preferences based on mediators such as length difference and other relevant features. By conditioning the GLM with a zero difference in lengths, they obtain length-controlled preferences, which improve the robustness of the metric to manipulations in model verbosity. The length-controlled AlpacaEval (AlpacaEval-LC) shows higher correlations with human evaluations (Chatbot Arena) and is more robust to adversarial attacks, such as truncation. The paper also discusses the limitations of other length-correction methods and suggests potential applications in reinforcement learning from human feedback (RLHF). Overall, the proposed method provides a principled and interpretable way to mitigate length bias in automated evaluations.The paper "Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators" addresses the issue of spurious correlations in automated evaluation metrics, particularly focusing on the length bias in AlpacaEval, a popular benchmark for chat LLMs. The authors propose a simple regression-based approach to control for biases in auto-evaluations. They fit a generalized linear model (GLM) to predict auto-annotator preferences based on mediators such as length difference and other relevant features. By conditioning the GLM with a zero difference in lengths, they obtain length-controlled preferences, which improve the robustness of the metric to manipulations in model verbosity. The length-controlled AlpacaEval (AlpacaEval-LC) shows higher correlations with human evaluations (Chatbot Arena) and is more robust to adversarial attacks, such as truncation. The paper also discusses the limitations of other length-correction methods and suggests potential applications in reinforcement learning from human feedback (RLHF). Overall, the proposed method provides a principled and interpretable way to mitigate length bias in automated evaluations.
Reach us at info@study.space