Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

6 Apr 2024 | Yann Dubois, Balázs Galambosi, Percy Liang and Tatsunori B. Hashimoto
This paper introduces a method to debias automatic evaluation metrics by controlling for spurious correlations, specifically focusing on length bias in the AlpacaEval benchmark. AlpacaEval is an automated evaluation metric for chat LLMs that uses LLMs to estimate response quality. However, it is known to favor models that generate longer outputs, which is a spurious correlation. To address this, the authors propose a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?" The approach involves fitting a generalized linear model (GLM) to predict the biased output of interest (auto-annotator preferences) based on the mediators we want to control for (length difference) and other relevant features. By conditioning the GLM with a zero difference in lengths, the authors obtain length-controlled preferences. This method not only improves the robustness of the metric to manipulations in model verbosity but also increases the Spearman correlation with LMSYS' Chatbot Arena from 0.94 to 0.98. The authors demonstrate that length-controlled AlpacaEval (AlpacaEval-LC) is more robust to length-based spurious correlates and better correlates with human evaluations of model rankings. They also show that AlpacaEval-LC is interpretable as a win-rate and robust to adversarial attacks. The method is simple, interpretable, and effective in reducing the length bias of automated evaluations. The code and resulting leaderboard are released for further research and evaluation.This paper introduces a method to debias automatic evaluation metrics by controlling for spurious correlations, specifically focusing on length bias in the AlpacaEval benchmark. AlpacaEval is an automated evaluation metric for chat LLMs that uses LLMs to estimate response quality. However, it is known to favor models that generate longer outputs, which is a spurious correlation. To address this, the authors propose a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?" The approach involves fitting a generalized linear model (GLM) to predict the biased output of interest (auto-annotator preferences) based on the mediators we want to control for (length difference) and other relevant features. By conditioning the GLM with a zero difference in lengths, the authors obtain length-controlled preferences. This method not only improves the robustness of the metric to manipulations in model verbosity but also increases the Spearman correlation with LMSYS' Chatbot Arena from 0.94 to 0.98. The authors demonstrate that length-controlled AlpacaEval (AlpacaEval-LC) is more robust to length-based spurious correlates and better correlates with human evaluations of model rankings. They also show that AlpacaEval-LC is interpretable as a win-rate and robust to adversarial attacks. The method is simple, interpretable, and effective in reducing the length bias of automated evaluations. The code and resulting leaderboard are released for further research and evaluation.
Reach us at info@study.space