5 Jun 2024 | Melissa Ailem and Katerina Marazopoulou and Charlotte Siska and James Bono
This paper examines the robustness of Large Language Model (LLM) evaluations to the distributional assumptions of benchmarks. The authors argue that benchmarks often assume that test prompts are representative samples from a real-world distribution, but this is not always the case. Instead, the distribution of interest varies depending on the specific use case. The study finds that model performance across test prompts is not random, and that accounting for correlations between prompts can change model rankings on major benchmarks. Explanatory factors for these correlations include semantic similarity and common LLM failure points.
The paper presents a novel approach to assess the robustness and adequacy of benchmarks used in evaluating LLMs by analyzing the performance of multiple LLMs on four major benchmarks. The key contributions include: (1) all prompts contribute equally during evaluation, (2) prompts are weighted during evaluation, and (3) the performance of models can be explained by semantic similarity or common failure points.
The authors evaluate the performance of LLMs on four major benchmarks: ANLI, HellaSwag, CommonsenseQA, and CNN/Daily Mail. They find that the correlation of model performance across prompts is significant (p-value < 0.05) for all benchmarks. They also find that the performance of models can change significantly when different distributional assumptions are considered, with shifts as large as 10% and rank changes as large as 5 (out of 14 models).
The study shows that model performance similarity across prompts can be explained by semantic similarity, but it is most likely derived from common failure points of the LLM. The authors also find that the performance of models can be significantly affected by the weighting of prompts, with different weighting schemes leading to different rankings of models.
The paper concludes that the robustness of LLM evaluations to the distributional assumptions of benchmarks is an important issue that needs to be addressed. The authors suggest that future work should focus on identifying additional factors that may explain these biases and developing solutions to improve benchmark robustness.This paper examines the robustness of Large Language Model (LLM) evaluations to the distributional assumptions of benchmarks. The authors argue that benchmarks often assume that test prompts are representative samples from a real-world distribution, but this is not always the case. Instead, the distribution of interest varies depending on the specific use case. The study finds that model performance across test prompts is not random, and that accounting for correlations between prompts can change model rankings on major benchmarks. Explanatory factors for these correlations include semantic similarity and common LLM failure points.
The paper presents a novel approach to assess the robustness and adequacy of benchmarks used in evaluating LLMs by analyzing the performance of multiple LLMs on four major benchmarks. The key contributions include: (1) all prompts contribute equally during evaluation, (2) prompts are weighted during evaluation, and (3) the performance of models can be explained by semantic similarity or common failure points.
The authors evaluate the performance of LLMs on four major benchmarks: ANLI, HellaSwag, CommonsenseQA, and CNN/Daily Mail. They find that the correlation of model performance across prompts is significant (p-value < 0.05) for all benchmarks. They also find that the performance of models can change significantly when different distributional assumptions are considered, with shifts as large as 10% and rank changes as large as 5 (out of 14 models).
The study shows that model performance similarity across prompts can be explained by semantic similarity, but it is most likely derived from common failure points of the LLM. The authors also find that the performance of models can be significantly affected by the weighting of prompts, with different weighting schemes leading to different rankings of models.
The paper concludes that the robustness of LLM evaluations to the distributional assumptions of benchmarks is an important issue that needs to be addressed. The authors suggest that future work should focus on identifying additional factors that may explain these biases and developing solutions to improve benchmark robustness.