5 Jun 2024 | Melissa Ailem and Katerina Marazopoulou and Charlotte Siska and James Bono
This paper examines the robustness of Large Language Models (LLMs) to the distributional assumptions of benchmarks. The authors argue that the assumption that test prompts within a benchmark are randomly sampled from a real-world distribution is often not valid, as the distribution of interest varies according to specific use cases. They find that:
1. Model performance across test prompts is significantly correlated, indicating non-random relationships.
2. Accounting for these correlations can change model rankings on major benchmarks.
3. Explanatory factors for these correlations include semantic similarity and common LLM failure points.
The paper introduces a novel method to assess the robustness and adequacy of benchmarks by analyzing the performance of multiple LLMs on four major benchmarks. Key contributions include:
- Observing significant correlations in model performance across prompts.
- Exploring the impact of different distributional assumptions on model comparisons, with shifts in performance up to 10% and rank changes up to 5.
- Characterizing performance over the distribution of all possible prompt weights.
- Show that model performance similarity is driven by semantic similarity and common failure points.
The study highlights the importance of considering the distributional assumptions of benchmarks to ensure the reliability of LLM evaluations.This paper examines the robustness of Large Language Models (LLMs) to the distributional assumptions of benchmarks. The authors argue that the assumption that test prompts within a benchmark are randomly sampled from a real-world distribution is often not valid, as the distribution of interest varies according to specific use cases. They find that:
1. Model performance across test prompts is significantly correlated, indicating non-random relationships.
2. Accounting for these correlations can change model rankings on major benchmarks.
3. Explanatory factors for these correlations include semantic similarity and common LLM failure points.
The paper introduces a novel method to assess the robustness and adequacy of benchmarks by analyzing the performance of multiple LLMs on four major benchmarks. Key contributions include:
- Observing significant correlations in model performance across prompts.
- Exploring the impact of different distributional assumptions on model comparisons, with shifts in performance up to 10% and rank changes up to 5.
- Characterizing performance over the distribution of all possible prompt weights.
- Show that model performance similarity is driven by semantic similarity and common failure points.
The study highlights the importance of considering the distributional assumptions of benchmarks to ensure the reliability of LLM evaluations.