tinyBenchmarks: evaluating LLMs with fewer examples

tinyBenchmarks: evaluating LLMs with fewer examples

2024 | Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin
This paper introduces tinyBenchmarks, a method to evaluate large language models (LLMs) with significantly fewer examples than traditional benchmarks. Traditional benchmarks, such as MMLU, HELM, and AlpacaEval 2.0, consist of tens of thousands of examples, making evaluation expensive and time-consuming. The authors propose strategies to reduce the number of examples needed for accurate performance estimation. For instance, they show that evaluating an LLM on just 100 curated examples is sufficient to estimate its performance on MMLU with an average error of less than 2%. The paper presents several evaluation strategies, including stratified random sampling, clustering based on model correctness, and Item Response Theory (IRT). The IRT approach is particularly effective, as it uses statistical models to estimate the performance of LLMs based on their ability to answer questions. The authors also release tiny versions of popular benchmarks, each containing 100 examples per scenario, along with IRT-based tools for improving performance estimation. The study evaluates these strategies on four benchmarks: Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0. Results show that using 100 examples per scenario is sufficient to reliably estimate LLM performance with an average error of about 2%. This approach significantly reduces computational, environmental, and financial costs associated with evaluating LLMs. The authors also demonstrate that their methods are effective even when there is a distribution shift between training and test models, which is common in real-world scenarios. The paper concludes that using a small number of examples, combined with IRT-based methods, allows for efficient and reliable evaluation of LLMs. This approach not only reduces the cost of evaluation but also enables more frequent testing during fine-tuning and prompt engineering. The authors release tinyBenchmarks and an IRT-based tool to facilitate this efficient evaluation of future LLMs.This paper introduces tinyBenchmarks, a method to evaluate large language models (LLMs) with significantly fewer examples than traditional benchmarks. Traditional benchmarks, such as MMLU, HELM, and AlpacaEval 2.0, consist of tens of thousands of examples, making evaluation expensive and time-consuming. The authors propose strategies to reduce the number of examples needed for accurate performance estimation. For instance, they show that evaluating an LLM on just 100 curated examples is sufficient to estimate its performance on MMLU with an average error of less than 2%. The paper presents several evaluation strategies, including stratified random sampling, clustering based on model correctness, and Item Response Theory (IRT). The IRT approach is particularly effective, as it uses statistical models to estimate the performance of LLMs based on their ability to answer questions. The authors also release tiny versions of popular benchmarks, each containing 100 examples per scenario, along with IRT-based tools for improving performance estimation. The study evaluates these strategies on four benchmarks: Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0. Results show that using 100 examples per scenario is sufficient to reliably estimate LLM performance with an average error of about 2%. This approach significantly reduces computational, environmental, and financial costs associated with evaluating LLMs. The authors also demonstrate that their methods are effective even when there is a distribution shift between training and test models, which is common in real-world scenarios. The paper concludes that using a small number of examples, combined with IRT-based methods, allows for efficient and reliable evaluation of LLMs. This approach not only reduces the cost of evaluation but also enables more frequent testing during fine-tuning and prompt engineering. The authors release tinyBenchmarks and an IRT-based tool to facilitate this efficient evaluation of future LLMs.
Reach us at info@study.space
Understanding tinyBenchmarks%3A evaluating LLMs with fewer examples