7 Jun 2024 | Felipe Maia Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, Mikhail Yurochkin
PromptEval is a method for efficiently evaluating the performance of large language models (LLMs) across multiple prompt templates. The method leverages statistical models and borrowing strength across prompts and examples to estimate performance distributions and quantiles with a limited evaluation budget. It is based on Item Response Theory (IRT) and uses a parametric IRT model to estimate performance across different prompt templates. The method is demonstrated on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry. PromptEval can accurately estimate performance quantiles across 100 prompt templates with a budget equivalent to two single-prompt evaluations. The method is theoretically proven to consistently estimate performance distributions and is shown to be effective in practical evaluations. The results demonstrate that PromptEval provides more accurate and robust performance estimates compared to traditional single-prompt evaluation methods. The method also enables the identification of the best prompt for a given task, which is crucial for applications where prompt engineering is important. Overall, PromptEval offers a more comprehensive and efficient way to evaluate LLMs across multiple prompt templates.PromptEval is a method for efficiently evaluating the performance of large language models (LLMs) across multiple prompt templates. The method leverages statistical models and borrowing strength across prompts and examples to estimate performance distributions and quantiles with a limited evaluation budget. It is based on Item Response Theory (IRT) and uses a parametric IRT model to estimate performance across different prompt templates. The method is demonstrated on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry. PromptEval can accurately estimate performance quantiles across 100 prompt templates with a budget equivalent to two single-prompt evaluations. The method is theoretically proven to consistently estimate performance distributions and is shown to be effective in practical evaluations. The results demonstrate that PromptEval provides more accurate and robust performance estimates compared to traditional single-prompt evaluation methods. The method also enables the identification of the best prompt for a given task, which is crucial for applications where prompt engineering is important. Overall, PromptEval offers a more comprehensive and efficient way to evaluate LLMs across multiple prompt templates.