7 Jun 2024 | Felipe Maia Polo, Ronald Xu, Lucas Weber, Mirian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, Mikhail Yurochkin
The paper "Efficient Multi-Prompt Evaluation of LLMs" addresses the issue of evaluating large language models (LLMs) using a limited set of prompt templates, which can lead to unreliable and inconsistent results. The authors introduce PromptEval, a method that estimates the performance distribution across a large set of prompts, borrowing strength across prompts and examples to produce accurate estimates within practical evaluation budgets. PromptEval is based on Item Response Theory (IRT) and can estimate performance quantiles, providing a more comprehensive evaluation framework. The method is theoretically proven to be consistent and empirically validated on three prominent benchmarks: MMLU, BIG-bench Hard, and LMentry. The paper also discusses the effectiveness of different variations of PromptEval and its application in best-prompt identification tasks. Additionally, it analyzes prompt sensitivity on the MMLU dataset, highlighting the variability in performance across different prompts. Overall, PromptEval offers a more robust and efficient approach to evaluating LLMs, addressing the limitations of traditional evaluation methods.The paper "Efficient Multi-Prompt Evaluation of LLMs" addresses the issue of evaluating large language models (LLMs) using a limited set of prompt templates, which can lead to unreliable and inconsistent results. The authors introduce PromptEval, a method that estimates the performance distribution across a large set of prompts, borrowing strength across prompts and examples to produce accurate estimates within practical evaluation budgets. PromptEval is based on Item Response Theory (IRT) and can estimate performance quantiles, providing a more comprehensive evaluation framework. The method is theoretically proven to be consistent and empirically validated on three prominent benchmarks: MMLU, BIG-bench Hard, and LMentry. The paper also discusses the effectiveness of different variations of PromptEval and its application in best-prompt identification tasks. Additionally, it analyzes prompt sensitivity on the MMLU dataset, highlighting the variability in performance across different prompts. Overall, PromptEval offers a more robust and efficient approach to evaluating LLMs, addressing the limitations of traditional evaluation methods.