What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

18 Jun 2024 | Federico Errica and Giuseppe Siracusano and Davide Sanvito and Roberto Bifulco
This paper introduces two metrics for evaluating Large Language Models (LLMs): sensitivity and consistency. Sensitivity measures how much predictions change with variations in the prompt, while consistency measures how predictions vary across examples of the same class. These metrics are complementary to traditional performance metrics like accuracy and help identify failure modes in LLMs. The authors perform an empirical comparison of these metrics on text classification tasks, showing that they provide different insights into LLM behavior. They argue that these metrics should be included in automatic prompt engineering frameworks to improve LLM robustness and performance. The study evaluates these metrics on five datasets and three prompting strategies, finding that they convey distinct information about LLM behavior. The results show that sensitivity and consistency can be used to identify problematic samples and improve prompt engineering. The authors also discuss the limitations of these metrics, including their applicability to classification tasks only and the trade-off between approximation quality and computational cost. They conclude that sensitivity and consistency are valuable tools for evaluating and improving LLMs, particularly in production environments where prompt variations can lead to significant performance differences. The paper also highlights the importance of using these metrics to ensure LLMs are robust and reliable, especially in applications where safety and accuracy are critical.This paper introduces two metrics for evaluating Large Language Models (LLMs): sensitivity and consistency. Sensitivity measures how much predictions change with variations in the prompt, while consistency measures how predictions vary across examples of the same class. These metrics are complementary to traditional performance metrics like accuracy and help identify failure modes in LLMs. The authors perform an empirical comparison of these metrics on text classification tasks, showing that they provide different insights into LLM behavior. They argue that these metrics should be included in automatic prompt engineering frameworks to improve LLM robustness and performance. The study evaluates these metrics on five datasets and three prompting strategies, finding that they convey distinct information about LLM behavior. The results show that sensitivity and consistency can be used to identify problematic samples and improve prompt engineering. The authors also discuss the limitations of these metrics, including their applicability to classification tasks only and the trade-off between approximation quality and computational cost. They conclude that sensitivity and consistency are valuable tools for evaluating and improving LLMs, particularly in production environments where prompt variations can lead to significant performance differences. The paper also highlights the importance of using these metrics to ensure LLMs are robust and reliable, especially in applications where safety and accuracy are critical.
Reach us at info@study.space