18 Jun 2024 | Federico Errica and Giuseppe Siracusano and Davide Sanvito and Roberto Bifulco
The paper "What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering" by Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco from NEC Laboratories Europe introduces two new metrics—*sensitivity* and *consistency*—to evaluate the behavior of Large Language Models (LLMs) in response to variations in prompts. These metrics complement traditional task performance metrics like accuracy, providing a more comprehensive understanding of LLMs' robustness and consistency.
**Sensitivity** measures how much predictions change when the prompt is rephrased, without requiring ground truth labels. **Consistency** assesses how predictions vary across samples of the same class. The authors argue that LLMs with low sensitivity and high consistency are more reliable in production environments, as they are less prone to hallucinations and better understand the tasks.
The paper includes an empirical comparison of these metrics on text classification tasks using five datasets and two open-source and two closed-source LLMs. The results show that sensitivity and consistency provide different insights into LLM behavior, and developers can use these metrics to guide prompt engineering and choose the most suitable LLMs for specific use cases.
The authors also discuss related work on spurious features, uncertainty quantification, and prompt optimization, highlighting the limitations and future directions of their proposed metrics. They conclude by emphasizing the importance of integrating these metrics into automatic prompt engineering frameworks to improve the reliability and trustworthiness of AI systems.The paper "What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering" by Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco from NEC Laboratories Europe introduces two new metrics—*sensitivity* and *consistency*—to evaluate the behavior of Large Language Models (LLMs) in response to variations in prompts. These metrics complement traditional task performance metrics like accuracy, providing a more comprehensive understanding of LLMs' robustness and consistency.
**Sensitivity** measures how much predictions change when the prompt is rephrased, without requiring ground truth labels. **Consistency** assesses how predictions vary across samples of the same class. The authors argue that LLMs with low sensitivity and high consistency are more reliable in production environments, as they are less prone to hallucinations and better understand the tasks.
The paper includes an empirical comparison of these metrics on text classification tasks using five datasets and two open-source and two closed-source LLMs. The results show that sensitivity and consistency provide different insights into LLM behavior, and developers can use these metrics to guide prompt engineering and choose the most suitable LLMs for specific use cases.
The authors also discuss related work on spurious features, uncertainty quantification, and prompt optimization, highlighting the limitations and future directions of their proposed metrics. They conclude by emphasizing the importance of integrating these metrics into automatic prompt engineering frameworks to improve the reliability and trustworthiness of AI systems.