Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements

Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements

6 Jun 2024 | Anton Voronov, Lena Wolf, Max Ryabinin
This paper investigates the impact of prompt template selection on the performance of in-context learning (ICL) in large language models (LLMs). The authors find that the choice of prompt template significantly affects the performance of LLMs, and that the best template for one model or method may not be optimal for another. They evaluate 21 models across four classification datasets and show that the performance of the strongest models can be reduced to random guess levels by using a poor template. The study highlights the importance of template selection in ICL and proposes Template Ensembles, a test-time augmentation method that aggregates model predictions across multiple templates to improve performance and robustness. The results show that Template Ensembles increase average performance while being robust to the choice of random templates. The authors conclude that the current evaluation practices often ignore template selection, leading to misleading results. They argue that template sensitivity is a critical factor in ICL evaluation and propose Template Ensembles as a baseline solution to improve template robustness for in-context learning. The study also shows that the best templates do not transfer between different setups or models, and that the performance of ICL methods can vary significantly depending on the template used. The authors emphasize the need for more consistent evaluation practices that take template selection into account.This paper investigates the impact of prompt template selection on the performance of in-context learning (ICL) in large language models (LLMs). The authors find that the choice of prompt template significantly affects the performance of LLMs, and that the best template for one model or method may not be optimal for another. They evaluate 21 models across four classification datasets and show that the performance of the strongest models can be reduced to random guess levels by using a poor template. The study highlights the importance of template selection in ICL and proposes Template Ensembles, a test-time augmentation method that aggregates model predictions across multiple templates to improve performance and robustness. The results show that Template Ensembles increase average performance while being robust to the choice of random templates. The authors conclude that the current evaluation practices often ignore template selection, leading to misleading results. They argue that template sensitivity is a critical factor in ICL evaluation and propose Template Ensembles as a baseline solution to improve template robustness for in-context learning. The study also shows that the best templates do not transfer between different setups or models, and that the performance of ICL methods can vary significantly depending on the template used. The authors emphasize the need for more consistent evaluation practices that take template selection into account.
Reach us at info@study.space
[slides] Mind Your Format%3A Towards Consistent Evaluation of In-Context Learning Improvements | StudySpace