[slides] Mind Your Format%3A Towards Consistent Evaluation of In-Context Learning Improvements

This paper investigates the impact of prompt template format on the performance of in-context learning (ICL) in large language models. The authors conduct a comprehensive study across 21 models and 4 standard classification datasets, demonstrating that poor template choices can significantly reduce the performance of even the strongest models to random guess levels. They find that the best templates do not transfer between different setups or even between models of the same family. The paper proposes *Template Ensembles*, a method that aggregates model predictions across multiple templates, to mitigate the issue of template sensitivity. This approach boosts average performance while being robust to the choice of random templates. The findings highlight the need for more consistent evaluation methods in ICL research to avoid misleading results due to varying template selections.This paper investigates the impact of prompt template format on the performance of in-context learning (ICL) in large language models. The authors conduct a comprehensive study across 21 models and 4 standard classification datasets, demonstrating that poor template choices can significantly reduce the performance of even the strongest models to random guess levels. They find that the best templates do not transfer between different setups or even between models of the same family. The paper proposes *Template Ensembles*, a method that aggregates model predictions across multiple templates, to mitigate the issue of template sensitivity. This approach boosts average performance while being robust to the choice of random templates. The findings highlight the need for more consistent evaluation methods in ICL research to avoid misleading results due to varying template selections.

Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements

6 Jun 2024 | Anton Voronov, Lena Wolf, Max Ryabinin