Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

3 Mar 2022 | Yao Lu† Max Bartolo† Alastair Moore† Sebastian Riedel† Pontus Stenetorp†
This paper investigates the sensitivity of few-shot learning performance to the order of training samples in large pretrained language models (PLMs), such as GPT-3. It shows that the order of samples can significantly affect performance, with some permutations yielding near-state-of-the-art results and others performing at random chance. The study reveals that this order sensitivity is present across different model sizes and tasks, and that a good permutation for one model is not necessarily good for another. The paper proposes a method to automatically generate a "probing set" using the generative nature of language models, and then uses entropy-based metrics to identify performant prompt orderings. This method achieves an average 13% relative improvement across eleven text classification tasks. The study also shows that the performance of prompts is not transferable across different model sizes and tasks, and that the order sensitivity remains a challenge even for large models. The paper concludes that the proposed method is universally applicable and effective across different sizes of pre-trained language models and for different types of datasets.This paper investigates the sensitivity of few-shot learning performance to the order of training samples in large pretrained language models (PLMs), such as GPT-3. It shows that the order of samples can significantly affect performance, with some permutations yielding near-state-of-the-art results and others performing at random chance. The study reveals that this order sensitivity is present across different model sizes and tasks, and that a good permutation for one model is not necessarily good for another. The paper proposes a method to automatically generate a "probing set" using the generative nature of language models, and then uses entropy-based metrics to identify performant prompt orderings. This method achieves an average 13% relative improvement across eleven text classification tasks. The study also shows that the performance of prompts is not transferable across different model sizes and tasks, and that the order sensitivity remains a challenge even for large models. The paper concludes that the proposed method is universally applicable and effective across different sizes of pre-trained language models and for different types of datasets.
Reach us at info@study.space