Calibrate Before Use: Improving Few-Shot Performance of Language Models

Calibrate Before Use: Improving Few-Shot Performance of Language Models

10 Jun 2021 | Tony Z. Zhao * 1 Eric Wallace * 1 Shi Feng 2 Dan Klein 1 Sameer Singh 3
This paper investigates the instability of few-shot learning in large language models (LLMs) like GPT-3, where accuracy can vary significantly based on prompt format, training examples, and their order. The authors show that GPT-3's performance is highly sensitive to these factors, with accuracy ranging from near chance to near state-of-the-art. They identify three main biases in LMs that contribute to this instability: majority label bias (favoring frequent labels), recency bias (favoring answers near the end of the prompt), and common token bias (favoring frequent pre-training tokens). These biases cause the model's output distribution to shift, leading to inconsistent performance. To address this, the authors propose "contextual calibration," a method that adjusts the model's output probabilities to reduce bias. This is done by estimating the model's bias using a content-free input (e.g., "N/A") and then adjusting the calibration parameters so that the model's predictions are uniform across answers. This approach significantly improves accuracy and reduces variance across different prompts and training examples. Experiments show that contextual calibration improves GPT-3 and GPT-2's average accuracy by up to 30.0% and reduces variance. It also mitigates the need for prompt engineering, making LMs more reliable for few-shot learning. The method is simple, data-free, and effective across various tasks, including text classification, fact retrieval, and information extraction. The results suggest that contextual calibration is a valuable tool for improving the performance of LMs in few-shot learning scenarios.This paper investigates the instability of few-shot learning in large language models (LLMs) like GPT-3, where accuracy can vary significantly based on prompt format, training examples, and their order. The authors show that GPT-3's performance is highly sensitive to these factors, with accuracy ranging from near chance to near state-of-the-art. They identify three main biases in LMs that contribute to this instability: majority label bias (favoring frequent labels), recency bias (favoring answers near the end of the prompt), and common token bias (favoring frequent pre-training tokens). These biases cause the model's output distribution to shift, leading to inconsistent performance. To address this, the authors propose "contextual calibration," a method that adjusts the model's output probabilities to reduce bias. This is done by estimating the model's bias using a content-free input (e.g., "N/A") and then adjusting the calibration parameters so that the model's predictions are uniform across answers. This approach significantly improves accuracy and reduces variance across different prompts and training examples. Experiments show that contextual calibration improves GPT-3 and GPT-2's average accuracy by up to 30.0% and reduces variance. It also mitigates the need for prompt engineering, making LMs more reliable for few-shot learning. The method is simple, data-free, and effective across various tasks, including text classification, fact retrieval, and information extraction. The results suggest that contextual calibration is a valuable tool for improving the performance of LMs in few-shot learning scenarios.
Reach us at info@study.space
[slides] Calibrate Before Use%3A Improving Few-Shot Performance of Language Models | StudySpace