[slides and audio] Calibrate Before Use%3A Improving Few-Shot Performance of Language Models

The paper "Calibrate Before Use: Improving Few-Shot Performance of Language Models" addresses the instability in few-shot learning with large language models like GPT-3. The authors demonstrate that the accuracy of these models can vary significantly based on the choice of prompt format, training examples, and the order of examples. This instability is attributed to biases in the models, such as majority label bias, recency bias, and common token bias. To mitigate these biases, the authors propose a contextual calibration method. This method estimates the model's bias by feeding it a content-free test input and then adjusts the output probabilities to make them uniform across answers. The results show that contextual calibration significantly improves the average accuracy of GPT-3 and GPT-2 by up to 30% and reduces variance across different prompts. The effectiveness of contextual calibration is demonstrated on various tasks, including text classification, fact retrieval, and information extraction. The paper also discusses the implications of these findings for the design of prompts and the potential for combining calibration with fine-tuning.The paper "Calibrate Before Use: Improving Few-Shot Performance of Language Models" addresses the instability in few-shot learning with large language models like GPT-3. The authors demonstrate that the accuracy of these models can vary significantly based on the choice of prompt format, training examples, and the order of examples. This instability is attributed to biases in the models, such as majority label bias, recency bias, and common token bias. To mitigate these biases, the authors propose a contextual calibration method. This method estimates the model's bias by feeding it a content-free test input and then adjusts the output probabilities to make them uniform across answers. The results show that contextual calibration significantly improves the average accuracy of GPT-3 and GPT-2 by up to 30% and reduces variance across different prompts. The effectiveness of contextual calibration is demonstrated on various tasks, including text classification, fact retrieval, and information extraction. The paper also discusses the implications of these findings for the design of prompts and the potential for combining calibration with fine-tuning.

Calibrate Before Use: Improving Few-Shot Performance of Language Models

10 Jun 2021 | Tony Z. Zhao * 1 Eric Wallace * 1 Shi Feng 2 Dan Klein 1 Sameer Singh 3