Date:2022-12-06
Author:Hyung Won Chung*, Le Hou*, Shayne Longpre*, Barret Zoph†, Yi Tay†, William Fedus†, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei*
Pages:54
Summary:This paper explores the effectiveness of instruction finetuning on language models, focusing on three key aspects: scaling the number of tasks, scaling the model size, and incorporating chain-of-thought (CoT) data. The authors find that these enhancements significantly improve performance across various model classes (PaLM, T5, U-PaLM), setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MSGSM, open-ended generation, RealToxicityPrompts). Notably, Flan-PaLM 540B, a 540B-parameter model, outperforms PaLM 540B by a large margin (+9.4% on average) and achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. The paper also highlights the importance of CoT data, showing that including just nine CoT datasets improves performance on all evaluations. Additionally, the authors release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to larger models like PaLM 62B. Overall, the results demonstrate that instruction finetuning is a general method for improving the performance and usability of pre-trained language models.