Scaling Instruction-Finetuned Language Models

Scaling Instruction-Finetuned Language Models

6 Dec 2022 | Hyung Won Chung*, Le Hou*, Shayne Longpre*, Barret Zoph†, Yi Tay†, William Fedus†, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei*
This paper explores instruction finetuning for language models, focusing on scaling the number of tasks, model size, and incorporating chain-of-thought (CoT) data. The study shows that instruction finetuning significantly improves performance across various model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation, RealToxicityPrompts). For example, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by +9.4% on average, achieving state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. Flan-T5 checkpoints also show strong few-shot performance, even compared to larger models like PaLM 62B. The study demonstrates that instruction finetuning enhances reasoning abilities, enabling models to perform better on CoT tasks and improve performance on multilingual benchmarks. It also shows that instruction finetuning improves usability, allowing models to perform zero-shot reasoning without prompt engineering or few-shot exemplars. The results indicate that instruction finetuning is a general method for improving the performance and usability of pretrained language models. The paper also highlights the importance of including CoT data in instruction finetuning, as it significantly improves performance on reasoning tasks. Additionally, the study shows that instruction finetuning is computationally efficient, requiring only a small fraction of the pre-training compute. This makes it a viable method for improving language models without significantly increasing model size. Overall, the findings suggest that instruction finetuning is a powerful technique for enhancing the performance and usability of language models across a wide range of tasks and benchmarks. The results underscore the importance of scaling both the number of tasks and model size, as well as incorporating CoT data, to achieve state-of-the-art performance. The study also emphasizes the need for further research into the generalization of instruction finetuning across different model architectures and pre-training objectives.This paper explores instruction finetuning for language models, focusing on scaling the number of tasks, model size, and incorporating chain-of-thought (CoT) data. The study shows that instruction finetuning significantly improves performance across various model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation, RealToxicityPrompts). For example, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by +9.4% on average, achieving state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. Flan-T5 checkpoints also show strong few-shot performance, even compared to larger models like PaLM 62B. The study demonstrates that instruction finetuning enhances reasoning abilities, enabling models to perform better on CoT tasks and improve performance on multilingual benchmarks. It also shows that instruction finetuning improves usability, allowing models to perform zero-shot reasoning without prompt engineering or few-shot exemplars. The results indicate that instruction finetuning is a general method for improving the performance and usability of pretrained language models. The paper also highlights the importance of including CoT data in instruction finetuning, as it significantly improves performance on reasoning tasks. Additionally, the study shows that instruction finetuning is computationally efficient, requiring only a small fraction of the pre-training compute. This makes it a viable method for improving language models without significantly increasing model size. Overall, the findings suggest that instruction finetuning is a powerful technique for enhancing the performance and usability of language models across a wide range of tasks and benchmarks. The results underscore the importance of scaling both the number of tasks and model size, as well as incorporating CoT data, to achieve state-of-the-art performance. The study also emphasizes the need for further research into the generalization of instruction finetuning across different model architectures and pre-training objectives.
Reach us at info@study.space