[slides] Scaling Instruction-Finetuned Language Models

This paper explores the effectiveness of instruction finetuning on language models, focusing on three key aspects: scaling the number of tasks, scaling the model size, and incorporating chain-of-thought (CoT) data. The authors find that these enhancements significantly improve performance across various model classes (PaLM, T5, U-PaLM), setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MSGSM, open-ended generation, RealToxicityPrompts). Notably, Flan-PaLM 540B, a 540B-parameter model, outperforms PaLM 540B by a large margin (+9.4% on average) and achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. The paper also highlights the importance of CoT data, showing that including just nine CoT datasets improves performance on all evaluations. Additionally, the authors release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to larger models like PaLM 62B. Overall, the results demonstrate that instruction finetuning is a general method for improving the performance and usability of pre-trained language models.This paper explores the effectiveness of instruction finetuning on language models, focusing on three key aspects: scaling the number of tasks, scaling the model size, and incorporating chain-of-thought (CoT) data. The authors find that these enhancements significantly improve performance across various model classes (PaLM, T5, U-PaLM), setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MSGSM, open-ended generation, RealToxicityPrompts). Notably, Flan-PaLM 540B, a 540B-parameter model, outperforms PaLM 540B by a large margin (+9.4% on average) and achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. The paper also highlights the importance of CoT data, showing that including just nine CoT datasets improves performance on all evaluations. Additionally, the authors release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to larger models like PaLM 62B. Overall, the results demonstrate that instruction finetuning is a general method for improving the performance and usability of pre-trained language models.