8 Feb 2022 | Jason Wei*, Maarten Bosma*, Vincent Y. Zhao*, Kelvin Guu*, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le
This paper presents FLAN, a language model trained through instruction tuning, which significantly improves zero-shot learning performance on unseen tasks. The model is created by fine-tuning a 137B parameter pretrained language model on over 60 NLP datasets described via natural language instructions. FLAN outperforms GPT-3 on 20 out of 25 evaluated datasets and surpasses few-shot GPT-3 on several tasks. Ablation studies show that the number of instruction tuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
FLAN is evaluated on tasks such as natural language inference (NLI), reading comprehension, closed-book QA, translation, commonsense reasoning, and coreference resolution. It performs well on tasks naturally verbalized as instructions, such as NLI, QA, and translation, but less so on tasks directly formulated as language modeling, such as commonsense reasoning and coreference resolution. FLAN also performs well with few-shot exemplars and shows improved performance when combined with prompt tuning.
The study shows that instruction tuning is effective in improving zero-shot learning performance, especially for large models. However, it can hurt performance for smaller models. The paper also discusses the ethical and environmental considerations of using large language models, including the potential for bias in labeled datasets and the energy cost of training large models. The authors conclude that instruction tuning is a promising approach for improving the ability of language models to perform zero-shot tasks based on instructions.This paper presents FLAN, a language model trained through instruction tuning, which significantly improves zero-shot learning performance on unseen tasks. The model is created by fine-tuning a 137B parameter pretrained language model on over 60 NLP datasets described via natural language instructions. FLAN outperforms GPT-3 on 20 out of 25 evaluated datasets and surpasses few-shot GPT-3 on several tasks. Ablation studies show that the number of instruction tuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
FLAN is evaluated on tasks such as natural language inference (NLI), reading comprehension, closed-book QA, translation, commonsense reasoning, and coreference resolution. It performs well on tasks naturally verbalized as instructions, such as NLI, QA, and translation, but less so on tasks directly formulated as language modeling, such as commonsense reasoning and coreference resolution. FLAN also performs well with few-shot exemplars and shows improved performance when combined with prompt tuning.
The study shows that instruction tuning is effective in improving zero-shot learning performance, especially for large models. However, it can hurt performance for smaller models. The paper also discusses the ethical and environmental considerations of using large language models, including the potential for bias in labeled datasets and the energy cost of training large models. The authors conclude that instruction tuning is a promising approach for improving the ability of language models to perform zero-shot tasks based on instructions.