6 Apr 2020 | Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou
This paper introduces MINILM, a task-agnostic knowledge distillation method for compressing large pre-trained Transformer-based language models. The key idea is to train a small student model by deeply mimicking the self-attention modules of the large teacher model. Specifically, the self-attention module of the last Transformer layer of the teacher is distilled, which is effective and flexible for the student. Additionally, the scaled dot-product between values in the self-attention module is introduced as new deep self-attention knowledge, in addition to the attention distributions used in existing works. The paper also shows that introducing a teacher assistant helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that the monolingual MINILM model outperforms state-of-the-art baselines in different parameter sizes of student models. It retains over 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks using 50% of the teacher model's parameters and computations. The multilingual MINILM model also achieves competitive results. The method is effective for both monolingual and multilingual pre-trained models, and the student model can use arbitrary hidden dimensions without introducing additional parameters. The paper also compares the proposed method with previous approaches and shows that it achieves better performance on various downstream tasks, including extractive question answering and the GLUE benchmark. The results show that MINILM is faster than the original BERT_BASE and achieves competitive performance on multiple tasks. The method is also effective for smaller student models and can be applied to compress larger pre-trained models. The paper concludes that the proposed method is a simple and effective approach for compressing large pre-trained Transformer-based language models.This paper introduces MINILM, a task-agnostic knowledge distillation method for compressing large pre-trained Transformer-based language models. The key idea is to train a small student model by deeply mimicking the self-attention modules of the large teacher model. Specifically, the self-attention module of the last Transformer layer of the teacher is distilled, which is effective and flexible for the student. Additionally, the scaled dot-product between values in the self-attention module is introduced as new deep self-attention knowledge, in addition to the attention distributions used in existing works. The paper also shows that introducing a teacher assistant helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that the monolingual MINILM model outperforms state-of-the-art baselines in different parameter sizes of student models. It retains over 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks using 50% of the teacher model's parameters and computations. The multilingual MINILM model also achieves competitive results. The method is effective for both monolingual and multilingual pre-trained models, and the student model can use arbitrary hidden dimensions without introducing additional parameters. The paper also compares the proposed method with previous approaches and shows that it achieves better performance on various downstream tasks, including extractive question answering and the GLUE benchmark. The results show that MINILM is faster than the original BERT_BASE and achieves competitive performance on multiple tasks. The method is also effective for smaller student models and can be applied to compress larger pre-trained models. The paper concludes that the proposed method is a simple and effective approach for compressing large pre-trained Transformer-based language models.