MINILM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

MINILM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

6 Apr 2020 | Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou
The paper introduces a novel approach called Deep Self-Attention Distillation (MINILM) to compress large pre-trained Transformer models for task-agnostic compression. The key idea is to deeply mimic the self-attention module of the last Transformer layer of the teacher model in the student model. This approach introduces the scaled dot-product between values in the self-attention module as additional knowledge, in addition to the attention distributions. The method also benefits from a teacher assistant, which helps bridge the size gap between the teacher and student models. Experimental results show that the proposed method outperforms state-of-the-art baselines in various downstream tasks, achieving over 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks with only 50% of the parameters and computations of the teacher model. The method is also effective for multilingual pre-trained models, demonstrating competitive performance with fewer parameters.The paper introduces a novel approach called Deep Self-Attention Distillation (MINILM) to compress large pre-trained Transformer models for task-agnostic compression. The key idea is to deeply mimic the self-attention module of the last Transformer layer of the teacher model in the student model. This approach introduces the scaled dot-product between values in the self-attention module as additional knowledge, in addition to the attention distributions. The method also benefits from a teacher assistant, which helps bridge the size gap between the teacher and student models. Experimental results show that the proposed method outperforms state-of-the-art baselines in various downstream tasks, achieving over 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks with only 50% of the parameters and computations of the teacher model. The method is also effective for multilingual pre-trained models, demonstrating competitive performance with fewer parameters.
Reach us at info@study.space
[slides] MiniLM%3A Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers | StudySpace