TinyBERT: Distilling BERT for Natural Language Understanding

TinyBERT: Distilling BERT for Natural Language Understanding

November 16-20, 2020 | Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu
TinyBERT is a compact model derived from BERT through knowledge distillation, designed to maintain high performance while significantly reducing model size and inference time. The paper introduces a novel Transformer-based knowledge distillation method and a two-stage learning framework for TinyBERT. The first stage, general distillation, transfers general-domain knowledge from BERT to TinyBERT using pre-trained BERT on a large-scale text corpus. The second stage, task-specific distillation, further transfers task-specific knowledge from fine-tuned BERT to TinyBERT through data augmentation and distillation on augmented datasets. TinyBERT $ _{4} $ achieves over 96.8% of BERT $ _{BASE} $ performance on the GLUE benchmark while being 7.5x smaller and 9.4x faster in inference. TinyBERT $ _{6} $ performs on-par with BERT $ _{BASE} $ on GLUE tasks. The method includes attention-based and hidden state-based distillation, as well as embedding and prediction layer distillation. Experiments show that TinyBERT outperforms several state-of-the-art baselines in terms of performance and efficiency. The two-stage learning framework ensures that TinyBERT can effectively capture both general-domain and task-specific knowledge from BERT. The results demonstrate that TinyBERT is a competitive model for deploying BERT-based NLP tasks on resource-constrained devices.TinyBERT is a compact model derived from BERT through knowledge distillation, designed to maintain high performance while significantly reducing model size and inference time. The paper introduces a novel Transformer-based knowledge distillation method and a two-stage learning framework for TinyBERT. The first stage, general distillation, transfers general-domain knowledge from BERT to TinyBERT using pre-trained BERT on a large-scale text corpus. The second stage, task-specific distillation, further transfers task-specific knowledge from fine-tuned BERT to TinyBERT through data augmentation and distillation on augmented datasets. TinyBERT $ _{4} $ achieves over 96.8% of BERT $ _{BASE} $ performance on the GLUE benchmark while being 7.5x smaller and 9.4x faster in inference. TinyBERT $ _{6} $ performs on-par with BERT $ _{BASE} $ on GLUE tasks. The method includes attention-based and hidden state-based distillation, as well as embedding and prediction layer distillation. Experiments show that TinyBERT outperforms several state-of-the-art baselines in terms of performance and efficiency. The two-stage learning framework ensures that TinyBERT can effectively capture both general-domain and task-specific knowledge from BERT. The results demonstrate that TinyBERT is a competitive model for deploying BERT-based NLP tasks on resource-constrained devices.
Reach us at info@futurestudyspace.com