1 Mar 2020 | Victor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF
DistilBERT is a smaller, faster, and lighter version of BERT, developed through knowledge distillation. It retains 97% of BERT's language understanding capabilities and is 60% faster. The model is pre-trained using a triple loss that combines language modeling, distillation, and cosine-distance losses. This approach allows DistilBERT to be significantly smaller, with 40% fewer parameters than BERT, while maintaining strong performance on various tasks. DistilBERT is also suitable for on-device computations, as demonstrated by a mobile application for question answering. The model was trained on the same corpus as BERT, using 8 16GB V100 GPUs for approximately 90 hours. DistilBERT performs well on the GLUE benchmark, achieving results comparable to BERT, with a 0.6% accuracy difference on the IMDb benchmark and within 3.9 points on SQuAD. It is also faster than BERT, with inference times reduced by 60%. The model's size and speed make it suitable for edge applications, such as mobile devices, where it weighs 207 MB. The study shows that knowledge distillation during pre-training is effective for creating smaller, efficient models that retain the performance of larger ones. DistilBERT is a compelling option for edge applications due to its efficiency and performance.DistilBERT is a smaller, faster, and lighter version of BERT, developed through knowledge distillation. It retains 97% of BERT's language understanding capabilities and is 60% faster. The model is pre-trained using a triple loss that combines language modeling, distillation, and cosine-distance losses. This approach allows DistilBERT to be significantly smaller, with 40% fewer parameters than BERT, while maintaining strong performance on various tasks. DistilBERT is also suitable for on-device computations, as demonstrated by a mobile application for question answering. The model was trained on the same corpus as BERT, using 8 16GB V100 GPUs for approximately 90 hours. DistilBERT performs well on the GLUE benchmark, achieving results comparable to BERT, with a 0.6% accuracy difference on the IMDb benchmark and within 3.9 points on SQuAD. It is also faster than BERT, with inference times reduced by 60%. The model's size and speed make it suitable for edge applications, such as mobile devices, where it weighs 207 MB. The study shows that knowledge distillation during pre-training is effective for creating smaller, efficient models that retain the performance of larger ones. DistilBERT is a compelling option for edge applications due to its efficiency and performance.