[slides and audio] DistilBERT%2C a distilled version of BERT%3A smaller%2C faster%2C cheaper and lighter

DistilBERT is a distilled version of BERT, designed to be smaller, faster, and cheaper while retaining 97% of BERT's language understanding capabilities. The authors propose a method to pre-train a smaller general-purpose language representation model, DistilBERT, which can be fine-tuned on various tasks with good performance. By leveraging knowledge distillation during the pre-training phase, they achieve a 40% reduction in model size while being 60% faster. The model uses a triple loss combining language modeling, distillation, and cosine-distance losses to leverage the inductive biases learned by larger models. DistilBERT is demonstrated to perform well on edge devices, showing significant speed-ups and reduced computational requirements. The paper also includes ablation studies to analyze the effectiveness of different components of the triple loss and student initialization.DistilBERT is a distilled version of BERT, designed to be smaller, faster, and cheaper while retaining 97% of BERT's language understanding capabilities. The authors propose a method to pre-train a smaller general-purpose language representation model, DistilBERT, which can be fine-tuned on various tasks with good performance. By leveraging knowledge distillation during the pre-training phase, they achieve a 40% reduction in model size while being 60% faster. The model uses a triple loss combining language modeling, distillation, and cosine-distance losses to leverage the inductive biases learned by larger models. DistilBERT is demonstrated to perform well on edge devices, showing significant speed-ups and reduced computational requirements. The paper also includes ablation studies to analyze the effectiveness of different components of the triple loss and student initialization.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

1 Mar 2020 | Victor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF