Sequence-Level Knowledge Distillation

Sequence-Level Knowledge Distillation

22 Sep 2016 | Yoon Kim, Alexander M. Rush
This paper explores the application of knowledge distillation to Neural Machine Translation (NMT) to reduce the size and computational requirements of NMT models. The authors introduce two novel sequence-level versions of knowledge distillation, which improve performance and eliminate the need for beam search at test time. The first sequence-level knowledge distillation method trains the student model on the output of beam search from the teacher model, while the second method trains the student model on the sequence with the highest similarity to the target sequence generated by the teacher model. These methods are shown to be effective, with the best student model running 10 times faster than the state-of-the-art teacher model with minimal performance loss. Additionally, weight pruning is applied to the student model, resulting in a model with 13 times fewer parameters than the original teacher model, with only a 0.4 BLEU score decrease. The paper also discusses the practical implications of these techniques for running NMT systems on various devices, such as smartphones.This paper explores the application of knowledge distillation to Neural Machine Translation (NMT) to reduce the size and computational requirements of NMT models. The authors introduce two novel sequence-level versions of knowledge distillation, which improve performance and eliminate the need for beam search at test time. The first sequence-level knowledge distillation method trains the student model on the output of beam search from the teacher model, while the second method trains the student model on the sequence with the highest similarity to the target sequence generated by the teacher model. These methods are shown to be effective, with the best student model running 10 times faster than the state-of-the-art teacher model with minimal performance loss. Additionally, weight pruning is applied to the student model, resulting in a model with 13 times fewer parameters than the original teacher model, with only a 0.4 BLEU score decrease. The paper also discusses the practical implications of these techniques for running NMT systems on various devices, such as smartphones.
Reach us at info@study.space