Sequence-Level Knowledge Distillation

Sequence-Level Knowledge Distillation

22 Sep 2016 | Yoon Kim, Alexander M. Rush
This paper introduces sequence-level knowledge distillation for neural machine translation (NMT), which improves performance while significantly reducing model size. The authors demonstrate that standard knowledge distillation applied to word-level prediction is effective for NMT, and propose two novel sequence-level versions of knowledge distillation that further improve performance. These sequence-level methods seem to eliminate the need for beam search, even when applied on the original teacher model. The best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation, achieving a 4.2/1.7 BLEU improvement with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model with 13× fewer parameters than the original teacher model, with a 0.4 BLEU decrease. The paper explores three different ways to apply knowledge distillation to NMT: word-level, sequence-level, and sequence-level interpolation. Word-level knowledge distillation is applied to minimize word NLL, while sequence-level knowledge distillation is applied to minimize sequence NLL. Sequence-level interpolation combines both the original training data and the teacher-generated data. The authors find that sequence-level knowledge distillation is effective because it allows the student network to only model relevant parts of the teacher distribution, leading to better performance and faster decoding. The results show that the sequence-level knowledge distillation approach outperforms the baseline model and achieves comparable performance to beam search with a much smaller model. The student model can even be run efficiently on a standard smartphone. Weight pruning is also applied to further reduce the model size, resulting in a model with 13× fewer parameters than the original teacher model. The authors have released all the code for the models described in this paper.This paper introduces sequence-level knowledge distillation for neural machine translation (NMT), which improves performance while significantly reducing model size. The authors demonstrate that standard knowledge distillation applied to word-level prediction is effective for NMT, and propose two novel sequence-level versions of knowledge distillation that further improve performance. These sequence-level methods seem to eliminate the need for beam search, even when applied on the original teacher model. The best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation, achieving a 4.2/1.7 BLEU improvement with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model with 13× fewer parameters than the original teacher model, with a 0.4 BLEU decrease. The paper explores three different ways to apply knowledge distillation to NMT: word-level, sequence-level, and sequence-level interpolation. Word-level knowledge distillation is applied to minimize word NLL, while sequence-level knowledge distillation is applied to minimize sequence NLL. Sequence-level interpolation combines both the original training data and the teacher-generated data. The authors find that sequence-level knowledge distillation is effective because it allows the student network to only model relevant parts of the teacher distribution, leading to better performance and faster decoding. The results show that the sequence-level knowledge distillation approach outperforms the baseline model and achieves comparable performance to beam search with a much smaller model. The student model can even be run efficiently on a standard smartphone. Weight pruning is also applied to further reduce the model size, resulting in a model with 13× fewer parameters than the original teacher model. The authors have released all the code for the models described in this paper.
Reach us at info@study.space
[slides] Sequence-Level Knowledge Distillation | StudySpace