17 Dec 2019 | Seyed Iman Mirzadeh*, Mehrdad Farajtabar*, Ang Li, Nir Levine, Akihiro Matsukawa†, Hassan Ghasemzadeh
This paper introduces a new knowledge distillation framework called Teacher Assistant Knowledge Distillation (TAKD) to improve the performance of student networks when the gap between the student and teacher is large. Knowledge distillation is a popular method for compressing deep neural networks, where a large pre-trained network (teacher) is used to train a smaller network (student). However, the performance of the student network can degrade when the gap between the student and teacher is too large. To address this issue, TAKD introduces an intermediate-sized network, called a teacher assistant (TA), to bridge the gap between the student and teacher. The TA is first distilled from the teacher, and then the student is trained using the TA as a teacher. This approach improves the knowledge transfer and enhances the performance of the student network.
The paper shows that the size (capacity) gap between the teacher and student is important for the effectiveness of knowledge distillation. The authors propose TAKD as a method to improve the accuracy of the student network in cases of extreme compression. They also extend the framework to include a chain of multiple TAs from the teacher to the student, further improving knowledge transfer and providing insights into finding the optimal TA size. Theoretical analysis and extensive experiments on CIFAR-10, CIFAR-100, and ImageNet datasets, as well as on CNN and ResNet architectures, substantiate the effectiveness of the proposed approach.
The paper also discusses related work in knowledge distillation and neural network compression, and compares TAKD with other distillation methods. The results show that TAKD outperforms baseline knowledge distillation and normal training in various scenarios. The authors also provide a theoretical analysis of the effectiveness of TAKD, showing that introducing an intermediate TA network improves the distillation performance. The paper concludes that TAKD is a promising approach for improving the performance of student networks in knowledge distillation tasks.This paper introduces a new knowledge distillation framework called Teacher Assistant Knowledge Distillation (TAKD) to improve the performance of student networks when the gap between the student and teacher is large. Knowledge distillation is a popular method for compressing deep neural networks, where a large pre-trained network (teacher) is used to train a smaller network (student). However, the performance of the student network can degrade when the gap between the student and teacher is too large. To address this issue, TAKD introduces an intermediate-sized network, called a teacher assistant (TA), to bridge the gap between the student and teacher. The TA is first distilled from the teacher, and then the student is trained using the TA as a teacher. This approach improves the knowledge transfer and enhances the performance of the student network.
The paper shows that the size (capacity) gap between the teacher and student is important for the effectiveness of knowledge distillation. The authors propose TAKD as a method to improve the accuracy of the student network in cases of extreme compression. They also extend the framework to include a chain of multiple TAs from the teacher to the student, further improving knowledge transfer and providing insights into finding the optimal TA size. Theoretical analysis and extensive experiments on CIFAR-10, CIFAR-100, and ImageNet datasets, as well as on CNN and ResNet architectures, substantiate the effectiveness of the proposed approach.
The paper also discusses related work in knowledge distillation and neural network compression, and compares TAKD with other distillation methods. The results show that TAKD outperforms baseline knowledge distillation and normal training in various scenarios. The authors also provide a theoretical analysis of the effectiveness of TAKD, showing that introducing an intermediate TA network improves the distillation performance. The paper concludes that TAKD is a promising approach for improving the performance of student networks in knowledge distillation tasks.