17 Dec 2019 | Seyed Iman Mirzadeh*, Mehrdad Farajtabar*, Ang Li, Nir Levine, Akihiro Matsukawa†, Hassan Ghasemzadeh
The paper "Improved Knowledge Distillation via Teacher Assistant" addresses the issue of performance degradation in knowledge distillation when the gap between the student and teacher networks is large. The authors propose a new framework called Teacher Assistant Knowledge Distillation (TAKD), which introduces intermediate-sized networks (teacher assistants, or TAs) to bridge the gap between the student and teacher networks. This approach aims to improve the accuracy of the student network, especially in scenarios where the fixed student and teacher network sizes are not optimal for effective distillation.
The paper begins by discussing the limitations of traditional knowledge distillation, where the performance of the student network can degrade when the gap between the student and teacher networks is large. The authors then introduce TAKD, which involves training TAs from the teacher network and using these TAs to distill knowledge to the student network. They argue that TAs, being closer in size to the student network, can better mimic the teacher's behavior and provide softer targets for the student, thus improving the distillation process.
The effectiveness of TAKD is evaluated through extensive experiments on the CIFAR-10, CIFAR-100, and ImageNet datasets using CNN and ResNet architectures. The results show that TAKD consistently outperforms traditional knowledge distillation and normal training without distillation. The authors also study the optimal size of the TA and find that a TA size close to the average performance of the teacher and student networks is generally beneficial.
The paper further discusses the theoretical underpinnings of TAKD, using VC theory to explain why introducing a TA can improve the distillation process. Additionally, empirical analysis of the loss landscape shows that TAKD leads to flatter surfaces around local minima, which can enhance generalization.
Overall, the paper provides a comprehensive study of the benefits of TAKD and its potential for improving knowledge distillation in various scenarios.The paper "Improved Knowledge Distillation via Teacher Assistant" addresses the issue of performance degradation in knowledge distillation when the gap between the student and teacher networks is large. The authors propose a new framework called Teacher Assistant Knowledge Distillation (TAKD), which introduces intermediate-sized networks (teacher assistants, or TAs) to bridge the gap between the student and teacher networks. This approach aims to improve the accuracy of the student network, especially in scenarios where the fixed student and teacher network sizes are not optimal for effective distillation.
The paper begins by discussing the limitations of traditional knowledge distillation, where the performance of the student network can degrade when the gap between the student and teacher networks is large. The authors then introduce TAKD, which involves training TAs from the teacher network and using these TAs to distill knowledge to the student network. They argue that TAs, being closer in size to the student network, can better mimic the teacher's behavior and provide softer targets for the student, thus improving the distillation process.
The effectiveness of TAKD is evaluated through extensive experiments on the CIFAR-10, CIFAR-100, and ImageNet datasets using CNN and ResNet architectures. The results show that TAKD consistently outperforms traditional knowledge distillation and normal training without distillation. The authors also study the optimal size of the TA and find that a TA size close to the average performance of the teacher and student networks is generally beneficial.
The paper further discusses the theoretical underpinnings of TAKD, using VC theory to explain why introducing a TA can improve the distillation process. Additionally, empirical analysis of the loss landscape shows that TAKD leads to flatter surfaces around local minima, which can enhance generalization.
Overall, the paper provides a comprehensive study of the benefits of TAKD and its potential for improving knowledge distillation in various scenarios.