Knowledge Distillation Based on Transformed Teacher Matching

Knowledge Distillation Based on Transformed Teacher Matching

2024 | Kaixiang Zheng & En-Hui Yang
This paper proposes a knowledge distillation (KD) method called transformed teacher matching (TTM), which differs from traditional KD by not applying temperature scaling to the student model. Instead, TTM focuses on matching the power-transformed probability distribution of the teacher, which is equivalent to temperature scaling. This approach introduces an inherent Rényi entropy term in the objective function, acting as an extra regularization term that improves the generalization of the student model. The paper further introduces a sample-adaptive weighting coefficient into TTM, resulting in a new distillation method called weighted TTM (WTTM). WTTM is simple and has almost the same computational complexity as KD, yet it achieves state-of-the-art accuracy performance. The paper also provides extensive experimental results showing that TTM and WTTM outperform other distillation methods on two image classification datasets, CIFAR-100 and ImageNet. The results demonstrate that TTM and WTTM are effective in improving the performance of student models, especially when the teacher model is more accurate. The paper also discusses the theoretical basis of TTM and WTTM, showing that they can be decomposed into KD plus a Rényi entropy regularizer. The paper concludes that TTM and WTTM are effective methods for knowledge distillation, with WTTM achieving the best performance among all the methods tested.This paper proposes a knowledge distillation (KD) method called transformed teacher matching (TTM), which differs from traditional KD by not applying temperature scaling to the student model. Instead, TTM focuses on matching the power-transformed probability distribution of the teacher, which is equivalent to temperature scaling. This approach introduces an inherent Rényi entropy term in the objective function, acting as an extra regularization term that improves the generalization of the student model. The paper further introduces a sample-adaptive weighting coefficient into TTM, resulting in a new distillation method called weighted TTM (WTTM). WTTM is simple and has almost the same computational complexity as KD, yet it achieves state-of-the-art accuracy performance. The paper also provides extensive experimental results showing that TTM and WTTM outperform other distillation methods on two image classification datasets, CIFAR-100 and ImageNet. The results demonstrate that TTM and WTTM are effective in improving the performance of student models, especially when the teacher model is more accurate. The paper also discusses the theoretical basis of TTM and WTTM, showing that they can be decomposed into KD plus a Rényi entropy regularizer. The paper concludes that TTM and WTTM are effective methods for knowledge distillation, with WTTM achieving the best performance among all the methods tested.
Reach us at info@study.space