Logit Standardization in Knowledge Distillation

Logit Standardization in Knowledge Distillation

3 Mar 2024 | Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang, Xiaochun Cao
This paper proposes a novel approach to knowledge distillation (KD) by introducing logit standardization to address the limitations of traditional methods that assume a shared temperature between teacher and student. The key issue with conventional KD is that it enforces an exact match between the magnitudes of teacher and student logits, which is not necessary for effective learning. Instead, the authors argue that preserving the intrinsic relationships between logits is more important than matching their magnitudes. The proposed method involves standardizing the logits of the student using a Z-score transformation before applying the softmax function and Kullback-Leibler divergence. This preprocessing step allows the student to focus on the essential relationships between logits rather than requiring a magnitude match. The temperature in the softmax function is then set as the weighted standard deviation of the logits, enabling adaptive temperature settings for different samples and between teacher and student. The authors demonstrate that this approach significantly improves the performance of existing logit-based KD methods. They show that the conventional setting of shared temperatures can lead to misleading evaluations of student performance, while their Z-score standardization effectively addresses this issue. The method is evaluated on CIFAR-100 and ImageNet datasets, showing superior performance compared to state-of-the-art methods. The results indicate that the proposed method achieves favorable performance against the best methods and provides considerable gains for other distillation variants. The paper also provides theoretical support for the approach, showing that the temperature in the softmax function can be derived from the principle of entropy maximization. This leads to the conclusion that the temperatures of the teacher and student can be distinct, and that the standard deviation of the logits can be used to determine the temperature. The authors highlight the importance of preserving the intrinsic relationships between logits rather than enforcing a magnitude match, which is a key insight for improving the effectiveness of knowledge distillation.This paper proposes a novel approach to knowledge distillation (KD) by introducing logit standardization to address the limitations of traditional methods that assume a shared temperature between teacher and student. The key issue with conventional KD is that it enforces an exact match between the magnitudes of teacher and student logits, which is not necessary for effective learning. Instead, the authors argue that preserving the intrinsic relationships between logits is more important than matching their magnitudes. The proposed method involves standardizing the logits of the student using a Z-score transformation before applying the softmax function and Kullback-Leibler divergence. This preprocessing step allows the student to focus on the essential relationships between logits rather than requiring a magnitude match. The temperature in the softmax function is then set as the weighted standard deviation of the logits, enabling adaptive temperature settings for different samples and between teacher and student. The authors demonstrate that this approach significantly improves the performance of existing logit-based KD methods. They show that the conventional setting of shared temperatures can lead to misleading evaluations of student performance, while their Z-score standardization effectively addresses this issue. The method is evaluated on CIFAR-100 and ImageNet datasets, showing superior performance compared to state-of-the-art methods. The results indicate that the proposed method achieves favorable performance against the best methods and provides considerable gains for other distillation variants. The paper also provides theoretical support for the approach, showing that the temperature in the softmax function can be derived from the principle of entropy maximization. This leads to the conclusion that the temperatures of the teacher and student can be distinct, and that the standard deviation of the logits can be used to determine the temperature. The authors highlight the importance of preserving the intrinsic relationships between logits rather than enforcing a magnitude match, which is a key insight for improving the effectiveness of knowledge distillation.
Reach us at info@study.space