Understanding Logit Standardization in Knowledge Distillation

The paper "Logit Standardization in Knowledge Distillation" addresses the limitations of traditional knowledge distillation (KD) methods, which assume a shared temperature between the teacher and student models. This assumption often results in an exact match between the logits of the teacher and student, limiting the performance of the student model due to the capacity discrepancy between them. The authors propose a novel approach called Z-score logit standardization to mitigate this issue. Key contributions of the paper include: 1. **Theoretical Analysis**: The authors derive the softmax function in KD based on entropy maximization, showing that the temperature can be different for the teacher and student, and can vary across samples. 2. **Z-score Logit Standardization**: They introduce a preprocessing step that uses the weighted standard deviation of logits as the temperature, allowing the student to focus on essential logit relations from the teacher rather than requiring a magnitude match. 3. **Experiments**: Extensive experiments on CIFAR-100 and ImageNet datasets demonstrate the effectiveness of the proposed method, showing significant improvements over existing logit-based KD methods. The paper also highlights a toy case where the conventional KD pipeline with shared temperature can lead to misleading performance evaluations, while the proposed Z-score pre-process resolves this issue. The authors conclude that their method enhances the performance of existing logit-based KD methods and provides a more flexible and effective approach to knowledge distillation.The paper "Logit Standardization in Knowledge Distillation" addresses the limitations of traditional knowledge distillation (KD) methods, which assume a shared temperature between the teacher and student models. This assumption often results in an exact match between the logits of the teacher and student, limiting the performance of the student model due to the capacity discrepancy between them. The authors propose a novel approach called Z-score logit standardization to mitigate this issue. Key contributions of the paper include: 1. **Theoretical Analysis**: The authors derive the softmax function in KD based on entropy maximization, showing that the temperature can be different for the teacher and student, and can vary across samples. 2. **Z-score Logit Standardization**: They introduce a preprocessing step that uses the weighted standard deviation of logits as the temperature, allowing the student to focus on essential logit relations from the teacher rather than requiring a magnitude match. 3. **Experiments**: Extensive experiments on CIFAR-100 and ImageNet datasets demonstrate the effectiveness of the proposed method, showing significant improvements over existing logit-based KD methods. The paper also highlights a toy case where the conventional KD pipeline with shared temperature can lead to misleading performance evaluations, while the proposed Z-score pre-process resolves this issue. The authors conclude that their method enhances the performance of existing logit-based KD methods and provides a more flexible and effective approach to knowledge distillation.

Logit Standardization in Knowledge Distillation

3 Mar 2024 | Shangquan Sun1,2, Wenqi Ren3†, Jingzhi Li1, Rui Wang1,2, Xiaochun Cao3