1 May 2019 | Wonpyo Park, Dongju Kim, Yan Lu, Minsu Cho
The paper introduces a novel approach called Relational Knowledge Distillation (RKD) to transfer knowledge from a teacher model to a student model, focusing on the structural relations between data examples rather than individual outputs. Traditional knowledge distillation methods typically train the student to mimic the output activations of the teacher, while RKD aims to preserve the structural relationships in the data embedding space. The authors propose two specific loss functions, distance-wise and angle-wise distillation losses, to penalize structural differences in these relationships. Experiments on various tasks, including metric learning, image classification, and few-shot learning, demonstrate that RKD significantly improves the performance of student models, often outperforming the teacher models, especially in metric learning tasks. The method is shown to be effective even when the student model is smaller or has fewer parameters, and it can be combined with other knowledge distillation techniques to further enhance performance.The paper introduces a novel approach called Relational Knowledge Distillation (RKD) to transfer knowledge from a teacher model to a student model, focusing on the structural relations between data examples rather than individual outputs. Traditional knowledge distillation methods typically train the student to mimic the output activations of the teacher, while RKD aims to preserve the structural relationships in the data embedding space. The authors propose two specific loss functions, distance-wise and angle-wise distillation losses, to penalize structural differences in these relationships. Experiments on various tasks, including metric learning, image classification, and few-shot learning, demonstrate that RKD significantly improves the performance of student models, often outperforming the teacher models, especially in metric learning tasks. The method is shown to be effective even when the student model is smaller or has fewer parameters, and it can be combined with other knowledge distillation techniques to further enhance performance.