Similarity-Preserving Knowledge Distillation

Similarity-Preserving Knowledge Distillation

1 Aug 2019 | Frederick Tung1,2 and Greg Mori1,2
Similarity-preserving knowledge distillation is a novel approach that preserves pairwise similarities in the student's representation space rather than mimicking the teacher's representation space. This method uses the similarity of activations between input pairs to guide the training of the student network. The distillation loss is defined on pairwise similarity matrices derived from the activations of the teacher and student networks. Experiments on three public datasets, including CIFAR-10, the describable textures dataset, and CINIC-10, demonstrate that similarity-preserving distillation improves student network performance and complements traditional distillation methods. It achieves lower error rates compared to conventional training and attention transfer baselines, and is effective in transfer learning scenarios with limited data. The method is also efficient, allowing for significant network compression with minimal accuracy loss. The approach is versatile and can be applied to various tasks, including model compression, privileged learning, adversarial defense, and learning with noisy data. The results show that similarity-preserving distillation provides a robust solution for domain shift problems and complements state-of-the-art attention transfer methods. Future work includes exploring the method in semi-supervised and omni-supervised learning settings.Similarity-preserving knowledge distillation is a novel approach that preserves pairwise similarities in the student's representation space rather than mimicking the teacher's representation space. This method uses the similarity of activations between input pairs to guide the training of the student network. The distillation loss is defined on pairwise similarity matrices derived from the activations of the teacher and student networks. Experiments on three public datasets, including CIFAR-10, the describable textures dataset, and CINIC-10, demonstrate that similarity-preserving distillation improves student network performance and complements traditional distillation methods. It achieves lower error rates compared to conventional training and attention transfer baselines, and is effective in transfer learning scenarios with limited data. The method is also efficient, allowing for significant network compression with minimal accuracy loss. The approach is versatile and can be applied to various tasks, including model compression, privileged learning, adversarial defense, and learning with noisy data. The results show that similarity-preserving distillation provides a robust solution for domain shift problems and complements state-of-the-art attention transfer methods. Future work includes exploring the method in semi-supervised and omni-supervised learning settings.
Reach us at info@study.space