24 Jan 2022 | Yonglong Tian, Dilip Krishnan, Phillip Isola
Contrastive Representation Distillation (CRD) is a novel approach to knowledge distillation that improves upon traditional methods by capturing structural knowledge in teacher networks. Unlike standard knowledge distillation, which minimizes KL divergence between teacher and student outputs, CRD uses a contrastive learning objective to maximize mutual information between teacher and student representations. This approach better captures correlations and higher-order dependencies in the data, leading to improved performance on various knowledge transfer tasks, including model compression, cross-modal transfer, and ensemble distillation. CRD outperforms existing distillation methods, including knowledge distillation and other state-of-the-art techniques, on multiple tasks, including single model compression, ensemble distillation, and cross-modal transfer. The method is effective in many transfer tasks and sometimes even outperforms the teacher network when combined with knowledge distillation. CRD is based on contrastive learning, which has been successfully used in representation learning and self-supervised settings. The method is implemented using a contrastive loss that maximizes a lower bound on mutual information between teacher and student representations. The approach is evaluated on several tasks, including model compression, cross-modal transfer, and ensemble distillation, and shows significant improvements over existing methods. The results demonstrate that CRD is a powerful and effective method for knowledge distillation.Contrastive Representation Distillation (CRD) is a novel approach to knowledge distillation that improves upon traditional methods by capturing structural knowledge in teacher networks. Unlike standard knowledge distillation, which minimizes KL divergence between teacher and student outputs, CRD uses a contrastive learning objective to maximize mutual information between teacher and student representations. This approach better captures correlations and higher-order dependencies in the data, leading to improved performance on various knowledge transfer tasks, including model compression, cross-modal transfer, and ensemble distillation. CRD outperforms existing distillation methods, including knowledge distillation and other state-of-the-art techniques, on multiple tasks, including single model compression, ensemble distillation, and cross-modal transfer. The method is effective in many transfer tasks and sometimes even outperforms the teacher network when combined with knowledge distillation. CRD is based on contrastive learning, which has been successfully used in representation learning and self-supervised settings. The method is implemented using a contrastive loss that maximizes a lower bound on mutual information between teacher and student representations. The approach is evaluated on several tasks, including model compression, cross-modal transfer, and ensemble distillation, and shows significant improvements over existing methods. The results demonstrate that CRD is a powerful and effective method for knowledge distillation.