[slides] Contrastive Representation Distillation

The paper introduces a novel approach called Contrastive Representation Distillation (CRD) for transferring knowledge between neural networks. Unlike traditional knowledge distillation, which minimizes the KL divergence between the outputs of a teacher and student network, CRD focuses on capturing structural knowledge by maximizing a lower bound on mutual information between the teacher and student representations. This approach is motivated by the observation that the original knowledge distillation objective ignores important structural dependencies in the teacher's representation. CRD is formulated as a contrastive learning problem, where the goal is to push similar inputs closer together and dissimilar inputs apart in a metric space. The paper demonstrates that CRD outperforms other distillation methods, including the original knowledge distillation objective, on various tasks such as model compression, cross-modal transfer, and ensemble distillation. Experiments show that CRD consistently improves performance, sometimes even outperforming the teacher network when combined with knowledge distillation. The method sets a new state-of-the-art in many transfer tasks and provides a connection between knowledge distillation and representation learning.The paper introduces a novel approach called Contrastive Representation Distillation (CRD) for transferring knowledge between neural networks. Unlike traditional knowledge distillation, which minimizes the KL divergence between the outputs of a teacher and student network, CRD focuses on capturing structural knowledge by maximizing a lower bound on mutual information between the teacher and student representations. This approach is motivated by the observation that the original knowledge distillation objective ignores important structural dependencies in the teacher's representation. CRD is formulated as a contrastive learning problem, where the goal is to push similar inputs closer together and dissimilar inputs apart in a metric space. The paper demonstrates that CRD outperforms other distillation methods, including the original knowledge distillation objective, on various tasks such as model compression, cross-modal transfer, and ensemble distillation. Experiments show that CRD consistently improves performance, sometimes even outperforming the teacher network when combined with knowledge distillation. The method sets a new state-of-the-art in many transfer tasks and provides a connection between knowledge distillation and representation learning.

CONTRASTIVE REPRESENTATION DISTILLATION

24 Jan 2022 | Yonglong Tian, Dilip Krishnan, Phillip Isola