2024 | Linfeng Ye, Shayan Mohajer Hamidi, Renhao Tan & En-Hui Yang
This paper introduces a novel approach to enhance knowledge distillation (KD) by improving the estimation of the Bayes conditional probability distribution (BCPD) used in student training. The proposed method, called Maximum Conditional Mutual Information (MCMI), simultaneously maximizes both the log-likelihood and conditional mutual information (CMI) during teacher training. Unlike conventional methods that rely on maximum log-likelihood (MLL) estimation, MCMI captures contextual information in images, as visualized through Eigen-CAM. Extensive experiments across various state-of-the-art KD frameworks demonstrate that using a teacher trained with MCMI leads to a consistent increase in student accuracy. The results show that the MCMI method provides a more accurate estimate of the BCPD compared to MLL, particularly in zero-shot and few-shot settings. For example, when only 5% of the training samples are available to the student (few-shot), the student's accuracy increases by up to 5.72%. In zero-shot settings, the student's accuracy increases from 0% to as high as 84% for an omitted class. The code is available at https://github.com/iclr2024mcmi/ICLRMCMI.This paper introduces a novel approach to enhance knowledge distillation (KD) by improving the estimation of the Bayes conditional probability distribution (BCPD) used in student training. The proposed method, called Maximum Conditional Mutual Information (MCMI), simultaneously maximizes both the log-likelihood and conditional mutual information (CMI) during teacher training. Unlike conventional methods that rely on maximum log-likelihood (MLL) estimation, MCMI captures contextual information in images, as visualized through Eigen-CAM. Extensive experiments across various state-of-the-art KD frameworks demonstrate that using a teacher trained with MCMI leads to a consistent increase in student accuracy. The results show that the MCMI method provides a more accurate estimate of the BCPD compared to MLL, particularly in zero-shot and few-shot settings. For example, when only 5% of the training samples are available to the student (few-shot), the student's accuracy increases by up to 5.72%. In zero-shot settings, the student's accuracy increases from 0% to as high as 84% for an omitted class. The code is available at https://github.com/iclr2024mcmi/ICLRMCMI.