1 Jun 2017 | Ying Zhang, Tao Xiang, Timothy M. Hospedales, Huchuan Lu
The paper introduces Deep Mutual Learning (DML), a novel approach that enhances the performance of deep neural networks by training them collaboratively in a cohort, rather than through one-way knowledge transfer from a pre-trained teacher. Unlike traditional model distillation, which relies on a powerful teacher network, DML involves multiple untrained student networks that learn from each other during training. Each student network is trained using a conventional supervised loss and a Kullback-Leibler (KL) divergence-based mimicry loss to align its class posterior with the probabilities of other students. This peer-teaching approach improves the generalization ability of the networks, leading to better performance on benchmarks such as CIFAR-100 and Market-1501. The paper demonstrates that DML outperforms both conventional distillation and independent learning, even when using large networks. Additionally, the performance of DML improves with the number of networks in the cohort, and the ensemble of DML-trained networks can further enhance performance. The authors provide insights into why DML works, suggesting that it helps find more robust solutions with higher posterior entropy, which are better suited for generalization.The paper introduces Deep Mutual Learning (DML), a novel approach that enhances the performance of deep neural networks by training them collaboratively in a cohort, rather than through one-way knowledge transfer from a pre-trained teacher. Unlike traditional model distillation, which relies on a powerful teacher network, DML involves multiple untrained student networks that learn from each other during training. Each student network is trained using a conventional supervised loss and a Kullback-Leibler (KL) divergence-based mimicry loss to align its class posterior with the probabilities of other students. This peer-teaching approach improves the generalization ability of the networks, leading to better performance on benchmarks such as CIFAR-100 and Market-1501. The paper demonstrates that DML outperforms both conventional distillation and independent learning, even when using large networks. Additionally, the performance of DML improves with the number of networks in the cohort, and the ensemble of DML-trained networks can further enhance performance. The authors provide insights into why DML works, suggesting that it helps find more robust solutions with higher posterior entropy, which are better suited for generalization.