[slides] Born Again Neural Networks

The paper introduces a novel approach to Knowledge Distillation (KD) called Born-Again Networks (BANs). Unlike traditional KD, which aims to compress a high-capacity model into a more compact one, BANs train students with identical architectures to their teachers. Surprisingly, these BANs outperform their teachers significantly, both in computer vision and language modeling tasks. The authors demonstrate state-of-the-art performance on the CIFAR-10 and CIFAR-100 datasets using DenseNets. They explore two distillation objectives: Confidence-Weighted by Teacher Max (CWTM) and Dark Knowledge with Permuted Predictions (DKPP), which help understand the essential components of KD. The paper also discusses the stability of BANs to depth and width variations and shows that BANs can improve the performance of ResNets when trained from DenseNet teachers. Experiments on the Penn Tree Bank dataset further validate the effectiveness of BANs in language modeling. The results suggest that KD can be beneficial even when applied to models of similar architecture, and that the teacher's output distribution can provide valuable training signals.The paper introduces a novel approach to Knowledge Distillation (KD) called Born-Again Networks (BANs). Unlike traditional KD, which aims to compress a high-capacity model into a more compact one, BANs train students with identical architectures to their teachers. Surprisingly, these BANs outperform their teachers significantly, both in computer vision and language modeling tasks. The authors demonstrate state-of-the-art performance on the CIFAR-10 and CIFAR-100 datasets using DenseNets. They explore two distillation objectives: Confidence-Weighted by Teacher Max (CWTM) and Dark Knowledge with Permuted Predictions (DKPP), which help understand the essential components of KD. The paper also discusses the stability of BANs to depth and width variations and shows that BANs can improve the performance of ResNets when trained from DenseNet teachers. Experiments on the Penn Tree Bank dataset further validate the effectiveness of BANs in language modeling. The results suggest that KD can be beneficial even when applied to models of similar architecture, and that the teacher's output distribution can provide valuable training signals.

Born-Again Neural Networks

29 Jun 2018 | Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, Anima Anandkumar