Born-Again Neural Networks

Born-Again Neural Networks

29 Jun 2018 | Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, Anima Anandkumar
Born-Again Neural Networks (BANs) are a novel approach to Knowledge Distillation (KD), where students are trained with identical architectures to their teachers, leading to significant performance improvements. Unlike traditional KD, which compresses models, BANs train students to match the teacher's output distribution, resulting in better performance on tasks like computer vision and language modeling. Experiments show that BANs outperform their teachers on CIFAR-10 (3.5%) and CIFAR-100 (15.5%) datasets. Two distillation objectives, Confidence-Weighted by Teacher Max (CWTM) and Dark Knowledge with Permuted Predictions (DKPP), are explored, revealing the impact of teacher outputs on both predicted and non-predicted classes. BANs are applied to DenseNets, ResNets, and LSTM-based models, consistently achieving lower validation errors than their teachers. The gradient from KD can be decomposed into dark knowledge and ground-truth components, with the latter corresponding to real label training. BANs are also effective in language modeling, achieving lower perplexity on the Penn Tree Bank dataset. The results demonstrate that KD can enhance model performance without requiring strong teachers, and that BANs can be used to improve simpler architectures like ResNets. The study highlights the effectiveness of BANs in various tasks and architectures, showing that they can achieve state-of-the-art results without ensemble methods.Born-Again Neural Networks (BANs) are a novel approach to Knowledge Distillation (KD), where students are trained with identical architectures to their teachers, leading to significant performance improvements. Unlike traditional KD, which compresses models, BANs train students to match the teacher's output distribution, resulting in better performance on tasks like computer vision and language modeling. Experiments show that BANs outperform their teachers on CIFAR-10 (3.5%) and CIFAR-100 (15.5%) datasets. Two distillation objectives, Confidence-Weighted by Teacher Max (CWTM) and Dark Knowledge with Permuted Predictions (DKPP), are explored, revealing the impact of teacher outputs on both predicted and non-predicted classes. BANs are applied to DenseNets, ResNets, and LSTM-based models, consistently achieving lower validation errors than their teachers. The gradient from KD can be decomposed into dark knowledge and ground-truth components, with the latter corresponding to real label training. BANs are also effective in language modeling, achieving lower perplexity on the Penn Tree Bank dataset. The results demonstrate that KD can enhance model performance without requiring strong teachers, and that BANs can be used to improve simpler architectures like ResNets. The study highlights the effectiveness of BANs in various tasks and architectures, showing that they can achieve state-of-the-art results without ensemble methods.
Reach us at info@study.space
[slides] Born Again Neural Networks | StudySpace