10 Jun 2020 | Rafael Müller, Simon Kornblith, Geoffrey Hinton
Label smoothing, a technique that softens the targets used in training neural networks, has been widely adopted to improve generalization and learning speed. This paper explores the effects of label smoothing on model calibration and knowledge distillation. Empirically, it is shown that label smoothing not only enhances generalization but also improves model calibration, which can significantly benefit beam-search in tasks like machine translation. However, the paper also finds that label smoothing impairs the effectiveness of knowledge distillation when the teacher network is trained with label smoothing. To understand these observations, the authors visualize the representations learned by the penultimate layer of the network, revealing that label smoothing encourages training examples from the same class to cluster tightly, while reducing information about the differences between classes. This loss of information in the logits hinders distillation but does not affect generalization or calibration. The paper concludes by suggesting further research directions, including the relationship between label smoothing and the information bottleneck principle, and highlights the importance of calibrated likelihoods in downstream tasks.Label smoothing, a technique that softens the targets used in training neural networks, has been widely adopted to improve generalization and learning speed. This paper explores the effects of label smoothing on model calibration and knowledge distillation. Empirically, it is shown that label smoothing not only enhances generalization but also improves model calibration, which can significantly benefit beam-search in tasks like machine translation. However, the paper also finds that label smoothing impairs the effectiveness of knowledge distillation when the teacher network is trained with label smoothing. To understand these observations, the authors visualize the representations learned by the penultimate layer of the network, revealing that label smoothing encourages training examples from the same class to cluster tightly, while reducing information about the differences between classes. This loss of information in the logits hinders distillation but does not affect generalization or calibration. The paper concludes by suggesting further research directions, including the relationship between label smoothing and the information bottleneck principle, and highlights the importance of calibrated likelihoods in downstream tasks.