When Does Label Smoothing Help?

When Does Label Smoothing Help?

10 Jun 2020 | Rafael Müller, Simon Kornblith, Geoffrey Hinton
Label smoothing, a technique that softens the targets in neural networks by averaging them with a uniform distribution, has been widely used to improve generalization and model calibration. However, its effects on knowledge distillation are less understood. This paper explores the behavior of neural networks trained with label smoothing, showing that it improves generalization and calibration but harms distillation. Label smoothing encourages the penultimate layer representations of training examples from the same class to cluster tightly, which reduces the information in logits about similarities between different classes, making distillation less effective. The paper introduces a visualization method based on linear projections of penultimate layer activations to show how label smoothing affects representations. It demonstrates that label smoothing implicitly calibrates models, aligning prediction confidence with accuracy. However, this calibration comes at the cost of distillation effectiveness, as the loss of information in logits hinders the transfer of knowledge from teacher to student networks. Label smoothing also affects the calibration of neural networks, reducing the expected calibration error (ECE) and improving model reliability. This is shown in experiments on image classification and machine translation, where label smoothing leads to better calibration without the need for temperature scaling. However, label smoothing can result in worse negative log-likelihoods (NLL) compared to hard-target training, indicating that while it improves calibration, it may not always enhance performance metrics like BLEU in translation tasks. In knowledge distillation, teachers trained with label smoothing produce inferior student networks compared to those trained with hard targets. This is attributed to the loss of information in logits, which is crucial for effective distillation. The paper also shows that label smoothing reduces mutual information between input examples and output logits, further impacting distillation performance. The study highlights the trade-offs between generalization, calibration, and distillation when using label smoothing. While it improves model calibration and generalization, it can hinder knowledge distillation by erasing information necessary for effective transfer learning. The findings suggest that label smoothing's impact on model behavior is complex and context-dependent, with implications for model interpretability and downstream tasks that rely on calibrated predictions.Label smoothing, a technique that softens the targets in neural networks by averaging them with a uniform distribution, has been widely used to improve generalization and model calibration. However, its effects on knowledge distillation are less understood. This paper explores the behavior of neural networks trained with label smoothing, showing that it improves generalization and calibration but harms distillation. Label smoothing encourages the penultimate layer representations of training examples from the same class to cluster tightly, which reduces the information in logits about similarities between different classes, making distillation less effective. The paper introduces a visualization method based on linear projections of penultimate layer activations to show how label smoothing affects representations. It demonstrates that label smoothing implicitly calibrates models, aligning prediction confidence with accuracy. However, this calibration comes at the cost of distillation effectiveness, as the loss of information in logits hinders the transfer of knowledge from teacher to student networks. Label smoothing also affects the calibration of neural networks, reducing the expected calibration error (ECE) and improving model reliability. This is shown in experiments on image classification and machine translation, where label smoothing leads to better calibration without the need for temperature scaling. However, label smoothing can result in worse negative log-likelihoods (NLL) compared to hard-target training, indicating that while it improves calibration, it may not always enhance performance metrics like BLEU in translation tasks. In knowledge distillation, teachers trained with label smoothing produce inferior student networks compared to those trained with hard targets. This is attributed to the loss of information in logits, which is crucial for effective distillation. The paper also shows that label smoothing reduces mutual information between input examples and output logits, further impacting distillation performance. The study highlights the trade-offs between generalization, calibration, and distillation when using label smoothing. While it improves model calibration and generalization, it can hinder knowledge distillation by erasing information necessary for effective transfer learning. The findings suggest that label smoothing's impact on model behavior is complex and context-dependent, with implications for model interpretability and downstream tasks that rely on calibrated predictions.
Reach us at info@study.space