REGULARIZING NEURAL NETWORKS BY PENALIZING CONFIDENT OUTPUT DISTRIBUTIONS

REGULARIZING NEURAL NETWORKS BY PENALIZING CONFIDENT OUTPUT DISTRIBUTIONS

23 Jan 2017 | Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, Geoffrey Hinton
This paper explores regularizing neural networks by penalizing low entropy output distributions. The authors show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. They connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. The authors evaluate the proposed confidence penalty and label smoothing on six common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). They find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters. The paper introduces a confidence penalty that penalizes confident output distributions by adding the negative entropy to the negative log-likelihood during training. The confidence penalty is shown to be effective in preventing overfitting and improving generalization. The authors also show that the confidence penalty can be connected to label smoothing through the direction of the KL divergence. They find that both label smoothing and the confidence penalty improve performance on a wide range of tasks, including image classification, language modeling, machine translation, and speech recognition. The authors also explore the connection between the confidence penalty and label smoothing. They find that label smoothing can be seen as a form of confidence penalty, where the KL divergence between the uniform distribution and the network's predicted distribution is added to the negative log-likelihood. This interpretation suggests further confidence regularizers that use alternative target distributions instead of the uniform distribution. The authors also explore the use of the confidence penalty in different settings, including with and without dropout, and find that it can be effective in improving performance. The authors evaluate the confidence penalty and label smoothing on a variety of tasks, including image classification, language modeling, machine translation, and speech recognition. They find that both methods improve performance on these tasks, with label smoothing showing slightly better results in some cases. The authors also find that the confidence penalty is effective in preventing overfitting and improving generalization. They conclude that both label smoothing and the confidence penalty are effective regularizers that can be used to improve the performance of neural networks on a wide range of tasks.This paper explores regularizing neural networks by penalizing low entropy output distributions. The authors show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. They connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. The authors evaluate the proposed confidence penalty and label smoothing on six common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). They find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters. The paper introduces a confidence penalty that penalizes confident output distributions by adding the negative entropy to the negative log-likelihood during training. The confidence penalty is shown to be effective in preventing overfitting and improving generalization. The authors also show that the confidence penalty can be connected to label smoothing through the direction of the KL divergence. They find that both label smoothing and the confidence penalty improve performance on a wide range of tasks, including image classification, language modeling, machine translation, and speech recognition. The authors also explore the connection between the confidence penalty and label smoothing. They find that label smoothing can be seen as a form of confidence penalty, where the KL divergence between the uniform distribution and the network's predicted distribution is added to the negative log-likelihood. This interpretation suggests further confidence regularizers that use alternative target distributions instead of the uniform distribution. The authors also explore the use of the confidence penalty in different settings, including with and without dropout, and find that it can be effective in improving performance. The authors evaluate the confidence penalty and label smoothing on a variety of tasks, including image classification, language modeling, machine translation, and speech recognition. They find that both methods improve performance on these tasks, with label smoothing showing slightly better results in some cases. The authors also find that the confidence penalty is effective in preventing overfitting and improving generalization. They conclude that both label smoothing and the confidence penalty are effective regularizers that can be used to improve the performance of neural networks on a wide range of tasks.
Reach us at info@study.space