REGULARIZING NEURAL NETWORKS BY PENALIZING CONFIDENT OUTPUT DISTRIBUTIONS

REGULARIZING NEURAL NETWORKS BY PENALIZING CONFIDENT OUTPUT DISTRIBUTIONS

23 Jan 2017 | Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, Geoffrey Hinton
The paper explores the use of penalizing low entropy output distributions as a regularization technique for neural networks. The authors demonstrate that this approach, which has been effective in improving exploration in reinforcement learning, also acts as a strong regularizer in supervised learning. They evaluate two specific regularizers: a maximum entropy-based confidence penalty and label smoothing (both uniform and unigram) on six common benchmarks: image classification (MNIST and CIFAR-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). The results show that both label smoothing and the confidence penalty improve state-of-the-art models across these benchmarks without requiring adjustments to existing hyperparameters, suggesting their wide applicability. The paper also discusses the connection between the confidence penalty and label smoothing, showing that the confidence penalty can be interpreted as a form of label smoothing with a specific target distribution. Experiments on various datasets and models, including fully-connected networks, convolutional neural networks, and sequence-to-sequence models, confirm the effectiveness of these regularizers.The paper explores the use of penalizing low entropy output distributions as a regularization technique for neural networks. The authors demonstrate that this approach, which has been effective in improving exploration in reinforcement learning, also acts as a strong regularizer in supervised learning. They evaluate two specific regularizers: a maximum entropy-based confidence penalty and label smoothing (both uniform and unigram) on six common benchmarks: image classification (MNIST and CIFAR-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). The results show that both label smoothing and the confidence penalty improve state-of-the-art models across these benchmarks without requiring adjustments to existing hyperparameters, suggesting their wide applicability. The paper also discusses the connection between the confidence penalty and label smoothing, showing that the confidence penalty can be interpreted as a form of label smoothing with a specific target distribution. Experiments on various datasets and models, including fully-connected networks, convolutional neural networks, and sequence-to-sequence models, confirm the effectiveness of these regularizers.
Reach us at info@study.space