22 Jun 2018 | Christos Louizos*, Max Welling, Diederik P. Kingma
The paper proposes a practical method for $L_0$ norm regularization in neural networks, which involves pruning the network during training by encouraging weights to become exactly zero. This approach can significantly speed up training and inference and improve generalization. The authors address the challenge of the non-differentiability of the $L_0$ norm by incorporating a collection of non-negative stochastic gates that collectively determine which weights to set to zero. They introduce the *hard concrete* distribution, a modified version of the binary concrete distribution, which allows for efficient gradient-based optimization while maintaining the exact zeros in the parameters. The parameters of the distribution over the gates can be jointly optimized with the original network parameters, enabling straightforward and efficient learning of model structures using stochastic gradient descent. The effectiveness of the proposed method is demonstrated through various experiments on MNIST and CIFAR-10/CIFAR-100 classification tasks, showing competitive performance with current approaches and potential training time speedup.The paper proposes a practical method for $L_0$ norm regularization in neural networks, which involves pruning the network during training by encouraging weights to become exactly zero. This approach can significantly speed up training and inference and improve generalization. The authors address the challenge of the non-differentiability of the $L_0$ norm by incorporating a collection of non-negative stochastic gates that collectively determine which weights to set to zero. They introduce the *hard concrete* distribution, a modified version of the binary concrete distribution, which allows for efficient gradient-based optimization while maintaining the exact zeros in the parameters. The parameters of the distribution over the gates can be jointly optimized with the original network parameters, enabling straightforward and efficient learning of model structures using stochastic gradient descent. The effectiveness of the proposed method is demonstrated through various experiments on MNIST and CIFAR-10/CIFAR-100 classification tasks, showing competitive performance with current approaches and potential training time speedup.