22 Jun 2018 | Christos Louizos*, Max Welling, Diederik P. Kingma
This paper proposes a practical method for $ L_0 $ norm regularization in neural networks by encouraging weights to become exactly zero during training. This approach allows for faster training and inference, as well as improved generalization. The $ L_0 $ norm is non-differentiable, so it cannot be directly used as a regularization term. Instead, the authors introduce a collection of non-negative stochastic gates that determine which weights to set to zero. They show that for certain distributions over the gates, the expected $ L_0 $ regularized objective is differentiable with respect to the distribution parameters. The hard concrete distribution is proposed for the gates, obtained by stretching a binary concrete distribution and then transforming its samples with a hard-sigmoid. This allows for efficient gradient-based optimization and conditional computation.
The authors demonstrate that their method allows for straightforward and efficient learning of model structures with stochastic gradient descent. They perform various experiments to show the effectiveness of their approach. The method is applied to tasks such as MNIST classification and CIFAR classification, where it achieves competitive results with other methods. The approach also allows for significant computational benefits, as it can reduce the number of parameters and thus the computational cost of training. The method is also shown to be effective in combination with other norms, such as the $ L_2 $ norm, and can be used for group sparsity in neural networks. The authors conclude that their method provides a principled and effective way to regularize neural networks with $ L_0 $ norm, leading to improved performance and efficiency.This paper proposes a practical method for $ L_0 $ norm regularization in neural networks by encouraging weights to become exactly zero during training. This approach allows for faster training and inference, as well as improved generalization. The $ L_0 $ norm is non-differentiable, so it cannot be directly used as a regularization term. Instead, the authors introduce a collection of non-negative stochastic gates that determine which weights to set to zero. They show that for certain distributions over the gates, the expected $ L_0 $ regularized objective is differentiable with respect to the distribution parameters. The hard concrete distribution is proposed for the gates, obtained by stretching a binary concrete distribution and then transforming its samples with a hard-sigmoid. This allows for efficient gradient-based optimization and conditional computation.
The authors demonstrate that their method allows for straightforward and efficient learning of model structures with stochastic gradient descent. They perform various experiments to show the effectiveness of their approach. The method is applied to tasks such as MNIST classification and CIFAR classification, where it achieves competitive results with other methods. The approach also allows for significant computational benefits, as it can reduce the number of parameters and thus the computational cost of training. The method is also shown to be effective in combination with other norms, such as the $ L_2 $ norm, and can be used for group sparsity in neural networks. The authors conclude that their method provides a principled and effective way to regularize neural networks with $ L_0 $ norm, leading to improved performance and efficiency.