27 Oct 2014 | Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
This paper presents a method for reducing the number of parameters in deep learning models by exploiting the structured nature of learned weights. The key idea is to represent the weight matrix as a low-rank product of two smaller matrices, allowing for significant parameter reduction while maintaining model performance. The technique is general and can be applied to various models, and it is compatible with other learning optimizations such as dropout, rectified linear units, and maxout. The method involves predicting most of the weights based on a small set of learned parameters, which can reduce the number of parameters in a network by over 95% without affecting accuracy.
The paper demonstrates that the weights in deep networks tend to have a structured pattern, such as smoothness in natural image patches, which can be exploited to reduce the number of parameters. By using a low-rank factorization of the weight matrix, the technique allows for efficient parameter prediction. The first factor of the factorization is constructed based on the smoothness of the input data, while the second factor is learned. This approach is shown to be effective in both convolutional and multi-layer perceptron (MLP) networks.
The paper also discusses the distinction between dynamic and static parameters. Dynamic parameters are updated frequently during training, while static parameters are computed once and remain fixed. Static parameters are easier to handle in distributed systems, as they do not require synchronization between machines. The technique presented is orthogonal to the choice of activation function and can be used alongside other recent advances in neural network training.
Experiments on various models, including MLPs, convolutional networks, and Reconstruction ICA, show that the parameter prediction technique is highly effective. In some cases, over 95% of the parameters can be predicted without any drop in accuracy. The technique is also shown to be applicable to different types of data and architectures, including image patches and speech data. The paper concludes that the method provides a valuable tool for reducing the number of parameters in deep learning models, which can lead to more efficient training and deployment of large-scale networks.This paper presents a method for reducing the number of parameters in deep learning models by exploiting the structured nature of learned weights. The key idea is to represent the weight matrix as a low-rank product of two smaller matrices, allowing for significant parameter reduction while maintaining model performance. The technique is general and can be applied to various models, and it is compatible with other learning optimizations such as dropout, rectified linear units, and maxout. The method involves predicting most of the weights based on a small set of learned parameters, which can reduce the number of parameters in a network by over 95% without affecting accuracy.
The paper demonstrates that the weights in deep networks tend to have a structured pattern, such as smoothness in natural image patches, which can be exploited to reduce the number of parameters. By using a low-rank factorization of the weight matrix, the technique allows for efficient parameter prediction. The first factor of the factorization is constructed based on the smoothness of the input data, while the second factor is learned. This approach is shown to be effective in both convolutional and multi-layer perceptron (MLP) networks.
The paper also discusses the distinction between dynamic and static parameters. Dynamic parameters are updated frequently during training, while static parameters are computed once and remain fixed. Static parameters are easier to handle in distributed systems, as they do not require synchronization between machines. The technique presented is orthogonal to the choice of activation function and can be used alongside other recent advances in neural network training.
Experiments on various models, including MLPs, convolutional networks, and Reconstruction ICA, show that the parameter prediction technique is highly effective. In some cases, over 95% of the parameters can be predicted without any drop in accuracy. The technique is also shown to be applicable to different types of data and architectures, including image patches and speech data. The paper concludes that the method provides a valuable tool for reducing the number of parameters in deep learning models, which can lead to more efficient training and deployment of large-scale networks.