Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction

Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction

2011 | Jonathan Masci, Ueli Meier, Dan Cirešć, and Jürgen Schmidhuber
This paper introduces a Convolutional Auto-Encoder (CAE), a hierarchical unsupervised feature extractor that scales well to high-dimensional inputs. The CAE learns non-trivial features using plain stochastic gradient descent and discovers good CNN initializations that avoid the numerous distinct local minima of highly non-convex objective functions. The CAE is trained using conventional on-line gradient descent without additional regularization terms. A max-pooling layer is essential to learn biologically plausible features consistent with those found by previous approaches. Initializing a CNN with filters of a trained CAE stack yields superior performance on a digit (MNIST) and an object recognition (CIFAR10) benchmark. The CAE architecture is similar to the one described in the denoising auto-encoder, except that the weights are shared. For a mono-channel input x, the latent representation of the k-th feature map is given by h^k = σ(x*W^k + b^k), where σ is an activation function and * denotes the 2D convolution. The reconstruction is obtained using y = σ(∑_{k∈H} h^k * W^k + c), where c is a bias per input channel. The cost function to minimize is the mean squared error (MSE). A max-pooling layer is introduced to obtain translation-invariant representations. Max-pooling down-samples the latent representation by a constant factor, usually taking the maximum value over non-overlapping sub-regions. This helps improve filter selectivity and forces the feature detectors to become more broadly applicable. The max-pooling layer is an elegant way of enforcing a sparse code required to deal with the overcomplete representations of convolutional architectures. The CAE is stacked to form a deep hierarchy. Each layer receives its input from the latent representation of the layer below. Unsupervised pre-training can be done in greedy, layer-wise fashion. Afterwards the weights can be fine-tuned using back-propagation, or the top level activations can be used as feature vectors for SVMs or other classifiers. Analogously, a CAE stack (CAES) can be used to initialize a CNN with identical topology prior to a supervised training stage. The CAE is tested on the MNIST and CIFAR10 datasets. The results show that the CAE performs well on both datasets, with the best results on CIFAR10 for any unsupervised architecture trained on non-whitened data. The CAE is also shown to outperform randomly initialized CNNs. The results indicate that the CAE is an effective method for hierarchical feature extraction and that max-pooling is essential for learning biologically plausible features.This paper introduces a Convolutional Auto-Encoder (CAE), a hierarchical unsupervised feature extractor that scales well to high-dimensional inputs. The CAE learns non-trivial features using plain stochastic gradient descent and discovers good CNN initializations that avoid the numerous distinct local minima of highly non-convex objective functions. The CAE is trained using conventional on-line gradient descent without additional regularization terms. A max-pooling layer is essential to learn biologically plausible features consistent with those found by previous approaches. Initializing a CNN with filters of a trained CAE stack yields superior performance on a digit (MNIST) and an object recognition (CIFAR10) benchmark. The CAE architecture is similar to the one described in the denoising auto-encoder, except that the weights are shared. For a mono-channel input x, the latent representation of the k-th feature map is given by h^k = σ(x*W^k + b^k), where σ is an activation function and * denotes the 2D convolution. The reconstruction is obtained using y = σ(∑_{k∈H} h^k * W^k + c), where c is a bias per input channel. The cost function to minimize is the mean squared error (MSE). A max-pooling layer is introduced to obtain translation-invariant representations. Max-pooling down-samples the latent representation by a constant factor, usually taking the maximum value over non-overlapping sub-regions. This helps improve filter selectivity and forces the feature detectors to become more broadly applicable. The max-pooling layer is an elegant way of enforcing a sparse code required to deal with the overcomplete representations of convolutional architectures. The CAE is stacked to form a deep hierarchy. Each layer receives its input from the latent representation of the layer below. Unsupervised pre-training can be done in greedy, layer-wise fashion. Afterwards the weights can be fine-tuned using back-propagation, or the top level activations can be used as feature vectors for SVMs or other classifiers. Analogously, a CAE stack (CAES) can be used to initialize a CNN with identical topology prior to a supervised training stage. The CAE is tested on the MNIST and CIFAR10 datasets. The results show that the CAE performs well on both datasets, with the best results on CIFAR10 for any unsupervised architecture trained on non-whitened data. The CAE is also shown to outperform randomly initialized CNNs. The results indicate that the CAE is an effective method for hierarchical feature extraction and that max-pooling is essential for learning biologically plausible features.
Reach us at info@study.space