12 Jul 2012 | Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, Andrew Y. Ng
This paper presents a method for learning high-level, class-specific feature detectors from unlabeled data. The approach involves training a deep autoencoder with pooling and local contrast normalization on a large dataset of images. The model is trained using model parallelism and asynchronous stochastic gradient descent on a cluster with 1,000 machines (16,000 cores) over three days. The results show that it is possible to train a face detector without labeled data, and the feature detector is robust to translation, scaling, and out-of-plane rotation. The same network is also sensitive to other high-level concepts such as cat faces and human bodies. Using these learned features, the network achieves 15.8% accuracy in recognizing 22,000 object categories from ImageNet, a 70% relative improvement over the previous state-of-the-art. The experiments also show that the network can detect faces, human bodies, and cat faces from random frames of YouTube videos. The learned representations are discriminative and work well for object recognition tasks. The paper also compares the performance of the proposed method with other algorithms such as deep autoencoders and K-means. The results show that the proposed method outperforms these baselines in terms of recognition rates. The paper concludes that it is possible to train neurons to be selective for high-level concepts using entirely unlabeled data.This paper presents a method for learning high-level, class-specific feature detectors from unlabeled data. The approach involves training a deep autoencoder with pooling and local contrast normalization on a large dataset of images. The model is trained using model parallelism and asynchronous stochastic gradient descent on a cluster with 1,000 machines (16,000 cores) over three days. The results show that it is possible to train a face detector without labeled data, and the feature detector is robust to translation, scaling, and out-of-plane rotation. The same network is also sensitive to other high-level concepts such as cat faces and human bodies. Using these learned features, the network achieves 15.8% accuracy in recognizing 22,000 object categories from ImageNet, a 70% relative improvement over the previous state-of-the-art. The experiments also show that the network can detect faces, human bodies, and cat faces from random frames of YouTube videos. The learned representations are discriminative and work well for object recognition tasks. The paper also compares the performance of the proposed method with other algorithms such as deep autoencoders and K-means. The results show that the proposed method outperforms these baselines in terms of recognition rates. The paper concludes that it is possible to train neurons to be selective for high-level concepts using entirely unlabeled data.