12 Jul 2012 | Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, Andrew Y. Ng
This paper explores the feasibility of building high-level, class-specific feature detectors from unlabeled data. The authors trained a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of 10 million 200x200 pixel images downloaded from the Internet. Using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores), they trained the network for three days. Surprisingly, the network learned to detect faces without any labeled data, achieving 81.7% accuracy in classifying faces against random distractors. The learned features were also robust to translation, scaling, and out-of-plane rotation. Additionally, the network learned to detect cat faces and human bodies, achieving 74.8% and 76.7% accuracy, respectively. Starting from these learned features, the network achieved 15.8% accuracy in recognizing 22,000 object categories from ImageNet, a 70% relative improvement over the state-of-the-art. The paper demonstrates that it is possible to build high-level features from unlabeled data, providing an inexpensive way to develop features and answering questions about the learnability of "grandmother neurons" in the human brain.This paper explores the feasibility of building high-level, class-specific feature detectors from unlabeled data. The authors trained a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of 10 million 200x200 pixel images downloaded from the Internet. Using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores), they trained the network for three days. Surprisingly, the network learned to detect faces without any labeled data, achieving 81.7% accuracy in classifying faces against random distractors. The learned features were also robust to translation, scaling, and out-of-plane rotation. Additionally, the network learned to detect cat faces and human bodies, achieving 74.8% and 76.7% accuracy, respectively. Starting from these learned features, the network achieved 15.8% accuracy in recognizing 22,000 object categories from ImageNet, a 70% relative improvement over the state-of-the-art. The paper demonstrates that it is possible to build high-level features from unlabeled data, providing an inexpensive way to develop features and answering questions about the learnability of "grandmother neurons" in the human brain.