Understanding Learning methods for generic object recognition with invariance to pose and lighting

The paper evaluates the effectiveness of various learning methods for recognizing generic visual categories while maintaining invariance to pose, lighting, and surrounding clutter. A large dataset, NORB, comprising 194,400 stereo image pairs of 30 uniform-colored toys under various conditions (36 angles, 9 azimuths, and 6 lighting conditions) was used. The objects belonged to 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. Five instances of each category were used for training, and the other five for testing. The methods tested included Nearest Neighbor, Support Vector Machines (SVM), and Convolutional Networks, both on raw pixels and PCA-derived features. The results showed that SVM achieved around 13% error rate on unseen object instances on uniform backgrounds, while Convolutional Nets achieved 7% error. On highly cluttered images, SVM proved impractical, while Convolutional Nets yielded 14% error. A real-time system was implemented to detect and classify objects in natural scenes at around 10 frames per second. The study highlights the limitations of template-based methods for large, complex datasets and emphasizes the importance of trainable local feature extractors for robust and invariant recognition.The paper evaluates the effectiveness of various learning methods for recognizing generic visual categories while maintaining invariance to pose, lighting, and surrounding clutter. A large dataset, NORB, comprising 194,400 stereo image pairs of 30 uniform-colored toys under various conditions (36 angles, 9 azimuths, and 6 lighting conditions) was used. The objects belonged to 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. Five instances of each category were used for training, and the other five for testing. The methods tested included Nearest Neighbor, Support Vector Machines (SVM), and Convolutional Networks, both on raw pixels and PCA-derived features. The results showed that SVM achieved around 13% error rate on unseen object instances on uniform backgrounds, while Convolutional Nets achieved 7% error. On highly cluttered images, SVM proved impractical, while Convolutional Nets yielded 14% error. A real-time system was implemented to detect and classify objects in natural scenes at around 10 frames per second. The study highlights the limitations of template-based methods for large, complex datasets and emphasizes the importance of trainable local feature extractors for robust and invariant recognition.

Learning Methods for Generic Object Recognition with Invariance to Pose and Lighting

| Yann LeCun, Fu Jie Huang, and Léon Bottou