[slides and audio] Transforming Auto-Encoders

The paper by G. E. Hinton, A. Krizhevsky, and S. D. Wang introduces a novel approach to object recognition using neural networks, specifically focusing on the use of "capsules" and transforming auto-encoders. The authors argue that current methods, such as those used in convolutional neural networks (CNNs), are inadequate for handling variations in position, orientation, scale, and lighting. Instead, they propose using local "capsules" that perform complex internal computations and output a vector of highly informative instantiation parameters, including pose, lighting, and deformation. Capsules are designed to recognize visual entities within a limited domain of viewing conditions and deformations, providing both a probability of the entity's presence and its precise pose relative to a canonical version. This approach is more efficient and adaptable than hand-engineered features in computer vision and offers a simple way to recognize wholes by recognizing their parts. The paper demonstrates how these capsules can be learned from pairs of transformed images, using a method called a "transforming auto-encoder." The transforming auto-encoder is a feedforward neural network that takes an image and desired shifts as inputs and outputs the shifted image. Each capsule in the network has logistic "recognition units" and "generation units." The recognition units compute pose outputs, while the generation units contribute to the output image based on these poses. The paper shows that this setup effectively learns to shift images correctly and captures localized receptive fields. The authors also explore more complex transformations, such as full 2-D affine transformations, and discuss the potential advantages of using matrix multiplies to model viewpoint changes. Preliminary experiments with stereo images of cars demonstrate the effectiveness of the transforming auto-encoder in handling 3-D viewpoint changes. The paper concludes by discussing the limitations and potential improvements of the capsule approach, including the challenge of representing multiple instances of the same visual entity and the relationship between transforming auto-encoders and Kalman filters. The authors suggest that the transforming auto-encoder can be used to model any property of an image that can be manipulated in a known way, enhancing the flexibility and adaptability of the system.The paper by G. E. Hinton, A. Krizhevsky, and S. D. Wang introduces a novel approach to object recognition using neural networks, specifically focusing on the use of "capsules" and transforming auto-encoders. The authors argue that current methods, such as those used in convolutional neural networks (CNNs), are inadequate for handling variations in position, orientation, scale, and lighting. Instead, they propose using local "capsules" that perform complex internal computations and output a vector of highly informative instantiation parameters, including pose, lighting, and deformation. Capsules are designed to recognize visual entities within a limited domain of viewing conditions and deformations, providing both a probability of the entity's presence and its precise pose relative to a canonical version. This approach is more efficient and adaptable than hand-engineered features in computer vision and offers a simple way to recognize wholes by recognizing their parts. The paper demonstrates how these capsules can be learned from pairs of transformed images, using a method called a "transforming auto-encoder." The transforming auto-encoder is a feedforward neural network that takes an image and desired shifts as inputs and outputs the shifted image. Each capsule in the network has logistic "recognition units" and "generation units." The recognition units compute pose outputs, while the generation units contribute to the output image based on these poses. The paper shows that this setup effectively learns to shift images correctly and captures localized receptive fields. The authors also explore more complex transformations, such as full 2-D affine transformations, and discuss the potential advantages of using matrix multiplies to model viewpoint changes. Preliminary experiments with stereo images of cars demonstrate the effectiveness of the transforming auto-encoder in handling 3-D viewpoint changes. The paper concludes by discussing the limitations and potential improvements of the capsule approach, including the challenge of representing multiple instances of the same visual entity and the relationship between transforming auto-encoders and Kalman filters. The authors suggest that the transforming auto-encoder can be used to model any property of an image that can be manipulated in a known way, enhancing the flexibility and adaptability of the system.

Transforming Auto-encoders

| G. E. Hinton, A. Krizhevsky & S. D. Wang