This paper introduces a new approach to learning visual features using transforming auto-encoders, which can better handle variations in position, orientation, scale, and lighting compared to traditional neural networks. The authors argue that current methods in neural networks are not as effective as hand-engineered features used in computer vision, as they do not provide an efficient way to adapt to different domains. Instead, they propose using capsules that output explicit instantiation parameters, which allow for more accurate spatial relationships between visual entities.
Capsules are local units that perform complex internal computations on their inputs and then encapsulate the results into a small vector of highly informative outputs. Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and outputs both the probability that the entity is present and a set of instantiation parameters that may include the precise pose, lighting, and deformation of the visual entity relative to an implicitly defined canonical version.
The paper shows that capsules can be used to recognize wholes by recognizing their parts. If a capsule can learn to output the pose of its visual entity in a vector that is linearly related to the "natural" representations of pose used in computer graphics, there is a simple and highly selective test for whether the visual entities represented by two active capsules have the right spatial relationship to activate a higher-level capsule.
The authors also show that transforming auto-encoders can be used to learn the first level of capsules by converting pixel intensities into the outputs of active, first-level capsules that produce explicit representations of the pose of their visual entities. The transforming auto-encoder is trained to predict a full 2-D affine transformation (translation, rotation, scaling, and shearing) by using a 3x3 matrix as the output of each capsule.
The paper concludes that transforming auto-encoders have an interesting relationship to Kalman filters and that they can be used to model small transformations between adjacent time-frames as zero-mean Gaussian noise. They also compare their approach to other models and show that their method is more effective in learning spatial transformations.This paper introduces a new approach to learning visual features using transforming auto-encoders, which can better handle variations in position, orientation, scale, and lighting compared to traditional neural networks. The authors argue that current methods in neural networks are not as effective as hand-engineered features used in computer vision, as they do not provide an efficient way to adapt to different domains. Instead, they propose using capsules that output explicit instantiation parameters, which allow for more accurate spatial relationships between visual entities.
Capsules are local units that perform complex internal computations on their inputs and then encapsulate the results into a small vector of highly informative outputs. Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and outputs both the probability that the entity is present and a set of instantiation parameters that may include the precise pose, lighting, and deformation of the visual entity relative to an implicitly defined canonical version.
The paper shows that capsules can be used to recognize wholes by recognizing their parts. If a capsule can learn to output the pose of its visual entity in a vector that is linearly related to the "natural" representations of pose used in computer graphics, there is a simple and highly selective test for whether the visual entities represented by two active capsules have the right spatial relationship to activate a higher-level capsule.
The authors also show that transforming auto-encoders can be used to learn the first level of capsules by converting pixel intensities into the outputs of active, first-level capsules that produce explicit representations of the pose of their visual entities. The transforming auto-encoder is trained to predict a full 2-D affine transformation (translation, rotation, scaling, and shearing) by using a 3x3 matrix as the output of each capsule.
The paper concludes that transforming auto-encoders have an interesting relationship to Kalman filters and that they can be used to model small transformations between adjacent time-frames as zero-mean Gaussian noise. They also compare their approach to other models and show that their method is more effective in learning spatial transformations.