Efficient Object Localization Using Convolutional Networks

Efficient Object Localization Using Convolutional Networks

9 Jun 2015 | Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christoph Bregler
This paper presents a novel ConvNet architecture for efficient human joint localization in monocular RGB images, achieving high spatial accuracy without significant computational overhead. The model uses an efficient 'position refinement' model trained to estimate joint offset locations within a small image region. This refinement model is jointly trained with a state-of-the-art ConvNet model to improve accuracy in human joint location estimation. The model's variance approaches that of human annotations on the FLIC dataset and outperforms existing approaches on the MPII-human-pose dataset. Traditional ConvNet architectures use pooling layers to reduce computational requirements, introduce invariance, and prevent over-training. However, pooling reduces spatial localization accuracy. The proposed architecture addresses this by using a multi-resolution ConvNet to produce a coarse heat-map output, which is then refined using a position refinement model. This architecture allows increased pooling for computational efficiency while maintaining high spatial precision. The coarse heat-map regression model uses a multi-resolution ConvNet architecture to produce a heat-map for each joint, describing the likelihood of a joint occurring in each spatial location. This model is then refined using a position refinement model that is jointly trained with the coarse model. The refinement model improves localization accuracy by estimating the joint offset location within a small region of the image. The model is trained by minimizing the Mean-Squared-Error (MSE) distance between the predicted and ground-truth heat-maps. During training, images are randomly rotated, scaled, and flipped to improve generalization performance. The model also incorporates a MRF-based spatial model to infer the most likely joint locations given noisy input distributions from the ConvNet. The fine heat-map regression model is a Siamese network that uses shared convolutional features to reduce the number of trainable parameters and prevent over-training. The model is trained by minimizing the distance between the predicted and ground-truth heat-maps. The model outperforms previous state-of-the-art results on the FLIC and MPII datasets, particularly in the high-precision region. The model's performance is evaluated on the FLIC and MPII datasets, showing significant improvements in joint localization accuracy. The model's performance is also compared to previous state-of-the-art results, demonstrating its effectiveness in human pose estimation. The model's architecture allows for efficient localization of human joints while maintaining the computational benefits of pooling.This paper presents a novel ConvNet architecture for efficient human joint localization in monocular RGB images, achieving high spatial accuracy without significant computational overhead. The model uses an efficient 'position refinement' model trained to estimate joint offset locations within a small image region. This refinement model is jointly trained with a state-of-the-art ConvNet model to improve accuracy in human joint location estimation. The model's variance approaches that of human annotations on the FLIC dataset and outperforms existing approaches on the MPII-human-pose dataset. Traditional ConvNet architectures use pooling layers to reduce computational requirements, introduce invariance, and prevent over-training. However, pooling reduces spatial localization accuracy. The proposed architecture addresses this by using a multi-resolution ConvNet to produce a coarse heat-map output, which is then refined using a position refinement model. This architecture allows increased pooling for computational efficiency while maintaining high spatial precision. The coarse heat-map regression model uses a multi-resolution ConvNet architecture to produce a heat-map for each joint, describing the likelihood of a joint occurring in each spatial location. This model is then refined using a position refinement model that is jointly trained with the coarse model. The refinement model improves localization accuracy by estimating the joint offset location within a small region of the image. The model is trained by minimizing the Mean-Squared-Error (MSE) distance between the predicted and ground-truth heat-maps. During training, images are randomly rotated, scaled, and flipped to improve generalization performance. The model also incorporates a MRF-based spatial model to infer the most likely joint locations given noisy input distributions from the ConvNet. The fine heat-map regression model is a Siamese network that uses shared convolutional features to reduce the number of trainable parameters and prevent over-training. The model is trained by minimizing the distance between the predicted and ground-truth heat-maps. The model outperforms previous state-of-the-art results on the FLIC and MPII datasets, particularly in the high-precision region. The model's performance is evaluated on the FLIC and MPII datasets, showing significant improvements in joint localization accuracy. The model's performance is also compared to previous state-of-the-art results, demonstrating its effectiveness in human pose estimation. The model's architecture allows for efficient localization of human joints while maintaining the computational benefits of pooling.
Reach us at info@study.space
Understanding Efficient object localization using Convolutional Networks