4 Oct 2017 | Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt
The paper presents a CNN-based approach for 3D human body pose estimation from single RGB images, addressing the issue of limited generalizability of models trained on limited 3D pose data. The authors introduce a new training set, MPI-INF-3DHP, which captures real humans with ground truth 3D annotations from a multi-camera markerless motion capture system, providing greater diversity in pose, appearance, clothing, occlusion, and viewpoints. They also propose a new benchmark covering outdoor and indoor scenes. The method leverages transfer learning from 2D pose data to improve 3D pose estimation accuracy and generalizability. The contributions include:
1. **Transfer Learning**: The authors explore the use of transfer learning to leverage learned mid- and high-level features from 2D pose datasets to improve 3D pose estimation. They show that this approach significantly improves accuracy and generalizability compared to naive weight initialization and domain adaptation methods.
2. **New Dataset**: The MPI-INF-3DHP dataset captures real humans with ground truth 3D annotations, providing a more diverse range of poses, clothing, interactions, and viewpoints compared to existing datasets. The dataset is captured in a multi-camera studio with ground truth from a state-of-the-art markerless motion capture system.
3. **CNN Architecture**: The authors propose a CNN architecture for 3D pose estimation, which includes multi-level corrective skip connections and multi-modal pose fusion to improve performance. The architecture is trained using a combination of ImageNet pre-trained weights and transfer learning from 2D pose data.
4. **Evaluation**: The method is evaluated on established benchmarks (Human3.6m and HumanEva) and a new in-the-wild benchmark. The results show that the proposed method achieves state-of-the-art performance on these benchmarks and generalizes well to in-the-wild scenes.
The paper concludes by discussing the limitations of the method, such as the bias towards chest-height cameras and the need for further improvements in real-time performance. The authors argue that combining transfer learning with algorithmic and data contributions is crucial for advancing 3D body pose estimation.The paper presents a CNN-based approach for 3D human body pose estimation from single RGB images, addressing the issue of limited generalizability of models trained on limited 3D pose data. The authors introduce a new training set, MPI-INF-3DHP, which captures real humans with ground truth 3D annotations from a multi-camera markerless motion capture system, providing greater diversity in pose, appearance, clothing, occlusion, and viewpoints. They also propose a new benchmark covering outdoor and indoor scenes. The method leverages transfer learning from 2D pose data to improve 3D pose estimation accuracy and generalizability. The contributions include:
1. **Transfer Learning**: The authors explore the use of transfer learning to leverage learned mid- and high-level features from 2D pose datasets to improve 3D pose estimation. They show that this approach significantly improves accuracy and generalizability compared to naive weight initialization and domain adaptation methods.
2. **New Dataset**: The MPI-INF-3DHP dataset captures real humans with ground truth 3D annotations, providing a more diverse range of poses, clothing, interactions, and viewpoints compared to existing datasets. The dataset is captured in a multi-camera studio with ground truth from a state-of-the-art markerless motion capture system.
3. **CNN Architecture**: The authors propose a CNN architecture for 3D pose estimation, which includes multi-level corrective skip connections and multi-modal pose fusion to improve performance. The architecture is trained using a combination of ImageNet pre-trained weights and transfer learning from 2D pose data.
4. **Evaluation**: The method is evaluated on established benchmarks (Human3.6m and HumanEva) and a new in-the-wild benchmark. The results show that the proposed method achieves state-of-the-art performance on these benchmarks and generalizes well to in-the-wild scenes.
The paper concludes by discussing the limitations of the method, such as the bias towards chest-height cameras and the need for further improvements in real-time performance. The authors argue that combining transfer learning with algorithmic and data contributions is crucial for advancing 3D body pose estimation.