Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision

Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision

4 Oct 2017 | Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt
This paper presents a CNN-based approach for monocular 3D human pose estimation that addresses the challenge of limited generalizability of models trained solely on publicly available 3D pose data. The method uses existing 3D and 2D pose data to achieve state-of-the-art performance on established benchmarks while generalizing to in-the-wild scenes. A new training set, MPI-INF-3DHP, is introduced, which includes ground truth 3D annotations from a markerless motion capture system. This dataset complements existing corpora with greater diversity in pose, human appearance, clothing, occlusion, and viewpoints, enabling increased augmentation scope. A new benchmark is also introduced, covering outdoor and indoor scenes, and demonstrates that the 3D pose dataset performs better in-the-wild than existing annotated data, further improved with transfer learning from 2D pose data. The paper argues that transfer learning of representations, along with algorithmic and data contributions, is crucial for general 3D body pose estimation. The method estimates 3D human body pose from a single image in three stages: (1) extraction of the actor bounding box from 2D detections; (2) direct CNN-based 3D pose regression; and (3) global root position computation in original footage by aligning 3D to 2D pose. The 2D pose is estimated using a CNN called 2DPoseNet, and the 3D pose is estimated using a second CNN called 3DPoseNet. The method also introduces a new dataset, MPI-INF-3DHP, which includes ground truth 3D annotations from a markerless motion capture system. This dataset complements existing datasets with everyday clothing appearance, a large range of motions, interactions with objects, and more varied camera viewpoints. The data capture approach eases appearance augmentation to extend the captured variability, complemented with improvements to existing augmentation methods for enhanced foreground texture variation. The paper also introduces a new test set, including sequences outdoors with accurate annotation, on which the method is validated. The components of the method are thoroughly evaluated on existing test datasets, demonstrating both state-of-the-art results in controlled settings and, more importantly, improvements over existing solutions for in-the-wild sequences thanks to the better generalization of the proposed techniques. The method uses transfer learning to leverage features learned on in-the-wild 2D pose datasets in conjunction with existing annotated 3D pose datasets. The method also introduces a new CNN architecture for 3D pose estimation, which uses multi-level corrective skip connections and multi-modal pose fusion to improve accuracy and generalization. The method is evaluated on several benchmarks, including Human3.6m, HumanEva, and MPI-INF-3DHP, and shows significant improvements in performance compared to existing methods. The paper concludes that the use of transfer learning of representations, along with algorithmic and data contributions, is crucial for general 3D body poseThis paper presents a CNN-based approach for monocular 3D human pose estimation that addresses the challenge of limited generalizability of models trained solely on publicly available 3D pose data. The method uses existing 3D and 2D pose data to achieve state-of-the-art performance on established benchmarks while generalizing to in-the-wild scenes. A new training set, MPI-INF-3DHP, is introduced, which includes ground truth 3D annotations from a markerless motion capture system. This dataset complements existing corpora with greater diversity in pose, human appearance, clothing, occlusion, and viewpoints, enabling increased augmentation scope. A new benchmark is also introduced, covering outdoor and indoor scenes, and demonstrates that the 3D pose dataset performs better in-the-wild than existing annotated data, further improved with transfer learning from 2D pose data. The paper argues that transfer learning of representations, along with algorithmic and data contributions, is crucial for general 3D body pose estimation. The method estimates 3D human body pose from a single image in three stages: (1) extraction of the actor bounding box from 2D detections; (2) direct CNN-based 3D pose regression; and (3) global root position computation in original footage by aligning 3D to 2D pose. The 2D pose is estimated using a CNN called 2DPoseNet, and the 3D pose is estimated using a second CNN called 3DPoseNet. The method also introduces a new dataset, MPI-INF-3DHP, which includes ground truth 3D annotations from a markerless motion capture system. This dataset complements existing datasets with everyday clothing appearance, a large range of motions, interactions with objects, and more varied camera viewpoints. The data capture approach eases appearance augmentation to extend the captured variability, complemented with improvements to existing augmentation methods for enhanced foreground texture variation. The paper also introduces a new test set, including sequences outdoors with accurate annotation, on which the method is validated. The components of the method are thoroughly evaluated on existing test datasets, demonstrating both state-of-the-art results in controlled settings and, more importantly, improvements over existing solutions for in-the-wild sequences thanks to the better generalization of the proposed techniques. The method uses transfer learning to leverage features learned on in-the-wild 2D pose datasets in conjunction with existing annotated 3D pose datasets. The method also introduces a new CNN architecture for 3D pose estimation, which uses multi-level corrective skip connections and multi-modal pose fusion to improve accuracy and generalization. The method is evaluated on several benchmarks, including Human3.6m, HumanEva, and MPI-INF-3DHP, and shows significant improvements in performance compared to existing methods. The paper concludes that the use of transfer learning of representations, along with algorithmic and data contributions, is crucial for general 3D body pose
Reach us at info@study.space
[slides] Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision | StudySpace