29 Mar 2019 | Dario Pavllo*, Christoph Feichtenhofer, David Grangier*, Michael Auli
This paper presents a fully convolutional model for 3D human pose estimation in video using dilated temporal convolutions and semi-supervised training. The model takes 2D keypoint sequences as input and generates 3D pose estimates as output. It is compatible with any 2D keypoint detector and can handle large contexts via dilated convolutions. The model outperforms previous state-of-the-art results in both supervised and semi-supervised settings. In the supervised setting, it achieves a 6 mm mean per-joint position error on Human3.6M, corresponding to an 11% error reduction, and shows significant improvements on HumanEva-I. In the semi-supervised setting, it outperforms previous results when labeled data is scarce. The model uses back-projection, a semi-supervised training method that leverages unlabeled video data. It only requires camera intrinsic parameters rather than ground-truth 2D annotations or multi-view imagery with extrinsic camera parameters. The model is efficient, with lower computational complexity and fewer parameters compared to RNN-based models. It also allows for faster training and inference. The model is evaluated on two motion capture datasets, Human3.6M and HumanEva-I, and shows significant improvements in accuracy and efficiency. The model is also evaluated in semi-supervised settings, where it achieves better performance than previous methods. The model is implemented using a fully convolutional architecture with residual connections and dilated convolutions. It is trained on 2D keypoint data and uses a semi-supervised approach to improve performance when labeled data is scarce. The model is evaluated on multiple protocols, including mean per-joint position error (MPJPE), P-MPJPE, and N-MPJPE. The model achieves significant improvements in all protocols, with the best results on Human3.6M and HumanEva-I. The model is also evaluated in terms of computational complexity and shows that it is more efficient than previous methods. The model is implemented using a fully convolutional architecture with residual connections and dilated convolutions. It is trained on 2D keypoint data and uses a semi-supervised approach to improve performance when labeled data is scarce. The model is evaluated on multiple protocols, including mean per-joint position error (MPJPE), P-MPJPE, and N-MPJPE. The model achieves significant improvements in all protocols, with the best results on Human3.6M and HumanEva-I. The model is also evaluated in terms of computational complexity and shows that it is more efficient than previous methods.This paper presents a fully convolutional model for 3D human pose estimation in video using dilated temporal convolutions and semi-supervised training. The model takes 2D keypoint sequences as input and generates 3D pose estimates as output. It is compatible with any 2D keypoint detector and can handle large contexts via dilated convolutions. The model outperforms previous state-of-the-art results in both supervised and semi-supervised settings. In the supervised setting, it achieves a 6 mm mean per-joint position error on Human3.6M, corresponding to an 11% error reduction, and shows significant improvements on HumanEva-I. In the semi-supervised setting, it outperforms previous results when labeled data is scarce. The model uses back-projection, a semi-supervised training method that leverages unlabeled video data. It only requires camera intrinsic parameters rather than ground-truth 2D annotations or multi-view imagery with extrinsic camera parameters. The model is efficient, with lower computational complexity and fewer parameters compared to RNN-based models. It also allows for faster training and inference. The model is evaluated on two motion capture datasets, Human3.6M and HumanEva-I, and shows significant improvements in accuracy and efficiency. The model is also evaluated in semi-supervised settings, where it achieves better performance than previous methods. The model is implemented using a fully convolutional architecture with residual connections and dilated convolutions. It is trained on 2D keypoint data and uses a semi-supervised approach to improve performance when labeled data is scarce. The model is evaluated on multiple protocols, including mean per-joint position error (MPJPE), P-MPJPE, and N-MPJPE. The model achieves significant improvements in all protocols, with the best results on Human3.6M and HumanEva-I. The model is also evaluated in terms of computational complexity and shows that it is more efficient than previous methods. The model is implemented using a fully convolutional architecture with residual connections and dilated convolutions. It is trained on 2D keypoint data and uses a semi-supervised approach to improve performance when labeled data is scarce. The model is evaluated on multiple protocols, including mean per-joint position error (MPJPE), P-MPJPE, and N-MPJPE. The model achieves significant improvements in all protocols, with the best results on Human3.6M and HumanEva-I. The model is also evaluated in terms of computational complexity and shows that it is more efficient than previous methods.