29 Apr 2020 | Muhammed Kocabas, Nikos Athanasiou, Michael J. Black
VIBE is a video-based method for estimating 3D human body pose and shape from monocular video. It leverages a large-scale motion capture dataset (AMASS) and unpaired 2D keypoint annotations to train a temporal generative adversarial network (GAN). The method uses a recurrent architecture with a self-attention mechanism to improve the accuracy and realism of 3D motion estimation. The model is trained with an adversarial loss that encourages the generator to produce realistic motion sequences that are indistinguishable from real human motion. The motion discriminator is trained on ground-truth motion data to learn the kinematics of human motion. The model outperforms previous state-of-the-art methods on challenging 3D pose estimation benchmarks, achieving state-of-the-art performance. The method is effective in handling in-the-wild videos with complex motion and occlusions. The model is implemented using a convolutional neural network pretrained for single-image pose estimation, followed by a temporal encoder and body parameter regressor. The motion discriminator is implemented using gated recurrent units (GRUs) and a self-attention mechanism to amplify the contribution of distinctive frames. The model is evaluated on multiple datasets, including 3DPW and MPI-INF-3DHP, and shows significant improvements in pose and shape estimation compared to single-frame methods. The method is also effective in handling motion sequences with varying speeds and poses. The model is trained with a combination of regression and adversarial losses to minimize the error between predicted and ground-truth keypoints, pose, and shape parameters. The model is available for research purposes at https://github.com/mkocabas/VIBE.VIBE is a video-based method for estimating 3D human body pose and shape from monocular video. It leverages a large-scale motion capture dataset (AMASS) and unpaired 2D keypoint annotations to train a temporal generative adversarial network (GAN). The method uses a recurrent architecture with a self-attention mechanism to improve the accuracy and realism of 3D motion estimation. The model is trained with an adversarial loss that encourages the generator to produce realistic motion sequences that are indistinguishable from real human motion. The motion discriminator is trained on ground-truth motion data to learn the kinematics of human motion. The model outperforms previous state-of-the-art methods on challenging 3D pose estimation benchmarks, achieving state-of-the-art performance. The method is effective in handling in-the-wild videos with complex motion and occlusions. The model is implemented using a convolutional neural network pretrained for single-image pose estimation, followed by a temporal encoder and body parameter regressor. The motion discriminator is implemented using gated recurrent units (GRUs) and a self-attention mechanism to amplify the contribution of distinctive frames. The model is evaluated on multiple datasets, including 3DPW and MPI-INF-3DHP, and shows significant improvements in pose and shape estimation compared to single-frame methods. The method is also effective in handling motion sequences with varying speeds and poses. The model is trained with a combination of regression and adversarial losses to minimize the error between predicted and ground-truth keypoints, pose, and shape parameters. The model is available for research purposes at https://github.com/mkocabas/VIBE.