25 Apr 2017 | Tomas Simon Hanbyul Joo Iain Matthews Yaser Sheikh
This paper presents a method for hand keypoint detection in single images using multiview bootstrapping. The approach involves training a keypoint detector using multiple camera views to improve detection accuracy, especially for occluded hand joints. The process starts with an initial detector that produces noisy labels in multiple views. These noisy detections are then triangulated in 3D using multiview geometry or marked as outliers. The reprojected triangulations are used as new labeled training data to improve the detector. This process is repeated, generating more labeled data in each iteration. The method is used to train a hand keypoint detector for single images that runs in real-time on RGB images and has accuracy comparable to methods that use depth sensors. The single view detector, triangulated over multiple views, enables 3D markerless hand motion capture with complex object interactions.
The paper also discusses related work in hand pose estimation, highlighting the challenges of generating annotated datasets for hand keypoints. It introduces the concept of multiview bootstrapping, which uses geometric constraints from multiple views to generate labeled data for training. The method is evaluated on publicly available datasets, showing improvements in performance compared to depth-based methods. The results demonstrate that multiview bootstrapping can produce hand keypoint detectors that rival the performance of RGB-D detectors and enable markerless 3D hand motion capture in challenging scenarios. The paper concludes with a discussion on the potential of multiview bootstrapping for building rich training sets and improving weakly supervised learning.This paper presents a method for hand keypoint detection in single images using multiview bootstrapping. The approach involves training a keypoint detector using multiple camera views to improve detection accuracy, especially for occluded hand joints. The process starts with an initial detector that produces noisy labels in multiple views. These noisy detections are then triangulated in 3D using multiview geometry or marked as outliers. The reprojected triangulations are used as new labeled training data to improve the detector. This process is repeated, generating more labeled data in each iteration. The method is used to train a hand keypoint detector for single images that runs in real-time on RGB images and has accuracy comparable to methods that use depth sensors. The single view detector, triangulated over multiple views, enables 3D markerless hand motion capture with complex object interactions.
The paper also discusses related work in hand pose estimation, highlighting the challenges of generating annotated datasets for hand keypoints. It introduces the concept of multiview bootstrapping, which uses geometric constraints from multiple views to generate labeled data for training. The method is evaluated on publicly available datasets, showing improvements in performance compared to depth-based methods. The results demonstrate that multiview bootstrapping can produce hand keypoint detectors that rival the performance of RGB-D detectors and enable markerless 3D hand motion capture in challenging scenarios. The paper concludes with a discussion on the potential of multiview bootstrapping for building rich training sets and improving weakly supervised learning.