4 Aug 2017 | Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little
A simple yet effective baseline for 3D human pose estimation is presented, which achieves state-of-the-art results by using a deep feedforward network to predict 3D joint positions from 2D joint locations. The system outperforms existing methods by 30% on the Human3.6M dataset, demonstrating that lifting 2D joint positions to 3D space is a relatively simple task. The network is trained on the output of an off-the-shelf 2D detector, achieving results that surpass those of end-to-end systems trained on raw pixels. The system is fast, with a forward pass taking around 3ms on a batch of size 64, allowing for 300 fps processing. The network is designed to be simple, efficient, and easy to reproduce, and it shows that a well-designed network can perform competitively in 2D-to-3D keypoint regression. The system is tested on the Human3.6M and HumanEva datasets, achieving high accuracy and outperforming previous methods in several cases. The results suggest that a large portion of the error in modern 3D pose estimation systems stems from their visual analysis, and that focusing on improving 2D human pose estimation could lead to better 3D results. The system is also shown to be robust to detector noise and performs well on both synthetic and real-world data. The work provides a high-performance, lightweight baseline that sets a new standard for future research in 3D human pose estimation.A simple yet effective baseline for 3D human pose estimation is presented, which achieves state-of-the-art results by using a deep feedforward network to predict 3D joint positions from 2D joint locations. The system outperforms existing methods by 30% on the Human3.6M dataset, demonstrating that lifting 2D joint positions to 3D space is a relatively simple task. The network is trained on the output of an off-the-shelf 2D detector, achieving results that surpass those of end-to-end systems trained on raw pixels. The system is fast, with a forward pass taking around 3ms on a batch of size 64, allowing for 300 fps processing. The network is designed to be simple, efficient, and easy to reproduce, and it shows that a well-designed network can perform competitively in 2D-to-3D keypoint regression. The system is tested on the Human3.6M and HumanEva datasets, achieving high accuracy and outperforming previous methods in several cases. The results suggest that a large portion of the error in modern 3D pose estimation systems stems from their visual analysis, and that focusing on improving 2D human pose estimation could lead to better 3D results. The system is also shown to be robust to detector noise and performs well on both synthetic and real-world data. The work provides a high-performance, lightweight baseline that sets a new standard for future research in 3D human pose estimation.