4 Mar 2024 | Zhishan Zhou*, Shihao Zhou*, Zhi Lv, Minqiang Zou, Yao Tang, Jiajun Liang†
This paper presents a simple yet effective baseline for efficient hand mesh reconstruction, outperforming state-of-the-art (SOTA) methods while maintaining real-time efficiency. The authors decompose the mesh decoder into two primary components: the token generator and the mesh regressor. Through extensive ablation experiments, they identify that the token generator should select discriminating and representative points, while the mesh regressor needs to upsample sparse keypoints into dense meshes in multiple stages. The proposed method achieves high performance with minimal computational resources, achieving SOTA results on multiple datasets. On the FreiHAND dataset, it recorded a PA-MPJPE of 5.8mm and a PA-MPVPE of 6.1mm, and on the DexYCB dataset, it achieved PA-MPJPE and PA-MPVPE of 5.5mm. The method also demonstrates efficiency, reaching up to 33 frames per second (fps) with HRNet and up to 70 fps with FastViT-MA36. The contributions of the paper include abstracting existing methods into token generator and mesh regressor modules, revealing their core structures, and developing a streamlined, real-time hand mesh regression module.This paper presents a simple yet effective baseline for efficient hand mesh reconstruction, outperforming state-of-the-art (SOTA) methods while maintaining real-time efficiency. The authors decompose the mesh decoder into two primary components: the token generator and the mesh regressor. Through extensive ablation experiments, they identify that the token generator should select discriminating and representative points, while the mesh regressor needs to upsample sparse keypoints into dense meshes in multiple stages. The proposed method achieves high performance with minimal computational resources, achieving SOTA results on multiple datasets. On the FreiHAND dataset, it recorded a PA-MPJPE of 5.8mm and a PA-MPVPE of 6.1mm, and on the DexYCB dataset, it achieved PA-MPJPE and PA-MPVPE of 5.5mm. The method also demonstrates efficiency, reaching up to 33 frames per second (fps) with HRNet and up to 70 fps with FastViT-MA36. The contributions of the paper include abstracting existing methods into token generator and mesh regressor modules, revealing their core structures, and developing a streamlined, real-time hand mesh regression module.