Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

24 Jul 2024 | Fabien Baradel*, Matthieu Armando, Salma Galaoui, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez, and Thomas Lucas*
Multi-HMR is a single-shot model for multi-person whole-body human mesh recovery from a single RGB image. It predicts the entire body, including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. The model detects people by predicting coarse 2D heatmaps of person locations using a Vision Transformer (ViT) backbone. It then predicts whole-body pose, shape, and 3D location using a new cross-attention module called the Human Prediction Head (HPH). To improve predictions, especially for hands, the authors introduce the CUFFS dataset, containing humans close to the camera with diverse hand poses. Multi-HMR optionally accounts for camera intrinsics by encoding camera ray directions for each image token. The model achieves strong performance on whole-body and body-only benchmarks, with a ViT-S backbone on 448x448 images yielding a fast and competitive model, while larger models and higher resolutions achieve state-of-the-art results. The model is efficient, with a single-shot approach that allows for fast inference and real-time processing. It is compared to existing methods on body-only and whole-body HMR benchmarks, showing significant improvements in performance. The model is conceptually simple, relying on a ViT backbone and a newly introduced cross-attention-based head for predictions. It is effective in predicting accurate 3D meshes and 3D positions in the scene, outperforming the state of the art for each sub-problem. The model also adapts to camera information when available, making it versatile for various applications.Multi-HMR is a single-shot model for multi-person whole-body human mesh recovery from a single RGB image. It predicts the entire body, including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. The model detects people by predicting coarse 2D heatmaps of person locations using a Vision Transformer (ViT) backbone. It then predicts whole-body pose, shape, and 3D location using a new cross-attention module called the Human Prediction Head (HPH). To improve predictions, especially for hands, the authors introduce the CUFFS dataset, containing humans close to the camera with diverse hand poses. Multi-HMR optionally accounts for camera intrinsics by encoding camera ray directions for each image token. The model achieves strong performance on whole-body and body-only benchmarks, with a ViT-S backbone on 448x448 images yielding a fast and competitive model, while larger models and higher resolutions achieve state-of-the-art results. The model is efficient, with a single-shot approach that allows for fast inference and real-time processing. It is compared to existing methods on body-only and whole-body HMR benchmarks, showing significant improvements in performance. The model is conceptually simple, relying on a ViT backbone and a newly introduced cross-attention-based head for predictions. It is effective in predicting accurate 3D meshes and 3D positions in the scene, outperforming the state of the art for each sub-problem. The model also adapts to camera information when available, making it versatile for various applications.
Reach us at info@study.space