24 Jul 2024 | Fabien Baradel*, Matthieu Armando, Salma Galaaoui, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez, and Thomas Lucas*
Multi-HMR is a single-shot model designed to recover whole-body 3D human meshes from a single RGB image, focusing on four key aspects: capturing expressive body poses (including hands and facial expressions), efficient processing of images with variable numbers of people, location estimation in camera space, and adaptability to camera information when available. The model uses a Vision Transformer (ViT) backbone to extract features from the input image, followed by a Human Perception Head (HPH) that employs cross-attention to predict pose, shape parameters, and 3D location for each detected person. The HPH allows the model to attend to all image patches for each detected person, enabling efficient and accurate predictions. Multi-HMR also optionally incorporates camera intrinsics to improve performance. The model is trained using a combination of real-world and synthetic datasets, including the CUFFS dataset, which contains synthetic renderings of people with diverse hand poses. Evaluations on various benchmarks show that Multi-HMR outperforms existing methods in both body-only and whole-body mesh recovery, achieving competitive or superior results even at lower resolutions and with smaller backbones. The model is also efficient, achieving real-time inference on a NVIDIA V100 GPU.Multi-HMR is a single-shot model designed to recover whole-body 3D human meshes from a single RGB image, focusing on four key aspects: capturing expressive body poses (including hands and facial expressions), efficient processing of images with variable numbers of people, location estimation in camera space, and adaptability to camera information when available. The model uses a Vision Transformer (ViT) backbone to extract features from the input image, followed by a Human Perception Head (HPH) that employs cross-attention to predict pose, shape parameters, and 3D location for each detected person. The HPH allows the model to attend to all image patches for each detected person, enabling efficient and accurate predictions. Multi-HMR also optionally incorporates camera intrinsics to improve performance. The model is trained using a combination of real-world and synthetic datasets, including the CUFFS dataset, which contains synthetic renderings of people with diverse hand poses. Evaluations on various benchmarks show that Multi-HMR outperforms existing methods in both body-only and whole-body mesh recovery, achieving competitive or superior results even at lower resolutions and with smaller backbones. The model is also efficient, achieving real-time inference on a NVIDIA V100 GPU.