2024-04-05 | George Retsinas, Panagiotis P. Filntisis, Radek Dančík, Victoria F. Abrevaya, Anastasios Roussos, Timo Bolkart, Petros Maragos
SMIRK is a novel method for 3D facial expression reconstruction from monocular images. It improves upon existing methods by replacing differentiable rendering with a neural rendering module that generates face images from predicted mesh geometry and sparsely sampled input pixels. This allows for more accurate geometry supervision and enables the generation of images with varying expressions during training. SMIRK introduces a cycle-based expression consistency loss that enhances training data diversity and improves generalization for diverse expressions. The method uses a U-Net-based neural renderer and employs a combination of photometric, VGG, landmark, and emotion losses for supervision. It also incorporates an augmented expression cycle path that modifies predicted expressions and enforces consistency between the modified input and reconstructed expressions. SMIRK achieves state-of-the-art performance in accurate expression reconstruction, as demonstrated by qualitative, quantitative, and perceptual evaluations. The method is effective in capturing a wide range of facial expressions, including challenging cases such as asymmetric and subtle expressions. SMIRK outperforms existing methods in user studies and is capable of generating realistic images with notable expression manipulation. The method is trained on datasets such as FFHQ, CelebA, LRS3, and MEAD, and is compared with recent state-of-the-art methods such as DECA, EMOCA v2, Deep3DFace, and FOCUS. SMIRK achieves better reconstruction and perceptual scores compared to other methods. The method is also effective in handling occlusions and has potential applications in 3D facial animation and video editing. The research was supported by the European Union-NextGenerationEU.SMIRK is a novel method for 3D facial expression reconstruction from monocular images. It improves upon existing methods by replacing differentiable rendering with a neural rendering module that generates face images from predicted mesh geometry and sparsely sampled input pixels. This allows for more accurate geometry supervision and enables the generation of images with varying expressions during training. SMIRK introduces a cycle-based expression consistency loss that enhances training data diversity and improves generalization for diverse expressions. The method uses a U-Net-based neural renderer and employs a combination of photometric, VGG, landmark, and emotion losses for supervision. It also incorporates an augmented expression cycle path that modifies predicted expressions and enforces consistency between the modified input and reconstructed expressions. SMIRK achieves state-of-the-art performance in accurate expression reconstruction, as demonstrated by qualitative, quantitative, and perceptual evaluations. The method is effective in capturing a wide range of facial expressions, including challenging cases such as asymmetric and subtle expressions. SMIRK outperforms existing methods in user studies and is capable of generating realistic images with notable expression manipulation. The method is trained on datasets such as FFHQ, CelebA, LRS3, and MEAD, and is compared with recent state-of-the-art methods such as DECA, EMOCA v2, Deep3DFace, and FOCUS. SMIRK achieves better reconstruction and perceptual scores compared to other methods. The method is also effective in handling occlusions and has potential applications in 3D facial animation and video editing. The research was supported by the European Union-NextGenerationEU.