29 Jul 2020 | Justus Thies1 Michael Zollhöfer2 Marc Stamminger1 Christian Theobalt2 Matthias Nießner3
Face2Face: Real-time Face Capture and Reenactment of RGB Videos
This paper presents a novel approach for real-time facial reenactment of a monocular target video sequence. The source sequence is also a monocular video stream, captured live with a commodity webcam. The goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. The method involves facial identity recovery from monocular video using non-rigid model-based bundling, facial expression tracking using a dense photometric consistency measure, and fast deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. The synthesized target face is then re-rendered on top of the corresponding video stream to seamlessly blend with the real-world illumination.
The method is demonstrated in a live setup where YouTube videos are reenacted in real time. The approach is compared against state-of-the-art reenactment methods, which it outperforms in terms of video quality and runtime. The key contributions include dense, global non-rigid model-based bundling, accurate tracking, appearance, and lighting estimation in unconstrained live RGB video, person-dependent expression transfer using subspace deformations, and a novel mouth synthesis approach.
The method uses a multi-linear PCA model to parametrize a face, with the first two dimensions representing facial identity and the third controlling facial expression. The synthesis of facial imagery is based on a statistical prior that assumes a multivariate normal probability distribution of shape and reflectance. The energy formulation includes data and prior terms, with the data term measuring photo-consistency and facial feature alignment, and the prior term enforcing statistical regularization.
A data-parallel optimization strategy is used to minimize the objective function in real-time. The method also includes non-rigid model-based bundling to estimate the identity of actors in monocular reconstruction. Expression transfer is achieved through subspace deformation transfer, which allows for fast real-time transfer rates. Mouth retrieval is performed using an appearance graph to find the best matching mouth frame, with a similarity metric based on geometric and photometric features.
The method is evaluated on a variety of target YouTube videos, demonstrating highly realistic reenactment results. The approach is compared to other reenactment methods, showing similar or better quality in terms of visual results and runtime. The method is able to preserve the identity of the target actor while altering the expression with respect to the source actor, leading to more plausible results. The method is also evaluated using cross-validation, showing a small mean photometric error in the self-reenactment of an actor.Face2Face: Real-time Face Capture and Reenactment of RGB Videos
This paper presents a novel approach for real-time facial reenactment of a monocular target video sequence. The source sequence is also a monocular video stream, captured live with a commodity webcam. The goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. The method involves facial identity recovery from monocular video using non-rigid model-based bundling, facial expression tracking using a dense photometric consistency measure, and fast deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. The synthesized target face is then re-rendered on top of the corresponding video stream to seamlessly blend with the real-world illumination.
The method is demonstrated in a live setup where YouTube videos are reenacted in real time. The approach is compared against state-of-the-art reenactment methods, which it outperforms in terms of video quality and runtime. The key contributions include dense, global non-rigid model-based bundling, accurate tracking, appearance, and lighting estimation in unconstrained live RGB video, person-dependent expression transfer using subspace deformations, and a novel mouth synthesis approach.
The method uses a multi-linear PCA model to parametrize a face, with the first two dimensions representing facial identity and the third controlling facial expression. The synthesis of facial imagery is based on a statistical prior that assumes a multivariate normal probability distribution of shape and reflectance. The energy formulation includes data and prior terms, with the data term measuring photo-consistency and facial feature alignment, and the prior term enforcing statistical regularization.
A data-parallel optimization strategy is used to minimize the objective function in real-time. The method also includes non-rigid model-based bundling to estimate the identity of actors in monocular reconstruction. Expression transfer is achieved through subspace deformation transfer, which allows for fast real-time transfer rates. Mouth retrieval is performed using an appearance graph to find the best matching mouth frame, with a similarity metric based on geometric and photometric features.
The method is evaluated on a variety of target YouTube videos, demonstrating highly realistic reenactment results. The approach is compared to other reenactment methods, showing similar or better quality in terms of visual results and runtime. The method is able to preserve the identity of the target actor while altering the expression with respect to the source actor, leading to more plausible results. The method is also evaluated using cross-validation, showing a small mean photometric error in the self-reenactment of an actor.