Understanding V-Express%3A Conditional Dropout for Progressive Training of Portrait Video Generation

V-Express is a method for progressive training of portrait video generation that balances different control signals through conditional dropout and progressive training. The method enables effective control by weaker conditions, such as audio signals, while maintaining the influence of stronger signals like facial pose and reference images. The method uses a Latent Diffusion Model (LDM) to generate video frames, incorporating ReferenceNet, V-Kps Guider, and Audio Projection to handle various control inputs efficiently. The progressive training and conditional dropout strategy help mitigate the dominance of stronger signals, allowing weaker conditions, particularly audio, to have a more pronounced influence. This approach not only enhances the overall quality of the generated videos but also ensures better synchronization and control. The method is trained in three stages: single-frame generation, multi-frame generation, and global fine-tuning. In the first stage, only ReferenceNet, V-Kps Guider, and the denoising U-Net are trained. In the second stage, only the Audio Projection, Audio Attention Layers, and Motion Attention Layers are trained. In the third stage, all parameters are updated. To balance the control signals, conditional dropout is used to disrupt the shortcut pattern where the model directly copies the V-Kps-affected reference image to the generated frames. The method is evaluated on two public datasets, TalkingHead1KH and AVSpeech. The results show that V-Express excels in video quality and alignment with other control signals, even though it does not achieve the best lip synchronization. The method is effective in generating portrait videos controlled by audio and V-Kps. The results also show that the weight of the cross-attention hidden states can vary the strength of the corresponding control signal. A larger audio attention weight results in more pronounced mouth movements. To reduce the influence of the reference image, decreasing the weight of the reference attention will be effective. The method has potential future improvements, including multilingual support, reducing computational burden, and explicit face attribute control. V-Express provides a solution for the simultaneous and effective use of diverse control signals, paving the way for more advanced and balanced portrait video generation systems.V-Express is a method for progressive training of portrait video generation that balances different control signals through conditional dropout and progressive training. The method enables effective control by weaker conditions, such as audio signals, while maintaining the influence of stronger signals like facial pose and reference images. The method uses a Latent Diffusion Model (LDM) to generate video frames, incorporating ReferenceNet, V-Kps Guider, and Audio Projection to handle various control inputs efficiently. The progressive training and conditional dropout strategy help mitigate the dominance of stronger signals, allowing weaker conditions, particularly audio, to have a more pronounced influence. This approach not only enhances the overall quality of the generated videos but also ensures better synchronization and control. The method is trained in three stages: single-frame generation, multi-frame generation, and global fine-tuning. In the first stage, only ReferenceNet, V-Kps Guider, and the denoising U-Net are trained. In the second stage, only the Audio Projection, Audio Attention Layers, and Motion Attention Layers are trained. In the third stage, all parameters are updated. To balance the control signals, conditional dropout is used to disrupt the shortcut pattern where the model directly copies the V-Kps-affected reference image to the generated frames. The method is evaluated on two public datasets, TalkingHead1KH and AVSpeech. The results show that V-Express excels in video quality and alignment with other control signals, even though it does not achieve the best lip synchronization. The method is effective in generating portrait videos controlled by audio and V-Kps. The results also show that the weight of the cross-attention hidden states can vary the strength of the corresponding control signal. A larger audio attention weight results in more pronounced mouth movements. To reduce the influence of the reference image, decreasing the weight of the reference attention will be effective. The method has potential future improvements, including multilingual support, reducing computational burden, and explicit face attribute control. V-Express provides a solution for the simultaneous and effective use of diverse control signals, paving the way for more advanced and balanced portrait video generation systems.

V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation

2024 | Cong Wang, Kuan Tian, Jun Zhang, Yonghang Guan, Feng Luo, Fei Shen, Zhiwei Jiang, Qing Gu