[slides and audio] UniAnimate%3A Taming Unified Video Diffusion Models for Consistent Human Image Animation

UniAnimate is a novel framework designed to address the limitations of existing diffusion-based human image animation techniques. These limitations include the need for an additional reference model to align the identity image with the main video branch, which increases optimization complexity and model parameters, and the generation of short videos ( typically 24 frames) that hinder practical applications. To overcome these issues, UniAnimate introduces a unified video diffusion model that maps the reference image, posture guidance, and noise video into a common feature space, reducing optimization complexity and ensuring temporal coherence. Additionally, UniAnimate proposes a unified noise input that supports random noised input and first-frame conditioned input, enhancing the ability to generate long-term videos. To further improve efficiency, UniAnimate employs an alternative temporal modeling architecture based on a state space model (Mamba) instead of the computationally expensive temporal Transformer. Extensive experimental results demonstrate that UniAnimate achieves superior performance in both quantitative and qualitative evaluations, generating highly consistent one-minute videos with smooth transitions. The framework is publicly available, and its effectiveness is validated through comprehensive evaluations on the TikTok and Fashion datasets.UniAnimate is a novel framework designed to address the limitations of existing diffusion-based human image animation techniques. These limitations include the need for an additional reference model to align the identity image with the main video branch, which increases optimization complexity and model parameters, and the generation of short videos ( typically 24 frames) that hinder practical applications. To overcome these issues, UniAnimate introduces a unified video diffusion model that maps the reference image, posture guidance, and noise video into a common feature space, reducing optimization complexity and ensuring temporal coherence. Additionally, UniAnimate proposes a unified noise input that supports random noised input and first-frame conditioned input, enhancing the ability to generate long-term videos. To further improve efficiency, UniAnimate employs an alternative temporal modeling architecture based on a state space model (Mamba) instead of the computationally expensive temporal Transformer. Extensive experimental results demonstrate that UniAnimate achieves superior performance in both quantitative and qualitative evaluations, generating highly consistent one-minute videos with smooth transitions. The framework is publicly available, and its effectiveness is validated through comprehensive evaluations on the TikTok and Fashion datasets.

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

3 Jun 2024 | Xiang WANG1, Shiwei ZHANG2, Changxin GAO1, Jiayu WANG2, Xiaoqiang ZHOU3, Yingya ZHANG2, Luxin YAN1 & Nong SANG1*