UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

3 Jun 2024 | Xiang WANG, Shiwei ZHANG, Changxin GAO, Jiayu WANG, Xiaoqiang ZHOU, Yingya ZHANG, Luxin YAN & Nong SANG
UniAnimate is a novel framework for generating high-quality, temporally consistent human image animations. The method addresses two key limitations of existing diffusion-based approaches: the need for an extra reference model to align identity images with the main video branch, and the short duration of generated videos, typically 24 frames. To overcome these issues, UniAnimate employs a unified video diffusion model to map reference images and posture guidance into a shared feature space, enabling efficient and long-term video generation. It also introduces a unified noise input that supports both random noise and first-frame conditioned inputs, enhancing the ability to generate long-term videos. Additionally, UniAnimate replaces the original temporal Transformer with a temporal Mamba model, which significantly improves efficiency and allows for longer video generation. Experimental results show that UniAnimate outperforms existing state-of-the-art methods in both quantitative and qualitative evaluations, achieving high-fidelity one-minute videos through iterative first-frame conditioning. The framework is designed to handle long sequences efficiently, ensuring smooth transitions and consistent video generation. UniAnimate is evaluated on the TikTok and Fashion datasets, demonstrating superior performance in terms of visual quality, structural preservation, and temporal consistency. The method is also validated through human evaluations, showing favorable visual aesthetics and controllability. The framework's unified video diffusion model and temporal Mamba architecture enable effective appearance alignment and motion modeling, making it a promising solution for long-term human image animation.UniAnimate is a novel framework for generating high-quality, temporally consistent human image animations. The method addresses two key limitations of existing diffusion-based approaches: the need for an extra reference model to align identity images with the main video branch, and the short duration of generated videos, typically 24 frames. To overcome these issues, UniAnimate employs a unified video diffusion model to map reference images and posture guidance into a shared feature space, enabling efficient and long-term video generation. It also introduces a unified noise input that supports both random noise and first-frame conditioned inputs, enhancing the ability to generate long-term videos. Additionally, UniAnimate replaces the original temporal Transformer with a temporal Mamba model, which significantly improves efficiency and allows for longer video generation. Experimental results show that UniAnimate outperforms existing state-of-the-art methods in both quantitative and qualitative evaluations, achieving high-fidelity one-minute videos through iterative first-frame conditioning. The framework is designed to handle long sequences efficiently, ensuring smooth transitions and consistent video generation. UniAnimate is evaluated on the TikTok and Fashion datasets, demonstrating superior performance in terms of visual quality, structural preservation, and temporal consistency. The method is also validated through human evaluations, showing favorable visual aesthetics and controllability. The framework's unified video diffusion model and temporal Mamba architecture enable effective appearance alignment and motion modeling, making it a promising solution for long-term human image animation.
Reach us at info@study.space
[slides] UniAnimate%3A Taming Unified Video Diffusion Models for Consistent Human Image Animation | StudySpace