Lumiere: A Space-Time Diffusion Model for Video Generation

Lumiere: A Space-Time Diffusion Model for Video Generation

5 Feb 2024 | Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri
Lumiere is a text-to-video diffusion model designed to generate realistic, diverse, and coherent motion in videos. The model uses a Space-Time U-Net (STUNet) architecture that generates the entire video duration in a single pass, unlike existing models that synthesize keyframes and then perform temporal super-resolution. This approach allows the model to learn to generate full-frame-rate, low-resolution videos by processing them in multiple space-time scales. The model leverages a pre-trained text-to-image diffusion model and incorporates both spatial and temporal down- and up-sampling to achieve high-quality video generation. Lumiere demonstrates state-of-the-art text-to-video generation results and supports a wide range of content creation tasks, including image-to-video generation, video inpainting, and stylized generation. The model's design enables it to generate videos with consistent motion and high visual quality, and it can be easily adapted to various video editing applications. The model's ability to generate full videos at once allows for consistent editing using off-the-shelf methods. The paper also discusses related work in text-to-image and text-to-video generation, highlighting the challenges and limitations of existing approaches. It presents a detailed description of the Lumiere framework, including the STUNet architecture and the use of Multidiffusion for spatial super-resolution. The model is evaluated on a variety of tasks, including qualitative and quantitative comparisons with other text-to-video models. The results show that Lumiere achieves competitive performance in terms of visual quality and motion coherence, and it is preferred by users in both text-to-video and image-to-video generation tasks. The paper concludes that Lumiere provides a new approach to text-to-video generation, with the potential to be applied to a wide range of applications. The model's design principles are applicable to latent video diffusion models and can trigger further research in the design of text-to-video models. The paper also discusses the societal impact of the technology, emphasizing the importance of developing tools to detect biases and malicious use cases to ensure safe and fair use.Lumiere is a text-to-video diffusion model designed to generate realistic, diverse, and coherent motion in videos. The model uses a Space-Time U-Net (STUNet) architecture that generates the entire video duration in a single pass, unlike existing models that synthesize keyframes and then perform temporal super-resolution. This approach allows the model to learn to generate full-frame-rate, low-resolution videos by processing them in multiple space-time scales. The model leverages a pre-trained text-to-image diffusion model and incorporates both spatial and temporal down- and up-sampling to achieve high-quality video generation. Lumiere demonstrates state-of-the-art text-to-video generation results and supports a wide range of content creation tasks, including image-to-video generation, video inpainting, and stylized generation. The model's design enables it to generate videos with consistent motion and high visual quality, and it can be easily adapted to various video editing applications. The model's ability to generate full videos at once allows for consistent editing using off-the-shelf methods. The paper also discusses related work in text-to-image and text-to-video generation, highlighting the challenges and limitations of existing approaches. It presents a detailed description of the Lumiere framework, including the STUNet architecture and the use of Multidiffusion for spatial super-resolution. The model is evaluated on a variety of tasks, including qualitative and quantitative comparisons with other text-to-video models. The results show that Lumiere achieves competitive performance in terms of visual quality and motion coherence, and it is preferred by users in both text-to-video and image-to-video generation tasks. The paper concludes that Lumiere provides a new approach to text-to-video generation, with the potential to be applied to a wide range of applications. The model's design principles are applicable to latent video diffusion models and can trigger further research in the design of text-to-video models. The paper also discusses the societal impact of the technology, emphasizing the importance of developing tools to detect biases and malicious use cases to ensure safe and fair use.
Reach us at info@study.space