Lumiere: A Space-Time Diffusion Model for Video Generation

Lumiere: A Space-Time Diffusion Model for Video Generation

5 Feb 2024 | Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri
**Lumiere: A Space-Time Diffusion Model for Video Generation** **Abstract:** Lumiere is a text-to-video diffusion model designed to synthesize realistic, diverse, and coherent motion videos. It introduces a Space-Time U-Net (STUNet) architecture that generates the entire temporal duration of the video in a single pass, addressing the challenge of achieving global temporal consistency in video synthesis. By leveraging both spatial and temporal down- and up-sampling, along with a pre-trained text-to-image diffusion model, Lumiere learns to generate full-frame-rate, low-resolution videos. The model demonstrates state-of-the-art text-to-video generation results and is versatile for various content creation tasks, including image-to-video, video inpainting, and stylized generation. **Introduction:** Generative models for images have seen significant advancements, but training large-scale text-to-video (T2V) models remains challenging due to the added complexity of motion. Existing T2V models often use cascaded designs, which struggle with global temporal consistency. Lumiere addresses this by generating the full temporal duration of the video at once, using STUNet to downsample the signal in both space and time. This approach allows for more coherent motion compared to previous methods. **Related Work:** The paper reviews existing T2V models, highlighting their limitations and the need for a new approach. It discusses the common cascaded design and its constraints, as well as the use of spatial super-resolution (SSR) models, which can introduce artifacts and limit the generation of full-length clips. **Lumiere:** Lumiere utilizes diffusion probabilistic models to approximate the data distribution over videos. It consists of a base model and a spatial super-resolution (SSR) model. The base model generates full clips at a coarse spatial resolution, which is then upscaled using a temporally-aware SSR model. The STUNet architecture interweaves temporal blocks with spatial resizing modules, allowing for efficient processing of the video signal. **Applications:** Lumiere is adapted for various downstream applications, including stylized generation, image-to-video, video inpainting, and cinematographs. It demonstrates superior performance in generating high-quality, coherent videos compared to existing methods. **Evaluation and Comparisons:** Lumiere is evaluated on a dataset containing 30M videos and text captions. It achieves competitive FVD and IS scores, and is preferred by users in both text-to-video and image-to-video generation tasks. The model outperforms baselines in terms of visual quality and motion magnitude. **Conclusion:** Lumiere presents a novel text-to-video generation framework that leverages a pre-trained text-to-image diffusion model. It addresses the challenge of learning globally coherent motion by using a Space-Time U-Net architecture. The model demonstrates state-of-the-art results and is versatile for a wide range of applications.**Lumiere: A Space-Time Diffusion Model for Video Generation** **Abstract:** Lumiere is a text-to-video diffusion model designed to synthesize realistic, diverse, and coherent motion videos. It introduces a Space-Time U-Net (STUNet) architecture that generates the entire temporal duration of the video in a single pass, addressing the challenge of achieving global temporal consistency in video synthesis. By leveraging both spatial and temporal down- and up-sampling, along with a pre-trained text-to-image diffusion model, Lumiere learns to generate full-frame-rate, low-resolution videos. The model demonstrates state-of-the-art text-to-video generation results and is versatile for various content creation tasks, including image-to-video, video inpainting, and stylized generation. **Introduction:** Generative models for images have seen significant advancements, but training large-scale text-to-video (T2V) models remains challenging due to the added complexity of motion. Existing T2V models often use cascaded designs, which struggle with global temporal consistency. Lumiere addresses this by generating the full temporal duration of the video at once, using STUNet to downsample the signal in both space and time. This approach allows for more coherent motion compared to previous methods. **Related Work:** The paper reviews existing T2V models, highlighting their limitations and the need for a new approach. It discusses the common cascaded design and its constraints, as well as the use of spatial super-resolution (SSR) models, which can introduce artifacts and limit the generation of full-length clips. **Lumiere:** Lumiere utilizes diffusion probabilistic models to approximate the data distribution over videos. It consists of a base model and a spatial super-resolution (SSR) model. The base model generates full clips at a coarse spatial resolution, which is then upscaled using a temporally-aware SSR model. The STUNet architecture interweaves temporal blocks with spatial resizing modules, allowing for efficient processing of the video signal. **Applications:** Lumiere is adapted for various downstream applications, including stylized generation, image-to-video, video inpainting, and cinematographs. It demonstrates superior performance in generating high-quality, coherent videos compared to existing methods. **Evaluation and Comparisons:** Lumiere is evaluated on a dataset containing 30M videos and text captions. It achieves competitive FVD and IS scores, and is preferred by users in both text-to-video and image-to-video generation tasks. The model outperforms baselines in terms of visual quality and motion magnitude. **Conclusion:** Lumiere presents a novel text-to-video generation framework that leverages a pre-trained text-to-image diffusion model. It addresses the challenge of learning globally coherent motion by using a Space-Time U-Net architecture. The model demonstrates state-of-the-art results and is versatile for a wide range of applications.
Reach us at info@study.space
[slides and audio] Lumiere%3A A Space-Time Diffusion Model for Video Generation