22 Feb 2024 | Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, Sergey Tulyakov
Snap Video is a novel video-first model designed to address the challenges of generating high-quality, temporally coherent videos from text prompts. The model extends the EDM (Diffusion Distribution Matching) framework to handle spatially and temporally redundant pixels, enabling joint video-image training. It introduces a scalable transformer-based architecture, the FIT (Fast Image-to-Image Transformer), which is 3.31 times faster to train and 4.5 times faster at inference compared to U-Nets. This allows Snap Video to efficiently train with billions of parameters, achieving state-of-the-art results on benchmarks like UCF101 and MSR-VTT. User studies show that Snap Video outperforms recent methods in terms of photorealism, video-text alignment, and motion quality, with a significant preference for its generated videos. The model's ability to handle large motions and maintain temporal consistency makes it a significant advancement in text-to-video synthesis.Snap Video is a novel video-first model designed to address the challenges of generating high-quality, temporally coherent videos from text prompts. The model extends the EDM (Diffusion Distribution Matching) framework to handle spatially and temporally redundant pixels, enabling joint video-image training. It introduces a scalable transformer-based architecture, the FIT (Fast Image-to-Image Transformer), which is 3.31 times faster to train and 4.5 times faster at inference compared to U-Nets. This allows Snap Video to efficiently train with billions of parameters, achieving state-of-the-art results on benchmarks like UCF101 and MSR-VTT. User studies show that Snap Video outperforms recent methods in terms of photorealism, video-text alignment, and motion quality, with a significant preference for its generated videos. The model's ability to handle large motions and maintain temporal consistency makes it a significant advancement in text-to-video synthesis.