22 Feb 2024 | Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, Sergey Tulyakov
Snap Video is a video-first model that addresses challenges in text-to-video generation by extending the EDM framework to handle spatiotemporal redundancy and supporting video generation naturally. It introduces a transformer-based architecture inspired by FIT, which trains 3.31 times faster than U-Nets and achieves 4.49 times faster inference. This allows efficient training of a text-to-video model with billions of parameters, leading to state-of-the-art results on benchmarks and higher quality, temporally consistent videos with complex motion. User studies show that Snap Video outperforms recent methods in photorealism, text-video alignment, and motion quality. The model uses a scalable transformer architecture that treats spatial and temporal dimensions as a single compressed 1D latent vector, enabling joint spatio-temporal computation. It also addresses image-video modality mismatches by treating images as high frame-rate videos. Snap Video is evaluated on UCF101 and MSR-VTT datasets, showing strong performance in motion quality and text alignment. The model outperforms existing methods in photorealism and motion quality, and produces more temporally coherent motion than baselines. The architecture is efficient and scalable, with a two-stage cascaded model for high-resolution video generation. The model uses a combination of diffusion processes and joint spatio-temporal modeling to achieve high-quality video generation.Snap Video is a video-first model that addresses challenges in text-to-video generation by extending the EDM framework to handle spatiotemporal redundancy and supporting video generation naturally. It introduces a transformer-based architecture inspired by FIT, which trains 3.31 times faster than U-Nets and achieves 4.49 times faster inference. This allows efficient training of a text-to-video model with billions of parameters, leading to state-of-the-art results on benchmarks and higher quality, temporally consistent videos with complex motion. User studies show that Snap Video outperforms recent methods in photorealism, text-video alignment, and motion quality. The model uses a scalable transformer architecture that treats spatial and temporal dimensions as a single compressed 1D latent vector, enabling joint spatio-temporal computation. It also addresses image-video modality mismatches by treating images as high frame-rate videos. Snap Video is evaluated on UCF101 and MSR-VTT datasets, showing strong performance in motion quality and text alignment. The model outperforms existing methods in photorealism and motion quality, and produces more temporally coherent motion than baselines. The architecture is efficient and scalable, with a two-stage cascaded model for high-resolution video generation. The model uses a combination of diffusion processes and joint spatio-temporal modeling to achieve high-quality video generation.