Video Interpolation with Diffusion Models

Video Interpolation with Diffusion Models

2024-01-01 | Siddhant Jain*, Daniel Watson*, Eric Tabellion, Aleksander Hołynski, Ben Poole, Janne Kontkanen
VIDIM is a generative model for video interpolation that creates short videos given a start and end frame. It uses cascaded diffusion models to first generate the target video at low resolution and then generate the high-resolution video conditioned on the low-resolution generated video. VIDIM outperforms previous state-of-the-art methods in handling complex, nonlinear, and ambiguous motion. It achieves high-fidelity results through classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters. VIDIM is fast to sample from, requires fewer than a billion parameters per diffusion model, and scales well with larger parameter counts. It is evaluated on two curated datasets, Davis-7 and UCF101-7, and shows superior performance in both quantitative and qualitative metrics. VIDIM is also shown to be preferred by human observers in qualitative evaluation. The model is trained with a cascaded approach, using separate base and super-resolution models. The base model generates intermediate frames between two input frames, while the super-resolution model upscales these frames to higher resolution. The model uses parameter-free frame conditioning and classifier-free guidance to improve sample quality. The model is evaluated on various metrics, including PSNR, SSIM, LPIPS, FID, and FVD, and shows superior performance compared to baselines. The model is also shown to be scalable, with larger models achieving better results. VIDIM is able to handle large and ambiguous motion, which is challenging for other methods. The model is trained on a mixture of publicly available and internal video datasets, and is evaluated on the Davis and UCF101 datasets. The model is also shown to be effective in human evaluation, with VIDIM samples being strongly preferred by human raters. The model is also shown to be effective in ablation studies, with the importance of explicit frame conditioning and classifier-free guidance being demonstrated. The model is also shown to be effective in scalability studies, with larger models achieving better results. The model is also shown to be effective in handling large and ambiguous motion, which is challenging for other methods. The model is trained on a mixture of publicly available and internal video datasets, and is evaluated on the Davis and UCF101 datasets. The model is also shown to be effective in human evaluation, with VIDIM samples being strongly preferred by human raters.VIDIM is a generative model for video interpolation that creates short videos given a start and end frame. It uses cascaded diffusion models to first generate the target video at low resolution and then generate the high-resolution video conditioned on the low-resolution generated video. VIDIM outperforms previous state-of-the-art methods in handling complex, nonlinear, and ambiguous motion. It achieves high-fidelity results through classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters. VIDIM is fast to sample from, requires fewer than a billion parameters per diffusion model, and scales well with larger parameter counts. It is evaluated on two curated datasets, Davis-7 and UCF101-7, and shows superior performance in both quantitative and qualitative metrics. VIDIM is also shown to be preferred by human observers in qualitative evaluation. The model is trained with a cascaded approach, using separate base and super-resolution models. The base model generates intermediate frames between two input frames, while the super-resolution model upscales these frames to higher resolution. The model uses parameter-free frame conditioning and classifier-free guidance to improve sample quality. The model is evaluated on various metrics, including PSNR, SSIM, LPIPS, FID, and FVD, and shows superior performance compared to baselines. The model is also shown to be scalable, with larger models achieving better results. VIDIM is able to handle large and ambiguous motion, which is challenging for other methods. The model is trained on a mixture of publicly available and internal video datasets, and is evaluated on the Davis and UCF101 datasets. The model is also shown to be effective in human evaluation, with VIDIM samples being strongly preferred by human raters. The model is also shown to be effective in ablation studies, with the importance of explicit frame conditioning and classifier-free guidance being demonstrated. The model is also shown to be effective in scalability studies, with larger models achieving better results. The model is also shown to be effective in handling large and ambiguous motion, which is challenging for other methods. The model is trained on a mixture of publicly available and internal video datasets, and is evaluated on the Davis and UCF101 datasets. The model is also shown to be effective in human evaluation, with VIDIM samples being strongly preferred by human raters.
Reach us at info@study.space