3 Dec 2018 | Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro
This paper addresses the problem of video-to-video synthesis, aiming to learn a mapping function that can convert an input video to an output photorealistic video. Unlike the image-to-image translation problem, which has been well-studied, video-to-video synthesis is less explored. The authors propose a generative adversarial network (GAN) framework to achieve this goal. By designing carefully crafted generators and discriminators, along with a spatio-temporal adversarial objective, the method can generate high-resolution, photorealistic, and temporally coherent videos from various input formats such as segmentation masks, sketches, and poses. Experiments on multiple benchmarks demonstrate the superior performance of the proposed method compared to strong baselines. The model can synthesize 2K resolution videos up to 30 seconds long, advancing the state-of-the-art in video synthesis. Additionally, the method is extended to future video prediction, outperforming existing systems. The paper also discusses related work, including GANs, image-to-image translation, unconditional video synthesis, and future video prediction.This paper addresses the problem of video-to-video synthesis, aiming to learn a mapping function that can convert an input video to an output photorealistic video. Unlike the image-to-image translation problem, which has been well-studied, video-to-video synthesis is less explored. The authors propose a generative adversarial network (GAN) framework to achieve this goal. By designing carefully crafted generators and discriminators, along with a spatio-temporal adversarial objective, the method can generate high-resolution, photorealistic, and temporally coherent videos from various input formats such as segmentation masks, sketches, and poses. Experiments on multiple benchmarks demonstrate the superior performance of the proposed method compared to strong baselines. The model can synthesize 2K resolution videos up to 30 seconds long, advancing the state-of-the-art in video synthesis. Additionally, the method is extended to future video prediction, outperforming existing systems. The paper also discusses related work, including GANs, image-to-image translation, unconditional video synthesis, and future video prediction.