Video-to-Video Synthesis

Video-to-Video Synthesis

3 Dec 2018 | Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro
This paper presents a video-to-video synthesis approach based on generative adversarial networks (GANs). The goal is to learn a mapping function that converts an input video (e.g., semantic segmentation masks) into a photorealistic output video that accurately represents the input content. The method addresses the challenge of generating temporally coherent and high-resolution videos, which is difficult with existing image synthesis techniques when applied directly to video inputs. The proposed approach uses a conditional GAN framework with carefully designed generators and discriminators, along with a spatio-temporal adversarial objective, to achieve high-quality, temporally coherent video results. The method is tested on various input formats, including segmentation masks, sketches, and poses, and demonstrates superior performance compared to existing baselines. The model can generate 2K resolution videos up to 30 seconds long, significantly advancing the state-of-the-art in video synthesis. Additionally, the method is applied to future video prediction, outperforming several competing systems. The approach is evaluated on multiple benchmarks, showing that the synthesized videos are more photorealistic and temporally coherent than those from strong baselines. The method also allows for multimodal synthesis, enabling the generation of videos with diverse appearances based on the same input. The paper also discusses the limitations of the approach, such as challenges in synthesizing turning cars and maintaining consistent object appearances across the entire video. Overall, the proposed method provides a general solution for video-to-video synthesis that achieves high-quality, temporally coherent results.This paper presents a video-to-video synthesis approach based on generative adversarial networks (GANs). The goal is to learn a mapping function that converts an input video (e.g., semantic segmentation masks) into a photorealistic output video that accurately represents the input content. The method addresses the challenge of generating temporally coherent and high-resolution videos, which is difficult with existing image synthesis techniques when applied directly to video inputs. The proposed approach uses a conditional GAN framework with carefully designed generators and discriminators, along with a spatio-temporal adversarial objective, to achieve high-quality, temporally coherent video results. The method is tested on various input formats, including segmentation masks, sketches, and poses, and demonstrates superior performance compared to existing baselines. The model can generate 2K resolution videos up to 30 seconds long, significantly advancing the state-of-the-art in video synthesis. Additionally, the method is applied to future video prediction, outperforming several competing systems. The approach is evaluated on multiple benchmarks, showing that the synthesized videos are more photorealistic and temporally coherent than those from strong baselines. The method also allows for multimodal synthesis, enabling the generation of videos with diverse appearances based on the same input. The paper also discusses the limitations of the approach, such as challenges in synthesizing turning cars and maintaining consistent object appearances across the entire video. Overall, the proposed method provides a general solution for video-to-video synthesis that achieves high-quality, temporally coherent results.
Reach us at info@study.space
[slides] Video-to-Video Synthesis | StudySpace