[slides] Deep multi-scale video prediction beyond mean square error

This paper addresses the problem of predicting future frames in a video sequence, focusing on unsupervised feature learning. The authors propose a convolutional network trained to generate future frames from input sequences, addressing the issue of blurry predictions obtained from the standard Mean Squared Error (MSE) loss function. To improve the quality of predictions, they introduce three complementary feature learning strategies: a multi-scale architecture, adversarial training, and an image gradient difference loss function (GDL). The multi-scale architecture helps preserve long-range dependencies, while the adversarial training and GDL enhance the sharpness of the predictions. The effectiveness of these strategies is evaluated on the UCF101 dataset, showing that the combination of multi-scale, GDL, and adversarial training achieves the best results in terms of Peak Signal to Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and sharpness. The paper also compares the proposed methods to those using LSTM models and optical flow, demonstrating superior performance in certain aspects.This paper addresses the problem of predicting future frames in a video sequence, focusing on unsupervised feature learning. The authors propose a convolutional network trained to generate future frames from input sequences, addressing the issue of blurry predictions obtained from the standard Mean Squared Error (MSE) loss function. To improve the quality of predictions, they introduce three complementary feature learning strategies: a multi-scale architecture, adversarial training, and an image gradient difference loss function (GDL). The multi-scale architecture helps preserve long-range dependencies, while the adversarial training and GDL enhance the sharpness of the predictions. The effectiveness of these strategies is evaluated on the UCF101 dataset, showing that the combination of multi-scale, GDL, and adversarial training achieves the best results in terms of Peak Signal to Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and sharpness. The paper also compares the proposed methods to those using LSTM models and optical flow, demonstrating superior performance in certain aspects.

DEEP MULTI-SCALE VIDEO PREDICTION BEYOND MEAN SQUARE ERROR

26 Feb 2016 | Michael Mathieu1,2, Camille Couprie2 & Yann LeCun1,2