10 Apr 2018 | Richard Zhang1 Phillip Isola12 Alexei A. Efros1 Eli Shechtman3 Oliver Wang3
The paper "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric" by Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang explores the effectiveness of deep features in capturing human perceptual similarity judgments. The authors introduce a new dataset, the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset, which contains 484k human judgments on a wide range of distortions and real algorithm outputs. They find that deep features, trained on various architectures and supervision types (supervised, self-supervised, and unsupervised), outperform traditional metrics like SSIM and PSNR in predicting human perceptual similarity. Specifically, they show that deep features from networks like VGG, AlexNet, and SqueezeNet, even without further calibration, perform better than these traditional metrics. The study also demonstrates that network architecture and training signal are crucial, with untrained networks achieving much lower performance. The authors conclude that perceptual similarity is an emergent property shared across deep visual representations, suggesting that features effective at semantic prediction tasks are also good at capturing human perceptual behavior.The paper "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric" by Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang explores the effectiveness of deep features in capturing human perceptual similarity judgments. The authors introduce a new dataset, the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset, which contains 484k human judgments on a wide range of distortions and real algorithm outputs. They find that deep features, trained on various architectures and supervision types (supervised, self-supervised, and unsupervised), outperform traditional metrics like SSIM and PSNR in predicting human perceptual similarity. Specifically, they show that deep features from networks like VGG, AlexNet, and SqueezeNet, even without further calibration, perform better than these traditional metrics. The study also demonstrates that network architecture and training signal are crucial, with untrained networks achieving much lower performance. The authors conclude that perceptual similarity is an emergent property shared across deep visual representations, suggesting that features effective at semantic prediction tasks are also good at capturing human perceptual behavior.