Understanding Pros and Cons of GAN Evaluation Measures

This paper reviews and critically discusses over 24 quantitative and 5 qualitative measures for evaluating generative models, particularly GAN-derived models. The author emphasizes the importance of settling on a few good measures to guide the progress in the field. The paper is organized into two main sections: quantitative and qualitative measures. Each measure is evaluated against a set of desiderata, which include favoring models that generate high-fidelity samples, being robust to overfitting, and capturing mode collapse. Key measures discussed include the Average Log-likelihood, Inception Score (IS), Mode Score, AM Score, Fréchet Inception Distance (FID), Maximum Mean Discrepancy (MMD), and the Wasserstein Critic. The paper also introduces the Birthday Paradox Test, Classifier Two-sample Tests (C2ST), Classification Performance, GAN Quality Index (GQI), Data Augmentation Utility, Boundary Distortion, Number of Statistically-Different Bins (NDB), Image Retrieval Performance, Generative Adversarial Metric (GAM), and Tournament Win Rate and Skill Rating. Each measure is evaluated based on its strengths and limitations, providing insights into the current state of GAN evaluation methods.This paper reviews and critically discusses over 24 quantitative and 5 qualitative measures for evaluating generative models, particularly GAN-derived models. The author emphasizes the importance of settling on a few good measures to guide the progress in the field. The paper is organized into two main sections: quantitative and qualitative measures. Each measure is evaluated against a set of desiderata, which include favoring models that generate high-fidelity samples, being robust to overfitting, and capturing mode collapse. Key measures discussed include the Average Log-likelihood, Inception Score (IS), Mode Score, AM Score, Fréchet Inception Distance (FID), Maximum Mean Discrepancy (MMD), and the Wasserstein Critic. The paper also introduces the Birthday Paradox Test, Classifier Two-sample Tests (C2ST), Classification Performance, GAN Quality Index (GQI), Data Augmentation Utility, Boundary Distortion, Number of Statistically-Different Bins (NDB), Image Retrieval Performance, Generative Adversarial Metric (GAM), and Tournament Win Rate and Skill Rating. Each measure is evaluated based on its strengths and limitations, providing insights into the current state of GAN evaluation methods.

Pros and Cons of GAN Evaluation Measures

October 25, 2018 | Ali Borji