Pros and Cons of GAN Evaluation Measures

Pros and Cons of GAN Evaluation Measures

October 25, 2018 | Ali Borji
This paper reviews and critically discusses over 24 quantitative and 5 qualitative measures for evaluating generative models, with a focus on GANs. It outlines 7 desiderata for effective GAN evaluation measures and evaluates whether existing measures align with these criteria. The paper highlights the challenges in evaluating GANs, including the lack of consensus on the best evaluation metric and the difficulty in capturing both quantitative and qualitative aspects of model performance. It discusses various evaluation measures, including average log-likelihood, coverage metric, Inception Score (IS), Modified Inception Score (m-IS), Mode Score, AM Score, Fréchet Inception Distance (FID), Maximum Mean Discrepancy (MMD), Wasserstein Critic, Birthday Paradox Test, Classifier Two-sample Tests (C2ST), Classification Performance, Boundary Distortion, Number of Statistically-Different Bins (NDB), Image Retrieval Performance, Generative Adversarial Metric (GAM), and Tournament Win Rate and Skill Rating. Each measure is analyzed in terms of its strengths, limitations, and compatibility with the desiderata. The paper concludes that while no single measure is perfect, some measures like FID and MMD are more robust and reliable for evaluating GANs. It also suggests that future research should focus on developing more efficient and fair evaluation measures for GANs.This paper reviews and critically discusses over 24 quantitative and 5 qualitative measures for evaluating generative models, with a focus on GANs. It outlines 7 desiderata for effective GAN evaluation measures and evaluates whether existing measures align with these criteria. The paper highlights the challenges in evaluating GANs, including the lack of consensus on the best evaluation metric and the difficulty in capturing both quantitative and qualitative aspects of model performance. It discusses various evaluation measures, including average log-likelihood, coverage metric, Inception Score (IS), Modified Inception Score (m-IS), Mode Score, AM Score, Fréchet Inception Distance (FID), Maximum Mean Discrepancy (MMD), Wasserstein Critic, Birthday Paradox Test, Classifier Two-sample Tests (C2ST), Classification Performance, Boundary Distortion, Number of Statistically-Different Bins (NDB), Image Retrieval Performance, Generative Adversarial Metric (GAM), and Tournament Win Rate and Skill Rating. Each measure is analyzed in terms of its strengths, limitations, and compatibility with the desiderata. The paper concludes that while no single measure is perfect, some measures like FID and MMD are more robust and reliable for evaluating GANs. It also suggests that future research should focus on developing more efficient and fair evaluation measures for GANs.
Reach us at info@study.space