24 Apr 2016 | Lucas Theis*, Aäron van den Oord*,†, Matthias Bethge
The article "A Note on the Evaluation of Generative Models" by Lucas Theis reviews the evaluation and interpretation of generative models, particularly focusing on image models. It highlights the heterogeneity in the formulation, training, and evaluation of these models, making direct comparisons challenging. The author discusses three commonly used criteria—average log-likelihood, Parzen window estimates, and visual fidelity of samples—and argues that they are largely independent when data is high-dimensional. Good performance in one criterion does not necessarily imply good performance in others. The article emphasizes the importance of evaluating generative models directly with respect to their intended applications and provides examples demonstrating that Parzen window estimates should generally be avoided due to their limitations. The introduction covers the motivation for using different training objectives and the trade-offs between them, while the evaluation section delves into the relationship between log-likelihood and sample quality, and how these metrics can be misleading in high-dimensional spaces. The conclusion stresses the need for context-specific evaluation methods and the importance of aligning training and evaluation with the target application.The article "A Note on the Evaluation of Generative Models" by Lucas Theis reviews the evaluation and interpretation of generative models, particularly focusing on image models. It highlights the heterogeneity in the formulation, training, and evaluation of these models, making direct comparisons challenging. The author discusses three commonly used criteria—average log-likelihood, Parzen window estimates, and visual fidelity of samples—and argues that they are largely independent when data is high-dimensional. Good performance in one criterion does not necessarily imply good performance in others. The article emphasizes the importance of evaluating generative models directly with respect to their intended applications and provides examples demonstrating that Parzen window estimates should generally be avoided due to their limitations. The introduction covers the motivation for using different training objectives and the trade-offs between them, while the evaluation section delves into the relationship between log-likelihood and sample quality, and how these metrics can be misleading in high-dimensional spaces. The conclusion stresses the need for context-specific evaluation methods and the importance of aligning training and evaluation with the target application.