[slides] Beyond ELBOs%3A A Large-Scale Evaluation of Variational Methods for Sampling

This paper addresses the challenge of sampling from intractable probability distributions, focusing on the evaluation of variational methods. Current studies lack a unified framework, leading to inconsistent performance measures and limited comparisons across diverse tasks. To address these issues, the authors introduce a benchmark that evaluates sampling methods using a standardized task suite and a broad range of performance criteria. They also propose new metrics to quantify mode collapse, a common issue in sampling methods. The findings provide insights into the strengths and weaknesses of existing sampling methods, serving as a valuable reference for future developments. The code for the benchmark is publicly available. Sampling methods are designed to generate approximate samples or estimate the normalization constant of a probability density. Monte Carlo (MC) methods, including Annealed Importance Sampling (AIS) and its Sequential Monte Carlo (SMC) extensions, have traditionally been considered the gold standard. Variational Inference (VI) is an alternative approach that parameterizes a tractable family of distributions and optimizes them to maximize similarity to the target distribution. Recent advancements have merged MC with VI techniques to approximate complex, potentially multimodal distributions. However, evaluating these methods faces significant challenges, including the absence of a standardized set of tasks and diverse performance criteria. Existing evaluation protocols, such as the evidence lower bound (ELBO), often rely on samples from the model, restricting their evaluation capabilities. To overcome this, integral probability metrics (IPMs) like maximum mean discrepancy and Wasserstein distance are used, but these metrics involve subjective design choices. To address these challenges, the authors introduce a comprehensive set of tasks for evaluating variational methods. They explore existing evaluation criteria and propose a novel metric, *entropic mode coverage* (EMC), to quantify mode collapse. Through this evaluation, they aim to provide valuable insights into the strengths and weaknesses of current sampling methods, contributing to the future design of more effective techniques and the establishment of standardized evaluation protocols. The paper provides an overview of Monte Carlo methods, Variational Inference, and their combinations. It discusses the limitations of existing evaluation protocols and introduces new metrics to address these issues. The authors categorize the included methods into three groups: tractable density models, sequential importance sampling methods, and diffusion-based methods. They detail the hyperparameters and tuning for each method and present the evaluation protocol, including the computation of performance criteria and hyperparameter tuning. The evaluation is conducted on synthetic and real-world target densities, providing quantitative and qualitative analyses. The results highlight the strengths and weaknesses of different methods, such as the effectiveness of GMMVI and FAB in high-dimensional problems and the limitations of diffusion-based methods in sample efficiency. The paper concludes with a discussion of general observations and specific method observations, offering valuable insights for future research. This paper advances the field of Machine Learning by providing a comprehensive evaluation of sampling methods. It offers a standardized framework for comparing different techniques and highlights the importance of addressing mode collapse. The findings can guide the development of more effective sampling methods and contribute to the establishment ofThis paper addresses the challenge of sampling from intractable probability distributions, focusing on the evaluation of variational methods. Current studies lack a unified framework, leading to inconsistent performance measures and limited comparisons across diverse tasks. To address these issues, the authors introduce a benchmark that evaluates sampling methods using a standardized task suite and a broad range of performance criteria. They also propose new metrics to quantify mode collapse, a common issue in sampling methods. The findings provide insights into the strengths and weaknesses of existing sampling methods, serving as a valuable reference for future developments. The code for the benchmark is publicly available. Sampling methods are designed to generate approximate samples or estimate the normalization constant of a probability density. Monte Carlo (MC) methods, including Annealed Importance Sampling (AIS) and its Sequential Monte Carlo (SMC) extensions, have traditionally been considered the gold standard. Variational Inference (VI) is an alternative approach that parameterizes a tractable family of distributions and optimizes them to maximize similarity to the target distribution. Recent advancements have merged MC with VI techniques to approximate complex, potentially multimodal distributions. However, evaluating these methods faces significant challenges, including the absence of a standardized set of tasks and diverse performance criteria. Existing evaluation protocols, such as the evidence lower bound (ELBO), often rely on samples from the model, restricting their evaluation capabilities. To overcome this, integral probability metrics (IPMs) like maximum mean discrepancy and Wasserstein distance are used, but these metrics involve subjective design choices. To address these challenges, the authors introduce a comprehensive set of tasks for evaluating variational methods. They explore existing evaluation criteria and propose a novel metric, *entropic mode coverage* (EMC), to quantify mode collapse. Through this evaluation, they aim to provide valuable insights into the strengths and weaknesses of current sampling methods, contributing to the future design of more effective techniques and the establishment of standardized evaluation protocols. The paper provides an overview of Monte Carlo methods, Variational Inference, and their combinations. It discusses the limitations of existing evaluation protocols and introduces new metrics to address these issues. The authors categorize the included methods into three groups: tractable density models, sequential importance sampling methods, and diffusion-based methods. They detail the hyperparameters and tuning for each method and present the evaluation protocol, including the computation of performance criteria and hyperparameter tuning. The evaluation is conducted on synthetic and real-world target densities, providing quantitative and qualitative analyses. The results highlight the strengths and weaknesses of different methods, such as the effectiveness of GMMVI and FAB in high-dimensional problems and the limitations of diffusion-based methods in sample efficiency. The paper concludes with a discussion of general observations and specific method observations, offering valuable insights for future research. This paper advances the field of Machine Learning by providing a comprehensive evaluation of sampling methods. It offers a standardized framework for comparing different techniques and highlights the importance of addressing mode collapse. The findings can guide the development of more effective sampling methods and contribute to the establishment of

Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling

2024 | Denis Blessing, Xiaogang Jia, Johannes Esslinger, Francisco Vargas, Gerhard Neumann