2 Jun 2019 | Ali Razavi, Aäron van den Oord, Oriol Vinyals
This paper presents a method for generating high-fidelity, diverse images using a Vector Quantized Variational AutoEncoder (VQ-VAE) with a powerful autoregressive prior. The approach involves training a hierarchical VQ-VAE to encode images into a discrete latent space, followed by fitting a powerful PixelCNN prior over the latent space. This allows for efficient sampling and high-quality image generation. The hierarchical structure enables the model to capture both global and local information, with the top-level latent code modeling global features and the bottom-level latent code capturing local details. The use of a PixelCNN prior with self-attention mechanisms improves the quality and diversity of generated images. The method is efficient, with training and sampling speeds 30x faster than direct pixel-space processing. The model is evaluated on ImageNet and other datasets, showing competitive performance with state-of-the-art generative adversarial networks (GANs) in terms of image quality and diversity, while avoiding the mode collapse and lack of diversity issues seen in GANs. The paper also introduces a classifier-based rejection sampling method to balance diversity and quality in generated samples. The results demonstrate that the proposed method generates high-resolution images with high fidelity and diversity, outperforming other models in several metrics. The approach is efficient and suitable for applications requiring fast, low-overhead encoding and decoding of large images.This paper presents a method for generating high-fidelity, diverse images using a Vector Quantized Variational AutoEncoder (VQ-VAE) with a powerful autoregressive prior. The approach involves training a hierarchical VQ-VAE to encode images into a discrete latent space, followed by fitting a powerful PixelCNN prior over the latent space. This allows for efficient sampling and high-quality image generation. The hierarchical structure enables the model to capture both global and local information, with the top-level latent code modeling global features and the bottom-level latent code capturing local details. The use of a PixelCNN prior with self-attention mechanisms improves the quality and diversity of generated images. The method is efficient, with training and sampling speeds 30x faster than direct pixel-space processing. The model is evaluated on ImageNet and other datasets, showing competitive performance with state-of-the-art generative adversarial networks (GANs) in terms of image quality and diversity, while avoiding the mode collapse and lack of diversity issues seen in GANs. The paper also introduces a classifier-based rejection sampling method to balance diversity and quality in generated samples. The results demonstrate that the proposed method generates high-resolution images with high fidelity and diversity, outperforming other models in several metrics. The approach is efficient and suitable for applications requiring fast, low-overhead encoding and decoding of large images.