Neural Discrete Representation Learning

Neural Discrete Representation Learning

30 May 2018 | Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu
This paper introduces the Vector Quantised-Variational AutoEncoder (VQ-VAE), a generative model that learns discrete latent representations. Unlike traditional Variational AutoEncoders (VAEs), which use continuous latent variables, VQ-VAE employs discrete latent variables, allowing for more natural representations in modalities like language, speech, and images. The model uses vector quantisation to learn discrete latent codes, which helps avoid the "posterior collapse" issue where the latent variables are ignored. The VQ-VAE is trained with a combination of reconstruction loss and a commitment loss to ensure the latent variables are effectively learned. The model is shown to perform as well as continuous VAEs in terms of log-likelihood and can generate high-quality images, videos, and speech. It also demonstrates the ability to learn phonemes and perform speaker conversion without supervision. The model's discrete latent variables are used to train an autoregressive prior, enabling the generation of diverse samples. The VQ-VAE is evaluated on various tasks including image, audio, and video generation, showing its effectiveness in learning meaningful representations. The model's discrete latent space is shown to capture important features of the data in an unsupervised manner, and it achieves high likelihoods comparable to continuous models on CIFAR10 data. The paper also discusses related work and compares VQ-VAE with other discrete and continuous VAE models, highlighting its advantages in terms of performance and flexibility.This paper introduces the Vector Quantised-Variational AutoEncoder (VQ-VAE), a generative model that learns discrete latent representations. Unlike traditional Variational AutoEncoders (VAEs), which use continuous latent variables, VQ-VAE employs discrete latent variables, allowing for more natural representations in modalities like language, speech, and images. The model uses vector quantisation to learn discrete latent codes, which helps avoid the "posterior collapse" issue where the latent variables are ignored. The VQ-VAE is trained with a combination of reconstruction loss and a commitment loss to ensure the latent variables are effectively learned. The model is shown to perform as well as continuous VAEs in terms of log-likelihood and can generate high-quality images, videos, and speech. It also demonstrates the ability to learn phonemes and perform speaker conversion without supervision. The model's discrete latent variables are used to train an autoregressive prior, enabling the generation of diverse samples. The VQ-VAE is evaluated on various tasks including image, audio, and video generation, showing its effectiveness in learning meaningful representations. The model's discrete latent space is shown to capture important features of the data in an unsupervised manner, and it achieves high likelihoods comparable to continuous models on CIFAR10 data. The paper also discusses related work and compares VQ-VAE with other discrete and continuous VAE models, highlighting its advantages in terms of performance and flexibility.
Reach us at info@study.space