HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

23 Oct 2020 | Jungil Kong, Jaehyeon Kim, Jackyoung Bae
HiFi-GAN is a generative adversarial network (GAN) designed for efficient and high-fidelity speech synthesis. It addresses the limitations of previous methods, which, although improving sampling efficiency and memory usage, failed to match the quality of autoregressive and flow-based models. HiFi-GAN achieves both efficiency and high-fidelity speech synthesis by modeling periodic patterns in speech audio, which are crucial for generating realistic speech. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that HiFi-GAN generates 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. It also demonstrates generalization to mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. A small-footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator is a fully convolutional neural network that upsamples mel-spectrograms to match the temporal resolution of raw waveforms. The multi-receptive field fusion (MRF) module allows the generator to observe patterns of various lengths in parallel. The multi-period discriminator (MPD) consists of several sub-discriminators that handle different periodic parts of the input audio. The multi-scale discriminator (MSD) evaluates audio samples at different levels. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance. The training process includes GAN loss, mel-spectrogram loss, and feature matching loss. The GAN loss is based on least squares loss functions, while the mel-spectrogram loss improves the fidelity of the generated audio. The feature matching loss measures the similarity between ground truth and generated samples. The final loss combines these components to optimize the generator and discriminator. Experiments show that HiFi-GAN outperforms other models in terms of audio quality and synthesis speed. V1, the largest version, has 13.92M parameters and achieves the highest MOS with a gap of 0.09 compared to the ground truth audio. V2, a smaller version, has 0.92M parameters and achieves a MOS of 4.23 while significantly reducing memory requirements. V3, the smallest version, synthesizes speech 13.44 times faster than real-time on CPU and 1,186 times faster than real-time on a single V100 GPU with comparable quality to an autoregressive counterpart. HiFi-GAN demonstrates strong generalization to unseen speakers and end-to-end speech synthesis. It can synthesize speech audio comparable to human quality from noisy inputs in an end-to-end setting. The small-footprint version of HiFi-GAN generates samplesHiFi-GAN is a generative adversarial network (GAN) designed for efficient and high-fidelity speech synthesis. It addresses the limitations of previous methods, which, although improving sampling efficiency and memory usage, failed to match the quality of autoregressive and flow-based models. HiFi-GAN achieves both efficiency and high-fidelity speech synthesis by modeling periodic patterns in speech audio, which are crucial for generating realistic speech. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that HiFi-GAN generates 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. It also demonstrates generalization to mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. A small-footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator is a fully convolutional neural network that upsamples mel-spectrograms to match the temporal resolution of raw waveforms. The multi-receptive field fusion (MRF) module allows the generator to observe patterns of various lengths in parallel. The multi-period discriminator (MPD) consists of several sub-discriminators that handle different periodic parts of the input audio. The multi-scale discriminator (MSD) evaluates audio samples at different levels. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance. The training process includes GAN loss, mel-spectrogram loss, and feature matching loss. The GAN loss is based on least squares loss functions, while the mel-spectrogram loss improves the fidelity of the generated audio. The feature matching loss measures the similarity between ground truth and generated samples. The final loss combines these components to optimize the generator and discriminator. Experiments show that HiFi-GAN outperforms other models in terms of audio quality and synthesis speed. V1, the largest version, has 13.92M parameters and achieves the highest MOS with a gap of 0.09 compared to the ground truth audio. V2, a smaller version, has 0.92M parameters and achieves a MOS of 4.23 while significantly reducing memory requirements. V3, the smallest version, synthesizes speech 13.44 times faster than real-time on CPU and 1,186 times faster than real-time on a single V100 GPU with comparable quality to an autoregressive counterpart. HiFi-GAN demonstrates strong generalization to unseen speakers and end-to-end speech synthesis. It can synthesize speech audio comparable to human quality from noisy inputs in an end-to-end setting. The small-footprint version of HiFi-GAN generates samples
Reach us at info@study.space