HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

23 Oct 2020 | Jungil Kong, Jaehyeon Kim, Jackyoung Bae
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis **Authors:** Jungil Kong **Abstract:** Recent advancements in speech synthesis have utilized generative adversarial networks (GANs) to produce raw waveforms, improving sampling efficiency and memory usage. However, their sample quality has not yet matched autoregressive and flow-based models. This paper introduces HiFi-GAN, a method that achieves both efficient and high-fidelity speech synthesis. The key innovation is modeling periodic patterns in audio, crucial for enhancing sample quality. Subjective human evaluations on a single speaker dataset show that HiFi-GAN generates high-fidelity audio (22.05 kHz) 167.9 times faster than real-time on a single V100 GPU, with MOS scores comparable to human quality. The model is also shown to generalize to unseen speakers and end-to-end speech synthesis. A small-footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU, with comparable quality to autoregressive models. **Introduction:** Speech synthesis has become increasingly important with the rise of AI voice assistants and smart devices. Traditional methods use a two-stage pipeline: predicting mel-spectrograms from text and synthesizing raw waveforms. HiFi-GAN focuses on the second stage, aiming to efficiently synthesize high-fidelity waveforms from mel-spectrograms. Previous work has improved synthesis speed and quality using WaveNet, Flow-based models, and GANs, but there remains a gap in quality compared to AR or flow-based models. **HiFi-GAN:** HiFi-GAN consists of a generator and two discriminators: multi-scale and multi-period discriminators. The generator is a fully convolutional neural network that upsamples mel-spectrograms to match the temporal resolution of raw waveforms. The multi-receptive field (MRF) module observes patterns of various lengths in parallel, enhancing synthesis efficiency and quality. The multi-period discriminator (MPD) and multi-scale discriminator (MSD) capture long-term dependencies and periodic patterns in the audio, respectively. **Experiments:** HiFi-GAN was evaluated on the LJSpeech dataset and the VCTK multi-speaker dataset. It outperformed other models in terms of MOS scores and synthesis speed. An ablation study confirmed the effectiveness of MPD, MRF, and mel-spectrogram loss. HiFi-GAN also generalized well to unseen speakers and performed well in end-to-end speech synthesis, showing the ability to adapt to different settings. **Conclusion:** HiFi-GAN achieves high-quality speech synthesis with significant improvements in efficiency. It generalizes well to unseen speakers and performs well in end-to-end settings. The model is released as open source, providing a foundation for future research in speech synthesis.HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis **Authors:** Jungil Kong **Abstract:** Recent advancements in speech synthesis have utilized generative adversarial networks (GANs) to produce raw waveforms, improving sampling efficiency and memory usage. However, their sample quality has not yet matched autoregressive and flow-based models. This paper introduces HiFi-GAN, a method that achieves both efficient and high-fidelity speech synthesis. The key innovation is modeling periodic patterns in audio, crucial for enhancing sample quality. Subjective human evaluations on a single speaker dataset show that HiFi-GAN generates high-fidelity audio (22.05 kHz) 167.9 times faster than real-time on a single V100 GPU, with MOS scores comparable to human quality. The model is also shown to generalize to unseen speakers and end-to-end speech synthesis. A small-footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU, with comparable quality to autoregressive models. **Introduction:** Speech synthesis has become increasingly important with the rise of AI voice assistants and smart devices. Traditional methods use a two-stage pipeline: predicting mel-spectrograms from text and synthesizing raw waveforms. HiFi-GAN focuses on the second stage, aiming to efficiently synthesize high-fidelity waveforms from mel-spectrograms. Previous work has improved synthesis speed and quality using WaveNet, Flow-based models, and GANs, but there remains a gap in quality compared to AR or flow-based models. **HiFi-GAN:** HiFi-GAN consists of a generator and two discriminators: multi-scale and multi-period discriminators. The generator is a fully convolutional neural network that upsamples mel-spectrograms to match the temporal resolution of raw waveforms. The multi-receptive field (MRF) module observes patterns of various lengths in parallel, enhancing synthesis efficiency and quality. The multi-period discriminator (MPD) and multi-scale discriminator (MSD) capture long-term dependencies and periodic patterns in the audio, respectively. **Experiments:** HiFi-GAN was evaluated on the LJSpeech dataset and the VCTK multi-speaker dataset. It outperformed other models in terms of MOS scores and synthesis speed. An ablation study confirmed the effectiveness of MPD, MRF, and mel-spectrogram loss. HiFi-GAN also generalized well to unseen speakers and performed well in end-to-end speech synthesis, showing the ability to adapt to different settings. **Conclusion:** HiFi-GAN achieves high-quality speech synthesis with significant improvements in efficiency. It generalizes well to unseen speakers and performs well in end-to-end settings. The model is released as open source, providing a foundation for future research in speech synthesis.
Reach us at info@study.space
[slides and audio] HiFi-GAN%3A Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis