Understanding WORLD%3A A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications

A vocoder-based speech synthesis system named WORLD was developed to improve sound quality and enable real-time processing. WORLD consists of three analysis algorithms for estimating fundamental frequency (F0), spectral envelope, and aperiodic parameters, along with a synthesis algorithm that uses these parameters. The system was evaluated against conventional systems like STRAIGHT and TANDEM-STRAIGHT in terms of sound quality and processing speed. Results showed that WORLD outperformed these systems in both aspects, being over ten times faster. It also demonstrated superior sound quality, particularly for male speech, and was comparable to STRAIGHT for female speech. The system's processing speed was evaluated using the real-time factor (RTF), which indicated its capability for real-time applications. WORLD's algorithms were found to be more efficient in terms of processing speed compared to other systems. The system was implemented in C and MATLAB, and its performance was tested on a database of four-mora words, including consonants. The results suggest that WORLD is a high-quality speech synthesis system suitable for real-time applications. Future work includes improving noise robustness and incorporating efficient phase modeling for better sound quality. The system is available in C and MATLAB implementations and is being developed for use in voice conversion and other applications.A vocoder-based speech synthesis system named WORLD was developed to improve sound quality and enable real-time processing. WORLD consists of three analysis algorithms for estimating fundamental frequency (F0), spectral envelope, and aperiodic parameters, along with a synthesis algorithm that uses these parameters. The system was evaluated against conventional systems like STRAIGHT and TANDEM-STRAIGHT in terms of sound quality and processing speed. Results showed that WORLD outperformed these systems in both aspects, being over ten times faster. It also demonstrated superior sound quality, particularly for male speech, and was comparable to STRAIGHT for female speech. The system's processing speed was evaluated using the real-time factor (RTF), which indicated its capability for real-time applications. WORLD's algorithms were found to be more efficient in terms of processing speed compared to other systems. The system was implemented in C and MATLAB, and its performance was tested on a database of four-mora words, including consonants. The results suggest that WORLD is a high-quality speech synthesis system suitable for real-time applications. Future work includes improving noise robustness and incorporating efficient phase modeling for better sound quality. The system is available in C and MATLAB implementations and is being developed for use in voice conversion and other applications.

WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications

July 2016 | Masanori MORISE, Fumiya YOKOMORI, Kenji OZAWA