STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS

STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS

| Heiga Zen, Andrew Senior, Mike Schuster
This paper presents a deep neural network (DNN)-based approach for statistical parametric speech synthesis, which improves upon traditional hidden Markov model (HMM)-based methods. Traditional HMM-based systems use decision trees to cluster contexts, but they are inefficient for complex dependencies. The DNN-based approach models the relationship between text and acoustic realizations directly, offering better performance. Experimental results show that DNN-based systems outperform HMM-based ones with similar parameter counts. The DNN-based system maps input features (e.g., linguistic contexts, phonetic information) to output features (e.g., spectral and excitation parameters). These outputs are then used to generate speech parameters, which are converted into waveforms. The DNN can handle complex context dependencies more efficiently than decision trees, and it provides better generalization by learning from all training data. The paper compares DNN-based and HMM-based systems in both objective and subjective evaluations. Objective metrics include Mel-cepstral distortion, aperiodicity distortion, and log F0 error. Subjective evaluations show that DNN-based systems are preferred for their clearer, more natural speech. The DNN-based approach also shows better performance in predicting spectral and excitation parameters. The DNN-based system uses a deep architecture with multiple hidden layers, which allows it to model complex functions more efficiently than traditional shallow architectures. Training involves back-propagation with a GPU-based stochastic gradient descent algorithm. Input features include binary and numerical data related to linguistic contexts, while output features include spectral and excitation parameters. The paper concludes that DNN-based systems have significant potential for statistical parametric speech synthesis, offering improved performance and better handling of complex context dependencies. Future work includes reducing computational costs and exploring better log F0 modeling schemes.This paper presents a deep neural network (DNN)-based approach for statistical parametric speech synthesis, which improves upon traditional hidden Markov model (HMM)-based methods. Traditional HMM-based systems use decision trees to cluster contexts, but they are inefficient for complex dependencies. The DNN-based approach models the relationship between text and acoustic realizations directly, offering better performance. Experimental results show that DNN-based systems outperform HMM-based ones with similar parameter counts. The DNN-based system maps input features (e.g., linguistic contexts, phonetic information) to output features (e.g., spectral and excitation parameters). These outputs are then used to generate speech parameters, which are converted into waveforms. The DNN can handle complex context dependencies more efficiently than decision trees, and it provides better generalization by learning from all training data. The paper compares DNN-based and HMM-based systems in both objective and subjective evaluations. Objective metrics include Mel-cepstral distortion, aperiodicity distortion, and log F0 error. Subjective evaluations show that DNN-based systems are preferred for their clearer, more natural speech. The DNN-based approach also shows better performance in predicting spectral and excitation parameters. The DNN-based system uses a deep architecture with multiple hidden layers, which allows it to model complex functions more efficiently than traditional shallow architectures. Training involves back-propagation with a GPU-based stochastic gradient descent algorithm. Input features include binary and numerical data related to linguistic contexts, while output features include spectral and excitation parameters. The paper concludes that DNN-based systems have significant potential for statistical parametric speech synthesis, offering improved performance and better handling of complex context dependencies. Future work includes reducing computational costs and exploring better log F0 modeling schemes.
Reach us at info@futurestudyspace.com