This paper explores the use of deep neural networks (DNNs) in statistical parametric speech synthesis, an approach that typically employs decision tree-clustered context-dependent hidden Markov models (HMMs) to model speech parameter probability densities. The authors highlight the limitations of HMMs, such as their inefficiency in handling complex context dependencies and the fragmentation of training data. They propose an alternative scheme based on DNNs, which can better address these issues. The DNN-based system models the relationship between input texts and their acoustic realizations, using a deep architecture to map linguistic contexts to speech parameters. Experimental results show that DNN-based systems outperform HMM-based systems with similar numbers of parameters, demonstrating the potential of DNNs in improving speech synthesis quality. The paper also discusses the trade-offs between computational efficiency and performance, suggesting future work on reducing computations and incorporating more input features.This paper explores the use of deep neural networks (DNNs) in statistical parametric speech synthesis, an approach that typically employs decision tree-clustered context-dependent hidden Markov models (HMMs) to model speech parameter probability densities. The authors highlight the limitations of HMMs, such as their inefficiency in handling complex context dependencies and the fragmentation of training data. They propose an alternative scheme based on DNNs, which can better address these issues. The DNN-based system models the relationship between input texts and their acoustic realizations, using a deep architecture to map linguistic contexts to speech parameters. Experimental results show that DNN-based systems outperform HMM-based systems with similar numbers of parameters, demonstrating the potential of DNNs in improving speech synthesis quality. The paper also discusses the trade-offs between computational efficiency and performance, suggesting future work on reducing computations and incorporating more input features.