The paper "UNIT SELECTION IN A CONCATENATIVE SPEECH SYNTHESIS SYSTEM USING A LARGE SPEECH DATABASE" by Andrew J. Hunt and Alan W. Black discusses an approach to generating natural-sounding synthesized speech by selecting and concatenating units from a large speech database. The authors propose treating the synthesis database as a state transition network, where the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation between two consecutive units. This framework is similar to HMM-based speech recognition.
The unit selection procedure in the CHATR speech synthesis system is described, which extends the ATR μ-Talk principle to consider both prosodic and phonetic appropriateness of units. The paper introduces two novel methods for training the cost functions that control unit selection: weight space search and regression training. Both methods use natural speech to train the weights, resulting in better-quality synthesis than hand-tuned weights.
The authors present results from subjective assessments of synthesized speech produced using these methods, showing consistent but small preferences for weights obtained by regression training. Despite this, regression training is preferred due to its lower computational requirements and greater flexibility. The paper concludes by discussing potential improvements to the regression training method and future research directions.The paper "UNIT SELECTION IN A CONCATENATIVE SPEECH SYNTHESIS SYSTEM USING A LARGE SPEECH DATABASE" by Andrew J. Hunt and Alan W. Black discusses an approach to generating natural-sounding synthesized speech by selecting and concatenating units from a large speech database. The authors propose treating the synthesis database as a state transition network, where the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation between two consecutive units. This framework is similar to HMM-based speech recognition.
The unit selection procedure in the CHATR speech synthesis system is described, which extends the ATR μ-Talk principle to consider both prosodic and phonetic appropriateness of units. The paper introduces two novel methods for training the cost functions that control unit selection: weight space search and regression training. Both methods use natural speech to train the weights, resulting in better-quality synthesis than hand-tuned weights.
The authors present results from subjective assessments of synthesized speech produced using these methods, showing consistent but small preferences for weights obtained by regression training. Despite this, regression training is preferred due to its lower computational requirements and greater flexibility. The paper concludes by discussing potential improvements to the regression training method and future research directions.