Understanding Unit selection in a concatenative speech synthesis system using a large speech database

This paper presents a method for unit selection in a concatenative speech synthesis system using a large speech database. The approach involves selecting and concatenating units from a database to produce natural-sounding speech. Units (phonemes) are selected to produce a natural realization of a target phoneme sequence derived from text with prosodic and phonetic context information. The synthesis database is treated as a state transition network, where the state occupancy cost is the distance between a database unit and the target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework is similar to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach allows training from natural speech, with two methods presented for training the cost functions that control unit selection. The first method is weight space search, which involves a limited search of the weight space. The second method is regression training, which uses linear regression to determine the weights for the concatenation and target costs. The results show that regression training produces better quality synthesis than hand-tuned weights. The paper also discusses the advantages of regression training, including its lower computational requirements and greater flexibility. The synthesis database is treated as a state transition network, with the state occupancy costs given by the target cost and the state transition costs given by the concatenation cost. The paper concludes that the new view of the synthesis database allows for automated training using natural speech and provides better quality synthesis than hand-tuned weights.This paper presents a method for unit selection in a concatenative speech synthesis system using a large speech database. The approach involves selecting and concatenating units from a database to produce natural-sounding speech. Units (phonemes) are selected to produce a natural realization of a target phoneme sequence derived from text with prosodic and phonetic context information. The synthesis database is treated as a state transition network, where the state occupancy cost is the distance between a database unit and the target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework is similar to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach allows training from natural speech, with two methods presented for training the cost functions that control unit selection. The first method is weight space search, which involves a limited search of the weight space. The second method is regression training, which uses linear regression to determine the weights for the concatenation and target costs. The results show that regression training produces better quality synthesis than hand-tuned weights. The paper also discusses the advantages of regression training, including its lower computational requirements and greater flexibility. The synthesis database is treated as a state transition network, with the state occupancy costs given by the target cost and the state transition costs given by the concatenation cost. The paper concludes that the new view of the synthesis database allows for automated training using natural speech and provides better quality synthesis than hand-tuned weights.

UNIT SELECTION IN A CONCATENATIVE SPEECH SYNTHESIS SYSTEM USING A LARGE SPEECH DATABASE

1996 | Andrew J. Hunt and Alan W. Black