ED-TTS: MULTI-SCALE EMOTION MODELING USING CROSS-DOMAIN EMOTION DIARIZATION FOR EMOTIONAL SPEECH SYNTHESIS

ED-TTS: MULTI-SCALE EMOTION MODELING USING CROSS-DOMAIN EMOTION DIARIZATION FOR EMOTIONAL SPEECH SYNTHESIS

16 Jan 2024 | Haobin Tang, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang
ED-TTS is a multi-scale emotional speech synthesis model that leverages speech emotion diarization (SED) and speech emotion recognition (SER) to model emotions at different levels. It integrates utterance-level emotion embeddings from SER with frame-level emotion embeddings from SED to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, cross-domain SED is used to predict soft labels, addressing the challenge of scarce fine-grained emotion-annotated datasets for supervising emotional TTS training. ED-TTS is based on the design of GradTTS, with a multi-scale style encoder that uses SER for utterance-level features and a pre-trained SED model for frame-level features. The extracted multi-scale style embeddings are used to condition the reverse process of DDPM. Furthermore, frame-level soft emotion labels predicted by a pre-trained cross-domain SED model are used to supervise TTS model training. The model employs cross-domain training to improve the performance of SED on TTS datasets by reducing distribution shifts between SED and TTS datasets. The main advantages of ED-TTS include its multi-scale emotional speech synthesis based on DDPM, the use of pre-trained SER and SED models to extract utterance-level and frame-level emotional features, and the use of SED to predict frame-level soft emotion labels for TTS training. Cross-domain training is adopted to improve the performance of SED on TTS datasets. The results from both subjective and objective evaluations indicate that ED-TTS outperforms baseline models in terms of audio quality and expressiveness. The model achieves significant improvements in emotion diarization error rate (EDER) and emotion reclassification accuracy (ERA) compared to other models. The ablation study shows that the use of SED and cross-domain training is crucial for achieving high-quality emotional speech synthesis.ED-TTS is a multi-scale emotional speech synthesis model that leverages speech emotion diarization (SED) and speech emotion recognition (SER) to model emotions at different levels. It integrates utterance-level emotion embeddings from SER with frame-level emotion embeddings from SED to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, cross-domain SED is used to predict soft labels, addressing the challenge of scarce fine-grained emotion-annotated datasets for supervising emotional TTS training. ED-TTS is based on the design of GradTTS, with a multi-scale style encoder that uses SER for utterance-level features and a pre-trained SED model for frame-level features. The extracted multi-scale style embeddings are used to condition the reverse process of DDPM. Furthermore, frame-level soft emotion labels predicted by a pre-trained cross-domain SED model are used to supervise TTS model training. The model employs cross-domain training to improve the performance of SED on TTS datasets by reducing distribution shifts between SED and TTS datasets. The main advantages of ED-TTS include its multi-scale emotional speech synthesis based on DDPM, the use of pre-trained SER and SED models to extract utterance-level and frame-level emotional features, and the use of SED to predict frame-level soft emotion labels for TTS training. Cross-domain training is adopted to improve the performance of SED on TTS datasets. The results from both subjective and objective evaluations indicate that ED-TTS outperforms baseline models in terms of audio quality and expressiveness. The model achieves significant improvements in emotion diarization error rate (EDER) and emotion reclassification accuracy (ERA) compared to other models. The ablation study shows that the use of SED and cross-domain training is crucial for achieving high-quality emotional speech synthesis.
Reach us at info@study.space
[slides] ED-TTS%3A Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis | StudySpace