16 Jan 2024 | Haobin Tang1,2†, Xulong Zhang1†, Ning Cheng1*, Jing Xiao1, Jianzong Wang1
The paper introduces ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. ED-TTS integrates utterance-level emotion embeddings from SER with fine-grained frame-level emotion embeddings from SED, conditioning the reverse process of the denoising diffusion probabilistic model (DDPM). The model uses cross-domain SED to predict soft labels, addressing the challenge of limited fine-grained emotion-annotated datasets. Key contributions include:
1. **Multi-scale Style Encoder**: ED-TTS employs a multi-scale style encoder to capture and transfer diverse style attributes, including speaker and utterance-level emotional characteristics, and nuanced frame-level prosodic representations.
2. **Frame-level Soft Label Supervision**: Frame-level soft emotion labels predicted by SED are used to supervise TTS model training.
3. **Cross-domain Training**: Cross-domain training is applied to improve the performance of SED on TTS datasets by reducing distribution shift.
The paper evaluates ED-TTS using subjective and objective metrics, showing superior audio quality and expressiveness compared to baseline models. Ablation studies confirm the importance of multi-scale emotion modeling and cross-domain training techniques.The paper introduces ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. ED-TTS integrates utterance-level emotion embeddings from SER with fine-grained frame-level emotion embeddings from SED, conditioning the reverse process of the denoising diffusion probabilistic model (DDPM). The model uses cross-domain SED to predict soft labels, addressing the challenge of limited fine-grained emotion-annotated datasets. Key contributions include:
1. **Multi-scale Style Encoder**: ED-TTS employs a multi-scale style encoder to capture and transfer diverse style attributes, including speaker and utterance-level emotional characteristics, and nuanced frame-level prosodic representations.
2. **Frame-level Soft Label Supervision**: Frame-level soft emotion labels predicted by SED are used to supervise TTS model training.
3. **Cross-domain Training**: Cross-domain training is applied to improve the performance of SED on TTS datasets by reducing distribution shift.
The paper evaluates ED-TTS using subjective and objective metrics, showing superior audio quality and expressiveness compared to baseline models. Ablation studies confirm the importance of multi-scale emotion modeling and cross-domain training techniques.