[slides] VoiceCraft%3A Zero-Shot Speech Editing and Text-to-Speech in the Wild

VOICECRAFT is a neural codec language model that achieves state-of-the-art performance in speech editing and zero-shot text-to-speech (TTS) tasks. It uses a Transformer decoder architecture with a novel token rearrangement procedure that combines causal masking and delayed stacking to enable autoregressive generation within existing sequences. The model is evaluated on challenging, realistic datasets including diverse accents, speaking styles, and background noise, and performs consistently well compared to other models and real recordings. For speech editing, the model is tested on the REALEDIT dataset, which contains 310 real-world speech editing examples with diverse content and editing scenarios. VOICECRAFT produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans. For zero-shot TTS, the model outperforms prior SotA models including VALL-E and the popular commercial model XTTS v2. VOICECRAFT generalizes well to zero-shot TTS without any fine-tuning, achieving SotA performance on a dataset comprising audiobooks and YouTube videos. The model is also evaluated on a 250 prompt-transcript paired dataset for zero-shot TTS. The model's performance is measured using objective metrics such as WER, speaker similarity, and subjective metrics including human listening tests. VOICECRAFT is found to outperform other models in terms of naturalness, intelligibility, and speaker similarity. The model is also evaluated on a variety of tasks, including speech editing and zero-shot TTS, and is shown to be effective in handling diverse accents and speech conditions. The model is open-sourced, and its code and model weights are available for use. The research community is encouraged to use the model and dataset for further research and development.VOICECRAFT is a neural codec language model that achieves state-of-the-art performance in speech editing and zero-shot text-to-speech (TTS) tasks. It uses a Transformer decoder architecture with a novel token rearrangement procedure that combines causal masking and delayed stacking to enable autoregressive generation within existing sequences. The model is evaluated on challenging, realistic datasets including diverse accents, speaking styles, and background noise, and performs consistently well compared to other models and real recordings. For speech editing, the model is tested on the REALEDIT dataset, which contains 310 real-world speech editing examples with diverse content and editing scenarios. VOICECRAFT produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans. For zero-shot TTS, the model outperforms prior SotA models including VALL-E and the popular commercial model XTTS v2. VOICECRAFT generalizes well to zero-shot TTS without any fine-tuning, achieving SotA performance on a dataset comprising audiobooks and YouTube videos. The model is also evaluated on a 250 prompt-transcript paired dataset for zero-shot TTS. The model's performance is measured using objective metrics such as WER, speaker similarity, and subjective metrics including human listening tests. VOICECRAFT is found to outperform other models in terms of naturalness, intelligibility, and speaker similarity. The model is also evaluated on a variety of tasks, including speech editing and zero-shot TTS, and is shown to be effective in handling diverse accents and speech conditions. The model is open-sourced, and its code and model weights are available for use. The research community is encouraged to use the model and dataset for further research and development.

VOICECAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild

14 Jun 2024 | Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath