14 Jun 2024 | Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
**VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild**
**Authors:** Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
**Abstract:**
VOICECRAFT is a neural codec language model that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) tasks on audiobooks, internet videos, and podcasts. It employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an existing sequence. On speech editing tasks, VOICECRAFT produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans. For zero-shot TTS, the model outperforms prior state-of-the-art models including VALL-E and the popular commercial model XTTS v2. The models are evaluated on challenging and realistic datasets that consist of diverse accents, speaking styles, recording conditions, and background noise and music. A high-quality, realistic, and challenging dataset named REALEDIT is introduced for speech editing evaluation. The dataset consists of 310 real-world speech editing examples sourced from audiobooks, YouTube videos, and Spotify podcasts.
**Introduction:**
VOICECRAFT is a Transformer-based neural codec language model that performs infilling generation of neural speech codec tokens autoregressively conditioned on bidirectional context. It achieves state-of-the-art performance on both speech editing and zero-shot TTS tasks. The method involves a two-step token rearrangement procedure that includes causal masking and delayed stacking. The causal masking technique is inspired by the success of causal masked multimodal models in joint text-image modeling, and the delayed stacking technique ensures efficient multi-codebook modeling. A high-quality, realistic, and challenging dataset named REALEDIT is introduced for speech editing evaluation, consisting of 310 manually crafted speech editing examples.
**Related Work:**
The paper reviews related work in neural codec language models, speech editing, and zero-shot TTS. It discusses the challenges and advancements in these areas, including the use of causal masking and delayed stacking techniques.
**Method:**
The method involves a two-step token rearrangement procedure: causal masking and delayed stacking. Causal masking enables autoregressive continuation/infilling with bidirectional context, while delayed stacking ensures efficient multi-codebook modeling. The model is trained with an autoregressive sequence prediction loss and evaluated on speech editing and zero-shot TTS tasks.
**Experiments:**
The paper presents experimental results on speech editing and zero-shot TTS tasks, comparing VOICECRAFT with state-of-the-art models. It includes ablation studies, human preference evaluations, and performance metrics such asWER, speaker similarity, and naturalness MOS.
**Conclusion:**
VOICECRAFT achieves state-of-the-art performance on speech editing and zero-shot TTS tasks on in-the-wild**VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild**
**Authors:** Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
**Abstract:**
VOICECRAFT is a neural codec language model that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) tasks on audiobooks, internet videos, and podcasts. It employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an existing sequence. On speech editing tasks, VOICECRAFT produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans. For zero-shot TTS, the model outperforms prior state-of-the-art models including VALL-E and the popular commercial model XTTS v2. The models are evaluated on challenging and realistic datasets that consist of diverse accents, speaking styles, recording conditions, and background noise and music. A high-quality, realistic, and challenging dataset named REALEDIT is introduced for speech editing evaluation. The dataset consists of 310 real-world speech editing examples sourced from audiobooks, YouTube videos, and Spotify podcasts.
**Introduction:**
VOICECRAFT is a Transformer-based neural codec language model that performs infilling generation of neural speech codec tokens autoregressively conditioned on bidirectional context. It achieves state-of-the-art performance on both speech editing and zero-shot TTS tasks. The method involves a two-step token rearrangement procedure that includes causal masking and delayed stacking. The causal masking technique is inspired by the success of causal masked multimodal models in joint text-image modeling, and the delayed stacking technique ensures efficient multi-codebook modeling. A high-quality, realistic, and challenging dataset named REALEDIT is introduced for speech editing evaluation, consisting of 310 manually crafted speech editing examples.
**Related Work:**
The paper reviews related work in neural codec language models, speech editing, and zero-shot TTS. It discusses the challenges and advancements in these areas, including the use of causal masking and delayed stacking techniques.
**Method:**
The method involves a two-step token rearrangement procedure: causal masking and delayed stacking. Causal masking enables autoregressive continuation/infilling with bidirectional context, while delayed stacking ensures efficient multi-codebook modeling. The model is trained with an autoregressive sequence prediction loss and evaluated on speech editing and zero-shot TTS tasks.
**Experiments:**
The paper presents experimental results on speech editing and zero-shot TTS tasks, comparing VOICECRAFT with state-of-the-art models. It includes ablation studies, human preference evaluations, and performance metrics such asWER, speaker similarity, and naturalness MOS.
**Conclusion:**
VOICECRAFT achieves state-of-the-art performance on speech editing and zero-shot TTS tasks on in-the-wild