This paper presents a method for generating high-fidelity text-to-speech (TTS) audio using natural language guidance. The authors address the limitations of existing TTS models, which often require reference speech recordings for speaker identity and style control, limiting their practical applications. They propose a scalable approach to label various aspects of speaker identity, style, and recording conditions, using a 45k-hour dataset. This method is applied to train a speech language model, enabling the generation of diverse accents, prosodic styles, and acoustic conditions with intuitive natural language conditioning.
Key contributions include:
1. Efficiently labeling the dataset with attributes like gender, accent, speaking rate, pitch, and recording conditions.
2. Training a speech language model to control these attributes independently, creating new speaker identities and styles.
3. Demonstrating high-fidelity audio generation with minimal high-quality training data and advanced audio codec models.
The authors compare their method to existing approaches, showing superior performance in terms of naturalness and audio fidelity. They also conduct objective and subjective evaluations to validate the effectiveness of their model. The results demonstrate that their model can generate high-quality speech that closely matches natural language descriptions, outperforming other methods in both objective and subjective metrics. The paper concludes by highlighting the broad potential of their approach and future plans to expand its capabilities to more languages and conditions.This paper presents a method for generating high-fidelity text-to-speech (TTS) audio using natural language guidance. The authors address the limitations of existing TTS models, which often require reference speech recordings for speaker identity and style control, limiting their practical applications. They propose a scalable approach to label various aspects of speaker identity, style, and recording conditions, using a 45k-hour dataset. This method is applied to train a speech language model, enabling the generation of diverse accents, prosodic styles, and acoustic conditions with intuitive natural language conditioning.
Key contributions include:
1. Efficiently labeling the dataset with attributes like gender, accent, speaking rate, pitch, and recording conditions.
2. Training a speech language model to control these attributes independently, creating new speaker identities and styles.
3. Demonstrating high-fidelity audio generation with minimal high-quality training data and advanced audio codec models.
The authors compare their method to existing approaches, showing superior performance in terms of naturalness and audio fidelity. They also conduct objective and subjective evaluations to validate the effectiveness of their model. The results demonstrate that their model can generate high-quality speech that closely matches natural language descriptions, outperforming other methods in both objective and subjective metrics. The paper concludes by highlighting the broad potential of their approach and future plans to expand its capabilities to more languages and conditions.