This paper presents a method for generating high-fidelity text-to-speech (TTS) using natural language guidance. The authors propose a scalable approach for labeling various aspects of speaker identity, style, and recording conditions. They apply this method to a 45k-hour dataset to train a speech language model, which enables intuitive natural language control over a wide range of speaking styles and recording conditions. The model is trained using a combination of synthetic annotations and high-fidelity audio data, achieving high audio fidelity with as little as 1% high-fidelity audio in the training data.
The authors address the challenge of controlling speaker identity and style in TTS models, which traditionally requires conditioning on reference speech recordings. Instead, they use natural language descriptions to guide the generation of speech, which allows for more intuitive and flexible control. They also demonstrate that their model can generate high-fidelity audio even with limited high-fidelity training data, using the latest state-of-the-art audio codec models.
The authors collect metadata for various aspects of speech, including accent, recording quality, pitch, and speaking rate. They use statistical measures and natural language descriptions to label these aspects, and then use a language model to generate natural language sentences that can be used to control the TTS model. The model is trained on a large dataset of speech, including both the English portion of Multilingual LibriSpeech and LibriTTS-R, which has been enhanced for higher audio quality.
The authors evaluate their model using objective and subjective measures. Objective evaluations show that their model produces speech with high audio fidelity and naturalness, outperforming other models in terms of speech quality and intelligibility. Subjective evaluations also show that their model produces speech that closely matches the provided descriptions, outperforming both Audiobox and the ground truth audio.
The authors conclude that their method is a simple but highly effective approach for generating high-fidelity TTS that can be intuitively guided by natural language descriptions. They note that this is the first time such a method has been capable of controlling a wide range of speech and channel condition attributes in conjunction with high audio fidelity and overall naturalness. The authors also note that their method can be extended to a wider range of languages, speaking styles, vocal effort, and channel conditions in the future.This paper presents a method for generating high-fidelity text-to-speech (TTS) using natural language guidance. The authors propose a scalable approach for labeling various aspects of speaker identity, style, and recording conditions. They apply this method to a 45k-hour dataset to train a speech language model, which enables intuitive natural language control over a wide range of speaking styles and recording conditions. The model is trained using a combination of synthetic annotations and high-fidelity audio data, achieving high audio fidelity with as little as 1% high-fidelity audio in the training data.
The authors address the challenge of controlling speaker identity and style in TTS models, which traditionally requires conditioning on reference speech recordings. Instead, they use natural language descriptions to guide the generation of speech, which allows for more intuitive and flexible control. They also demonstrate that their model can generate high-fidelity audio even with limited high-fidelity training data, using the latest state-of-the-art audio codec models.
The authors collect metadata for various aspects of speech, including accent, recording quality, pitch, and speaking rate. They use statistical measures and natural language descriptions to label these aspects, and then use a language model to generate natural language sentences that can be used to control the TTS model. The model is trained on a large dataset of speech, including both the English portion of Multilingual LibriSpeech and LibriTTS-R, which has been enhanced for higher audio quality.
The authors evaluate their model using objective and subjective measures. Objective evaluations show that their model produces speech with high audio fidelity and naturalness, outperforming other models in terms of speech quality and intelligibility. Subjective evaluations also show that their model produces speech that closely matches the provided descriptions, outperforming both Audiobox and the ground truth audio.
The authors conclude that their method is a simple but highly effective approach for generating high-fidelity TTS that can be intuitively guided by natural language descriptions. They note that this is the first time such a method has been capable of controlling a wide range of speech and channel condition attributes in conjunction with high audio fidelity and overall naturalness. The authors also note that their method can be extended to a wider range of languages, speaking styles, vocal effort, and channel conditions in the future.