8 Feb 2024 | Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux
SPIRIT-LM is a foundation multimodal language model that integrates text and speech. The model is based on a pre-trained text language model extended to handle speech by continuously training it on both text and speech units. Speech and text sequences are concatenated into a single set of tokens and trained using a word-level interleaving method with a small curated speech-text parallel corpus. SPIRIT-LM comes in two versions: BASE, which uses speech semantic units, and EXPRESSIVE, which includes pitch and style units in addition to semantic units. The model demonstrates both semantic abilities of text models and expressive capabilities of speech models. It can learn new tasks in a few-shot fashion across modalities, such as ASR, TTS, and Speech Classification. The paper introduces the SPEECH-TEXT SENTIMENT PRESERVATION benchmark (STSP) to evaluate generative models' ability to preserve sentiment within and across modalities. SPIRIT-LM-EXPRESSIVE, an expressive version of SPIRIT-LM, is shown to be the first language model capable of preserving sentiment in both text and speech prompts. The paper also discusses responsible AI aspects, including toxicity detection and mitigation strategies.SPIRIT-LM is a foundation multimodal language model that integrates text and speech. The model is based on a pre-trained text language model extended to handle speech by continuously training it on both text and speech units. Speech and text sequences are concatenated into a single set of tokens and trained using a word-level interleaving method with a small curated speech-text parallel corpus. SPIRIT-LM comes in two versions: BASE, which uses speech semantic units, and EXPRESSIVE, which includes pitch and style units in addition to semantic units. The model demonstrates both semantic abilities of text models and expressive capabilities of speech models. It can learn new tasks in a few-shot fashion across modalities, such as ASR, TTS, and Speech Classification. The paper introduces the SPEECH-TEXT SENTIMENT PRESERVATION benchmark (STSP) to evaluate generative models' ability to preserve sentiment within and across modalities. SPIRIT-LM-EXPRESSIVE, an expressive version of SPIRIT-LM, is shown to be the first language model capable of preserving sentiment in both text and speech prompts. The paper also discusses responsible AI aspects, including toxicity detection and mitigation strategies.