[slides] SpiRit-LM%3A Interleaved Spoken and Written Language Model

SPIRIT-LM is a multimodal language model that integrates spoken and written language. It is based on a pretrained text language model that is extended to the speech modality through continuous training on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained using a word-level interleaving method with a small automatically-curated speech-text parallel corpus. SPIRIT-LM has two versions: BASE and EXPRESSIVE. The BASE version uses speech semantic units, while the EXPRESSIVE version models expressivity using pitch and style units in addition to semantic units. Both versions encode text with subword BPE tokens. The model combines the semantic abilities of text models with the expressive abilities of speech models. It can learn new tasks in a few-shot fashion across modalities, including ASR, TTS, and Speech Classification. SPIRIT-LM is evaluated on comprehension tasks in both speech and text, and few-shot prompting is extended to speech-text tasks. The model also includes expressive tokens that capture pitch and style. The model is trained on a mix of text-only, speech-only, and interleaved speech-text sequences. It is evaluated on a new benchmark, the SPEECH-TEXT SENTIMENT PRESERVATION benchmark, which measures how well generative models preserve the sentiment of the prompt within and across modalities. The model is also evaluated for responsible AI, including toxicity detection in speech and text. SPIRIT-LM is shown to preserve sentiment in both speech and text, and it is evaluated for its ability to generate expressive speech and text. The model is trained on a large amount of data, including text-only, speech-only, and interleaved speech-text sequences. It is evaluated on a variety of tasks, including speech and text comprehension, ASR, TTS, and speech classification. The model is also evaluated for its ability to generate expressive speech and text. The model is trained on a large amount of data, including text-only, speech-only, and interleaved speech-text sequences. It is evaluated on a variety of tasks, including speech and text comprehension, ASR, TTS, and speech classification. The model is also evaluated for its ability to generate expressive speech and text.SPIRIT-LM is a multimodal language model that integrates spoken and written language. It is based on a pretrained text language model that is extended to the speech modality through continuous training on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained using a word-level interleaving method with a small automatically-curated speech-text parallel corpus. SPIRIT-LM has two versions: BASE and EXPRESSIVE. The BASE version uses speech semantic units, while the EXPRESSIVE version models expressivity using pitch and style units in addition to semantic units. Both versions encode text with subword BPE tokens. The model combines the semantic abilities of text models with the expressive abilities of speech models. It can learn new tasks in a few-shot fashion across modalities, including ASR, TTS, and Speech Classification. SPIRIT-LM is evaluated on comprehension tasks in both speech and text, and few-shot prompting is extended to speech-text tasks. The model also includes expressive tokens that capture pitch and style. The model is trained on a mix of text-only, speech-only, and interleaved speech-text sequences. It is evaluated on a new benchmark, the SPEECH-TEXT SENTIMENT PRESERVATION benchmark, which measures how well generative models preserve the sentiment of the prompt within and across modalities. The model is also evaluated for responsible AI, including toxicity detection in speech and text. SPIRIT-LM is shown to preserve sentiment in both speech and text, and it is evaluated for its ability to generate expressive speech and text. The model is trained on a large amount of data, including text-only, speech-only, and interleaved speech-text sequences. It is evaluated on a variety of tasks, including speech and text comprehension, ASR, TTS, and speech classification. The model is also evaluated for its ability to generate expressive speech and text. The model is trained on a large amount of data, including text-only, speech-only, and interleaved speech-text sequences. It is evaluated on a variety of tasks, including speech and text comprehension, ASR, TTS, and speech classification. The model is also evaluated for its ability to generate expressive speech and text.

SPIRIT-LM: Interleaved Spoken and Written Language Model

8 Feb 2024 | Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoît Sagot, Emmanuel Dupoux