FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

11 Jul 2024 | Tongyi SpeechTeam Alibaba Group
FunAudioLLM is an innovative framework designed to enhance natural voice interactions between humans and large language models (LLMs). It consists of two core models: SenseVoice and CosyVoice. SenseVoice handles multilingual speech recognition, emotion recognition, and audio event detection, while CosyVoice facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small supports 5 languages with low latency, and SenseVoice-Large supports over 50 languages with high precision. CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following. These models have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes. FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, pushing the boundaries of voice interaction technology. Demos and code are available at <https://fun-audio-llm.github.io> and <https://github.com/FunAudioLLM>, respectively.FunAudioLLM is an innovative framework designed to enhance natural voice interactions between humans and large language models (LLMs). It consists of two core models: SenseVoice and CosyVoice. SenseVoice handles multilingual speech recognition, emotion recognition, and audio event detection, while CosyVoice facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small supports 5 languages with low latency, and SenseVoice-Large supports over 50 languages with high precision. CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following. These models have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes. FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, pushing the boundaries of voice interaction technology. Demos and code are available at <https://fun-audio-llm.github.io> and <https://github.com/FunAudioLLM>, respectively.
Reach us at info@study.space