Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

10 Jul 2024 | Seed Team, ByteDance*
Seed-ASR is a large language model (LLM)-based speech recognition model designed to transcribe diverse speech signals from various domains, languages, accents, and dialects. It leverages the capabilities of LLMs by inputting continuous speech representations and contextual information into the LLM. The model is developed based on the audio conditioned LLM (AcLLM) framework and undergoes a stage-wise large-scale training process, including self-supervised learning (SSL), supervised fine-tuning (SFT), context SFT, and reinforcement learning (RL). Seed-ASR demonstrates significant improvements over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects, and languages. It achieves a 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, showcasing its powerful performance. The model supports multiple languages, including Mandarin, 13 Chinese dialects, and English, and can be further deployed to support specific needs in various scenarios without requiring extra language models.Seed-ASR is a large language model (LLM)-based speech recognition model designed to transcribe diverse speech signals from various domains, languages, accents, and dialects. It leverages the capabilities of LLMs by inputting continuous speech representations and contextual information into the LLM. The model is developed based on the audio conditioned LLM (AcLLM) framework and undergoes a stage-wise large-scale training process, including self-supervised learning (SSL), supervised fine-tuning (SFT), context SFT, and reinforcement learning (RL). Seed-ASR demonstrates significant improvements over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects, and languages. It achieves a 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, showcasing its powerful performance. The model supports multiple languages, including Mandarin, 13 Chinese dialects, and English, and can be further deployed to support specific needs in various scenarios without requiring extra language models.
Reach us at info@study.space
[slides] Seed-ASR%3A Understanding Diverse Speech and Contexts with LLM-based Speech Recognition | StudySpace