Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

10 Jul 2024 | Seed Team, ByteDance
Seed-ASR is an LLM-based speech recognition model developed under the audio conditioned LLM (AcLLM) framework. It leverages the capabilities of large language models (LLMs) by inputting continuous speech representations along with contextual information into the LLM. Seed-ASR demonstrates significant improvements over end-to-end models on comprehensive evaluation sets, including multiple domains, accents, and languages. It achieves a 10%-40% reduction in word error rates on Chinese and English public test sets compared to recently released large ASR models. Seed-ASR supports multiple languages, including Mandarin and 13 Chinese dialects, and is being extended to support over 40 languages. It also has context-aware capabilities, utilizing historical dialogues, video editing history, and meeting participation details to enhance keyword recall in ASR evaluation sets. Seed-ASR is trained through a stage-wise process: self-supervised learning (SSL) of the audio encoder, supervised fine-tuning (SFT), context SFT, and reinforcement learning (RL). The model's training includes a large-scale SSL stage with an audio encoder having nearly 2 billion parameters and a Mixture of Experts (MoE) LLM with tens of billions of parameters. The model's performance is evaluated on various datasets, including public and internal multi-domain sets, showing significant improvements over other models. Seed-ASR (CN) achieves state-of-the-art results on Chinese ASR benchmarks, with a significant reduction in word error rates compared to other models. Seed-ASR (ML) demonstrates strong performance on multilingual public sets, achieving over 42% and 40% improvements on English and multilingual multi-domain evaluation sets compared to the strongest baselines. The model's ability to handle diverse speech inputs and contexts is demonstrated through its performance on multi-domain, multi-accent, and hardcase evaluation sets. Overall, Seed-ASR shows strong capabilities in speech recognition across various scenarios, including different languages, accents, and domains.Seed-ASR is an LLM-based speech recognition model developed under the audio conditioned LLM (AcLLM) framework. It leverages the capabilities of large language models (LLMs) by inputting continuous speech representations along with contextual information into the LLM. Seed-ASR demonstrates significant improvements over end-to-end models on comprehensive evaluation sets, including multiple domains, accents, and languages. It achieves a 10%-40% reduction in word error rates on Chinese and English public test sets compared to recently released large ASR models. Seed-ASR supports multiple languages, including Mandarin and 13 Chinese dialects, and is being extended to support over 40 languages. It also has context-aware capabilities, utilizing historical dialogues, video editing history, and meeting participation details to enhance keyword recall in ASR evaluation sets. Seed-ASR is trained through a stage-wise process: self-supervised learning (SSL) of the audio encoder, supervised fine-tuning (SFT), context SFT, and reinforcement learning (RL). The model's training includes a large-scale SSL stage with an audio encoder having nearly 2 billion parameters and a Mixture of Experts (MoE) LLM with tens of billions of parameters. The model's performance is evaluated on various datasets, including public and internal multi-domain sets, showing significant improvements over other models. Seed-ASR (CN) achieves state-of-the-art results on Chinese ASR benchmarks, with a significant reduction in word error rates compared to other models. Seed-ASR (ML) demonstrates strong performance on multilingual public sets, achieving over 42% and 40% improvements on English and multilingual multi-domain evaluation sets compared to the strongest baselines. The model's ability to handle diverse speech inputs and contexts is demonstrated through its performance on multi-domain, multi-accent, and hardcase evaluation sets. Overall, Seed-ASR shows strong capabilities in speech recognition across various scenarios, including different languages, accents, and domains.
Reach us at info@study.space