16 Jun 2024 | Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe
OWSM v3.1 is an improved version of the Open Whisper-style Speech Model (OWSM), based on the E-Branchformer architecture. It offers better performance and faster inference speeds compared to its predecessor, OWSM v3. The models range from 100M to 1B parameters and are trained on the same amount of data. OWSM v3.1 outperforms OWSM v3 in most evaluation benchmarks, achieving up to 25% faster inference speeds. It also demonstrates emergent abilities in zero-shot contextual biasing speech recognition. A smaller model trained on a subset of data with low license restrictions is also provided. The code, pre-trained models, and training logs will be publicly released to promote transparency and open science. The paper evaluates OWSM v3.1 in various speech processing tasks, including speech recognition, speech translation, language identification, and spoken language understanding. It shows that OWSM v3.1 performs well in these tasks, especially in multilingual and long-form speech recognition. The model also has faster inference speeds than Whisper at each scale. The study highlights the importance of training data quantity and quality for speech foundation models. Future work includes exploring the impact of data diversity on model performance and adding more public data for better performance.OWSM v3.1 is an improved version of the Open Whisper-style Speech Model (OWSM), based on the E-Branchformer architecture. It offers better performance and faster inference speeds compared to its predecessor, OWSM v3. The models range from 100M to 1B parameters and are trained on the same amount of data. OWSM v3.1 outperforms OWSM v3 in most evaluation benchmarks, achieving up to 25% faster inference speeds. It also demonstrates emergent abilities in zero-shot contextual biasing speech recognition. A smaller model trained on a subset of data with low license restrictions is also provided. The code, pre-trained models, and training logs will be publicly released to promote transparency and open science. The paper evaluates OWSM v3.1 in various speech processing tasks, including speech recognition, speech translation, language identification, and spoken language understanding. It shows that OWSM v3.1 performs well in these tasks, especially in multilingual and long-form speech recognition. The model also has faster inference speeds than Whisper at each scale. The study highlights the importance of training data quantity and quality for speech foundation models. Future work includes exploring the impact of data diversity on model performance and adding more public data for better performance.