Understanding OWSM v3.1%3A Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

This paper presents OWSM v3.1, an advanced version of the Open Whisper-style Speech Model (OWSM) based on the E-Branchformer architecture. OWSM v3.1 aims to improve the performance and efficiency of OWSM without additional data, addressing the limitations of previous versions that were based on standard Transformers. The models, ranging from 100M to 1B parameters, outperform their predecessors in most evaluation benchmarks, achieving up to 25% faster inference speeds. Key contributions include: 1. **Model Architecture**: OWSM v3.1 uses E-Branchformer, which captures both local and global information through parallel branches and convolutions, leading to faster convergence and better performance. 2. **Performance Improvements**: OWSM v3.1 outperforms OWSM v3 in English ASR, multilingual ASR, speech translation (ST), and language identification (LID) tasks. 3. **Inference Speed**: The models are 24% faster for English ASR and 16% to 25% faster for ST due to a smaller decoder. 4. **Zero-Shot Contextual Biasing**: OWSM v3.1 demonstrates emergent ability in zero-shot contextual biasing, improving ASR performance for rare words. 5. **Data Accessibility**: A small-sized model trained on a subset of data with low license restrictions is provided to enhance accessibility. The paper also discusses the training setup, including a piecewise-linear learning rate schedule to improve convergence, and provides detailed results on various benchmarks. The authors plan to release the code, pre-trained models, and training logs to promote transparency and open science.This paper presents OWSM v3.1, an advanced version of the Open Whisper-style Speech Model (OWSM) based on the E-Branchformer architecture. OWSM v3.1 aims to improve the performance and efficiency of OWSM without additional data, addressing the limitations of previous versions that were based on standard Transformers. The models, ranging from 100M to 1B parameters, outperform their predecessors in most evaluation benchmarks, achieving up to 25% faster inference speeds. Key contributions include: 1. **Model Architecture**: OWSM v3.1 uses E-Branchformer, which captures both local and global information through parallel branches and convolutions, leading to faster convergence and better performance. 2. **Performance Improvements**: OWSM v3.1 outperforms OWSM v3 in English ASR, multilingual ASR, speech translation (ST), and language identification (LID) tasks. 3. **Inference Speed**: The models are 24% faster for English ASR and 16% to 25% faster for ST due to a smaller decoder. 4. **Zero-Shot Contextual Biasing**: OWSM v3.1 demonstrates emergent ability in zero-shot contextual biasing, improving ASR performance for rare words. 5. **Data Accessibility**: A small-sized model trained on a subset of data with low license restrictions is provided to enhance accessibility. The paper also discusses the training setup, including a piecewise-linear learning rate schedule to improve convergence, and provides detailed results on various benchmarks. The authors plan to release the code, pre-trained models, and training logs to promote transparency and open science.

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

16 Jun 2024 | Yifan Peng1, Jinchuan Tian1, William Chen1, Siddhant Arora1, Brian Yan1, Yui Sudo2, Muhammad Shakeel2, Kwanghee Choi1, Jiatong Shi1, Xuankai Chang1, Jee-weon Jung1, Shinji Watanabe1