This study evaluates the performance of OpenAI's Whisper ASR system across diverse English accents, focusing on both native and non-native speakers. The research aims to understand how speaker characteristics, such as accent, age, sex, and language proficiency, influence the accuracy of ASR systems. Key findings include:
1. **Native English Accents**: Whisper performs better with American English compared to British and Australian English, with Canadian English showing similar performance to American English.
2. **Non-Native English Accents**: Non-native accents generally have higher word error rates (WER) than native accents, with Vietnamese and Thai speakers performing the worst.
3. **Speaker Characteristics**:
- **English Proficiency**: Higher proficiency is associated with lower WER.
- **Vowel Inventory**: Speakers with smaller vowel inventories tend to have higher WER.
- **Sex**: Female speakers exhibit lower WER than male speakers.
- **L1 Typology**: Speakers from pitch-accent languages have lower WER than those from tone languages.
4. **Speech Type**:
- **Conversational vs Read Speech**: Conversational speech has a higher WER due to inherent disfluencies.
- **Sex and Speech Type**: Male speakers have a higher WER in conversational speech compared to female speakers.
The study highlights the importance of considering regional accent diversity and speaker characteristics in developing inclusive and accurate ASR systems. Future research could explore advanced machine learning techniques to enhance ASR robustness and address biases in training datasets.This study evaluates the performance of OpenAI's Whisper ASR system across diverse English accents, focusing on both native and non-native speakers. The research aims to understand how speaker characteristics, such as accent, age, sex, and language proficiency, influence the accuracy of ASR systems. Key findings include:
1. **Native English Accents**: Whisper performs better with American English compared to British and Australian English, with Canadian English showing similar performance to American English.
2. **Non-Native English Accents**: Non-native accents generally have higher word error rates (WER) than native accents, with Vietnamese and Thai speakers performing the worst.
3. **Speaker Characteristics**:
- **English Proficiency**: Higher proficiency is associated with lower WER.
- **Vowel Inventory**: Speakers with smaller vowel inventories tend to have higher WER.
- **Sex**: Female speakers exhibit lower WER than male speakers.
- **L1 Typology**: Speakers from pitch-accent languages have lower WER than those from tone languages.
4. **Speech Type**:
- **Conversational vs Read Speech**: Conversational speech has a higher WER due to inherent disfluencies.
- **Sex and Speech Type**: Male speakers have a higher WER in conversational speech compared to female speakers.
The study highlights the importance of considering regional accent diversity and speaker characteristics in developing inclusive and accurate ASR systems. Future research could explore advanced machine learning techniques to enhance ASR robustness and address biases in training datasets.