This study evaluates the performance of OpenAI's Whisper automatic speech recognition (ASR) system across diverse English accents and speaker characteristics. The research investigates how Whisper performs with native and non-native English accents, focusing on factors such as speaker sex, native language (L1) typology, second language (L2) proficiency, and age. The study uses two datasets: the Speech Accent Archive (SAA) and the Cambridge English Corpus (CEC). Whisper's performance is measured using match error rate (MER), which evaluates the proportion of correctly transcribed words.
Results show that native English accents, particularly American English, outperform non-native accents in terms of accuracy. Canadian English performs similarly to American English. Non-native accents, especially those with smaller vowel inventories, show higher MER. Female speakers are recognized more accurately than male speakers, and speakers of tone languages (e.g., Mandarin, Thai, Vietnamese) have higher MER compared to speakers of stress-accent languages (e.g., Arabic, Dutch, French, Polish). Additionally, Whisper performs better on read speech than conversational speech, indicating challenges with spontaneous speech.
The study also finds that English experience (proficiency) is negatively correlated with MER, suggesting that higher proficiency leads to better recognition. The findings highlight the importance of considering regional accent diversity and speaker characteristics when developing and testing ASR systems. The study underscores the need for inclusive ASR technologies that can handle diverse accents and speech patterns. The results suggest that future research should focus on improving ASR systems through techniques like accent-agnostic meta-learning and transfer learning to enhance performance across different accents and speech types. The study also emphasizes the importance of considering environmental noise and other contextual factors in real-world applications of ASR.This study evaluates the performance of OpenAI's Whisper automatic speech recognition (ASR) system across diverse English accents and speaker characteristics. The research investigates how Whisper performs with native and non-native English accents, focusing on factors such as speaker sex, native language (L1) typology, second language (L2) proficiency, and age. The study uses two datasets: the Speech Accent Archive (SAA) and the Cambridge English Corpus (CEC). Whisper's performance is measured using match error rate (MER), which evaluates the proportion of correctly transcribed words.
Results show that native English accents, particularly American English, outperform non-native accents in terms of accuracy. Canadian English performs similarly to American English. Non-native accents, especially those with smaller vowel inventories, show higher MER. Female speakers are recognized more accurately than male speakers, and speakers of tone languages (e.g., Mandarin, Thai, Vietnamese) have higher MER compared to speakers of stress-accent languages (e.g., Arabic, Dutch, French, Polish). Additionally, Whisper performs better on read speech than conversational speech, indicating challenges with spontaneous speech.
The study also finds that English experience (proficiency) is negatively correlated with MER, suggesting that higher proficiency leads to better recognition. The findings highlight the importance of considering regional accent diversity and speaker characteristics when developing and testing ASR systems. The study underscores the need for inclusive ASR technologies that can handle diverse accents and speech patterns. The results suggest that future research should focus on improving ASR systems through techniques like accent-agnostic meta-learning and transfer learning to enhance performance across different accents and speech types. The study also emphasizes the importance of considering environmental noise and other contextual factors in real-world applications of ASR.