15 Oct 2021 | Shu-wen Yang, Po-Han Chi, Yang-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee
The paper introduces the Speech Processing Universal PERformance Benchmark (SUPERB), a comprehensive framework designed to evaluate the generalizability and reusability of self-supervised learning (SSL) models across various speech processing tasks. SUPERB aims to bridge the gap in speech processing research by providing a standardized benchmark that can be used to assess the performance of SSL models on a wide range of tasks with minimal architectural changes and labeled data. The benchmark includes ten tasks categorized into content, speaker, semantics, and paralinguistics, covering aspects such as phoneme recognition, automatic speech recognition, speaker identification, intent classification, and emotion recognition. The authors propose a simple framework that leverages a frozen, shared pre-trained model and lightweight prediction heads for each task, demonstrating competitive performance compared to traditional supervised methods. The paper also discusses the evaluation metrics and datasets used for each task, and provides details on the SSL models and downstream models employed. The results show that SSL representations, particularly wav2vec 2.0 and HuBERT, achieve highly competitive performance across multiple tasks, highlighting the potential of SSL models in speech processing. SUPERB is released as a challenge with a leaderboard and a benchmark toolkit to facilitate further research and development in the field.The paper introduces the Speech Processing Universal PERformance Benchmark (SUPERB), a comprehensive framework designed to evaluate the generalizability and reusability of self-supervised learning (SSL) models across various speech processing tasks. SUPERB aims to bridge the gap in speech processing research by providing a standardized benchmark that can be used to assess the performance of SSL models on a wide range of tasks with minimal architectural changes and labeled data. The benchmark includes ten tasks categorized into content, speaker, semantics, and paralinguistics, covering aspects such as phoneme recognition, automatic speech recognition, speaker identification, intent classification, and emotion recognition. The authors propose a simple framework that leverages a frozen, shared pre-trained model and lightweight prediction heads for each task, demonstrating competitive performance compared to traditional supervised methods. The paper also discusses the evaluation metrics and datasets used for each task, and provides details on the SSL models and downstream models employed. The results show that SSL representations, particularly wav2vec 2.0 and HuBERT, achieve highly competitive performance across multiple tasks, highlighting the potential of SSL models in speech processing. SUPERB is released as a challenge with a leaderboard and a benchmark toolkit to facilitate further research and development in the field.