29 May 2024 | Shu-wen Yang, Heng-Jui Chang*, Zili Huang*, Andy T. Liu*, Cheng-I Lai*, Haibin Wu*, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee
This paper introduces the Speech Processing Universal PERformance Benchmark (SUPERB), a comprehensive framework for evaluating speech foundation models across 15 diverse tasks, including content, speaker characteristics, prosody, semantics, and generation. The authors propose a unified multi-tasking framework that uses a frozen foundation model followed by task-specialized, lightweight prediction heads. They validate the effectiveness of the foundation model paradigm for speech processing, demonstrating competitive generalizability across most SUPERB tasks. The paper also discusses the benefits of using a learnable weighted-sum approach over traditional evaluation protocols, which allows for better task generalization. Additionally, the authors provide a detailed analysis of the layer-wise performance and contributions of different layers within the foundation models, highlighting the limitations of using layer weights for interpretation. The paper concludes with a series of analyses to deepen the understanding of SUPERB and speech foundation models, including information flows, the correctness of the weighted-sum benchmarking protocol, and the statistical significance and robustness of the benchmark.This paper introduces the Speech Processing Universal PERformance Benchmark (SUPERB), a comprehensive framework for evaluating speech foundation models across 15 diverse tasks, including content, speaker characteristics, prosody, semantics, and generation. The authors propose a unified multi-tasking framework that uses a frozen foundation model followed by task-specialized, lightweight prediction heads. They validate the effectiveness of the foundation model paradigm for speech processing, demonstrating competitive generalizability across most SUPERB tasks. The paper also discusses the benefits of using a learnable weighted-sum approach over traditional evaluation protocols, which allows for better task generalization. Additionally, the authors provide a detailed analysis of the layer-wise performance and contributions of different layers within the foundation models, highlighting the limitations of using layer weights for interpretation. The paper concludes with a series of analyses to deepen the understanding of SUPERB and speech foundation models, including information flows, the correctness of the weighted-sum benchmarking protocol, and the statistical significance and robustness of the benchmark.