[slides and audio] A Large-Scale Evaluation of Speech Foundation Models

This paper introduces the Speech Processing Universal PERformance Benchmark (SUPERB), a comprehensive framework for evaluating speech foundation models across 15 diverse tasks, including content, speaker characteristics, prosody, semantics, and generation. The authors propose a unified multi-tasking framework that uses a frozen foundation model followed by task-specialized, lightweight prediction heads. They validate the effectiveness of the foundation model paradigm for speech processing, demonstrating competitive generalizability across most SUPERB tasks. The paper also discusses the benefits of using a learnable weighted-sum approach over traditional evaluation protocols, which allows for better task generalization. Additionally, the authors provide a detailed analysis of the layer-wise performance and contributions of different layers within the foundation models, highlighting the limitations of using layer weights for interpretation. The paper concludes with a series of analyses to deepen the understanding of SUPERB and speech foundation models, including information flows, the correctness of the weighted-sum benchmarking protocol, and the statistical significance and robustness of the benchmark.This paper introduces the Speech Processing Universal PERformance Benchmark (SUPERB), a comprehensive framework for evaluating speech foundation models across 15 diverse tasks, including content, speaker characteristics, prosody, semantics, and generation. The authors propose a unified multi-tasking framework that uses a frozen foundation model followed by task-specialized, lightweight prediction heads. They validate the effectiveness of the foundation model paradigm for speech processing, demonstrating competitive generalizability across most SUPERB tasks. The paper also discusses the benefits of using a learnable weighted-sum approach over traditional evaluation protocols, which allows for better task generalization. Additionally, the authors provide a detailed analysis of the layer-wise performance and contributions of different layers within the foundation models, highlighting the limitations of using layer weights for interpretation. The paper concludes with a series of analyses to deepen the understanding of SUPERB and speech foundation models, including information flows, the correctness of the weighted-sum benchmarking protocol, and the statistical significance and robustness of the benchmark.