18 Sep 2024 | Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu-Hsuan Wang, Kai-Wei Chang, Alexander H. Liu, Hung-yi Lee
The study introduces Codec-SUPERB, a comprehensive and community-driven benchmark for evaluating sound codec models. Codec-SUPERB aims to address the limitations of previous codec evaluations by providing a holistic and fair comparison across diverse experimental settings and signal-level metrics. The platform includes an online leaderboard for result sharing and collaboration, and it covers six distinct codec models with unique training specifications, resulting in 19 unique codec models for comparison. The evaluation spans four applications—content, speaker, paralinguistic, and audio information preservation—and 20 datasets covering speech, audio, and music. Signal-level metrics, including PESQ, STOI, STFTDistance, MelDistance, and FOCORR, are used to assess sound quality, and an overall score is introduced to integrate these metrics. Application-level evaluations are conducted using pre-trained models for automatic speech recognition (ASR), automatic speaker verification (ASV), emotion recognition (ER), and audio event classification (AEC). The results show that the Descript-audio-codec (DAC) model achieves a well-balanced trade-off between performance and bitrate, while the AcademiCodec model excels at low bitrates. The study also highlights the importance of preserving emotional information even at very low bitrates and the potential of diverse speech data in maintaining audio information. The authors commit to releasing the code, leaderboard, and data to accelerate community progress.The study introduces Codec-SUPERB, a comprehensive and community-driven benchmark for evaluating sound codec models. Codec-SUPERB aims to address the limitations of previous codec evaluations by providing a holistic and fair comparison across diverse experimental settings and signal-level metrics. The platform includes an online leaderboard for result sharing and collaboration, and it covers six distinct codec models with unique training specifications, resulting in 19 unique codec models for comparison. The evaluation spans four applications—content, speaker, paralinguistic, and audio information preservation—and 20 datasets covering speech, audio, and music. Signal-level metrics, including PESQ, STOI, STFTDistance, MelDistance, and FOCORR, are used to assess sound quality, and an overall score is introduced to integrate these metrics. Application-level evaluations are conducted using pre-trained models for automatic speech recognition (ASR), automatic speaker verification (ASV), emotion recognition (ER), and audio event classification (AEC). The results show that the Descript-audio-codec (DAC) model achieves a well-balanced trade-off between performance and bitrate, while the AcademiCodec model excels at low bitrates. The study also highlights the importance of preserving emotional information even at very low bitrates and the potential of diverse speech data in maintaining audio information. The authors commit to releasing the code, leaderboard, and data to accelerate community progress.