DASB - Discrete Audio and Speech Benchmark

DASB - Discrete Audio and Speech Benchmark

21 Jun 2024 | Pooneh Mousavi, Luca Della Libera, Jarod Duref, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli
The paper introduces the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for evaluating discrete audio tokens across various speech processing tasks. Discrete audio tokens, which transform audio signals into a finite set of vectors, have gained attention due to their potential to connect audio and language processing, enabling the development of multi-modal large language models. The benchmark includes a wide range of tasks such as speech recognition, speaker identification, emotion recognition, keyword spotting, intent classification, speech enhancement, separation, and text-to-speech. The authors benchmark a diverse set of discrete audio encoders, including semantic, compression, and hybrid tokenizers, and consider different downstream architectures for each task to ensure reliable evaluation. The results show that semantic tokens generally outperform compression tokens in most tasks, but there is still a significant performance gap compared to continuous representations. The benchmark also highlights the impact of bitrate on performance, with medium bitrates achieving the best results. The paper concludes by emphasizing the need for further research to improve the preservation of information in discrete audio tokens and suggests future directions for expanding the benchmark to include music and sound processing tasks.The paper introduces the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for evaluating discrete audio tokens across various speech processing tasks. Discrete audio tokens, which transform audio signals into a finite set of vectors, have gained attention due to their potential to connect audio and language processing, enabling the development of multi-modal large language models. The benchmark includes a wide range of tasks such as speech recognition, speaker identification, emotion recognition, keyword spotting, intent classification, speech enhancement, separation, and text-to-speech. The authors benchmark a diverse set of discrete audio encoders, including semantic, compression, and hybrid tokenizers, and consider different downstream architectures for each task to ensure reliable evaluation. The results show that semantic tokens generally outperform compression tokens in most tasks, but there is still a significant performance gap compared to continuous representations. The benchmark also highlights the impact of bitrate on performance, with medium bitrates achieving the best results. The paper concludes by emphasizing the need for further research to improve the preservation of information in discrete audio tokens and suggests future directions for expanding the benchmark to include music and sound processing tasks.
Reach us at info@study.space
[slides] DASB - Discrete Audio and Speech Benchmark | StudySpace