Understanding DASB - Discrete Audio and Speech Benchmark

The Discrete Audio and Speech Benchmark (DASB) is a comprehensive evaluation framework for discrete audio tokens, designed to assess their performance across a wide range of speech processing tasks, including discriminative tasks like speech recognition, speaker identification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. The benchmark includes a variety of discrete audio encoders, including semantic (e.g., Discrete HuBERT, Discrete WavLM, Discrete Wav2Vec2), compression (e.g., EnCodec, DAC), and hybrid (e.g., SpeechTokenizer) tokenizers. DASB is built on the SpeechBrain toolkit and is publicly available under the Apache 2.0 license. The results show that semantic tokens generally outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains significant, highlighting the need for further research. DASB also evaluates the impact of different bitrates on tokenizer performance, finding that a medium bitrate achieves the best results for both discriminative and generative tasks. Compression tokens are more effective in preserving speaker identity, while semantic tokens provide better overall quality and intelligibility in generated outputs. The benchmark also evaluates the performance of different audio decoders, finding that the built-in decoders of compression tokens outperform other models in preserving speaker similarity. However, semantic tokens still produce high-quality audio, albeit with a slight risk of semantic degradation. The study highlights the importance of decoder architecture in achieving high-fidelity audio generation and suggests that further research is needed to improve the performance of discrete audio tokens. DASB provides a standardized evaluation framework for comparing different audio tokenizers and is intended to help the research community establish a shared benchmark and evaluation protocol for discrete audio representations.The Discrete Audio and Speech Benchmark (DASB) is a comprehensive evaluation framework for discrete audio tokens, designed to assess their performance across a wide range of speech processing tasks, including discriminative tasks like speech recognition, speaker identification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. The benchmark includes a variety of discrete audio encoders, including semantic (e.g., Discrete HuBERT, Discrete WavLM, Discrete Wav2Vec2), compression (e.g., EnCodec, DAC), and hybrid (e.g., SpeechTokenizer) tokenizers. DASB is built on the SpeechBrain toolkit and is publicly available under the Apache 2.0 license. The results show that semantic tokens generally outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains significant, highlighting the need for further research. DASB also evaluates the impact of different bitrates on tokenizer performance, finding that a medium bitrate achieves the best results for both discriminative and generative tasks. Compression tokens are more effective in preserving speaker identity, while semantic tokens provide better overall quality and intelligibility in generated outputs. The benchmark also evaluates the performance of different audio decoders, finding that the built-in decoders of compression tokens outperform other models in preserving speaker similarity. However, semantic tokens still produce high-quality audio, albeit with a slight risk of semantic degradation. The study highlights the importance of decoder architecture in achieving high-fidelity audio generation and suggests that further research is needed to improve the performance of discrete audio tokens. DASB provides a standardized evaluation framework for comparing different audio tokenizers and is intended to help the research community establish a shared benchmark and evaluation protocol for discrete audio representations.

DASB - Discrete Audio and Speech Benchmark

21 Jun 2024 | Pooneh Mousavi, Luca Della Libera, Jarod Duret, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli