The Interspeech 2024 Challenge on Speech Processing Using Discrete Units aims to explore the potential of discrete speech units in speech processing tasks. The challenge includes three main tasks: multilingual automatic speech recognition (ASR), text-to-speech (TTS), and singing voice synthesis (SVS). The challenge provides a benchmark for evaluating the effectiveness of discrete units in these tasks, with the goal of promoting innovation and research in this area.
The ASR task focuses on multilingual speech recognition, incorporating data from the ML-SUPERB challenge. The TTS task includes two tracks: a single-speaker TTS track and a vocoder track for multi-speaker speech resynthesis. The SVS task involves synthesizing singing voices from musical score information. These tasks cover the full speech processing pipeline and encourage holistic innovation in discrete unit processing.
The challenge defines discrete units as a sequence of tokens derived from speech signals, with a focus on efficiency and performance. The bitrate is a key metric for evaluating the efficiency of discrete representations. The challenge also includes baseline systems and preliminary results, which provide insights into the performance of different approaches.
The ASR baseline uses a joint CTC/attention-based encoder-decoder architecture, while the TTS baseline includes a vocoder and acoustic model. The SVS baseline consists of an acoustic model and a vocoder, adapted for singing voice synthesis. Preliminary results show that systems using discrete units outperform traditional methods in terms of performance and efficiency.
The challenge highlights the potential of discrete units in speech processing, with applications in ASR, TTS, and SVS. The results indicate that discrete units can improve performance while maintaining efficiency, making them a promising approach for future research in speech processing. The challenge encourages further exploration and innovation in this field.The Interspeech 2024 Challenge on Speech Processing Using Discrete Units aims to explore the potential of discrete speech units in speech processing tasks. The challenge includes three main tasks: multilingual automatic speech recognition (ASR), text-to-speech (TTS), and singing voice synthesis (SVS). The challenge provides a benchmark for evaluating the effectiveness of discrete units in these tasks, with the goal of promoting innovation and research in this area.
The ASR task focuses on multilingual speech recognition, incorporating data from the ML-SUPERB challenge. The TTS task includes two tracks: a single-speaker TTS track and a vocoder track for multi-speaker speech resynthesis. The SVS task involves synthesizing singing voices from musical score information. These tasks cover the full speech processing pipeline and encourage holistic innovation in discrete unit processing.
The challenge defines discrete units as a sequence of tokens derived from speech signals, with a focus on efficiency and performance. The bitrate is a key metric for evaluating the efficiency of discrete representations. The challenge also includes baseline systems and preliminary results, which provide insights into the performance of different approaches.
The ASR baseline uses a joint CTC/attention-based encoder-decoder architecture, while the TTS baseline includes a vocoder and acoustic model. The SVS baseline consists of an acoustic model and a vocoder, adapted for singing voice synthesis. Preliminary results show that systems using discrete units outperform traditional methods in terms of performance and efficiency.
The challenge highlights the potential of discrete units in speech processing, with applications in ASR, TTS, and SVS. The results indicate that discrete units can improve performance while maintaining efficiency, making them a promising approach for future research in speech processing. The challenge encourages further exploration and innovation in this field.