[slides and audio] Deep Speech 2 %3A End-to-End Speech Recognition in English and Mandarin

The paper presents Deep Speech 2 (DS2), an end-to-end deep learning approach for speech recognition in English and Mandarin. DS2 replaces traditional hand-engineered components with neural networks, enabling it to handle diverse speech conditions such as noisy environments, accents, and different languages. Key advancements include the application of High Performance Computing (HPC) techniques, which reduce training time from weeks to days, allowing for rapid iteration and model improvement. The system achieves competitive performance with human workers on standard datasets and can be deployed in an online setting with low latency using Batch Dispatch with GPUs. The architecture explores deep RNNs with convolutional layers, batch normalization, and curriculum learning, achieving significant improvements in error rates compared to previous systems. The paper also details the optimization of the training process, including efficient GPU implementations of the CTC loss function and memory allocation, which further enhance the system's performance and scalability.The paper presents Deep Speech 2 (DS2), an end-to-end deep learning approach for speech recognition in English and Mandarin. DS2 replaces traditional hand-engineered components with neural networks, enabling it to handle diverse speech conditions such as noisy environments, accents, and different languages. Key advancements include the application of High Performance Computing (HPC) techniques, which reduce training time from weeks to days, allowing for rapid iteration and model improvement. The system achieves competitive performance with human workers on standard datasets and can be deployed in an online setting with low latency using Batch Dispatch with GPUs. The architecture explores deep RNNs with convolutional layers, batch normalization, and curriculum learning, achieving significant improvements in error rates compared to previous systems. The paper also details the optimization of the training process, including efficient GPU implementations of the CTC loss function and memory allocation, which further enhance the system's performance and scalability.

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin