8 Dec 2015 | Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, Zhenyao Zhu
The paper presents Deep Speech 2 (DS2), an end-to-end deep learning approach for speech recognition in English and Mandarin. DS2 replaces traditional hand-engineered components with neural networks, enabling it to handle diverse speech conditions such as noisy environments, accents, and different languages. Key advancements include the application of High Performance Computing (HPC) techniques, which reduce training time from weeks to days, allowing for rapid iteration and model improvement. The system achieves competitive performance with human workers on standard datasets and can be deployed in an online setting with low latency using Batch Dispatch with GPUs. The architecture explores deep RNNs with convolutional layers, batch normalization, and curriculum learning, achieving significant improvements in error rates compared to previous systems. The paper also details the optimization of the training process, including efficient GPU implementations of the CTC loss function and memory allocation, which further enhance the system's performance and scalability.The paper presents Deep Speech 2 (DS2), an end-to-end deep learning approach for speech recognition in English and Mandarin. DS2 replaces traditional hand-engineered components with neural networks, enabling it to handle diverse speech conditions such as noisy environments, accents, and different languages. Key advancements include the application of High Performance Computing (HPC) techniques, which reduce training time from weeks to days, allowing for rapid iteration and model improvement. The system achieves competitive performance with human workers on standard datasets and can be deployed in an online setting with low latency using Batch Dispatch with GPUs. The architecture explores deep RNNs with convolutional layers, batch normalization, and curriculum learning, achieving significant improvements in error rates compared to previous systems. The paper also details the optimization of the training process, including efficient GPU implementations of the CTC loss function and memory allocation, which further enhance the system's performance and scalability.