Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

8 Dec 2015 | Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, Zhenyao Zhu
Deep Speech 2 is an end-to-end speech recognition system that recognizes both English and Mandarin speech. It replaces traditional pipeline components with neural networks, enabling efficient handling of diverse speech, including noisy environments, accents, and multiple languages. The system achieves a 7x speedup over previous systems, reducing training time from weeks to days. This allows faster iteration and improved performance, with results competitive with human transcriptions on standard benchmarks. The system is deployed in an online setting with low latency, using Batch Dispatch with GPUs. The system uses a deep neural network architecture with multiple layers, including convolutional and recurrent layers, and employs Batch Normalization and a novel optimization technique called SortaGrad. These techniques improve performance and training efficiency. The system is trained on large datasets, with English speech data covering 11,940 hours and Mandarin speech data covering 9,400 hours. Data augmentation techniques are used to enhance performance. The system uses a combination of convolutional and recurrent layers, with bidirectional and unidirectional recurrent layers. It employs a CTC loss function to predict speech transcriptions from audio. The system is trained using synchronous SGD, which is more efficient and easier to debug than asynchronous methods. The system is optimized for scalability, using multiple GPUs and efficient memory allocation techniques. The system is evaluated on public benchmarks and internal datasets, showing improved performance over previous systems. It outperforms human workers in some benchmarks and significantly reduces error rates. The system is deployed in production with low latency, using a single GPU with 98th percentile compute latency of 67 milliseconds. The system is scalable, with training times reduced to 3-5 days, allowing faster iteration and improvement. The system is effective for both English and Mandarin speech recognition, with performance improvements in both languages. The system uses a language model during inference to improve transcription accuracy. The system is optimized for both training and deployment, with efficient use of GPUs and memory.Deep Speech 2 is an end-to-end speech recognition system that recognizes both English and Mandarin speech. It replaces traditional pipeline components with neural networks, enabling efficient handling of diverse speech, including noisy environments, accents, and multiple languages. The system achieves a 7x speedup over previous systems, reducing training time from weeks to days. This allows faster iteration and improved performance, with results competitive with human transcriptions on standard benchmarks. The system is deployed in an online setting with low latency, using Batch Dispatch with GPUs. The system uses a deep neural network architecture with multiple layers, including convolutional and recurrent layers, and employs Batch Normalization and a novel optimization technique called SortaGrad. These techniques improve performance and training efficiency. The system is trained on large datasets, with English speech data covering 11,940 hours and Mandarin speech data covering 9,400 hours. Data augmentation techniques are used to enhance performance. The system uses a combination of convolutional and recurrent layers, with bidirectional and unidirectional recurrent layers. It employs a CTC loss function to predict speech transcriptions from audio. The system is trained using synchronous SGD, which is more efficient and easier to debug than asynchronous methods. The system is optimized for scalability, using multiple GPUs and efficient memory allocation techniques. The system is evaluated on public benchmarks and internal datasets, showing improved performance over previous systems. It outperforms human workers in some benchmarks and significantly reduces error rates. The system is deployed in production with low latency, using a single GPU with 98th percentile compute latency of 67 milliseconds. The system is scalable, with training times reduced to 3-5 days, allowing faster iteration and improvement. The system is effective for both English and Mandarin speech recognition, with performance improvements in both languages. The system uses a language model during inference to improve transcription accuracy. The system is optimized for both training and deployment, with efficient use of GPUs and memory.
Reach us at info@study.space