Deep Speech: Scaling up end-to-end speech recognition

Deep Speech: Scaling up end-to-end speech recognition

19 Dec 2014 | Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng
The paper presents Deep Speech, an end-to-end deep learning-based speech recognition system that outperforms traditional speech systems in both clear and noisy environments. The system is significantly simpler than traditional methods, which rely on complex engineered processing pipelines and struggle in noisy conditions. Deep Speech does not require hand-designed components for modeling background noise, reverberation, or speaker variation; instead, it learns these effects directly from data. Key to its success is a well-optimized RNN training system using multiple GPUs and novel data synthesis techniques to generate a large amount of varied training data. The system achieves 16.0% error on the Switchboard Hub5'00 corpus, outperforming previously published results, and performs better than commercial systems in noisy speech recognition tests. The paper discusses the architecture of the RNN, GPU optimizations, data capture and synthesis strategies, and experimental results, demonstrating the system's superior performance and scalability.The paper presents Deep Speech, an end-to-end deep learning-based speech recognition system that outperforms traditional speech systems in both clear and noisy environments. The system is significantly simpler than traditional methods, which rely on complex engineered processing pipelines and struggle in noisy conditions. Deep Speech does not require hand-designed components for modeling background noise, reverberation, or speaker variation; instead, it learns these effects directly from data. Key to its success is a well-optimized RNN training system using multiple GPUs and novel data synthesis techniques to generate a large amount of varied training data. The system achieves 16.0% error on the Switchboard Hub5'00 corpus, outperforming previously published results, and performs better than commercial systems in noisy speech recognition tests. The paper discusses the architecture of the RNN, GPU optimizations, data capture and synthesis strategies, and experimental results, demonstrating the system's superior performance and scalability.
Reach us at info@study.space