Understanding ESPnet%3A End-to-End Speech Processing Toolkit

This paper introduces ESPnet, a new open-source platform for end-to-end speech processing, primarily focusing on automatic speech recognition (ASR). ESPnet leverages dynamic neural network toolkits, Chainer and PyTorch, as its main deep learning engine. It follows the Kaldi ASR toolkit style for data processing, feature extraction, and recipes, providing a complete setup for speech recognition and other speech processing experiments. ESPnet supports hybrid CTC/attention end-to-end ASR, multiobjective learning during training, and joint decoding during recognition. It also integrates recurrent neural network language models (RNNLM) and offers various recipes for major ASR benchmarks, including Wall Street Journal (WSJ), Librispeech, TED-LIUM, Corpus of Spontaneous Japanese (CSJ), AMI, HKUST Mandarin CTS, VoxForge, CHiME-4/5, and more. Experimental results show that ESPnet achieves reasonable ASR performance and comparable or superior results to state-of-the-art hybrid HMM/DNN systems in certain tasks. The paper discusses the architecture, functionalities, and benchmark results of ESPnet, highlighting its unique features and contributions to the field of end-to-end ASR.This paper introduces ESPnet, a new open-source platform for end-to-end speech processing, primarily focusing on automatic speech recognition (ASR). ESPnet leverages dynamic neural network toolkits, Chainer and PyTorch, as its main deep learning engine. It follows the Kaldi ASR toolkit style for data processing, feature extraction, and recipes, providing a complete setup for speech recognition and other speech processing experiments. ESPnet supports hybrid CTC/attention end-to-end ASR, multiobjective learning during training, and joint decoding during recognition. It also integrates recurrent neural network language models (RNNLM) and offers various recipes for major ASR benchmarks, including Wall Street Journal (WSJ), Librispeech, TED-LIUM, Corpus of Spontaneous Japanese (CSJ), AMI, HKUST Mandarin CTS, VoxForge, CHiME-4/5, and more. Experimental results show that ESPnet achieves reasonable ASR performance and comparable or superior results to state-of-the-art hybrid HMM/DNN systems in certain tasks. The paper discusses the architecture, functionalities, and benchmark results of ESPnet, highlighting its unique features and contributions to the field of end-to-end ASR.

ESPnet: End-to-End Speech Processing Toolkit

30 Mar 2018 | Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai