31 Jan 2017 | Suyoun Kim1 2, Takaaki Hori1, and Shinji Watanabe1
This paper presents a novel approach to end-to-end speech recognition that combines the strengths of Connectionist Temporal Classification (CTC) and attention-based encoder-decoder models within a multi-task learning (MTL) framework. The authors address the limitations of both CTC and attention models, particularly in noisy conditions and long input sequences, by jointly training a shared encoder using both CTC and attention objectives. The proposed method leverages the monotonic alignment property of CTC and the flexibility of attention models to improve robustness and convergence speed. Experiments on the WSJ and CHiME-4 datasets demonstrate significant improvements in Character Error Rate (CER) compared to both CTC and attention models, with relative reductions of 5.4-14.6%. The MTL approach also accelerates the learning process, as evidenced by faster convergence and better alignment estimation.This paper presents a novel approach to end-to-end speech recognition that combines the strengths of Connectionist Temporal Classification (CTC) and attention-based encoder-decoder models within a multi-task learning (MTL) framework. The authors address the limitations of both CTC and attention models, particularly in noisy conditions and long input sequences, by jointly training a shared encoder using both CTC and attention objectives. The proposed method leverages the monotonic alignment property of CTC and the flexibility of attention models to improve robustness and convergence speed. Experiments on the WSJ and CHiME-4 datasets demonstrate significant improvements in Character Error Rate (CER) compared to both CTC and attention models, with relative reductions of 5.4-14.6%. The MTL approach also accelerates the learning process, as evidenced by faster convergence and better alignment estimation.