JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING

JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING

31 Jan 2017 | Suyoun Kim1 2, Takaaki Hori1, and Shinji Watanabe1
This paper presents a novel end-to-end speech recognition method based on multi-task learning, combining Connectionist Temporal Classification (CTC) and attention-based encoder-decoder models to improve robustness and accelerate learning. The proposed method uses a shared encoder trained with both CTC and attention objectives, addressing the limitations of each individual model. The attention model, while effective in many cases, struggles with noisy data and long input sequences due to its lack of left-to-right constraints. CTC, on the other hand, provides monotonic alignment but lacks flexibility. The joint CTC-attention model mitigates these issues by leveraging the strengths of both approaches. The model was evaluated on the WSJ and CHiME-4 tasks, achieving significant improvements in Character Error Rate (CER) compared to both CTC and attention-based baselines, with relative improvements of 5.4-14.6%. The model also converges faster, making it more efficient for training. The results show that the joint CTC-attention model outperforms both CTC and attention models in both clean and noisy conditions. The method is implemented using a shared encoder and is applicable to various sequence-to-sequence learning tasks.This paper presents a novel end-to-end speech recognition method based on multi-task learning, combining Connectionist Temporal Classification (CTC) and attention-based encoder-decoder models to improve robustness and accelerate learning. The proposed method uses a shared encoder trained with both CTC and attention objectives, addressing the limitations of each individual model. The attention model, while effective in many cases, struggles with noisy data and long input sequences due to its lack of left-to-right constraints. CTC, on the other hand, provides monotonic alignment but lacks flexibility. The joint CTC-attention model mitigates these issues by leveraging the strengths of both approaches. The model was evaluated on the WSJ and CHiME-4 tasks, achieving significant improvements in Character Error Rate (CER) compared to both CTC and attention-based baselines, with relative improvements of 5.4-14.6%. The model also converges faster, making it more efficient for training. The results show that the joint CTC-attention model outperforms both CTC and attention models in both clean and noisy conditions. The method is implemented using a shared encoder and is applicable to various sequence-to-sequence learning tasks.
Reach us at info@study.space