20 Sep 2015 | Minh-Thang Luong, Hieu Pham, Christopher D. Manning
This paper explores two simple and effective attentional mechanisms for neural machine translation (NMT): a *global* approach that always attends to all source words and a *local* approach that only considers a subset of source words at each time step. The authors evaluate these models on the WMT translation tasks between English and German in both directions. The local attention model achieves a significant improvement of 5.0 BLEU points over non-attentional systems that incorporate known techniques like dropout. An ensemble model using different attention architectures achieves a new state-of-the-art result with 25.9 BLEU points on the WMT'15 English to German translation task, outperforming the existing best system by 1.0 BLEU points. The paper also discusses various alignment functions and analyzes the models' performance in terms of learning, handling long sentences, and alignment quality.This paper explores two simple and effective attentional mechanisms for neural machine translation (NMT): a *global* approach that always attends to all source words and a *local* approach that only considers a subset of source words at each time step. The authors evaluate these models on the WMT translation tasks between English and German in both directions. The local attention model achieves a significant improvement of 5.0 BLEU points over non-attentional systems that incorporate known techniques like dropout. An ensemble model using different attention architectures achieves a new state-of-the-art result with 25.9 BLEU points on the WMT'15 English to German translation task, outperforming the existing best system by 1.0 BLEU points. The paper also discusses various alignment functions and analyzes the models' performance in terms of learning, handling long sentences, and alignment quality.