25 Jul 2017 | Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin
This paper introduces a fully convolutional architecture for sequence to sequence learning that outperforms strong recurrent models on large benchmark datasets at an order of magnitude faster speed. The proposed model uses convolutional neural networks instead of recurrent neural networks, allowing for parallel computation during training and easier optimization due to a fixed number of non-linearities. Gated linear units are used to facilitate gradient propagation, and each decoder layer is equipped with a separate attention module. The model achieves higher accuracy and faster speed compared to the deep LSTM setup of Wu et al. (2016) on WMT'14 English-German and WMT'14 English-French translation tasks. The model also demonstrates faster translation speeds on both GPU and CPU hardware. The architecture is evaluated on several large datasets for machine translation and summarization, achieving new state-of-the-art results on WMT'16 English-Romanian translation and outperforming previous models on WMT'14 English-German and WMT'14 English-French translation. The model's convolutional architecture allows for efficient batch processing of attention computations across all elements of a sequence, and it uses a multi-step attention mechanism that enables the model to consider multiple 'hops' of attention per time step. The model also benefits from residual connections and careful weight initialization to stabilize learning. The results show that the convolutional model outperforms previous approaches in terms of accuracy and speed, and that the model can be effectively ensembled to further improve performance. The model is also evaluated on abstractive summarization tasks, where it outperforms previous models in terms of ROUGE scores. The paper concludes that the proposed fully convolutional model is a promising approach for sequence to sequence learning, particularly for tasks that benefit from hierarchical representations.This paper introduces a fully convolutional architecture for sequence to sequence learning that outperforms strong recurrent models on large benchmark datasets at an order of magnitude faster speed. The proposed model uses convolutional neural networks instead of recurrent neural networks, allowing for parallel computation during training and easier optimization due to a fixed number of non-linearities. Gated linear units are used to facilitate gradient propagation, and each decoder layer is equipped with a separate attention module. The model achieves higher accuracy and faster speed compared to the deep LSTM setup of Wu et al. (2016) on WMT'14 English-German and WMT'14 English-French translation tasks. The model also demonstrates faster translation speeds on both GPU and CPU hardware. The architecture is evaluated on several large datasets for machine translation and summarization, achieving new state-of-the-art results on WMT'16 English-Romanian translation and outperforming previous models on WMT'14 English-German and WMT'14 English-French translation. The model's convolutional architecture allows for efficient batch processing of attention computations across all elements of a sequence, and it uses a multi-step attention mechanism that enables the model to consider multiple 'hops' of attention per time step. The model also benefits from residual connections and careful weight initialization to stabilize learning. The results show that the convolutional model outperforms previous approaches in terms of accuracy and speed, and that the model can be effectively ensembled to further improve performance. The model is also evaluated on abstractive summarization tasks, where it outperforms previous models in terms of ROUGE scores. The paper concludes that the proposed fully convolutional model is a promising approach for sequence to sequence learning, particularly for tasks that benefit from hierarchical representations.