12 Apr 2018 | Peter Shaw, Jakob Uszkoreit, Ashish Vaswani
This paper introduces a method to incorporate relative position representations into the self-attention mechanism of the Transformer, improving machine translation performance. The Transformer, which relies on absolute position encodings, does not explicitly model relative or absolute positions. Instead, it requires adding absolute position representations to its inputs. The proposed method extends self-attention to efficiently consider relative positions between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, combining relative and absolute position representations does not yield further improvement in translation quality.
The method models the input as a labeled, directed, fully-connected graph, where edges between input elements are represented by vectors. The edge representations are learned based on relative positions, with the maximum relative position clipped to a maximum absolute value of k. This allows the model to generalize to sequence lengths not seen during training. The proposed method is implemented efficiently, reducing space complexity by sharing relative position representations across attention heads and sequences.
Experiments show that the proposed method improves translation performance on two machine translation tasks. The results indicate that relative position representations are effective for machine translation, although further research is needed to determine their utility for other tasks. The method can be generalized to arbitrary graph-labeled inputs, and future work will explore extending this mechanism to handle such inputs. The paper also discusses the use of nonlinear compatibility functions to combine input and edge representations, which is an important consideration for efficient implementation.This paper introduces a method to incorporate relative position representations into the self-attention mechanism of the Transformer, improving machine translation performance. The Transformer, which relies on absolute position encodings, does not explicitly model relative or absolute positions. Instead, it requires adding absolute position representations to its inputs. The proposed method extends self-attention to efficiently consider relative positions between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, combining relative and absolute position representations does not yield further improvement in translation quality.
The method models the input as a labeled, directed, fully-connected graph, where edges between input elements are represented by vectors. The edge representations are learned based on relative positions, with the maximum relative position clipped to a maximum absolute value of k. This allows the model to generalize to sequence lengths not seen during training. The proposed method is implemented efficiently, reducing space complexity by sharing relative position representations across attention heads and sequences.
Experiments show that the proposed method improves translation performance on two machine translation tasks. The results indicate that relative position representations are effective for machine translation, although further research is needed to determine their utility for other tasks. The method can be generalized to arbitrary graph-labeled inputs, and future work will explore extending this mechanism to handle such inputs. The paper also discusses the use of nonlinear compatibility functions to combine input and edge representations, which is an important consideration for efficient implementation.