[slides] Self-Attention with Relative Position Representations

This paper introduces an extension to the self-attention mechanism in the Transformer model, focusing on incorporating relative position representations to improve machine translation performance. Unlike traditional recurrent and convolutional neural networks, the Transformer does not explicitly model relative or absolute position information in its structure. Instead, it relies on sinusoidal position encodings, which can be limiting for tasks requiring precise relative positions. The proposed approach extends the self-attention mechanism to consider the pairwise relationships between input elements, treating the input as a labeled, directed, fully-connected graph. This extension includes two types of edge representations: $a_{ij}^V$ and $a_{ij}^K$, which are used in the compatibility function and the sublayer output, respectively. The maximum relative position is clipped to a maximum absolute value $k$, and relative position representations are learned for each unique edge label within this range. The authors demonstrate that this approach significantly improves translation quality on the WMT 2014 English-to-German and English-to-French translation tasks, achieving BLEU scores of 1.3 and 0.3, respectively, over the baseline Transformer with sinusoidal position encodings. They also show that combining relative and absolute position representations does not further enhance performance. An efficient implementation of the method is described, reducing space complexity and maintaining the same model and batch sizes on P100 GPUs. Future work will explore extending this mechanism to handle arbitrary directed, labeled graph inputs and developing nonlinear compatibility functions.This paper introduces an extension to the self-attention mechanism in the Transformer model, focusing on incorporating relative position representations to improve machine translation performance. Unlike traditional recurrent and convolutional neural networks, the Transformer does not explicitly model relative or absolute position information in its structure. Instead, it relies on sinusoidal position encodings, which can be limiting for tasks requiring precise relative positions. The proposed approach extends the self-attention mechanism to consider the pairwise relationships between input elements, treating the input as a labeled, directed, fully-connected graph. This extension includes two types of edge representations: $a_{ij}^V$ and $a_{ij}^K$, which are used in the compatibility function and the sublayer output, respectively. The maximum relative position is clipped to a maximum absolute value $k$, and relative position representations are learned for each unique edge label within this range. The authors demonstrate that this approach significantly improves translation quality on the WMT 2014 English-to-German and English-to-French translation tasks, achieving BLEU scores of 1.3 and 0.3, respectively, over the baseline Transformer with sinusoidal position encodings. They also show that combining relative and absolute position representations does not further enhance performance. An efficient implementation of the method is described, reducing space complexity and maintaining the same model and batch sizes on P100 GPUs. Future work will explore extending this mechanism to handle arbitrary directed, labeled graph inputs and developing nonlinear compatibility functions.

Self-Attention with Relative Position Representations

12 Apr 2018 | Peter Shaw, Jakob Uszkoreit, Ashish Vaswani