Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

2 Jun 2019 | Zihang Dai*12, Zhilin Yang*12, Yiming Yang1, Jaime Carbonell1, Quoc V. Le2, Ruslan Salakhutdinov1
Transformer-XL is a novel neural architecture that enables language models to learn dependencies beyond a fixed-length context without disrupting temporal coherence. It introduces a segment-level recurrence mechanism and a novel positional encoding scheme. This architecture allows the model to capture longer-term dependencies, resolve context fragmentation, and achieve better performance on both short and long sequences. Transformer-XL outperforms RNNs and vanilla Transformers in terms of dependency length, performance, and evaluation speed. It achieves state-of-the-art results on multiple language modeling benchmarks, including enwiki8, text8, WikiText-103, One Billion Word, and Penn Treebank. The model is also able to generate coherent, novel text articles with thousands of tokens. Transformer-XL is up to 1,800+ times faster during evaluation than vanilla Transformers. The key technical contributions include introducing recurrence in a self-attentive model and developing a relative positional encoding scheme that generalizes to longer attention lengths. These innovations enable the model to effectively capture long-term dependencies and improve performance on various language modeling tasks.Transformer-XL is a novel neural architecture that enables language models to learn dependencies beyond a fixed-length context without disrupting temporal coherence. It introduces a segment-level recurrence mechanism and a novel positional encoding scheme. This architecture allows the model to capture longer-term dependencies, resolve context fragmentation, and achieve better performance on both short and long sequences. Transformer-XL outperforms RNNs and vanilla Transformers in terms of dependency length, performance, and evaluation speed. It achieves state-of-the-art results on multiple language modeling benchmarks, including enwiki8, text8, WikiText-103, One Billion Word, and Penn Treebank. The model is also able to generate coherent, novel text articles with thousands of tokens. Transformer-XL is up to 1,800+ times faster during evaluation than vanilla Transformers. The key technical contributions include introducing recurrence in a self-attentive model and developing a relative positional encoding scheme that generalizes to longer attention lengths. These innovations enable the model to effectively capture long-term dependencies and improve performance on various language modeling tasks.
Reach us at info@study.space
[slides and audio] Transformer-XL%3A Attentive Language Models beyond a Fixed-Length Context