2 Jan 2020 | Zhilin Yang*1, Zihang Dai*12, Yiming Yang1, Jaime Carbonell1, Ruslan Salakhutdinov1, Quoc V. Le2
XLNet is a generalized autoregressive pretraining method that addresses the limitations of BERT and other autoregressive and autoencoding pretraining approaches. It enables bidirectional context modeling by maximizing the expected likelihood over all permutations of the factorization order, allowing each position to learn from both left and right contexts. Unlike BERT, which uses masked tokens and suffers from pretrain-finetune discrepancy, XLNet avoids this by using a permutation-based autoregressive objective that naturally incorporates the product rule for joint probability factorization. XLNet also integrates ideas from Transformer-XL, including relative positional encoding and segment recurrence, to improve performance on tasks involving long text sequences. Empirically, XLNet outperforms BERT on 20 tasks, including question answering, natural language inference, sentiment analysis, and document ranking. The method uses a two-stream self-attention mechanism to achieve target-aware representations, and incorporates a bidirectional input pipeline and span-based prediction to enhance performance. XLNet is evaluated on various natural language understanding datasets, including GLUE, SQuAD, RACE, and text classification tasks, demonstrating its effectiveness across a wide range of applications. The model's architecture is designed to work seamlessly with the autoregressive objective, and it achieves substantial improvements over previous pretraining methods.XLNet is a generalized autoregressive pretraining method that addresses the limitations of BERT and other autoregressive and autoencoding pretraining approaches. It enables bidirectional context modeling by maximizing the expected likelihood over all permutations of the factorization order, allowing each position to learn from both left and right contexts. Unlike BERT, which uses masked tokens and suffers from pretrain-finetune discrepancy, XLNet avoids this by using a permutation-based autoregressive objective that naturally incorporates the product rule for joint probability factorization. XLNet also integrates ideas from Transformer-XL, including relative positional encoding and segment recurrence, to improve performance on tasks involving long text sequences. Empirically, XLNet outperforms BERT on 20 tasks, including question answering, natural language inference, sentiment analysis, and document ranking. The method uses a two-stream self-attention mechanism to achieve target-aware representations, and incorporates a bidirectional input pipeline and span-based prediction to enhance performance. XLNet is evaluated on various natural language understanding datasets, including GLUE, SQuAD, RACE, and text classification tasks, demonstrating its effectiveness across a wide range of applications. The model's architecture is designed to work seamlessly with the autoregressive objective, and it achieves substantial improvements over previous pretraining methods.