DeBERTa is a new model architecture that improves BERT and RoBERTa using two novel techniques: disentangled attention and an enhanced mask decoder. Disentangled attention represents each word with two vectors, one for content and one for position, and computes attention weights using disentangled matrices based on content and relative positions. The enhanced mask decoder incorporates absolute positions in the decoding layer to predict masked tokens during pre-training. A new virtual adversarial training method is also used for fine-tuning to improve generalization. DeBERTa achieves significant improvements in pre-training efficiency and performance on NLU and NLG tasks. A DeBERTa model trained on half the training data performs consistently better than RoBERTa-Large on various NLP tasks, achieving improvements on MNLI by +0.9%, SQuAD v2.0 by +2.3%, and RACE by +3.6%. A larger version of DeBERTa with 1.5 billion parameters surpasses human performance on the SuperGLUE benchmark for the first time in terms of macro-average score. The ensemble DeBERTa model sits atop the SuperGLUE leaderboard, outperforming the human baseline by a decent margin. DeBERTa is also more energy-efficient to train and maintain compared to larger models like T5. The paper presents a comprehensive empirical study showing that these techniques significantly improve pre-training efficiency and downstream task performance.DeBERTa is a new model architecture that improves BERT and RoBERTa using two novel techniques: disentangled attention and an enhanced mask decoder. Disentangled attention represents each word with two vectors, one for content and one for position, and computes attention weights using disentangled matrices based on content and relative positions. The enhanced mask decoder incorporates absolute positions in the decoding layer to predict masked tokens during pre-training. A new virtual adversarial training method is also used for fine-tuning to improve generalization. DeBERTa achieves significant improvements in pre-training efficiency and performance on NLU and NLG tasks. A DeBERTa model trained on half the training data performs consistently better than RoBERTa-Large on various NLP tasks, achieving improvements on MNLI by +0.9%, SQuAD v2.0 by +2.3%, and RACE by +3.6%. A larger version of DeBERTa with 1.5 billion parameters surpasses human performance on the SuperGLUE benchmark for the first time in terms of macro-average score. The ensemble DeBERTa model sits atop the SuperGLUE leaderboard, outperforming the human baseline by a decent margin. DeBERTa is also more energy-efficient to train and maintain compared to larger models like T5. The paper presents a comprehensive empirical study showing that these techniques significantly improve pre-training efficiency and downstream task performance.