DeBERTa (Decoding-enhanced BERT with disentangled attention) is a new model architecture that improves upon BERT and RoBERTa by incorporating two novel techniques: disentangled attention and an enhanced mask decoder. Disentangled attention represents each word using two vectors, one encoding its content and the other its position, with attention weights computed using disentangled matrices. The enhanced mask decoder incorporates absolute word positions in the decoding layer to predict masked tokens during pre-training. Additionally, a virtual adversarial training method is introduced to improve the model's generalization. These techniques significantly enhance the efficiency of pre-training and the performance of downstream tasks, including natural language understanding (NLU) and natural language generation (NLG). DeBERTa outperforms RoBERTa-Large on various NLU tasks and achieves human-level performance on the SuperGLUE benchmark, marking a significant milestone in the development of general AI.DeBERTa (Decoding-enhanced BERT with disentangled attention) is a new model architecture that improves upon BERT and RoBERTa by incorporating two novel techniques: disentangled attention and an enhanced mask decoder. Disentangled attention represents each word using two vectors, one encoding its content and the other its position, with attention weights computed using disentangled matrices. The enhanced mask decoder incorporates absolute word positions in the decoding layer to predict masked tokens during pre-training. Additionally, a virtual adversarial training method is introduced to improve the model's generalization. These techniques significantly enhance the efficiency of pre-training and the performance of downstream tasks, including natural language understanding (NLU) and natural language generation (NLG). DeBERTa outperforms RoBERTa-Large on various NLU tasks and achieves human-level performance on the SuperGLUE benchmark, marking a significant milestone in the development of general AI.