DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

24 Mar 2023 | Pengcheng He1, Jianfeng Gao2, Weizhu Chen1
DeBERTaV3 is a new pre-trained language model that improves the original DeBERTa model by replacing masked language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. The paper introduces a new gradient-disentangled embedding sharing method to avoid the "tug-of-war" dynamics between the generator and discriminator in ELECTRA, which improves both training efficiency and model quality. DeBERTaV3 is pre-trained using the same settings as DeBERTa and achieves superior performance on a wide range of natural language understanding (NLU) tasks. For example, the DeBERTaV3 Large model achieves a 91.37% average score on the GLUE benchmark, which is 1.37% higher than DeBERTa and 1.91% higher than ELECTRA. A multilingual version, mDeBERTaV3, also shows significant improvements, achieving a 79.8% zero-shot cross-lingual accuracy on XNLI, which is 3.6% higher than XLM-R Base. The paper also demonstrates that DeBERTaV3 outperforms previous state-of-the-art models on various NLU tasks, including GLUE, SQuAD, MNLI, and others. The models and code are publicly available at https://github.com/microsoft/DeBERTa. The key contributions of the paper include the introduction of the GDES method, which allows the discriminator to benefit from the generator's embeddings without interfering with the generator's gradients, and the successful application of DeBERTaV3 to both single-lingual and multilingual tasks. The results show that DeBERTaV3 is more efficient and effective than previous models in terms of both training and performance.DeBERTaV3 is a new pre-trained language model that improves the original DeBERTa model by replacing masked language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. The paper introduces a new gradient-disentangled embedding sharing method to avoid the "tug-of-war" dynamics between the generator and discriminator in ELECTRA, which improves both training efficiency and model quality. DeBERTaV3 is pre-trained using the same settings as DeBERTa and achieves superior performance on a wide range of natural language understanding (NLU) tasks. For example, the DeBERTaV3 Large model achieves a 91.37% average score on the GLUE benchmark, which is 1.37% higher than DeBERTa and 1.91% higher than ELECTRA. A multilingual version, mDeBERTaV3, also shows significant improvements, achieving a 79.8% zero-shot cross-lingual accuracy on XNLI, which is 3.6% higher than XLM-R Base. The paper also demonstrates that DeBERTaV3 outperforms previous state-of-the-art models on various NLU tasks, including GLUE, SQuAD, MNLI, and others. The models and code are publicly available at https://github.com/microsoft/DeBERTa. The key contributions of the paper include the introduction of the GDES method, which allows the discriminator to benefit from the generator's embeddings without interfering with the generator's gradients, and the successful application of DeBERTaV3 to both single-lingual and multilingual tasks. The results show that DeBERTaV3 is more efficient and effective than previous models in terms of both training and performance.
Reach us at info@study.space
[slides] DeBERTaV3%3A Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing | StudySpace