[slides] DeBERTaV3%3A Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

This paper introduces DeBERTaV3, a new pre-trained language model that improves upon the original DeBERTa model by replacing masked language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. The authors analyze the limitations of vanilla embedding sharing in ELECTRA, which can lead to a "tug-of-war" dynamics between the discriminator and generator, reducing training efficiency and model performance. To address this issue, they propose a new method called *gradient-disentangled embedding sharing* (GDES), which allows the discriminator to leverage the semantic information encoded in the generator's embedding layer without interfering with the generator's gradients. This method improves both training efficiency and the quality of the pre-trained model. DeBERTaV3 is pre-trained using the same settings as DeBERTa and demonstrates superior performance on a wide range of downstream natural language understanding (NLU) tasks. Specifically, on the GLUE benchmark with eight tasks, DeBERTaV3 Large achieves a 91.37% average score, outperforming DeBERTa by 1.37% and ELECTRA by 1.91%, setting a new state-of-the-art (SOTA) record. Additionally, a multilingual model mDeBERTaV3 is trained on the CC100 dataset, achieving a 79.8% zero-shot cross-lingual accuracy on XNLI, a 3.6% improvement over XLM-R Base, further establishing DeBERTaV3 as a leading model in multi-lingual NLP. The models and code are publicly available at <https://github.com/microsoft/DeBERTa>.This paper introduces DeBERTaV3, a new pre-trained language model that improves upon the original DeBERTa model by replacing masked language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. The authors analyze the limitations of vanilla embedding sharing in ELECTRA, which can lead to a "tug-of-war" dynamics between the discriminator and generator, reducing training efficiency and model performance. To address this issue, they propose a new method called *gradient-disentangled embedding sharing* (GDES), which allows the discriminator to leverage the semantic information encoded in the generator's embedding layer without interfering with the generator's gradients. This method improves both training efficiency and the quality of the pre-trained model. DeBERTaV3 is pre-trained using the same settings as DeBERTa and demonstrates superior performance on a wide range of downstream natural language understanding (NLU) tasks. Specifically, on the GLUE benchmark with eight tasks, DeBERTaV3 Large achieves a 91.37% average score, outperforming DeBERTa by 1.37% and ELECTRA by 1.91%, setting a new state-of-the-art (SOTA) record. Additionally, a multilingual model mDeBERTaV3 is trained on the CC100 dataset, achieving a 79.8% zero-shot cross-lingual accuracy on XNLI, a 3.6% improvement over XLM-R Base, further establishing DeBERTaV3 as a leading model in multi-lingual NLP. The models and code are publicly available at <https://github.com/microsoft/DeBERTa>.

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

24 Mar 2023 | Pengcheng He1, Jianfeng Gao2, Weizhu Chen1