ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

9 Feb 2020 | Zhenzhong Lan1 Mingda Chen2* Sebastian Goodman1 Kevin Gimpel2 Piyush Sharma1 Radu Soricut1
ALBERT is a lightweight version of BERT designed to reduce parameter count while maintaining performance. It introduces two key techniques: factorized embedding parameterization and cross-layer parameter sharing. Factorized embedding parameterization separates vocabulary embeddings from hidden layers, allowing for more efficient use of parameters. Cross-layer parameter sharing reduces the number of parameters by reusing weights across layers. These techniques significantly reduce the number of parameters in ALBERT compared to BERT, with ALBERT-xxlarge having 18 times fewer parameters than BERT-large and being trained about 1.7 times faster. ALBERT also incorporates a self-supervised loss for sentence-order prediction (SOP), which focuses on inter-sentence coherence and improves performance on downstream tasks. ALBERT achieves state-of-the-art results on GLUE, RACE, and SQuAD benchmarks with fewer parameters than BERT-large. The model is trained on BOOKCORPUS and Wikipedia, and experiments show that ALBERT outperforms BERT in terms of parameter efficiency and training speed. Additional experiments demonstrate that ALBERT can achieve better performance with the same training time as BERT-large. The model also benefits from removing dropout, which improves performance on downstream tasks. ALBERT's results show significant improvements over previous models, achieving a GLUE score of 89.4, a SQuAD 2.0 F1 score of 92.2, and a RACE accuracy of 89.4. The code and pretrained models are available at https://github.com/google-research/ALBERT.ALBERT is a lightweight version of BERT designed to reduce parameter count while maintaining performance. It introduces two key techniques: factorized embedding parameterization and cross-layer parameter sharing. Factorized embedding parameterization separates vocabulary embeddings from hidden layers, allowing for more efficient use of parameters. Cross-layer parameter sharing reduces the number of parameters by reusing weights across layers. These techniques significantly reduce the number of parameters in ALBERT compared to BERT, with ALBERT-xxlarge having 18 times fewer parameters than BERT-large and being trained about 1.7 times faster. ALBERT also incorporates a self-supervised loss for sentence-order prediction (SOP), which focuses on inter-sentence coherence and improves performance on downstream tasks. ALBERT achieves state-of-the-art results on GLUE, RACE, and SQuAD benchmarks with fewer parameters than BERT-large. The model is trained on BOOKCORPUS and Wikipedia, and experiments show that ALBERT outperforms BERT in terms of parameter efficiency and training speed. Additional experiments demonstrate that ALBERT can achieve better performance with the same training time as BERT-large. The model also benefits from removing dropout, which improves performance on downstream tasks. ALBERT's results show significant improvements over previous models, achieving a GLUE score of 89.4, a SQuAD 2.0 F1 score of 92.2, and a RACE accuracy of 89.4. The code and pretrained models are available at https://github.com/google-research/ALBERT.
Reach us at info@study.space
[slides] ALBERT%3A A Lite BERT for Self-supervised Learning of Language Representations | StudySpace