GLM: General Language Model Pretraining with Autoregressive Blank Infilling

GLM: General Language Model Pretraining with Autoregressive Blank Infilling

May 22-27, 2022 | Zhengxiao Du, Yuji Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, Jie Tang
GLM is a general language model pretraining framework that uses autoregressive blank infilling to address the challenge of achieving performance across different natural language understanding (NLU), unconditional generation, and conditional generation tasks. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, resulting in performance gains over BERT and T5 on NLU tasks. GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25× parameters of BERT Large, demonstrating its generalizability to different downstream tasks. GLM is trained by optimizing an autoregressive blank infilling objective. Given an input text, multiple text spans are sampled, and each span is replaced with a single [MASK] token, forming a corrupted text. The model predicts the missing tokens in the spans from the corrupted text in an autoregressive manner. To fully capture the interdependencies between different spans, the order of the spans is randomly permuted. The pretraining objective is defined as maximizing the log probability of predicting the missing tokens in the spans. The model is implemented with a single Transformer, with several modifications to the architecture, including rearranging the order of layer normalization and the residual connection, using a single linear layer for the output token prediction, and replacing ReLU activation functions with GeLUs. GLM uses 2D positional encoding to address the challenge of encoding positional information. Each token is encoded with two positional ids, one representing the position in the corrupted text and the other representing the intra-span position. The two positional ids are projected into two vectors via learnable embedding tables, which are both added to the input token embeddings. This ensures that the model is not aware of the length of the masked span when reconstructing them. GLM is trained with a multi-task pretraining setup, in which a second objective of generating longer text is jointly optimized with the blank infilling objective. The model is evaluated on a wide range of tasks across NLU, conditional and unconditional generation, and achieves better performance than BERT, T5, and GPT. GLM outperforms BERT on the SuperGLUE benchmark by a large margin of 4.6% – 5.0% and outperforms RoBERTa and BART when pretrained on a corpus of similar size. GLM also significantly outperforms T5 on NLU and generation tasks with fewer parameters and data. The results show that GLM is effective in sharing model parameters across natural language understanding and generation tasks, achieving better performance than a standalone BERT, encoder-decoder, or GPT model.GLM is a general language model pretraining framework that uses autoregressive blank infilling to address the challenge of achieving performance across different natural language understanding (NLU), unconditional generation, and conditional generation tasks. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, resulting in performance gains over BERT and T5 on NLU tasks. GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25× parameters of BERT Large, demonstrating its generalizability to different downstream tasks. GLM is trained by optimizing an autoregressive blank infilling objective. Given an input text, multiple text spans are sampled, and each span is replaced with a single [MASK] token, forming a corrupted text. The model predicts the missing tokens in the spans from the corrupted text in an autoregressive manner. To fully capture the interdependencies between different spans, the order of the spans is randomly permuted. The pretraining objective is defined as maximizing the log probability of predicting the missing tokens in the spans. The model is implemented with a single Transformer, with several modifications to the architecture, including rearranging the order of layer normalization and the residual connection, using a single linear layer for the output token prediction, and replacing ReLU activation functions with GeLUs. GLM uses 2D positional encoding to address the challenge of encoding positional information. Each token is encoded with two positional ids, one representing the position in the corrupted text and the other representing the intra-span position. The two positional ids are projected into two vectors via learnable embedding tables, which are both added to the input token embeddings. This ensures that the model is not aware of the length of the masked span when reconstructing them. GLM is trained with a multi-task pretraining setup, in which a second objective of generating longer text is jointly optimized with the blank infilling objective. The model is evaluated on a wide range of tasks across NLU, conditional and unconditional generation, and achieves better performance than BERT, T5, and GPT. GLM outperforms BERT on the SuperGLUE benchmark by a large margin of 4.6% – 5.0% and outperforms RoBERTa and BART when pretrained on a corpus of similar size. GLM also significantly outperforms T5 on NLU and generation tasks with fewer parameters and data. The results show that GLM is effective in sharing model parameters across natural language understanding and generation tasks, achieving better performance than a standalone BERT, encoder-decoder, or GPT model.
Reach us at info@study.space