[slides and audio] A Primer in BERTology%3A What We Know About How BERT Works

This paper provides the first comprehensive survey of over 150 studies on the BERT model, reviewing current understanding of how BERT works, the types of knowledge it learns, and how it is represented. It discusses common modifications to BERT's training objectives and architecture, the overparameterization issue, and approaches to compression. The paper outlines directions for future research. BERT is a stack of Transformer encoder layers that use self-attention mechanisms to process input sequences. It consists of two stages: pre-training and fine-tuning. Pre-training involves two tasks: masked language modeling (MLM) and next sentence prediction (NSP). Fine-tuning involves adding fully-connected layers on top of the final encoder layer. BERT has been shown to learn syntactic, semantic, and world knowledge. Syntactic knowledge is encoded in BERT's token representations, although it is not directly encoded in self-attention weights. Semantic knowledge includes understanding of semantic roles and entity types. BERT struggles with numerical representations and named entity recognition. World knowledge includes commonsense knowledge, but BERT cannot reason based on this knowledge. BERT's embeddings are contextualized and capture phenomena like polysemy and homonymy. Self-attention heads have been studied, with some heads specializing in certain types of syntactic relations. However, no single head contains complete syntactic tree information. BERT layers have been shown to be task-specific, with the middle layers being the most transferable. The final layers are the most task-specific and change the most during fine-tuning. BERT's performance on tasks like GLUE is influenced by the number of layers and the size of the hidden representation. BERT's training has been optimized through various methods, including large-batch training, different masking strategies, and alternative pre-training objectives. Pre-training is the most expensive part of training BERT, but it provides significant benefits for fine-tuning. BERT can be compressed through techniques like knowledge distillation, quantization, and pruning. BERTology has made significant progress, but there are still many unanswered questions about how BERT works. Future research directions include developing benchmarks for verbal reasoning, comprehensive stress tests for linguistic competence, and methods to teach reasoning to BERT. Additionally, further research is needed to understand what knowledge is actually used during inference and how to improve BERT's performance on complex tasks.This paper provides the first comprehensive survey of over 150 studies on the BERT model, reviewing current understanding of how BERT works, the types of knowledge it learns, and how it is represented. It discusses common modifications to BERT's training objectives and architecture, the overparameterization issue, and approaches to compression. The paper outlines directions for future research. BERT is a stack of Transformer encoder layers that use self-attention mechanisms to process input sequences. It consists of two stages: pre-training and fine-tuning. Pre-training involves two tasks: masked language modeling (MLM) and next sentence prediction (NSP). Fine-tuning involves adding fully-connected layers on top of the final encoder layer. BERT has been shown to learn syntactic, semantic, and world knowledge. Syntactic knowledge is encoded in BERT's token representations, although it is not directly encoded in self-attention weights. Semantic knowledge includes understanding of semantic roles and entity types. BERT struggles with numerical representations and named entity recognition. World knowledge includes commonsense knowledge, but BERT cannot reason based on this knowledge. BERT's embeddings are contextualized and capture phenomena like polysemy and homonymy. Self-attention heads have been studied, with some heads specializing in certain types of syntactic relations. However, no single head contains complete syntactic tree information. BERT layers have been shown to be task-specific, with the middle layers being the most transferable. The final layers are the most task-specific and change the most during fine-tuning. BERT's performance on tasks like GLUE is influenced by the number of layers and the size of the hidden representation. BERT's training has been optimized through various methods, including large-batch training, different masking strategies, and alternative pre-training objectives. Pre-training is the most expensive part of training BERT, but it provides significant benefits for fine-tuning. BERT can be compressed through techniques like knowledge distillation, quantization, and pruning. BERTology has made significant progress, but there are still many unanswered questions about how BERT works. Future research directions include developing benchmarks for verbal reasoning, comprehensive stress tests for linguistic competence, and methods to teach reasoning to BERT. Additionally, further research is needed to understand what knowledge is actually used during inference and how to improve BERT's performance on complex tasks.

A Primer in BERTology: What We Know About How BERT Works

9 Nov 2020 | Anna Rogers, Olga Kovaleva, Anna Rumshisky