BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Date:2019-05-24
Author:Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
Pages:16
Summary:BERT, or Bidirectional Encoder Representations from Transformers, is a new language representation model designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. Unlike previous language representation models, BERT uses a "masked language model" (MLM) pre-training objective, where some tokens are randomly masked and the model is trained to predict the original vocabulary ID of the masked word based on its context. Additionally, BERT employs a "next sentence prediction" (NSP) task to jointly pre-train text-pair representations. These pre-training techniques enable BERT to achieve state-of-the-art performance on a wide range of natural language processing tasks, including question answering, language inference, and named entity recognition, without requiring substantial task-specific architecture modifications. BERT's unified architecture across different tasks allows for efficient fine-tuning, where all parameters are fine-tuned using labeled data from the downstream tasks. The paper demonstrates the importance of bidirectional pre-training and shows that pre-trained representations can significantly reduce the need for heavily engineered task-specific architectures. BERT advances the state of the art for eleven NLP tasks, achieving improvements in GLUE, MultiNLI, SQuAD v1.1, and SQuAD v2.0.