Deep contextualized word representations

Deep contextualized word representations

22 Mar 2018 | Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer
This paper introduces ELMo, a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics) and how these uses vary across linguistic contexts (i.e., to model polysemy). ELMo representations are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus. These representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment, and sentiment analysis. The paper also presents an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals. ELMo representations differ from traditional word type embeddings in that each token is assigned a representation that is a function of the entire input sentence. They are derived from a bidirectional LSTM trained with a coupled language model objective on a large text corpus. ELMo representations are deep in the sense that they are a function of all of the internal layers of the biLM. The paper shows that combining the internal states in this manner allows for very rich word representations. Using intrinsic evaluations, the paper shows that higher-level LSTM states capture context-dependent aspects of word meaning, while lower-level states model aspects of syntax. Simultaneously exposing all of these signals is highly beneficial, allowing the learned models to select the types of semi-supervision that are most useful for each end task. Extensive experiments demonstrate that ELMo representations work extremely well in practice. They can be easily added to existing models for six diverse and challenging language understanding problems, including textual entailment, question answering, and sentiment analysis. The addition of ELMo representations alone significantly improves the state of the art in every case, including up to 20% relative error reductions. For tasks where direct comparisons are possible, ELMo outperforms CoVe, which computes contextualized representations using a neural machine translation encoder. An analysis of both ELMo and CoVe reveals that deep representations outperform those derived from just the top layer of an LSTM. The paper also discusses the importance of different layers in the biLM, showing that different layers encode different types of information. For example, introducing multi-task syntactic supervision at the lower levels of a deep LSTM can improve overall performance of higher level tasks such as dependency parsing or CCG super tagging. The paper also discusses the importance of the biLM in supervised NLP tasks, showing that adding ELMo to a pre-trained biLM can significantly improve performance on tasks such as question answering, textual entailment, semantic role labeling, coreference resolution, and named entity recognition. The paper also discusses the importance of the biLM in transfer learning, showing that fine-tuning the biLM on domain-specific data can lead to significant improvements in performance on downstream tasks. The paper concludes that ELMo provides significantThis paper introduces ELMo, a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics) and how these uses vary across linguistic contexts (i.e., to model polysemy). ELMo representations are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus. These representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment, and sentiment analysis. The paper also presents an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals. ELMo representations differ from traditional word type embeddings in that each token is assigned a representation that is a function of the entire input sentence. They are derived from a bidirectional LSTM trained with a coupled language model objective on a large text corpus. ELMo representations are deep in the sense that they are a function of all of the internal layers of the biLM. The paper shows that combining the internal states in this manner allows for very rich word representations. Using intrinsic evaluations, the paper shows that higher-level LSTM states capture context-dependent aspects of word meaning, while lower-level states model aspects of syntax. Simultaneously exposing all of these signals is highly beneficial, allowing the learned models to select the types of semi-supervision that are most useful for each end task. Extensive experiments demonstrate that ELMo representations work extremely well in practice. They can be easily added to existing models for six diverse and challenging language understanding problems, including textual entailment, question answering, and sentiment analysis. The addition of ELMo representations alone significantly improves the state of the art in every case, including up to 20% relative error reductions. For tasks where direct comparisons are possible, ELMo outperforms CoVe, which computes contextualized representations using a neural machine translation encoder. An analysis of both ELMo and CoVe reveals that deep representations outperform those derived from just the top layer of an LSTM. The paper also discusses the importance of different layers in the biLM, showing that different layers encode different types of information. For example, introducing multi-task syntactic supervision at the lower levels of a deep LSTM can improve overall performance of higher level tasks such as dependency parsing or CCG super tagging. The paper also discusses the importance of the biLM in supervised NLP tasks, showing that adding ELMo to a pre-trained biLM can significantly improve performance on tasks such as question answering, textual entailment, semantic role labeling, coreference resolution, and named entity recognition. The paper also discusses the importance of the biLM in transfer learning, showing that fine-tuning the biLM on domain-specific data can lead to significant improvements in performance on downstream tasks. The paper concludes that ELMo provides significant
Reach us at info@study.space
[slides] Deep Contextualized Word Representations | StudySpace