July 28 - August 2, 2019 | Ganesh Jawahar, Benoît Sagot, Djamé Seddah
This paper investigates the structural information learned by BERT, a state-of-the-art language representation model, through a series of experiments. The authors demonstrate that BERT captures phrase-level information in its lower layers and encodes a rich hierarchy of linguistic features, including surface, syntactic, and semantic information, with these features becoming more complex as they move up the layers. They show that BERT requires deeper layers to handle long-distance dependency information, such as subject-verb agreement, and that its representations capture linguistic information in a compositional manner, mimicking classical tree-like structures. The study also explores the interpretability of BERT's representations using Tensor Product Decomposition Networks (TPDNs) and concludes that BERT implicitly implements a tree-based compositionality scheme. The findings contribute to the understanding of BERT's success in diverse language understanding tasks and its limitations, providing insights for the design of improved architectures.This paper investigates the structural information learned by BERT, a state-of-the-art language representation model, through a series of experiments. The authors demonstrate that BERT captures phrase-level information in its lower layers and encodes a rich hierarchy of linguistic features, including surface, syntactic, and semantic information, with these features becoming more complex as they move up the layers. They show that BERT requires deeper layers to handle long-distance dependency information, such as subject-verb agreement, and that its representations capture linguistic information in a compositional manner, mimicking classical tree-like structures. The study also explores the interpretability of BERT's representations using Tensor Product Decomposition Networks (TPDNs) and concludes that BERT implicitly implements a tree-based compositionality scheme. The findings contribute to the understanding of BERT's success in diverse language understanding tasks and its limitations, providing insights for the design of improved architectures.