June 2 - June 7, 2019 | John Hewitt, Christopher D. Manning
This paper introduces a *structural probe* to evaluate whether syntax trees are embedded in the word representation space of neural networks. The probe identifies linear transformations where squared L2 distance encodes the distance between words in the parse tree and squared L2 norm encodes the depth in the parse tree. The authors demonstrate that such transformations exist for both ELMo and BERT but not for baselines, indicating that entire syntax trees are implicitly embedded in deep models' vector geometry. The structural probe is designed to test whether a neural network embeds each sentence's dependency parse tree in its contextual word representations. The probe learns a linear transformation of the word representation space to embed parse trees across all sentences, capturing both the distances and norms that encode syntax tree structure. Experiments on ELMo and BERT models show that they embed parse trees with high consistency, while baselines fail to do so. The study also explores the rank of the linear transformation required to encode syntax, finding that a low rank is sufficient for both models. The results suggest that the structure of syntax trees emerges through properly defined distances and norms in the word representation spaces of deep models.This paper introduces a *structural probe* to evaluate whether syntax trees are embedded in the word representation space of neural networks. The probe identifies linear transformations where squared L2 distance encodes the distance between words in the parse tree and squared L2 norm encodes the depth in the parse tree. The authors demonstrate that such transformations exist for both ELMo and BERT but not for baselines, indicating that entire syntax trees are implicitly embedded in deep models' vector geometry. The structural probe is designed to test whether a neural network embeds each sentence's dependency parse tree in its contextual word representations. The probe learns a linear transformation of the word representation space to embed parse trees across all sentences, capturing both the distances and norms that encode syntax tree structure. Experiments on ELMo and BERT models show that they embed parse trees with high consistency, while baselines fail to do so. The study also explores the rank of the linear transformation required to encode syntax, finding that a low rank is sufficient for both models. The results suggest that the structure of syntax trees emerges through properly defined distances and norms in the word representation spaces of deep models.