10 Mar 2017 | Timothy Dozat, Christopher D. Manning
This paper presents a neural dependency parser using deep biaffine attention, achieving state-of-the-art performance on six languages. The parser uses biaffine classifiers for arc and label prediction, with a larger but more regularized network than previous BiLSTM-based approaches. It outperforms Kiperwasser & Goldberg (2016) by 1.8% and 2.2% on the English Penn Treebank (PTB) dataset, achieving 95.7% UAS and 94.1% LAS, comparable to the best transition-based parser (Kuncoro et al., 2016). The parser also shows that hyperparameter choices significantly affect parsing accuracy.
The paper compares transition-based and graph-based parsers. Transition-based parsers process sentences sequentially, using machine learning to predict transition actions. Graph-based parsers use machine learning to assign weights to edges and construct a maximum spanning tree. Kiperwasser & Goldberg (2016) used a neural graph-based parser with a similar attention mechanism to machine translation. Hashimoto et al. (2016) used a bilinear attention mechanism, while Cheng et al. (2016) used a graph-based parser that attempts to condition arc scores on previous parsing decisions.
The proposed parser modifies the graph-based architectures of Kiperwasser & Goldberg (2016), Hashimoto et al. (2016), and Cheng et al. (2016) by using biaffine attention instead of bilinear or traditional MLP-based attention. It uses a biaffine dependency label classifier and applies dimension-reducing MLPs to each recurrent output vector before applying the biaffine transformation. This approach is more efficient and directly models both the prior probability of a word receiving dependents and the likelihood of a specific dependent.
The parser uses 100-dimensional word and POS tag vectors, three BiLSTM layers, and 500- and 100-dimensional ReLU MLP layers. Dropout is applied at every stage of the model to prevent overfitting. The network is optimized with annealed Adam for about 50,000 steps. The parser achieves high performance on the PTB-SD 3.5.0 dataset, outperforming other graph-based parsers and achieving comparable performance to the best transition-based parser. The results show that the proposed parser is effective in handling non-projective dependencies, which are challenging for transition-based parsers. The paper concludes that the proposed parser is a significant improvement in neural dependency parsing, achieving high performance with a simpler architecture. Future work will focus on improving the parser's ability to handle out-of-vocabulary tokens for morphologically rich languages.This paper presents a neural dependency parser using deep biaffine attention, achieving state-of-the-art performance on six languages. The parser uses biaffine classifiers for arc and label prediction, with a larger but more regularized network than previous BiLSTM-based approaches. It outperforms Kiperwasser & Goldberg (2016) by 1.8% and 2.2% on the English Penn Treebank (PTB) dataset, achieving 95.7% UAS and 94.1% LAS, comparable to the best transition-based parser (Kuncoro et al., 2016). The parser also shows that hyperparameter choices significantly affect parsing accuracy.
The paper compares transition-based and graph-based parsers. Transition-based parsers process sentences sequentially, using machine learning to predict transition actions. Graph-based parsers use machine learning to assign weights to edges and construct a maximum spanning tree. Kiperwasser & Goldberg (2016) used a neural graph-based parser with a similar attention mechanism to machine translation. Hashimoto et al. (2016) used a bilinear attention mechanism, while Cheng et al. (2016) used a graph-based parser that attempts to condition arc scores on previous parsing decisions.
The proposed parser modifies the graph-based architectures of Kiperwasser & Goldberg (2016), Hashimoto et al. (2016), and Cheng et al. (2016) by using biaffine attention instead of bilinear or traditional MLP-based attention. It uses a biaffine dependency label classifier and applies dimension-reducing MLPs to each recurrent output vector before applying the biaffine transformation. This approach is more efficient and directly models both the prior probability of a word receiving dependents and the likelihood of a specific dependent.
The parser uses 100-dimensional word and POS tag vectors, three BiLSTM layers, and 500- and 100-dimensional ReLU MLP layers. Dropout is applied at every stage of the model to prevent overfitting. The network is optimized with annealed Adam for about 50,000 steps. The parser achieves high performance on the PTB-SD 3.5.0 dataset, outperforming other graph-based parsers and achieving comparable performance to the best transition-based parser. The results show that the proposed parser is effective in handling non-projective dependencies, which are challenging for transition-based parsers. The paper concludes that the proposed parser is a significant improvement in neural dependency parsing, achieving high performance with a simpler architecture. Future work will focus on improving the parser's ability to handle out-of-vocabulary tokens for morphologically rich languages.