Learning Accurate, Compact, and Interpretable Tree Annotation

Learning Accurate, Compact, and Interpretable Tree Annotation

July 2006 | Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein
This paper presents an automatic method for tree annotation that alternately splits and merges nonterminal symbols to maximize the likelihood of training treebanks. Starting with a simple X-bar grammar, the method learns a new grammar where nonterminals are subsymbols of the original. Unlike previous work, this approach allows for varying degrees of splitting based on data complexity. The resulting grammars are more compact and accurate than previous automatic annotation methods, achieving an F1 score of 90.2% on the Penn Treebank, higher than fully lexicalized systems. The method combines the strengths of manual and automatic approaches, using a split-and-merge strategy that adaptively allocates subsymbols where they are most effective. This approach recovers patterns similar to those in manual tree annotation, while significantly reducing grammar size and improving accuracy. Empirically, hierarchical splitting increases accuracy and reduces variance in learned grammars. The method also investigates smoothed models, allowing for more effective splitting before encountering the oversplitting effect. The method starts with a simple X-bar grammar and learns a grammar with 1043 symbols, achieving an F1 score of 90.2% on the Penn Treebank. This is a 27% reduction in error compared to previous work. The grammar is human-interpretable, recovering many of the manually introduced annotations from previous work while also learning other linguistic phenomena. The method uses a split-merge cycle to iteratively refine the grammar, merging annotations that do not contribute significantly to accuracy. This approach allows for more complex grammars while maintaining accuracy and reducing size. The method also includes smoothing to prevent overfitting, allowing for more annotations without sacrificing accuracy. The resulting parser is among the best lexicalized parsers, outperforming previous work in terms of accuracy and efficiency. The grammar is compact, accurate, and interpretable, demonstrating the effectiveness of the split-merge approach in learning grammars that are both accurate and human-interpretable.This paper presents an automatic method for tree annotation that alternately splits and merges nonterminal symbols to maximize the likelihood of training treebanks. Starting with a simple X-bar grammar, the method learns a new grammar where nonterminals are subsymbols of the original. Unlike previous work, this approach allows for varying degrees of splitting based on data complexity. The resulting grammars are more compact and accurate than previous automatic annotation methods, achieving an F1 score of 90.2% on the Penn Treebank, higher than fully lexicalized systems. The method combines the strengths of manual and automatic approaches, using a split-and-merge strategy that adaptively allocates subsymbols where they are most effective. This approach recovers patterns similar to those in manual tree annotation, while significantly reducing grammar size and improving accuracy. Empirically, hierarchical splitting increases accuracy and reduces variance in learned grammars. The method also investigates smoothed models, allowing for more effective splitting before encountering the oversplitting effect. The method starts with a simple X-bar grammar and learns a grammar with 1043 symbols, achieving an F1 score of 90.2% on the Penn Treebank. This is a 27% reduction in error compared to previous work. The grammar is human-interpretable, recovering many of the manually introduced annotations from previous work while also learning other linguistic phenomena. The method uses a split-merge cycle to iteratively refine the grammar, merging annotations that do not contribute significantly to accuracy. This approach allows for more complex grammars while maintaining accuracy and reducing size. The method also includes smoothing to prevent overfitting, allowing for more annotations without sacrificing accuracy. The resulting parser is among the best lexicalized parsers, outperforming previous work in terms of accuracy and efficiency. The grammar is compact, accurate, and interpretable, demonstrating the effectiveness of the split-merge approach in learning grammars that are both accurate and human-interpretable.
Reach us at info@study.space