Head-Driven Statistical Models for Natural Language Parsing

Head-Driven Statistical Models for Natural Language Parsing

2003 | Michael Collins
This article presents three statistical models for natural language parsing, which extend probabilistic context-free grammars (PCFGs) to lexicalized grammars. The models represent parse trees as sequences of decisions in a head-centered, top-down derivation. Independence assumptions lead to parameters that encode the X-bar schema, subcategorization, complement ordering, adjunct placement, lexical dependencies, wh-movement, and preferences for close attachment. These preferences are expressed by probabilities conditioned on lexical heads, making the models head-driven statistical models. The models are evaluated on the Penn Wall Street Journal Treebank, showing competitive accuracy with other models. Model 1 achieves 87.7% constituent precision and 87.5% recall on sentences up to 100 words. Models 2 and 3 improve these results to 88.3% and 88.0% respectively. Model 2 introduces parameters for subcategorization frames, while Model 3 provides a probabilistic treatment of wh-movement based on generalized phrase structure grammar (GPSG). The article discusses background on PCFGs and lexicalized PCFGs, then describes the three models in detail. Model 1 breaks down rule generation into smaller steps, incorporating distance features. Model 2 distinguishes between complements and adjuncts, improving parsing accuracy. Model 3 handles wh-movement by passing slash features through parse trees. The article also addresses practical issues, including parameter estimation, handling unknown words, and the parsing algorithm. It discusses how the models are refined to handle special cases like nonrecursive NPs, coordination, punctuation, and sentences with empty subjects. The models are shown to perform well on the Penn Treebank, with improvements in parsing accuracy and the ability to capture complex linguistic phenomena.This article presents three statistical models for natural language parsing, which extend probabilistic context-free grammars (PCFGs) to lexicalized grammars. The models represent parse trees as sequences of decisions in a head-centered, top-down derivation. Independence assumptions lead to parameters that encode the X-bar schema, subcategorization, complement ordering, adjunct placement, lexical dependencies, wh-movement, and preferences for close attachment. These preferences are expressed by probabilities conditioned on lexical heads, making the models head-driven statistical models. The models are evaluated on the Penn Wall Street Journal Treebank, showing competitive accuracy with other models. Model 1 achieves 87.7% constituent precision and 87.5% recall on sentences up to 100 words. Models 2 and 3 improve these results to 88.3% and 88.0% respectively. Model 2 introduces parameters for subcategorization frames, while Model 3 provides a probabilistic treatment of wh-movement based on generalized phrase structure grammar (GPSG). The article discusses background on PCFGs and lexicalized PCFGs, then describes the three models in detail. Model 1 breaks down rule generation into smaller steps, incorporating distance features. Model 2 distinguishes between complements and adjuncts, improving parsing accuracy. Model 3 handles wh-movement by passing slash features through parse trees. The article also addresses practical issues, including parameter estimation, handling unknown words, and the parsing algorithm. It discusses how the models are refined to handle special cases like nonrecursive NPs, coordination, punctuation, and sentences with empty subjects. The models are shown to perform well on the Penn Treebank, with improvements in parsing accuracy and the ability to capture complex linguistic phenomena.
Reach us at info@study.space
[slides] Head-Driven Statistical Models for Natural Language Parsing | StudySpace