Understanding Accurate Unlexicalized Parsing

This paper presents an unlexicalized probabilistic context-free grammar (PCFG) that achieves high parsing accuracy, surpassing earlier lexicalized models. The key insight is that simple, linguistically motivated state splits can significantly improve parsing accuracy by breaking down false independence assumptions in a vanilla treebank grammar. The unlexicalized PCFG achieves an 86.36% F1 score, which is better than early lexicalized models and surprisingly close to the current state-of-the-art. This result has broader implications beyond establishing a lower bound on unlexicalized model accuracy: unlexicalized PCFGs are more compact, easier to replicate, and easier to interpret than complex lexical models. Parsing algorithms are simpler, more widely understood, and have lower asymptotic complexity. The paper discusses the historical context of PCFGs in natural language processing (NLP), noting that early results on their utility were disappointing. However, lexicalized PCFGs were seen as the key to high-performance parsing, influenced by the success of word n-gram models in speech recognition. Recent results, such as Johnson (1998) and Gildea (2001), have questioned the importance of lexicalization in parsing. The paper argues that lexicalized PCFGs have been overestimated in their benefits, and that unlexicalized models can achieve comparable performance with simpler, more interpretable structures. The paper describes several simple, linguistically motivated annotations that significantly improve the performance of an unlexicalized PCFG. These include parent annotations, unary internal/external annotations, tag splitting, and head annotations. The best-performing model, with a final F1 score of 86.36%, uses a combination of these annotations to achieve high accuracy. The paper also discusses the importance of markovization in handling sparsity and the benefits of using vertical and horizontal markovizations to capture context. The paper concludes that unlexicalized grammars are not only easier to estimate and parse but also more efficient in terms of time and space. While basic unlexicalized grammars have poor performance, the paper shows that with appropriate annotations, unlexicalized models can achieve performance comparable to early lexicalized parsers. The paper emphasizes that lexicalization is still valuable for certain tasks, but unlexicalized models offer a simpler and more interpretable alternative.This paper presents an unlexicalized probabilistic context-free grammar (PCFG) that achieves high parsing accuracy, surpassing earlier lexicalized models. The key insight is that simple, linguistically motivated state splits can significantly improve parsing accuracy by breaking down false independence assumptions in a vanilla treebank grammar. The unlexicalized PCFG achieves an 86.36% F1 score, which is better than early lexicalized models and surprisingly close to the current state-of-the-art. This result has broader implications beyond establishing a lower bound on unlexicalized model accuracy: unlexicalized PCFGs are more compact, easier to replicate, and easier to interpret than complex lexical models. Parsing algorithms are simpler, more widely understood, and have lower asymptotic complexity. The paper discusses the historical context of PCFGs in natural language processing (NLP), noting that early results on their utility were disappointing. However, lexicalized PCFGs were seen as the key to high-performance parsing, influenced by the success of word n-gram models in speech recognition. Recent results, such as Johnson (1998) and Gildea (2001), have questioned the importance of lexicalization in parsing. The paper argues that lexicalized PCFGs have been overestimated in their benefits, and that unlexicalized models can achieve comparable performance with simpler, more interpretable structures. The paper describes several simple, linguistically motivated annotations that significantly improve the performance of an unlexicalized PCFG. These include parent annotations, unary internal/external annotations, tag splitting, and head annotations. The best-performing model, with a final F1 score of 86.36%, uses a combination of these annotations to achieve high accuracy. The paper also discusses the importance of markovization in handling sparsity and the benefits of using vertical and horizontal markovizations to capture context. The paper concludes that unlexicalized grammars are not only easier to estimate and parse but also more efficient in terms of time and space. While basic unlexicalized grammars have poor performance, the paper shows that with appropriate annotations, unlexicalized models can achieve performance comparable to early lexicalized parsers. The paper emphasizes that lexicalization is still valuable for certain tasks, but unlexicalized models offer a simpler and more interpretable alternative.

Accurate Unlexicalized Parsing

| Dan Klein, Christopher D. Manning