Building a Large Annotated Corpus of English: The Penn Treebank

Building a Large Annotated Corpus of English: The Penn Treebank

1993 | Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz
The Penn Treebank is a large annotated corpus of American English, consisting of over 4.5 million words. It was developed by researchers at the University of Pennsylvania and Northwestern University. The corpus is annotated for part-of-speech (POS) information and skeletal syntactic structure. The POS tagging task involved creating a simplified tagset that reduced redundancy and improved consistency. The tagset was designed to be more practical and efficient, allowing for better accuracy and speed in tagging. The tagging process involved an initial automatic assignment followed by manual correction by annotators. The results showed that semi-automated tagging was more efficient than manual tagging in terms of speed, consistency, and accuracy. The bracketing task involved parsing and simplifying the text to create a skeletal syntactic representation, which was then corrected by human annotators. The bracketing process was more complex than the tagging task, requiring annotators to identify and correct syntactic structures. The corpus has been used for various research projects, including the development of statistical models, formal theories of grammar, and the evaluation of parsing models. The project has also been used to bootstrap the development of lexicons and to study linguistic phenomena. The corpus is available to members of the Linguistic Data Consortium and has been used in various research efforts. The project is expected to continue in the future, with a focus on providing a richer analysis of the corpus and a parallel corpus of predicate-argument structures. The Penn Treebank has been widely used and has contributed significantly to the field of natural language processing and computational linguistics.The Penn Treebank is a large annotated corpus of American English, consisting of over 4.5 million words. It was developed by researchers at the University of Pennsylvania and Northwestern University. The corpus is annotated for part-of-speech (POS) information and skeletal syntactic structure. The POS tagging task involved creating a simplified tagset that reduced redundancy and improved consistency. The tagset was designed to be more practical and efficient, allowing for better accuracy and speed in tagging. The tagging process involved an initial automatic assignment followed by manual correction by annotators. The results showed that semi-automated tagging was more efficient than manual tagging in terms of speed, consistency, and accuracy. The bracketing task involved parsing and simplifying the text to create a skeletal syntactic representation, which was then corrected by human annotators. The bracketing process was more complex than the tagging task, requiring annotators to identify and correct syntactic structures. The corpus has been used for various research projects, including the development of statistical models, formal theories of grammar, and the evaluation of parsing models. The project has also been used to bootstrap the development of lexicons and to study linguistic phenomena. The corpus is available to members of the Linguistic Data Consortium and has been used in various research efforts. The project is expected to continue in the future, with a focus on providing a richer analysis of the corpus and a parallel corpus of predicate-argument structures. The Penn Treebank has been widely used and has contributed significantly to the field of natural language processing and computational linguistics.
Reach us at info@study.space