1993 | Mitchell P. Marcus*, Beatrice Santorini†, Mary Ann Marcinkiewicz‡
The paper introduces the Penn Treebank, a large annotated corpus of American English consisting of over 4.5 million words. The corpus has been annotated for part-of-speech (POS) information and skeletal syntactic structure. The authors discuss the design of the POS tagset, which is based on the Brown Corpus but simplified to reduce redundancy and improve consistency. They describe a two-stage tagging process: initial automatic tagging using PARTS, followed by manual correction by human annotators. The results show that the semi-automated tagging process is superior in terms of speed, consistency, and accuracy. For the bracketing task, the authors use a deterministic parser, Fidditch, to provide an initial parse, which is then simplified and corrected by annotators. The paper also outlines the syntactic tagset used and discusses the methodology and challenges of the bracketing process. The Penn Treebank has been widely used in various research projects, and the authors plan to enrich the annotation scheme to address limitations and provide a richer analysis of the corpus.The paper introduces the Penn Treebank, a large annotated corpus of American English consisting of over 4.5 million words. The corpus has been annotated for part-of-speech (POS) information and skeletal syntactic structure. The authors discuss the design of the POS tagset, which is based on the Brown Corpus but simplified to reduce redundancy and improve consistency. They describe a two-stage tagging process: initial automatic tagging using PARTS, followed by manual correction by human annotators. The results show that the semi-automated tagging process is superior in terms of speed, consistency, and accuracy. For the bracketing task, the authors use a deterministic parser, Fidditch, to provide an initial parse, which is then simplified and corrected by annotators. The paper also outlines the syntactic tagset used and discusses the methodology and challenges of the bracketing process. The Penn Treebank has been widely used in various research projects, and the authors plan to enrich the annotation scheme to address limitations and provide a richer analysis of the corpus.