11 Apr 2011 | Slav Petrov, Dipanjan Das, Ryan McDonald
The paper proposes a universal part-of-speech (POS) tagset consisting of twelve categories to facilitate unsupervised induction of syntactic structure and standardize best practices. The tagset includes NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT, ‘.’ (punctuation marks), and X (catch-all). A mapping from 25 different treebank tagsets to this universal set is developed, enabling the creation of a dataset with common POS tags for 22 languages. The resource is made available for download and is used in two experiments: one comparing POS tagging accuracies across languages and another combining cross-lingual projection POS taggers with an unsupervised grammar induction system. The results show competitive accuracies in unsupervised grammar induction without gold standard POS tags.The paper proposes a universal part-of-speech (POS) tagset consisting of twelve categories to facilitate unsupervised induction of syntactic structure and standardize best practices. The tagset includes NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT, ‘.’ (punctuation marks), and X (catch-all). A mapping from 25 different treebank tagsets to this universal set is developed, enabling the creation of a dataset with common POS tags for 22 languages. The resource is made available for download and is used in two experiments: one comparing POS tagging accuracies across languages and another combining cross-lingual projection POS taggers with an unsupervised grammar induction system. The results show competitive accuracies in unsupervised grammar induction without gold standard POS tags.