[slides and audio] A Universal Part-of-Speech Tagset

A universal part-of-speech (POS) tagset has been proposed to facilitate future research in unsupervised induction of syntactic structure and to standardize best practices. The tagset consists of twelve universal POS categories: NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT, and X. A mapping from 25 different treebank tagsets to this universal set has also been developed. When combined with original treebank data, this universal tagset and mapping produce a dataset of common parts-of-speech for 22 different languages. The tagset and mapping are available for download at http://code.google.com/p/universal-pos-tags/. This resource serves multiple purposes, including the development and evaluation of unsupervised and cross-lingual taggers, more reasonable comparison of accuracy across languages for supervised taggers, and enabling language technology practitioners to train POS taggers with common tagsets across multiple languages. Two experiments were conducted to demonstrate the effectiveness of the universal POS tagset. First, a language comparison was performed by training a supervised POS tagging model on all treebanks and evaluating tagging accuracy on the universal tagset. Second, the universal POS tags were used as the starting point for unsupervised grammar induction, producing completely unsupervised parsers for several languages. The results showed that the universal POS categories generalize well across language boundaries on an unsupervised grammar induction task, giving competitive parsing accuracies without relying on gold POS tags. The tagset and mappings are available for download at the specified URL.A universal part-of-speech (POS) tagset has been proposed to facilitate future research in unsupervised induction of syntactic structure and to standardize best practices. The tagset consists of twelve universal POS categories: NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT, and X. A mapping from 25 different treebank tagsets to this universal set has also been developed. When combined with original treebank data, this universal tagset and mapping produce a dataset of common parts-of-speech for 22 different languages. The tagset and mapping are available for download at http://code.google.com/p/universal-pos-tags/. This resource serves multiple purposes, including the development and evaluation of unsupervised and cross-lingual taggers, more reasonable comparison of accuracy across languages for supervised taggers, and enabling language technology practitioners to train POS taggers with common tagsets across multiple languages. Two experiments were conducted to demonstrate the effectiveness of the universal POS tagset. First, a language comparison was performed by training a supervised POS tagging model on all treebanks and evaluating tagging accuracy on the universal tagset. Second, the universal POS tags were used as the starting point for unsupervised grammar induction, producing completely unsupervised parsers for several languages. The results showed that the universal POS categories generalize well across language boundaries on an unsupervised grammar induction task, giving competitive parsing accuracies without relying on gold POS tags. The tagset and mappings are available for download at the specified URL.

A Universal Part-of-Speech Tagset

11 Apr 2011 | Slav Petrov, Dipanjan Das, Ryan McDonald