Universal Dependencies v1: A Multilingual Treebank Collection

Universal Dependencies v1: A Multilingual Treebank Collection

2016 | Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman
The Universal Dependencies (UD) project aims to develop cross-linguistically consistent treebank annotations for multiple languages, supporting multilingual parsing research and practical NLP systems. The first version of UD, released in November 2015, includes 37 treebanks for 33 languages, with annotations of parts-of-speech and dependency relations. The project is based on two layers of annotation: the morphological layer, which uses the Google universal tag set, and the syntactic layer, which employs the Stanford dependencies. The morphological layer is derived from cross-linguistic error analysis and the Interset system, while the syntactic layer is adapted from the Stanford Dependencies. The UD annotation guidelines emphasize lexicalism, where words are the basic units of grammatical annotation, and focus on capturing morphological properties and syntactic relations. The project also introduces enhanced dependency representations and language-specific subtypes to capture special phenomena in different languages. The data is encoded in the CoNLL-U format, and tools are available for reading and validating the annotations. The UD project plans to release treebanks twice a year and improve the consistency and completeness of annotations across languages.The Universal Dependencies (UD) project aims to develop cross-linguistically consistent treebank annotations for multiple languages, supporting multilingual parsing research and practical NLP systems. The first version of UD, released in November 2015, includes 37 treebanks for 33 languages, with annotations of parts-of-speech and dependency relations. The project is based on two layers of annotation: the morphological layer, which uses the Google universal tag set, and the syntactic layer, which employs the Stanford dependencies. The morphological layer is derived from cross-linguistic error analysis and the Interset system, while the syntactic layer is adapted from the Stanford Dependencies. The UD annotation guidelines emphasize lexicalism, where words are the basic units of grammatical annotation, and focus on capturing morphological properties and syntactic relations. The project also introduces enhanced dependency representations and language-specific subtypes to capture special phenomena in different languages. The data is encoded in the CoNLL-U format, and tools are available for reading and validating the annotations. The UD project plans to release treebanks twice a year and improve the consistency and completeness of annotations across languages.
Reach us at info@study.space