[slides] Universal Dependencies v1%3A A Multilingual Treebank Collection

The Universal Dependencies (UD) project aims to create cross-linguistically consistent treebank annotations for many languages within a dependency-based lexicalist framework. This paper describes version 1 of the universal guidelines, the underlying design principles, and the currently available treebanks for 33 languages. The UD project seeks to address the problem of inconsistent annotation schemes across languages, which has hindered comparative evaluations and cross-lingual learning experiments. By developing a unified annotation framework, UD aims to support multilingual natural language processing and comparative linguistic studies. The framework is based on common usage and existing de facto standards, and is intended to replace all previous versions by a single coherent standard. The UD project is a merger of several initiatives, including the Stanford dependencies, the Google universal tag set, and the Interset interlingua for morphosyntactic tag sets. The UD project has released treebanks for 33 languages, with the latest version (v1.2) containing 37 treebanks. The UD guidelines are based on dependency, which is widely used in contemporary NLP, and lexicalism, the idea that words are the basic units of grammatical annotation. The UD project also includes a rich taxonomy of noun dependents and relations to capture phenomena appearing in non-edited or informal texts. The UD project allows for language-specific subtypes to capture special phenomena in different languages. The UD project also includes a format and tools for reading and validating CoNLL-U, as well as annotation visualizations. The UD project has released treebanks for 33 languages, with the latest version (v1.2) containing 37 treebanks. The UD project aims to continue with treebank releases twice a year to keep up the momentum of the project. In the near future, the main priority is to improve the consistency and completeness of annotations for all languages, but also to expand the sample of languages and welcome all new contributors to the project. As a medium-term goal, an improved version of the universal guidelines is envisaged, based on an analysis of issues that have arisen in the work on improving consistency across languages. Ideally, the next version of the guidelines should also cover the enhanced dependencies. In parallel to the development of guidelines and annotated corpora, the project hopes to release tools for tokenization, morphological analysis and syntactic parsing for all languages, as well as large-scale parsebanks (automatically parsed corpora).The Universal Dependencies (UD) project aims to create cross-linguistically consistent treebank annotations for many languages within a dependency-based lexicalist framework. This paper describes version 1 of the universal guidelines, the underlying design principles, and the currently available treebanks for 33 languages. The UD project seeks to address the problem of inconsistent annotation schemes across languages, which has hindered comparative evaluations and cross-lingual learning experiments. By developing a unified annotation framework, UD aims to support multilingual natural language processing and comparative linguistic studies. The framework is based on common usage and existing de facto standards, and is intended to replace all previous versions by a single coherent standard. The UD project is a merger of several initiatives, including the Stanford dependencies, the Google universal tag set, and the Interset interlingua for morphosyntactic tag sets. The UD project has released treebanks for 33 languages, with the latest version (v1.2) containing 37 treebanks. The UD guidelines are based on dependency, which is widely used in contemporary NLP, and lexicalism, the idea that words are the basic units of grammatical annotation. The UD project also includes a rich taxonomy of noun dependents and relations to capture phenomena appearing in non-edited or informal texts. The UD project allows for language-specific subtypes to capture special phenomena in different languages. The UD project also includes a format and tools for reading and validating CoNLL-U, as well as annotation visualizations. The UD project has released treebanks for 33 languages, with the latest version (v1.2) containing 37 treebanks. The UD project aims to continue with treebank releases twice a year to keep up the momentum of the project. In the near future, the main priority is to improve the consistency and completeness of annotations for all languages, but also to expand the sample of languages and welcome all new contributors to the project. As a medium-term goal, an improved version of the universal guidelines is envisaged, based on an analysis of issues that have arisen in the work on improving consistency across languages. Ideally, the next version of the guidelines should also cover the enhanced dependencies. In parallel to the development of guidelines and annotated corpora, the project hopes to release tools for tokenization, morphological analysis and syntactic parsing for all languages, as well as large-scale parsebanks (automatically parsed corpora).

Universal Dependencies v1: A Multilingual Treebank Collection

| Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman