06 October 2018 | Kenneth Benoit, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo
Quanteda is an R package for quantitative analysis of textual data, offering a comprehensive workflow and toolkit for natural language processing (NLP) tasks such as corpus management, tokenization, analysis, and visualization. It provides extensive functions for dictionary analysis, keyword-in-context exploration, document and feature similarity computation, and multi-word expression discovery through collocation scoring. Based on sparse operations, quanteda is highly efficient for compiling document-feature matrices and manipulating them for further analysis. It is faster and more efficient than other R and Python packages in processing large textual data.
The package is designed for R users needing to apply NLP to texts, from documents to final analysis. Its capabilities match or exceed those of many end-user software applications, many of which are expensive and not open source. Quanteda is of great benefit to researchers, students, and analysts with fewer financial resources. It is designed to enable powerful, efficient analysis with a minimum of steps, and emphasizes consistent design to lower the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers.
Quanteda makes it easy to manage texts in the form of a "corpus", which includes document-level variables and metadata. It allows users to segment texts by words, paragraphs, sentences, or user-supplied delimiters and tags, group them into larger documents by document-level variables, or subset them based on logical conditions or combinations of document-level variables.
Quanteda is principally designed to allow users to construct a document-feature matrix from a corpus and perform common NLP tasks such as tokenizing, stemming, forming n-grams, and selecting and weighting features. It uses the ICU library in the stringi package for text processing, correctly handling Unicode character sets for regular expression matching and detecting word boundaries for tokenization. Once texts are tokenized, quanteda maps tokens to a hash table of integers to increase processing speed while reducing memory usage.
Quanteda is especially suited to research because it was designed for the social scientific analysis of textual data. It provides native, highly efficient implementations of several text analytic scaling methods, such as Wordscores, Wordfish, class affinity scaling, and correspondence analysis. It also provides a variety of text statistics, such as frequency analysis, "keyness", lexical diversity, readability, and similarity and distance of documents or features.
Quanteda provides extensive methods for visualizing textual analyses via its family of textplot_* functions. It is carefully designed with several key aims in mind: consistency, accessibility, performance, transparency and reproducibility, and compatibility with other packages. Quanteda is supported by the Quanteda Initiative, a non-profit organization founded in 2018 to provide ongoing support for the "quanteda ecosystem" of open-source text analysis software.Quanteda is an R package for quantitative analysis of textual data, offering a comprehensive workflow and toolkit for natural language processing (NLP) tasks such as corpus management, tokenization, analysis, and visualization. It provides extensive functions for dictionary analysis, keyword-in-context exploration, document and feature similarity computation, and multi-word expression discovery through collocation scoring. Based on sparse operations, quanteda is highly efficient for compiling document-feature matrices and manipulating them for further analysis. It is faster and more efficient than other R and Python packages in processing large textual data.
The package is designed for R users needing to apply NLP to texts, from documents to final analysis. Its capabilities match or exceed those of many end-user software applications, many of which are expensive and not open source. Quanteda is of great benefit to researchers, students, and analysts with fewer financial resources. It is designed to enable powerful, efficient analysis with a minimum of steps, and emphasizes consistent design to lower the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers.
Quanteda makes it easy to manage texts in the form of a "corpus", which includes document-level variables and metadata. It allows users to segment texts by words, paragraphs, sentences, or user-supplied delimiters and tags, group them into larger documents by document-level variables, or subset them based on logical conditions or combinations of document-level variables.
Quanteda is principally designed to allow users to construct a document-feature matrix from a corpus and perform common NLP tasks such as tokenizing, stemming, forming n-grams, and selecting and weighting features. It uses the ICU library in the stringi package for text processing, correctly handling Unicode character sets for regular expression matching and detecting word boundaries for tokenization. Once texts are tokenized, quanteda maps tokens to a hash table of integers to increase processing speed while reducing memory usage.
Quanteda is especially suited to research because it was designed for the social scientific analysis of textual data. It provides native, highly efficient implementations of several text analytic scaling methods, such as Wordscores, Wordfish, class affinity scaling, and correspondence analysis. It also provides a variety of text statistics, such as frequency analysis, "keyness", lexical diversity, readability, and similarity and distance of documents or features.
Quanteda provides extensive methods for visualizing textual analyses via its family of textplot_* functions. It is carefully designed with several key aims in mind: consistency, accessibility, performance, transparency and reproducibility, and compatibility with other packages. Quanteda is supported by the Quanteda Initiative, a non-profit organization founded in 2018 to provide ongoing support for the "quanteda ecosystem" of open-source text analysis software.