The Natural Language Toolkit (NLTK) is a suite of Python modules, data sets, and tutorials supporting research and teaching in computational linguistics and natural language processing (NLP). Written in Python and distributed under the GPL license, NLTK has been rewritten to simplify linguistic data structures and take advantage of recent Python enhancements. This paper presents a simplified version of NLTK and explains its use in teaching NLP.
NLTK provides a wide range of NLP data types, processing tasks, corpus samples, and readers, along with animated algorithms, tutorials, and problem sets. It includes data types such as tokens, tags, chunks, trees, and feature structures. Interface definitions and reference implementations are provided for tokenizers, stemmers, taggers, chunkers, parsers, clusterers, and classifiers. NLTK is ideal for students learning NLP or conducting related research. It has been successfully used as a teaching tool, individual study tool, and platform for prototyping and building research systems.
Python was chosen for its shallow learning curve, transparent syntax, and good string-handling. Python allows exploration via its interactive interpreter and supports object-oriented programming, enabling data and code to be encapsulated and reused easily. Python comes with an extensive library, including tools for graphical programming and numerical processing.
Over the past four years, NLTK has grown rapidly, with data structures becoming significantly more complex. Each new processing task added new requirements on input and output representations. It was not clear how to generalize tasks so they could be applied independently of each other. To address this, NLTK 1.4 introduced a blackboard architecture for tokens, unifying many data types and allowing distinct tasks to be run independently. However, this architecture came with significant overhead for programmers.
This paper presents a brief overview and tutorial on a new, simplified toolkit and describes how it is used in teaching. It includes examples of simple processing tasks such as tokenization, stemming, tagging, chunking, and parsing. NLTK also provides support for conditional frequency distributions, making it easy to count items of interest in specified contexts. It includes a Brill tagger and an HMM tagger.
NLTK provides several parsers for context-free phrase-structure grammars. It also includes a recursive descent parser and a chart parser. NLTK is a unique framework for teaching NLP, providing comprehensive support for a first course in NLP that tightly couples theory and practice. Its extensive documentation maximizes the potential for independent learning. For more information, see http://nltk.sourceforge.net/.The Natural Language Toolkit (NLTK) is a suite of Python modules, data sets, and tutorials supporting research and teaching in computational linguistics and natural language processing (NLP). Written in Python and distributed under the GPL license, NLTK has been rewritten to simplify linguistic data structures and take advantage of recent Python enhancements. This paper presents a simplified version of NLTK and explains its use in teaching NLP.
NLTK provides a wide range of NLP data types, processing tasks, corpus samples, and readers, along with animated algorithms, tutorials, and problem sets. It includes data types such as tokens, tags, chunks, trees, and feature structures. Interface definitions and reference implementations are provided for tokenizers, stemmers, taggers, chunkers, parsers, clusterers, and classifiers. NLTK is ideal for students learning NLP or conducting related research. It has been successfully used as a teaching tool, individual study tool, and platform for prototyping and building research systems.
Python was chosen for its shallow learning curve, transparent syntax, and good string-handling. Python allows exploration via its interactive interpreter and supports object-oriented programming, enabling data and code to be encapsulated and reused easily. Python comes with an extensive library, including tools for graphical programming and numerical processing.
Over the past four years, NLTK has grown rapidly, with data structures becoming significantly more complex. Each new processing task added new requirements on input and output representations. It was not clear how to generalize tasks so they could be applied independently of each other. To address this, NLTK 1.4 introduced a blackboard architecture for tokens, unifying many data types and allowing distinct tasks to be run independently. However, this architecture came with significant overhead for programmers.
This paper presents a brief overview and tutorial on a new, simplified toolkit and describes how it is used in teaching. It includes examples of simple processing tasks such as tokenization, stemming, tagging, chunking, and parsing. NLTK also provides support for conditional frequency distributions, making it easy to count items of interest in specified contexts. It includes a Brill tagger and an HMM tagger.
NLTK provides several parsers for context-free phrase-structure grammars. It also includes a recursive descent parser and a chart parser. NLTK is a unique framework for teaching NLP, providing comprehensive support for a first course in NLP that tightly couples theory and practice. Its extensive documentation maximizes the potential for independent learning. For more information, see http://nltk.sourceforge.net/.