[slides and audio] A Survey of Text Classification Algorithms

A survey of text classification algorithms is presented. Text classification is a widely studied problem in data mining, machine learning, database, and information retrieval. It has applications in various domains such as target marketing, medical diagnosis, news group filtering, and document organization. The classification problem involves assigning a label to a record based on its features. In the hard version, a specific label is assigned, while in the soft version, a probability is assigned. Other variations allow ranking of class choices or multiple labels. Text classification is closely related to classification of records with set-valued features, but it uses word presence/absence information. In reality, word frequency also plays a role. The domain size of text data is much larger than typical set-valued classification problems. A broad survey of classification methods is found in [42, 62], and a survey specific to text is found in [111]. A relative evaluation of text classification methods is found in [132]. Several techniques discussed in this chapter have been converted into software and are publicly available through toolkits such as BOW, Mallot, WEKA, and LingPipe. Text classification has applications in various domains in text mining. Examples include news filtering and organization, where automated methods are useful for categorizing news articles. Document organization and retrieval involve supervised methods for organizing documents in digital libraries, web collections, and scientific literature.A survey of text classification algorithms is presented. Text classification is a widely studied problem in data mining, machine learning, database, and information retrieval. It has applications in various domains such as target marketing, medical diagnosis, news group filtering, and document organization. The classification problem involves assigning a label to a record based on its features. In the hard version, a specific label is assigned, while in the soft version, a probability is assigned. Other variations allow ranking of class choices or multiple labels. Text classification is closely related to classification of records with set-valued features, but it uses word presence/absence information. In reality, word frequency also plays a role. The domain size of text data is much larger than typical set-valued classification problems. A broad survey of classification methods is found in [42, 62], and a survey specific to text is found in [111]. A relative evaluation of text classification methods is found in [132]. Several techniques discussed in this chapter have been converted into software and are publicly available through toolkits such as BOW, Mallot, WEKA, and LingPipe. Text classification has applications in various domains in text mining. Examples include news filtering and organization, where automated methods are useful for categorizing news articles. Document organization and retrieval involve supervised methods for organizing documents in digital libraries, web collections, and scientific literature.

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

2012 | Charu C. Aggarwal, ChengXiang Zhai