[slides] Fast and effective text mining using linear-time document clustering

This paper presents a fast and effective text mining system for discovering topic hierarchies in large document collections. The system employs feature extraction to map documents into high-dimensional space and clustering algorithms to group these points into a hierarchy of clusters. The authors describe an unsupervised, near-linear time text clustering system with various algorithm choices for both feature extraction and clustering. They introduce a methodology to measure cluster quality using F-Measure and compare different algorithms through experiments. The evaluation considers feature selection parameters (tfidf and feature vector length) and focuses on clustering algorithms, including Scatter/Gather techniques (buckshot, fractionation, and split/join) and k-means. The results suggest that continuous center adjustment contributes more to cluster quality than seed selection, and using a simpler seed selection algorithm provides a better time/quality tradeoff. The authors also introduce "vector average damping," a refinement to center adjustment that further improves cluster quality. The paper compares near-linear time algorithms to a group average greedy agglomerative clustering algorithm to demonstrate the time/quality tradeoff quantitatively. The system is designed to handle gigabytes of documents per day and presents results in an intuitive GUI. The evaluation uses two distinct test corpora and measures the quality of generated cluster hierarchies by comparing them to human-labeled topics. The paper concludes with a discussion of the best time/quality tradeoff and future directions, including user-oriented evaluations and handling multiple topic clusters for a single document.This paper presents a fast and effective text mining system for discovering topic hierarchies in large document collections. The system employs feature extraction to map documents into high-dimensional space and clustering algorithms to group these points into a hierarchy of clusters. The authors describe an unsupervised, near-linear time text clustering system with various algorithm choices for both feature extraction and clustering. They introduce a methodology to measure cluster quality using F-Measure and compare different algorithms through experiments. The evaluation considers feature selection parameters (tfidf and feature vector length) and focuses on clustering algorithms, including Scatter/Gather techniques (buckshot, fractionation, and split/join) and k-means. The results suggest that continuous center adjustment contributes more to cluster quality than seed selection, and using a simpler seed selection algorithm provides a better time/quality tradeoff. The authors also introduce "vector average damping," a refinement to center adjustment that further improves cluster quality. The paper compares near-linear time algorithms to a group average greedy agglomerative clustering algorithm to demonstrate the time/quality tradeoff quantitatively. The system is designed to handle gigabytes of documents per day and presents results in an intuitive GUI. The evaluation uses two distinct test corpora and measures the quality of generated cluster hierarchies by comparing them to human-labeled topics. The paper concludes with a discussion of the best time/quality tradeoff and future directions, including user-oriented evaluations and handling multiple topic clusters for a single document.

Fast and Effective Text Mining Using Linear-time Document Clustering

1999 | Bjornar Larsen and Chinatsu Aone