This paper presents a fast and effective text mining system for large-scale topic discovery using linear-time document clustering. The system is designed to process large volumes of text efficiently and present results in an intuitive graphical user interface. It uses a combination of feature extraction and clustering algorithms to generate hierarchical topic structures. The system evaluates the quality of the generated hierarchies using F-Measure, which combines precision and recall. The evaluation considers various feature selection parameters, such as tf-idf and feature vector length, and focuses on clustering algorithms, including techniques from Scatter/Gather and k-means.
The system's feature extraction process maps documents into a high-dimensional space using the vector space model, where each dimension corresponds to a unique word or concept. The system uses tf-idf weighting to determine the importance of terms, which generally outperforms term frequency alone, except for very small document sets. The feature vector length also affects clustering quality, with longer vectors improving quality but requiring more computational time and memory.
The clustering algorithms used include seed selection, center adjustment, and cluster refinement. The system evaluates different seed selection techniques, including random, buckshot, and fractionation, and finds that continuous center adjustment significantly improves cluster quality. A novel technique called vector average damping is introduced to further enhance cluster quality without additional computational cost. The system also compares the performance of near-linear time clustering algorithms with quadratic-time alternatives, finding that while the latter produces higher scores, it is prohibitively slow and does not scale well to large datasets.
The system's performance is evaluated on two distinct test corpora, and the results show that the near-linear time clustering algorithms provide a better time/quality tradeoff than the quadratic-time alternatives. The system is designed to handle large volumes of text efficiently and is capable of processing gigabytes of documents per day. Future work includes improving the evaluation method, enhancing the system to handle multiple topic clusters for a single document, and applying the clustering algorithms to structured data instead of text.This paper presents a fast and effective text mining system for large-scale topic discovery using linear-time document clustering. The system is designed to process large volumes of text efficiently and present results in an intuitive graphical user interface. It uses a combination of feature extraction and clustering algorithms to generate hierarchical topic structures. The system evaluates the quality of the generated hierarchies using F-Measure, which combines precision and recall. The evaluation considers various feature selection parameters, such as tf-idf and feature vector length, and focuses on clustering algorithms, including techniques from Scatter/Gather and k-means.
The system's feature extraction process maps documents into a high-dimensional space using the vector space model, where each dimension corresponds to a unique word or concept. The system uses tf-idf weighting to determine the importance of terms, which generally outperforms term frequency alone, except for very small document sets. The feature vector length also affects clustering quality, with longer vectors improving quality but requiring more computational time and memory.
The clustering algorithms used include seed selection, center adjustment, and cluster refinement. The system evaluates different seed selection techniques, including random, buckshot, and fractionation, and finds that continuous center adjustment significantly improves cluster quality. A novel technique called vector average damping is introduced to further enhance cluster quality without additional computational cost. The system also compares the performance of near-linear time clustering algorithms with quadratic-time alternatives, finding that while the latter produces higher scores, it is prohibitively slow and does not scale well to large datasets.
The system's performance is evaluated on two distinct test corpora, and the results show that the near-linear time clustering algorithms provide a better time/quality tradeoff than the quadratic-time alternatives. The system is designed to handle large volumes of text efficiently and is capable of processing gigabytes of documents per day. Future work includes improving the evaluation method, enhancing the system to handle multiple topic clusters for a single document, and applying the clustering algorithms to structured data instead of text.