Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections

Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections

1992 | Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John W. Tukey
This paper introduces Scatter/Gather, a cluster-based approach for browsing large document collections. The authors argue that document clustering, when used as an information access tool in its own right, can be more effective than traditional search techniques. They present a document browsing method that uses clustering as its primary operation, along with fast linear-time clustering algorithms to support interactive browsing. The paper discusses the limitations of traditional clustering methods, such as their quadratic running time and the fact that they often do not significantly improve retrieval. It also addresses the issue of clustering being used to enhance near-neighbor search, which has not been widely adopted due to its indifferent performance. The authors propose an alternative application for clustering in information access, inspired by conventional textbook access methods. They describe Scatter/Gather as a browsing method that uses a cluster-based, dynamic table-of-contents metaphor for navigating a collection of documents. This method allows users to explore document collections with non-specific goals and serves as a complement to more focused techniques. The paper also presents two new near-linear time clustering algorithms, Buckshot and Fractionation, which are effective for the online reclustering essential for Scatter/Gather. The authors discuss the requirements for Scatter/Gather, including the need for fast clustering algorithms and automatic summarization of document groups. The paper outlines the process of document clustering, including the use of similarity measures and the distinction between flat partitions and hierarchical clusters. It also discusses the performance of various clustering algorithms and the trade-offs between speed and accuracy. The authors conclude that Scatter/Gather demonstrates that document clustering can be an effective information access tool in its own right. The method is particularly helpful in situations where it is difficult or undesirable to specify a query formally. The paper emphasizes the importance of fast clustering algorithms for supporting Scatter/Gather and highlights the potential for further improvements in handling extremely large corpora.This paper introduces Scatter/Gather, a cluster-based approach for browsing large document collections. The authors argue that document clustering, when used as an information access tool in its own right, can be more effective than traditional search techniques. They present a document browsing method that uses clustering as its primary operation, along with fast linear-time clustering algorithms to support interactive browsing. The paper discusses the limitations of traditional clustering methods, such as their quadratic running time and the fact that they often do not significantly improve retrieval. It also addresses the issue of clustering being used to enhance near-neighbor search, which has not been widely adopted due to its indifferent performance. The authors propose an alternative application for clustering in information access, inspired by conventional textbook access methods. They describe Scatter/Gather as a browsing method that uses a cluster-based, dynamic table-of-contents metaphor for navigating a collection of documents. This method allows users to explore document collections with non-specific goals and serves as a complement to more focused techniques. The paper also presents two new near-linear time clustering algorithms, Buckshot and Fractionation, which are effective for the online reclustering essential for Scatter/Gather. The authors discuss the requirements for Scatter/Gather, including the need for fast clustering algorithms and automatic summarization of document groups. The paper outlines the process of document clustering, including the use of similarity measures and the distinction between flat partitions and hierarchical clusters. It also discusses the performance of various clustering algorithms and the trade-offs between speed and accuracy. The authors conclude that Scatter/Gather demonstrates that document clustering can be an effective information access tool in its own right. The method is particularly helpful in situations where it is difficult or undesirable to specify a query formally. The paper emphasizes the importance of fast clustering algorithms for supporting Scatter/Gather and highlights the potential for further improvements in handling extremely large corpora.
Reach us at info@study.space