[slides] Scatter%2FGather%3A a cluster-based approach to browsing large document collections

The paper "Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections" by Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey addresses the limitations of traditional document clustering methods, which are often criticized for being too slow and not significantly improving retrieval performance. The authors propose a new approach called Scatter/Gather, which uses document clustering as a primary operation for information access. This method is designed to be effective even when users do not have specific queries, making it particularly useful for browsing large document collections. Scatter/Gather operates by repeatedly scattering the collection into clusters and gathering selected clusters to form smaller, more detailed subcollections. The process continues until the clusters are small enough to list individual documents. The authors introduce two fast clustering algorithms, Buckshot and Fractionation, which support this interactive browsing paradigm. Buckshot is a fast clustering algorithm suitable for online reclustering, while Fractionation is more accurate but slower, making it suitable for offline partitioning of the entire corpus. The paper also discusses the challenges of document clustering, including the need for efficient similarity measures and the trade-offs between speed and accuracy in clustering algorithms. The authors provide detailed descriptions of their algorithms and demonstrate their effectiveness through a case study using a corpus of 5000 articles from the *New York Times News Service* in August 1990. In conclusion, Scatter/Gather shows that document clustering can be an effective tool for information access, especially in situations where precise queries are difficult to formulate. The method's intuitive table-of-contents metaphor and fast clustering algorithms make it a powerful tool for browsing large document collections.The paper "Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections" by Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey addresses the limitations of traditional document clustering methods, which are often criticized for being too slow and not significantly improving retrieval performance. The authors propose a new approach called Scatter/Gather, which uses document clustering as a primary operation for information access. This method is designed to be effective even when users do not have specific queries, making it particularly useful for browsing large document collections. Scatter/Gather operates by repeatedly scattering the collection into clusters and gathering selected clusters to form smaller, more detailed subcollections. The process continues until the clusters are small enough to list individual documents. The authors introduce two fast clustering algorithms, Buckshot and Fractionation, which support this interactive browsing paradigm. Buckshot is a fast clustering algorithm suitable for online reclustering, while Fractionation is more accurate but slower, making it suitable for offline partitioning of the entire corpus. The paper also discusses the challenges of document clustering, including the need for efficient similarity measures and the trade-offs between speed and accuracy in clustering algorithms. The authors provide detailed descriptions of their algorithms and demonstrate their effectiveness through a case study using a corpus of 5000 articles from the *New York Times News Service* in August 1990. In conclusion, Scatter/Gather shows that document clustering can be an effective tool for information access, especially in situations where precise queries are difficult to formulate. The method's intuitive table-of-contents metaphor and fast clustering algorithms make it a powerful tool for browsing large document collections.

Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections

1992 | Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John W. Tukey