Understanding A vector space model for automatic indexing

The paper "A Vector Space Model for Automatic Indexing" by G. Salton, A. Wong, and C. S. Yang from Cornell University explores the relationship between document space configurations and retrieval performance in automatic information retrieval systems. The authors propose that the best indexing space is one where documents are as far apart as possible, and that retrieval performance can be expressed as a function of the density of the object space, with higher performance correlating with lower space density. The study uses a vector space model where each document is represented by a vector in a multi-dimensional space, with terms weighted according to their importance. The similarity between documents is measured by the inner product of their vectors or the inverse of the angle between them. The authors introduce the concept of space density, which is the sum of all pairwise document similarities, and show that minimizing this measure can improve retrieval performance. The paper evaluates different clustering methods and term weighting schemes, finding that effective indexing methods often result in a more compressed document space. The authors also introduce the "term discrimination" model, which measures the extent to which a term increases the differences among document vectors. Terms with high discrimination values are found to be more effective for content identification. The study concludes that the term discrimination model and the automatic indexing theory based on it are useful for improving retrieval performance. The authors provide practical strategies for selecting index terms, such as using terms with medium document frequency directly, transforming high-frequency terms into phrases, and combining low-frequency terms into thesaurus classes. These methods have been tested on various document collections and shown to improve recall and precision.The paper "A Vector Space Model for Automatic Indexing" by G. Salton, A. Wong, and C. S. Yang from Cornell University explores the relationship between document space configurations and retrieval performance in automatic information retrieval systems. The authors propose that the best indexing space is one where documents are as far apart as possible, and that retrieval performance can be expressed as a function of the density of the object space, with higher performance correlating with lower space density. The study uses a vector space model where each document is represented by a vector in a multi-dimensional space, with terms weighted according to their importance. The similarity between documents is measured by the inner product of their vectors or the inverse of the angle between them. The authors introduce the concept of space density, which is the sum of all pairwise document similarities, and show that minimizing this measure can improve retrieval performance. The paper evaluates different clustering methods and term weighting schemes, finding that effective indexing methods often result in a more compressed document space. The authors also introduce the "term discrimination" model, which measures the extent to which a term increases the differences among document vectors. Terms with high discrimination values are found to be more effective for content identification. The study concludes that the term discrimination model and the automatic indexing theory based on it are useful for improving retrieval performance. The authors provide practical strategies for selecting index terms, such as using terms with medium document frequency directly, transforming high-frequency terms into phrases, and combining low-frequency terms into thesaurus classes. These methods have been tested on various document collections and shown to improve recall and precision.

A Vector Space Model for Automatic Indexing

November 1975 | G. Salton, A. Wong and C. S. Yang