This paper presents a vector space model for automatic indexing, proposing that the best indexing system is one where documents are as far apart as possible in the index space. The model suggests that retrieval performance is inversely related to the density of the document space. The paper describes a method for choosing an optimal indexing vocabulary by considering the space density of the document space.
The model uses a vector representation of documents, where each document is represented as a vector in a multi-dimensional space, with each dimension corresponding to an index term. The similarity between documents is calculated based on the similarity of their vectors. The paper discusses the importance of maximizing the separation between documents in the space to improve retrieval performance.
The paper also introduces the concept of clustered document spaces, where documents are grouped into clusters, and each cluster is represented by a centroid. The space density is measured by the sum of similarities between documents and the main centroid. The paper evaluates the performance of different indexing methods, including term-frequency weighting and inverse document frequency weighting, and finds that inverse document frequency weighting improves recall and precision.
The paper also introduces the concept of discrimination value, which measures the ability of a term to increase the differences between document vectors. Terms with high discrimination values are considered to be good discriminators, as they help to separate documents in the space. The paper evaluates the performance of different indexing strategies, including the use of phrases and thesaurus classes, and finds that these strategies improve retrieval performance.
The paper concludes that the vector space model for automatic indexing is effective in improving retrieval performance, and that the model has been successfully applied to various document collections. The model is considered to be a useful tool for automatic indexing and retrieval.This paper presents a vector space model for automatic indexing, proposing that the best indexing system is one where documents are as far apart as possible in the index space. The model suggests that retrieval performance is inversely related to the density of the document space. The paper describes a method for choosing an optimal indexing vocabulary by considering the space density of the document space.
The model uses a vector representation of documents, where each document is represented as a vector in a multi-dimensional space, with each dimension corresponding to an index term. The similarity between documents is calculated based on the similarity of their vectors. The paper discusses the importance of maximizing the separation between documents in the space to improve retrieval performance.
The paper also introduces the concept of clustered document spaces, where documents are grouped into clusters, and each cluster is represented by a centroid. The space density is measured by the sum of similarities between documents and the main centroid. The paper evaluates the performance of different indexing methods, including term-frequency weighting and inverse document frequency weighting, and finds that inverse document frequency weighting improves recall and precision.
The paper also introduces the concept of discrimination value, which measures the ability of a term to increase the differences between document vectors. Terms with high discrimination values are considered to be good discriminators, as they help to separate documents in the space. The paper evaluates the performance of different indexing strategies, including the use of phrases and thesaurus classes, and finds that these strategies improve retrieval performance.
The paper concludes that the vector space model for automatic indexing is effective in improving retrieval performance, and that the model has been successfully applied to various document collections. The model is considered to be a useful tool for automatic indexing and retrieval.