The paper "Term Weighting Approaches in Automatic Text Retrieval" by Gerard Salton and Chris Buckley, published in 1987, reviews the evolution and effectiveness of term weighting systems in automatic text retrieval. The authors highlight that while more complex text representations have been proposed, single-term indexing systems with appropriate term weighting produce superior retrieval results. They emphasize the importance of effective term weighting systems in distinguishing relevant from irrelevant documents.
The paper outlines the development of term weighting methods, starting from early experiments using single terms to more sophisticated approaches involving term frequency (tf), inverse document frequency (idf), and normalization factors. These methods aim to enhance recall and precision, with tfidf being a widely used composite term weighting factor.
The authors conduct a series of experiments on six document collections of varying sizes and subjects, evaluating different term weighting combinations. The results show that the best performance is achieved by methods that use enhanced frequency weights for technical vocabulary and meaningful terms, conventional frequency weights for varied vocabulary, and fully weighted terms for short document vectors. For query vectors, they recommend using inverse document frequency factors and enhanced query term weights.
The paper concludes with recommendations for standard single-term weighting systems to compare with more complex text analysis methods, emphasizing the importance of term frequency, collection frequency, and normalization components in different contexts.The paper "Term Weighting Approaches in Automatic Text Retrieval" by Gerard Salton and Chris Buckley, published in 1987, reviews the evolution and effectiveness of term weighting systems in automatic text retrieval. The authors highlight that while more complex text representations have been proposed, single-term indexing systems with appropriate term weighting produce superior retrieval results. They emphasize the importance of effective term weighting systems in distinguishing relevant from irrelevant documents.
The paper outlines the development of term weighting methods, starting from early experiments using single terms to more sophisticated approaches involving term frequency (tf), inverse document frequency (idf), and normalization factors. These methods aim to enhance recall and precision, with tfidf being a widely used composite term weighting factor.
The authors conduct a series of experiments on six document collections of varying sizes and subjects, evaluating different term weighting combinations. The results show that the best performance is achieved by methods that use enhanced frequency weights for technical vocabulary and meaningful terms, conventional frequency weights for varied vocabulary, and fully weighted terms for short document vectors. For query vectors, they recommend using inverse document frequency factors and enhanced query term weights.
The paper concludes with recommendations for standard single-term weighting systems to compare with more complex text analysis methods, emphasizing the importance of term frequency, collection frequency, and normalization components in different contexts.