Understanding Term-Weighting Approaches in Automatic Text Retrieval

This paper discusses term weighting approaches in automatic text retrieval, emphasizing the importance of effective term weighting systems in achieving superior retrieval results. It summarizes insights from automatic term weighting and provides baseline single-term indexing models for comparison with more complex content analysis methods. The paper outlines the use of term vectors to represent documents and queries, with weights assigned to terms to distinguish their importance. It also explores various term weighting strategies, including term frequency, inverse document frequency, and length normalization, and evaluates their effectiveness in different contexts. The paper highlights that while complex term representations can be useful, they often fail to produce consistent results across different document collections. Single-term indexing, when appropriately weighted, is often more effective and reliable. The paper recommends specific term weighting strategies based on the length and nature of the query and document vectors. For short query vectors, enhanced query term weights are preferred, while for longer vectors, more discriminative term weighting is needed. For document vectors, enhanced frequency weights are recommended for technical vocabulary, while conventional frequency weights are suitable for more varied vocabularies. Length normalization is also recommended when vector lengths vary significantly. The paper also discusses the use of probabilistic models and the importance of relevance in term weighting. It concludes that while probabilistic models can be effective, they may not always outperform single-term indexing in conventional natural language retrieval scenarios. The paper provides recommendations for term weighting systems based on experimental results, emphasizing the importance of appropriate term weighting in achieving effective text retrieval.This paper discusses term weighting approaches in automatic text retrieval, emphasizing the importance of effective term weighting systems in achieving superior retrieval results. It summarizes insights from automatic term weighting and provides baseline single-term indexing models for comparison with more complex content analysis methods. The paper outlines the use of term vectors to represent documents and queries, with weights assigned to terms to distinguish their importance. It also explores various term weighting strategies, including term frequency, inverse document frequency, and length normalization, and evaluates their effectiveness in different contexts. The paper highlights that while complex term representations can be useful, they often fail to produce consistent results across different document collections. Single-term indexing, when appropriately weighted, is often more effective and reliable. The paper recommends specific term weighting strategies based on the length and nature of the query and document vectors. For short query vectors, enhanced query term weights are preferred, while for longer vectors, more discriminative term weighting is needed. For document vectors, enhanced frequency weights are recommended for technical vocabulary, while conventional frequency weights are suitable for more varied vocabularies. Length normalization is also recommended when vector lengths vary significantly. The paper also discusses the use of probabilistic models and the importance of relevance in term weighting. It concludes that while probabilistic models can be effective, they may not always outperform single-term indexing in conventional natural language retrieval scenarios. The paper provides recommendations for term weighting systems based on experimental results, emphasizing the importance of appropriate term weighting in achieving effective text retrieval.

Term Weighting Approaches in Automatic Text Retrieval

November 1987 | Gerard Salton, Chris Buckley