August 15-19, 2005, Salvador, Brazil | Donald Metzler, W. Bruce Croft
This paper presents a Markov random field (MRF) model for modeling term dependencies in information retrieval. The model allows for arbitrary text features to be incorporated as evidence, including single terms, ordered phrases, and unordered phrases. The authors explore three variants of the model: full independence (FI), sequential dependence (SD), and full dependence (FD). The model is trained to directly maximize mean average precision (MAP) rather than maximizing the likelihood of the training data. Ad hoc retrieval experiments are conducted on several newswire and web collections, including the GOV2 collection used in the TREC 2004 Terabyte Track. The results show that modeling term dependencies significantly improves retrieval effectiveness, especially on larger web collections. The authors hypothesize that larger collections are more effective for dependence models due to their higher information content and that incorporating multiple types of evidence into the model further improves effectiveness. The MRF model is shown to be able to emulate various retrieval and dependence models, including unigram, bigram, and biterm language modeling, as well as the binary independence model (BIM). The model is trained using a parameter sweep approach to maximize MAP, and the results demonstrate that the sequential dependence variant outperforms the full independence variant across all collections. The full dependence variant performs best on larger, less homogeneous collections. The study also shows that parameters trained on one collection generalize well to others, suggesting that the features used in the model capture general aspects of text. The authors conclude that modeling term dependencies can significantly improve retrieval effectiveness and that the MRF model provides a flexible and effective framework for this purpose.This paper presents a Markov random field (MRF) model for modeling term dependencies in information retrieval. The model allows for arbitrary text features to be incorporated as evidence, including single terms, ordered phrases, and unordered phrases. The authors explore three variants of the model: full independence (FI), sequential dependence (SD), and full dependence (FD). The model is trained to directly maximize mean average precision (MAP) rather than maximizing the likelihood of the training data. Ad hoc retrieval experiments are conducted on several newswire and web collections, including the GOV2 collection used in the TREC 2004 Terabyte Track. The results show that modeling term dependencies significantly improves retrieval effectiveness, especially on larger web collections. The authors hypothesize that larger collections are more effective for dependence models due to their higher information content and that incorporating multiple types of evidence into the model further improves effectiveness. The MRF model is shown to be able to emulate various retrieval and dependence models, including unigram, bigram, and biterm language modeling, as well as the binary independence model (BIM). The model is trained using a parameter sweep approach to maximize MAP, and the results demonstrate that the sequential dependence variant outperforms the full independence variant across all collections. The full dependence variant performs best on larger, less homogeneous collections. The study also shows that parameters trained on one collection generalize well to others, suggesting that the features used in the model capture general aspects of text. The authors conclude that modeling term dependencies can significantly improve retrieval effectiveness and that the MRF model provides a flexible and effective framework for this purpose.