Understanding Probabilistic latent semantic indexing

Probabilistic Latent Semantic Indexing (PLSI) is a novel approach to automated document indexing that leverages a statistical latent class model for factor analysis of count data. Unlike traditional Latent Semantic Indexing (LSI) which uses Singular Value Decomposition (SVD), PLSI has a solid statistical foundation and defines a proper generative data model. PLSI can handle domain-specific synonyms and polysemous words, making it more robust and accurate in information retrieval tasks. The core of PLSI is the aspect model, a statistical mixture model that associates each word occurrence with an unobserved class variable. This model is fitted using a generalized Expectation Maximization (EM) algorithm, known as Tempered EM (TEM), which helps avoid overfitting and improve generalization. PLSI can be applied to both term-based and latent space representations, and it offers advantages over LSI in terms of precision and robustness. Experiments on various test collections demonstrate that PLSI outperforms standard term matching methods and LSI, particularly in handling polysemous words and combining models with different dimensionalities. The method's effectiveness is further highlighted by its ability to handle queries not present in the training data through a process called "folding-in." Overall, PLSI provides a more principled and flexible approach to document indexing, offering significant improvements in retrieval performance and robustness.Probabilistic Latent Semantic Indexing (PLSI) is a novel approach to automated document indexing that leverages a statistical latent class model for factor analysis of count data. Unlike traditional Latent Semantic Indexing (LSI) which uses Singular Value Decomposition (SVD), PLSI has a solid statistical foundation and defines a proper generative data model. PLSI can handle domain-specific synonyms and polysemous words, making it more robust and accurate in information retrieval tasks. The core of PLSI is the aspect model, a statistical mixture model that associates each word occurrence with an unobserved class variable. This model is fitted using a generalized Expectation Maximization (EM) algorithm, known as Tempered EM (TEM), which helps avoid overfitting and improve generalization. PLSI can be applied to both term-based and latent space representations, and it offers advantages over LSI in terms of precision and robustness. Experiments on various test collections demonstrate that PLSI outperforms standard term matching methods and LSI, particularly in handling polysemous words and combining models with different dimensionalities. The method's effectiveness is further highlighted by its ability to handle queries not present in the training data through a process called "folding-in." Overall, PLSI provides a more principled and flexible approach to document indexing, offering significant improvements in retrieval performance and robustness.

Probabilistic Latent Semantic Indexing

8/99 Berkeley, CA USA | Thomas Hofmann