Text Clustering with LLM Embeddings

Text Clustering with LLM Embeddings

9 Aug 2024 | Alina Petukhova, João P. Matos-Carvalho, Nuno Fachada
This study investigates the effectiveness of large language model (LLM) embeddings in text clustering, comparing them with traditional methods like TF-IDF and BERT. The research evaluates how different embeddings and clustering algorithms influence the clustering of text datasets. Experiments were conducted to assess the impact of embeddings on clustering results, the role of dimensionality reduction through summarisation, and the adjustment of model size. The findings indicate that LLM embeddings are superior at capturing subtleties in structured language. OpenAI's GPT-3.5 Turbo model yields better results in three out of five clustering metrics across most tested datasets. Most LLM embeddings show improvements in cluster purity and provide a more informative silhouette score, reflecting a refined structural understanding of text data compared to traditional methods. Among the more lightweight models, BERT demonstrates leading performance. Additionally, it was observed that increasing model dimensionality and employing summarisation techniques do not consistently enhance clustering efficiency, suggesting that these strategies require careful consideration for practical application. These results highlight a complex balance between the need for refined text representation and computational feasibility in text clustering applications. The study extends traditional text clustering frameworks by integrating embeddings from LLMs, offering improved methodologies and suggesting new avenues for future research in various types of textual analysis. The research also explores the impact of summarisation as a dimensionality reduction technique and the effect of model size on clustering performance. Results show that summarisation does not consistently benefit all models, and higher-dimensional models display mixed results. The study concludes that while larger models may offer improved clustering performance, their computational demands must be weighed against practical resource constraints. The findings emphasize the importance of balancing embedding quality with computational efficiency in text clustering tasks.This study investigates the effectiveness of large language model (LLM) embeddings in text clustering, comparing them with traditional methods like TF-IDF and BERT. The research evaluates how different embeddings and clustering algorithms influence the clustering of text datasets. Experiments were conducted to assess the impact of embeddings on clustering results, the role of dimensionality reduction through summarisation, and the adjustment of model size. The findings indicate that LLM embeddings are superior at capturing subtleties in structured language. OpenAI's GPT-3.5 Turbo model yields better results in three out of five clustering metrics across most tested datasets. Most LLM embeddings show improvements in cluster purity and provide a more informative silhouette score, reflecting a refined structural understanding of text data compared to traditional methods. Among the more lightweight models, BERT demonstrates leading performance. Additionally, it was observed that increasing model dimensionality and employing summarisation techniques do not consistently enhance clustering efficiency, suggesting that these strategies require careful consideration for practical application. These results highlight a complex balance between the need for refined text representation and computational feasibility in text clustering applications. The study extends traditional text clustering frameworks by integrating embeddings from LLMs, offering improved methodologies and suggesting new avenues for future research in various types of textual analysis. The research also explores the impact of summarisation as a dimensionality reduction technique and the effect of model size on clustering performance. Results show that summarisation does not consistently benefit all models, and higher-dimensional models display mixed results. The study concludes that while larger models may offer improved clustering performance, their computational demands must be weighed against practical resource constraints. The findings emphasize the importance of balancing embedding quality with computational efficiency in text clustering tasks.
Reach us at info@study.space