9 Aug 2024 | Alina Petukhova, João P. Matos-Carvalho, Nuno Fachada
This study investigates the impact of different textual embeddings, particularly those from large language models (LLMs), and various clustering algorithms on text clustering performance. The research evaluates how these factors influence the clustering of text datasets, focusing on the effectiveness of embeddings, the role of dimensionality reduction through summarization, and the impact of model size. Key findings include:
1. **LLM Embeddings**: OpenAI's GPT-3.5 Turbo model outperforms other models in three out of five clustering metrics across most tested datasets, demonstrating superior performance in capturing structured language nuances.
2. **BERT**: Among lightweight models, BERT shows leading performance, highlighting its effectiveness in text clustering.
3. **Dimensionality Reduction**: Summarization techniques do not consistently enhance clustering efficiency, suggesting careful consideration is needed for practical application.
4. **Model Size**: Increasing model dimensionality does not always improve clustering results, indicating a need to balance computational feasibility with text representation quality.
The study extends traditional text clustering frameworks by integrating LLM embeddings, offering improved methodologies and suggesting new avenues for future research in textual analysis. The findings underscore the importance of balancing detailed text representation with computational feasibility in text clustering tasks.This study investigates the impact of different textual embeddings, particularly those from large language models (LLMs), and various clustering algorithms on text clustering performance. The research evaluates how these factors influence the clustering of text datasets, focusing on the effectiveness of embeddings, the role of dimensionality reduction through summarization, and the impact of model size. Key findings include:
1. **LLM Embeddings**: OpenAI's GPT-3.5 Turbo model outperforms other models in three out of five clustering metrics across most tested datasets, demonstrating superior performance in capturing structured language nuances.
2. **BERT**: Among lightweight models, BERT shows leading performance, highlighting its effectiveness in text clustering.
3. **Dimensionality Reduction**: Summarization techniques do not consistently enhance clustering efficiency, suggesting careful consideration is needed for practical application.
4. **Model Size**: Increasing model dimensionality does not always improve clustering results, indicating a need to balance computational feasibility with text representation quality.
The study extends traditional text clustering frameworks by integrating LLM embeddings, offering improved methodologies and suggesting new avenues for future research in textual analysis. The findings underscore the importance of balancing detailed text representation with computational feasibility in text clustering tasks.