TSpec-LLM: An Open-source Dataset for LLM Understanding of 3GPP Specifications

TSpec-LLM: An Open-source Dataset for LLM Understanding of 3GPP Specifications

3 Jun 2024 | Rasoul Nikbakht, Mohamed Benzaghta, and Giovanni Geraci
TSpec-LLM is an open-source dataset containing all 3GPP documents from Release 8 to Release 19 (1999–2023), totaling 13.5 GB with 30,137 documents and 535 million words. It is designed for research on large language models (LLMs) in the telecommunications domain. The dataset preserves the original content of 3GPP specifications, including tables, formulas, and figures, and is structured to maintain the original document format. It is available on Hugging Face and can be used for training and fine-tuning LLMs to understand telecom standards. To evaluate the dataset's effectiveness, a questionnaire was created based on 3GPP Releases 15–17, focusing on series 36 and 38. The questionnaire was generated through a three-step process involving prompt engineering, verification by an open-source LLM, and human validation. The performance of state-of-the-art LLMs (GPT-3.5, GPT-4, and Gemini Pro 1.0) was assessed on the questionnaire, with accuracies of 44%, 46%, and 51%, respectively. When combined with a retrieval-augmented generation (RAG) framework, the accuracy improved to 71%, 75%, and 72%, respectively. The RAG framework enhances LLM performance by retrieving relevant information from the TSpec-LLM dataset. The naive-RAG paradigm, which involves retrieving relevant documents and using them to generate prompts for LLMs, significantly improves the accuracy of LLMs in answering complex telecom-related questions. The framework's effectiveness is demonstrated by its ability to enhance the accuracy of GPT-3.5, Gemini Pro 1.0, and GPT-4 from 44%, 46%, and 51% to 71%, 75%, and 72%, respectively. The TSpec-LLM dataset is a comprehensive resource for LLMs in the telecom domain, providing a structured and well-organized collection of 3GPP specifications. It is designed to support research and development in the field of telecommunications, enabling LLMs to better understand and process complex technical documents. The dataset's inclusion of all 3GPP documents, along with their original structure and content, makes it a valuable resource for training and fine-tuning LLMs to effectively understand and respond to telecom-related queries.TSpec-LLM is an open-source dataset containing all 3GPP documents from Release 8 to Release 19 (1999–2023), totaling 13.5 GB with 30,137 documents and 535 million words. It is designed for research on large language models (LLMs) in the telecommunications domain. The dataset preserves the original content of 3GPP specifications, including tables, formulas, and figures, and is structured to maintain the original document format. It is available on Hugging Face and can be used for training and fine-tuning LLMs to understand telecom standards. To evaluate the dataset's effectiveness, a questionnaire was created based on 3GPP Releases 15–17, focusing on series 36 and 38. The questionnaire was generated through a three-step process involving prompt engineering, verification by an open-source LLM, and human validation. The performance of state-of-the-art LLMs (GPT-3.5, GPT-4, and Gemini Pro 1.0) was assessed on the questionnaire, with accuracies of 44%, 46%, and 51%, respectively. When combined with a retrieval-augmented generation (RAG) framework, the accuracy improved to 71%, 75%, and 72%, respectively. The RAG framework enhances LLM performance by retrieving relevant information from the TSpec-LLM dataset. The naive-RAG paradigm, which involves retrieving relevant documents and using them to generate prompts for LLMs, significantly improves the accuracy of LLMs in answering complex telecom-related questions. The framework's effectiveness is demonstrated by its ability to enhance the accuracy of GPT-3.5, Gemini Pro 1.0, and GPT-4 from 44%, 46%, and 51% to 71%, 75%, and 72%, respectively. The TSpec-LLM dataset is a comprehensive resource for LLMs in the telecom domain, providing a structured and well-organized collection of 3GPP specifications. It is designed to support research and development in the field of telecommunications, enabling LLMs to better understand and process complex technical documents. The dataset's inclusion of all 3GPP documents, along with their original structure and content, makes it a valuable resource for training and fine-tuning LLMs to effectively understand and respond to telecom-related queries.
Reach us at info@study.space