Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation

2024 | Davlatyor Mengliev, Vladimir Barakhnin, Nilufar Abdurakhmonova, Mukhriddin Eshkulov
This paper presents a dataset and approaches for named entity recognition (NER) in the Uzbek language, developed for resource-constrained language environments. The dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. The authors developed two algorithms for NER in Uzbek language texts. The first algorithm uses a dictionary to identify named entities, while the second algorithm is based on neural networks and uses the SpaCy library. The dataset was compiled from legal documents, ensuring grammatical correctness and providing a robust basis for NER algorithms tailored to the Uzbek language. The dataset includes five columns: sentence order, word, part of speech, named entity category, and entity type according to the BIOES scheme. The BIOES tagging scheme allows for accurate delineation of entity boundaries, improving the performance of NER systems. The dataset includes entity types such as city, country, date, organization, person, position, and street. The authors also describe the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. The dataset and algorithms can be used to create applications such as improved chatbot systems, text mining applications, and other analytical tools for the Uzbek language. The study provides an important dataset for future NER tasks in the Uzbek language and offers a methodological basis for the use of vocabulary-based NER or machine learning NER in other low-resource languages. The authors also discuss the limitations and opportunities for further development of the dataset and algorithms. The dataset is available for download from Google Drive. The study emphasizes the importance of ethical considerations in data collection, processing, and use, ensuring that all data handling activities adhere to ethical standards and respect the rights and privacy of participants. The authors also mention that the dataset is derived from the National Database of Legislation of the Republic of Uzbekistan and that any use of the data must comply with the terms of information provision of the lex.uz website.This paper presents a dataset and approaches for named entity recognition (NER) in the Uzbek language, developed for resource-constrained language environments. The dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. The authors developed two algorithms for NER in Uzbek language texts. The first algorithm uses a dictionary to identify named entities, while the second algorithm is based on neural networks and uses the SpaCy library. The dataset was compiled from legal documents, ensuring grammatical correctness and providing a robust basis for NER algorithms tailored to the Uzbek language. The dataset includes five columns: sentence order, word, part of speech, named entity category, and entity type according to the BIOES scheme. The BIOES tagging scheme allows for accurate delineation of entity boundaries, improving the performance of NER systems. The dataset includes entity types such as city, country, date, organization, person, position, and street. The authors also describe the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. The dataset and algorithms can be used to create applications such as improved chatbot systems, text mining applications, and other analytical tools for the Uzbek language. The study provides an important dataset for future NER tasks in the Uzbek language and offers a methodological basis for the use of vocabulary-based NER or machine learning NER in other low-resource languages. The authors also discuss the limitations and opportunities for further development of the dataset and algorithms. The dataset is available for download from Google Drive. The study emphasizes the importance of ethical considerations in data collection, processing, and use, ensuring that all data handling activities adhere to ethical standards and respect the rights and privacy of participants. The authors also mention that the dataset is derived from the National Database of Legislation of the Republic of Uzbekistan and that any use of the data must comply with the terms of information provision of the lex.uz website.
Reach us at info@futurestudyspace.com
Understanding Developing named entity recognition algorithms for Uzbek%3A Dataset insights and implementation