2024 | Davlatory Mengliev, Vladimir Barakhnin, Nilufar Abdurakhmonova, Mukhriddin Eshkulov
This paper presents a dataset and approaches to named entity recognition (NER) in the Uzbek language, addressing the underrepresentation of NLP resources for this resource-constrained language. The dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications. Two algorithms were developed to identify named entities using this dataset. The first algorithm relies on a dictionary-based approach, while the second uses neural network technologies with the SpaCy library. The study also describes the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. The dataset and algorithms can be used to develop applications such as improved chatbot systems, text mining, and other analytical tools for the Uzbek language. The paper discusses the value of the data, the data processing methods, the implementation of the algorithms, and the results of experiments. It also addresses limitations and ethical considerations in the research. The dataset and algorithms can be adapted for other low-resource Turkic languages, such as Karakalpak or the Oguz dialect of Uzbek.This paper presents a dataset and approaches to named entity recognition (NER) in the Uzbek language, addressing the underrepresentation of NLP resources for this resource-constrained language. The dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications. Two algorithms were developed to identify named entities using this dataset. The first algorithm relies on a dictionary-based approach, while the second uses neural network technologies with the SpaCy library. The study also describes the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. The dataset and algorithms can be used to develop applications such as improved chatbot systems, text mining, and other analytical tools for the Uzbek language. The paper discusses the value of the data, the data processing methods, the implementation of the algorithms, and the results of experiments. It also addresses limitations and ethical considerations in the research. The dataset and algorithms can be adapted for other low-resource Turkic languages, such as Karakalpak or the Oguz dialect of Uzbek.