NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data

NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data

23 Feb 2024 | Sergei Bogdanov, Alexandre Constantin, Timothée Bernard, Benoit Crabbé, Etienne Bernard
NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data This paper introduces NuNER, a compact language representation model specialized in Named Entity Recognition (NER). The model is trained using data annotated by large language models (LLMs), enabling efficient fine-tuning for various NER tasks. NuNER outperforms similar-sized foundation models in few-shot scenarios and competes with larger LLMs. The key to NuNER's performance lies in the diversity and size of the pre-training dataset. The model is designed as a task-specific foundation model, suitable for fine-tuning to specific NER problems. The dataset for NuNER is created by using GPT-3.5 to annotate a subset of the C4 corpus, resulting in a dataset with 4.38 million annotations across 200,000 unique concepts. This dataset is then used to pre-train a base RoBERTa model via a contrastive learning approach, leading to the creation of NuNER. NuNER is evaluated on various NER tasks, showing strong performance in few-shot learning scenarios. It outperforms both its base model and a model pre-trained on NER-BERT data. The model's performance is influenced by the diversity of the annotations and the size of the pre-training dataset, with text diversity having a lesser impact. NuNER is compared with other models, including GPT-3.5, GPT-4, and UniversalNER. It performs well against GPT-3.5 and is competitive with GPT-4 when sufficient training examples are available. NuNER also shows similar transfer learning performance to UniversalNER, despite being significantly smaller. The paper also explores the factors affecting NuNER's performance, including text diversity, concept diversity, pre-training dataset size, and model size. The results indicate that concept diversity and dataset size are crucial for performance, while model size has a smaller impact. NuNER is presented as a viable alternative to RoBERTa, offering a compact, efficient solution for NER tasks. The model is open-sourced, along with the LLM-annotated NER dataset used for pre-training. The paper concludes that the use of LLMs for data annotation can significantly improve the efficiency and effectiveness of NER models.NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data This paper introduces NuNER, a compact language representation model specialized in Named Entity Recognition (NER). The model is trained using data annotated by large language models (LLMs), enabling efficient fine-tuning for various NER tasks. NuNER outperforms similar-sized foundation models in few-shot scenarios and competes with larger LLMs. The key to NuNER's performance lies in the diversity and size of the pre-training dataset. The model is designed as a task-specific foundation model, suitable for fine-tuning to specific NER problems. The dataset for NuNER is created by using GPT-3.5 to annotate a subset of the C4 corpus, resulting in a dataset with 4.38 million annotations across 200,000 unique concepts. This dataset is then used to pre-train a base RoBERTa model via a contrastive learning approach, leading to the creation of NuNER. NuNER is evaluated on various NER tasks, showing strong performance in few-shot learning scenarios. It outperforms both its base model and a model pre-trained on NER-BERT data. The model's performance is influenced by the diversity of the annotations and the size of the pre-training dataset, with text diversity having a lesser impact. NuNER is compared with other models, including GPT-3.5, GPT-4, and UniversalNER. It performs well against GPT-3.5 and is competitive with GPT-4 when sufficient training examples are available. NuNER also shows similar transfer learning performance to UniversalNER, despite being significantly smaller. The paper also explores the factors affecting NuNER's performance, including text diversity, concept diversity, pre-training dataset size, and model size. The results indicate that concept diversity and dataset size are crucial for performance, while model size has a smaller impact. NuNER is presented as a viable alternative to RoBERTa, offering a compact, efficient solution for NER tasks. The model is open-sourced, along with the LLM-annotated NER dataset used for pre-training. The paper concludes that the use of LLMs for data annotation can significantly improve the efficiency and effectiveness of NER models.
Reach us at info@study.space
Understanding NuNER%3A Entity Recognition Encoder Pre-training via LLM-Annotated Data