NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data

NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data

23 Feb 2024 | Sergei Bogdanov, Alexandre Constantin, Timothée Bernard, Benoit Crabbe, Etienne Bernard
The paper introduces NuNER, a compact language representation model specialized for Named Entity Recognition (NER) tasks. NuNER leverages large language models (LLMs) to create a task-specific foundation model by pre-training on a dataset annotated by LLMs. The pre-training process involves using GPT-3.5 to annotate a subset of the C4 corpus, resulting in a dataset with 4.38 million annotations from 200k unique concepts. This annotated dataset is then used to pre-train a RoBERTa model via contrastive learning, resulting in NuNER. The paper demonstrates that NuNER outperforms similar-sized foundation models and competes with much larger LLMs in few-shot settings. Key findings include the importance of the diversity of annotations and the size of the pre-training dataset for achieving good performance. NuNER is open-sourced and can be used as a drop-in replacement for RoBERTa in NER tasks. The paper also compares NuNER with other LLMs and specialized NER models, showing competitive performance despite its smaller size.The paper introduces NuNER, a compact language representation model specialized for Named Entity Recognition (NER) tasks. NuNER leverages large language models (LLMs) to create a task-specific foundation model by pre-training on a dataset annotated by LLMs. The pre-training process involves using GPT-3.5 to annotate a subset of the C4 corpus, resulting in a dataset with 4.38 million annotations from 200k unique concepts. This annotated dataset is then used to pre-train a RoBERTa model via contrastive learning, resulting in NuNER. The paper demonstrates that NuNER outperforms similar-sized foundation models and competes with much larger LLMs in few-shot settings. Key findings include the importance of the diversity of annotations and the size of the pre-training dataset for achieving good performance. NuNER is open-sourced and can be used as a drop-in replacement for RoBERTa in NER tasks. The paper also compares NuNER with other LLMs and specialized NER models, showing competitive performance despite its smaller size.
Reach us at info@study.space
Understanding NuNER%3A Entity Recognition Encoder Pre-training via LLM-Annotated Data