This paper explores the application of Large Language Models (LLMs) in the biomedical domain, specifically focusing on Named Entity Recognition (NER). The study highlights the challenges LLMs face in biomedical NER due to data scarcity and the complexity of medical language. Key findings include:
1. **Prompt Engineering**: Carefully designed prompts significantly enhance LLM performance. The TANL and DICE formats are adapted for biomedical NER, with the TANL format showing better effectiveness in most cases.
2. **In-Context Learning (ICL)**: Strategic selection of in-context examples using nearest neighbor search (KATE) improves ICL outcomes, outperforming random example selection. Biomedical text encoders, such as BioClinicalRoBERTa, perform better than general domain encoders.
3. **LLM Selection**: The study compares the performance and cost of fine-tuning closed-source LLMs versus using open-source LLMs for biomedical NER. Fine-tuning Llama2 shows superior outcomes on the NCBI-disease dataset, while GPT-4, enhanced with KATE and a biomedical encoder, performs better on the I2B2 and BC2GM datasets.
4. **Dictionary-Infused RAG**: A novel method, DiRAG, is proposed to integrate external medical knowledge (e.g., UMLS) into LLMs. This method significantly improves zero-shot NER performance on the I2B2 and NCBI-disease datasets, achieving state-of-the-art results.
The paper concludes by discussing the limitations of the study, including computational constraints and the need for more diverse knowledge bases. The code for the proposed methods is available at <https://github.com/masoud-monajati/LLM_Bio_NER>.This paper explores the application of Large Language Models (LLMs) in the biomedical domain, specifically focusing on Named Entity Recognition (NER). The study highlights the challenges LLMs face in biomedical NER due to data scarcity and the complexity of medical language. Key findings include:
1. **Prompt Engineering**: Carefully designed prompts significantly enhance LLM performance. The TANL and DICE formats are adapted for biomedical NER, with the TANL format showing better effectiveness in most cases.
2. **In-Context Learning (ICL)**: Strategic selection of in-context examples using nearest neighbor search (KATE) improves ICL outcomes, outperforming random example selection. Biomedical text encoders, such as BioClinicalRoBERTa, perform better than general domain encoders.
3. **LLM Selection**: The study compares the performance and cost of fine-tuning closed-source LLMs versus using open-source LLMs for biomedical NER. Fine-tuning Llama2 shows superior outcomes on the NCBI-disease dataset, while GPT-4, enhanced with KATE and a biomedical encoder, performs better on the I2B2 and BC2GM datasets.
4. **Dictionary-Infused RAG**: A novel method, DiRAG, is proposed to integrate external medical knowledge (e.g., UMLS) into LLMs. This method significantly improves zero-shot NER performance on the I2B2 and NCBI-disease datasets, achieving state-of-the-art results.
The paper concludes by discussing the limitations of the study, including computational constraints and the need for more diverse knowledge bases. The code for the proposed methods is available at <https://github.com/masoud-monajati/LLM_Bio_NER>.