[slides] BMRetriever%3A Tuning Large Language Models as Better Biomedical Text Retrievers

**BMRETRIEVER: Tuning Large Language Models as Better Biomedical Text Retrievers** **Authors:** Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D. Wang, Joyce C. Ho, Chao Zhang, Carl Yang **Institution:** Emory University, Georgia Tech, UCLA **Abstract:** Developing effective biomedical retrieval models is crucial for excelling in knowledge-intensive biomedical tasks, but it remains challenging due to limited publicly annotated data and computational resources. BMRETRIEVER is a series of dense retrievers designed to enhance biomedical retrieval through unsupervised pre-training on large biomedical corpora followed by instruction fine-tuning on labeled datasets and synthetic pairs. Experiments across five biomedical tasks on 11 datasets demonstrate BMRETRIEVER's efficacy and strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are released to ensure transparency, reproducibility, and potential domain-specific adaptations. **Introduction:** Effective biomedical retrieval models are essential for excelling in knowledge-intensive biomedical tasks, but they face challenges due to limited annotated data and computational resources. BMRETRIEVER addresses these issues by pre-training on large biomedical corpora and fine-tuning on labeled datasets and synthetic pairs. The model's performance is evaluated across five biomedical tasks on 11 datasets, showing superior results compared to existing dense retrievers. BMRETRIEVER also demonstrates strong parameter efficiency, with smaller variants achieving high performance using significantly fewer parameters. **Method:** BMRETRIEVER leverages pre-trained autoregressive transformers as the backbone, allowing for flexible scaling. The model is pre-trained on a diverse range of biomedical corpora and then fine-tuned using labeled data and synthetic examples generated by LLMs. This two-stage framework ensures the model's adaptability to various biomedical downstream tasks. **Experimental Results:** Extensive experiments on five biomedical tasks across 11 datasets validate BMRETRIEVER's efficacy. The model outperforms existing dense retrievers with significantly fewer parameters, demonstrating its parameter efficiency and strong domain adaptation capabilities. BMRETRIEVER also shows robust generalization across diverse tasks and input formats, including long context retrieval, long answer retrieval, and entity linking. **Conclusion:** BMRETRIEVER is a powerful tool for biomedical text retrieval, achieving state-of-the-art performance with efficient parameter usage. The model's transparency, reproducibility, and potential for domain-specific adaptations make it a valuable contribution to biomedical NLP research.**BMRETRIEVER: Tuning Large Language Models as Better Biomedical Text Retrievers** **Authors:** Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D. Wang, Joyce C. Ho, Chao Zhang, Carl Yang **Institution:** Emory University, Georgia Tech, UCLA **Abstract:** Developing effective biomedical retrieval models is crucial for excelling in knowledge-intensive biomedical tasks, but it remains challenging due to limited publicly annotated data and computational resources. BMRETRIEVER is a series of dense retrievers designed to enhance biomedical retrieval through unsupervised pre-training on large biomedical corpora followed by instruction fine-tuning on labeled datasets and synthetic pairs. Experiments across five biomedical tasks on 11 datasets demonstrate BMRETRIEVER's efficacy and strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are released to ensure transparency, reproducibility, and potential domain-specific adaptations. **Introduction:** Effective biomedical retrieval models are essential for excelling in knowledge-intensive biomedical tasks, but they face challenges due to limited annotated data and computational resources. BMRETRIEVER addresses these issues by pre-training on large biomedical corpora and fine-tuning on labeled datasets and synthetic pairs. The model's performance is evaluated across five biomedical tasks on 11 datasets, showing superior results compared to existing dense retrievers. BMRETRIEVER also demonstrates strong parameter efficiency, with smaller variants achieving high performance using significantly fewer parameters. **Method:** BMRETRIEVER leverages pre-trained autoregressive transformers as the backbone, allowing for flexible scaling. The model is pre-trained on a diverse range of biomedical corpora and then fine-tuned using labeled data and synthetic examples generated by LLMs. This two-stage framework ensures the model's adaptability to various biomedical downstream tasks. **Experimental Results:** Extensive experiments on five biomedical tasks across 11 datasets validate BMRETRIEVER's efficacy. The model outperforms existing dense retrievers with significantly fewer parameters, demonstrating its parameter efficiency and strong domain adaptation capabilities. BMRETRIEVER also shows robust generalization across diverse tasks and input formats, including long context retrieval, long answer retrieval, and entity linking. **Conclusion:** BMRETRIEVER is a powerful tool for biomedical text retrieval, achieving state-of-the-art performance with efficient parameter usage. The model's transparency, reproducibility, and potential for domain-specific adaptations make it a valuable contribution to biomedical NLP research.

BMRETRIEVER: Tuning Large Language Models as Better Biomedical Text Retrievers

29 Apr 2024 | Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D. Wang, Joyce C. Ho, Chao Zhang, Carl Yang