BMRETRIEVER: Tuning Large Language Models as Better Biomedical Text Retrievers

BMRETRIEVER: Tuning Large Language Models as Better Biomedical Text Retrievers

29 Apr 2024 | Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D. Wang, Joyce C. Ho, Chao Zhang, Carl Yang
BMRETRIEVER is a series of dense retrievers designed to enhance biomedical text retrieval using large language models (LLMs). The model is trained through unsupervised pre-training on large biomedical corpora and followed by instruction fine-tuning on labeled datasets and synthetic pairs. BMRETRIEVER has been tested on five biomedical tasks across 11 datasets, demonstrating strong performance in various biomedical applications. It also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are available on Hugging Face for transparency and reproducibility. The model leverages a two-stage pre-training framework: first, unsupervised contrastive pre-training on large-scale biomedical query-passage pairs, and second, instruction fine-tuning using diverse labeled data, including synthetic examples generated by LLMs. This approach allows BMRETRIEVER to adapt to various biomedical downstream tasks and input formats. The model's performance is evaluated across multiple biomedical tasks, showing that it outperforms existing dense retrievers with significantly more parameters. BMRETRIEVER also provides a lightweight yet high-performing domain adaptation solution, with its 1B variant achieving over 98% performance of E5-Mistral using only 14.3% of parameters. BMRETRIEVER demonstrates strong generalization capabilities across diverse tasks and input formats, including retrieving long context from short questions, long answers from patient questions, definitions from entity names, and relevant abstracts given an abstract. The model's efficiency and lightweight nature make it suitable for various biomedical tasks, and its performance is validated through extensive experiments. The model's effectiveness is further supported by its ability to handle unseen tasks, such as entity linking and paper recommendation, showcasing its potential for generalization. BMRETRIEVER is also efficient in terms of data usage, requiring significantly less data than many baselines. The model's performance is further validated through case studies, where it outperforms strong baselines in retrieving relevant information. Overall, BMRETRIEVER is a promising approach for improving biomedical text retrieval and has the potential to advance biomedical NLP research.BMRETRIEVER is a series of dense retrievers designed to enhance biomedical text retrieval using large language models (LLMs). The model is trained through unsupervised pre-training on large biomedical corpora and followed by instruction fine-tuning on labeled datasets and synthetic pairs. BMRETRIEVER has been tested on five biomedical tasks across 11 datasets, demonstrating strong performance in various biomedical applications. It also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are available on Hugging Face for transparency and reproducibility. The model leverages a two-stage pre-training framework: first, unsupervised contrastive pre-training on large-scale biomedical query-passage pairs, and second, instruction fine-tuning using diverse labeled data, including synthetic examples generated by LLMs. This approach allows BMRETRIEVER to adapt to various biomedical downstream tasks and input formats. The model's performance is evaluated across multiple biomedical tasks, showing that it outperforms existing dense retrievers with significantly more parameters. BMRETRIEVER also provides a lightweight yet high-performing domain adaptation solution, with its 1B variant achieving over 98% performance of E5-Mistral using only 14.3% of parameters. BMRETRIEVER demonstrates strong generalization capabilities across diverse tasks and input formats, including retrieving long context from short questions, long answers from patient questions, definitions from entity names, and relevant abstracts given an abstract. The model's efficiency and lightweight nature make it suitable for various biomedical tasks, and its performance is validated through extensive experiments. The model's effectiveness is further supported by its ability to handle unseen tasks, such as entity linking and paper recommendation, showcasing its potential for generalization. BMRETRIEVER is also efficient in terms of data usage, requiring significantly less data than many baselines. The model's performance is further validated through case studies, where it outperforms strong baselines in retrieving relevant information. Overall, BMRETRIEVER is a promising approach for improving biomedical text retrieval and has the potential to advance biomedical NLP research.
Reach us at info@study.space