This paper introduces the MIRAGE benchmark and the MEDRAG toolkit for evaluating retrieval-augmented generation (RAG) systems in medicine. MIRAGE is a first-of-its-kind benchmark consisting of 7,663 questions from five medical QA datasets. The MEDRAG toolkit enables large-scale experiments with over 1.8 trillion prompt tokens across 41 combinations of different corpora, retrievers, and backbone LLMs. The results show that MEDRAG improves the accuracy of six different LLMs by up to 18% over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. The study also identifies a log-linear scaling property and the "lost-in-the-middle" effects in medical RAG. The findings suggest that combining various medical corpora and retrievers achieves the best performance. The paper provides practical recommendations for implementing RAG systems in medicine, including the selection of corpora, retrievers, and LLMs. The study highlights the importance of using up-to-date and reliable information sources to improve the accuracy and reliability of medical QA systems. The results demonstrate that RAG can significantly enhance the zero-shot capability of LLMs in answering medical questions, making it a more efficient choice than larger-scale pre-training. The study also shows that the performance of RAG systems can be improved by using a combination of different retrievers and corpora. The paper concludes that the MIRAGE benchmark and MEDRAG toolkit provide a comprehensive evaluation framework for RAG systems in medicine, offering practical guidelines for their implementation and future research.This paper introduces the MIRAGE benchmark and the MEDRAG toolkit for evaluating retrieval-augmented generation (RAG) systems in medicine. MIRAGE is a first-of-its-kind benchmark consisting of 7,663 questions from five medical QA datasets. The MEDRAG toolkit enables large-scale experiments with over 1.8 trillion prompt tokens across 41 combinations of different corpora, retrievers, and backbone LLMs. The results show that MEDRAG improves the accuracy of six different LLMs by up to 18% over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. The study also identifies a log-linear scaling property and the "lost-in-the-middle" effects in medical RAG. The findings suggest that combining various medical corpora and retrievers achieves the best performance. The paper provides practical recommendations for implementing RAG systems in medicine, including the selection of corpora, retrievers, and LLMs. The study highlights the importance of using up-to-date and reliable information sources to improve the accuracy and reliability of medical QA systems. The results demonstrate that RAG can significantly enhance the zero-shot capability of LLMs in answering medical questions, making it a more efficient choice than larger-scale pre-training. The study also shows that the performance of RAG systems can be improved by using a combination of different retrievers and corpora. The paper concludes that the MIRAGE benchmark and MEDRAG toolkit provide a comprehensive evaluation framework for RAG systems in medicine, offering practical guidelines for their implementation and future research.