The paper introduces the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE) benchmark and the MEDRAG toolkit to systematically evaluate Retrieval-Augmented Generation (RAG) systems for medical question answering (QA). MIRAGE includes 7,663 questions from five medical QA datasets, focusing on zero-shot learning and question-only retrieval settings. MEDRAG is a toolkit that integrates various corpora, retrievers, and large language models (LLMs) to evaluate RAG performance. The study finds that combining different corpora and retrievers significantly improves accuracy, with GPT-3.5 and Mixtral achieving comparable performance to GPT-4. The results also highlight the importance of selecting appropriate corpora and retrievers, with PubMed and MedCorp being the most effective. Additionally, the study observes a log-linear scaling relationship between model performance and the number of retrieved snippets, as well as a "lost-in-the-middle" effect where the position of the ground-truth snippet affects performance. The paper provides practical recommendations for corpus, retriever, and LLM selection to enhance the effectiveness of RAG systems in medical QA.The paper introduces the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE) benchmark and the MEDRAG toolkit to systematically evaluate Retrieval-Augmented Generation (RAG) systems for medical question answering (QA). MIRAGE includes 7,663 questions from five medical QA datasets, focusing on zero-shot learning and question-only retrieval settings. MEDRAG is a toolkit that integrates various corpora, retrievers, and large language models (LLMs) to evaluate RAG performance. The study finds that combining different corpora and retrievers significantly improves accuracy, with GPT-3.5 and Mixtral achieving comparable performance to GPT-4. The results also highlight the importance of selecting appropriate corpora and retrievers, with PubMed and MedCorp being the most effective. Additionally, the study observes a log-linear scaling relationship between model performance and the number of retrieved snippets, as well as a "lost-in-the-middle" effect where the position of the ground-truth snippet affects performance. The paper provides practical recommendations for corpus, retriever, and LLM selection to enhance the effectiveness of RAG systems in medical QA.