RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

6 Jul 2024 | Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, Huaxiu Yao
RULE is a reliable multimodal Retrieval-Augmented Generation (RAG) method designed to enhance the factual accuracy of Medical Large Vision Language Models (Med-LVLMs). The method addresses two major challenges in using RAG: (1) the risk of factual inaccuracies due to an insufficient or excessive number of retrieved contexts, and (2) over-reliance on retrieved contexts, which can lead to incorrect answers even when the model's original response is correct. RULE consists of two key components: (1) a factuality risk control strategy that calibrates the number of retrieved contexts to ensure high accuracy without additional training, and (2) a knowledge-retrieval balance tuning strategy that uses preference optimization to balance the model's reliance on its own knowledge and retrieved contexts. The factuality risk control strategy involves calculating the risk of factual errors for different numbers of retrieved contexts and selecting those that meet a predefined risk tolerance. The knowledge-retrieval balance tuning strategy uses a preference dataset to fine-tune the model, ensuring it does not overly depend on retrieved contexts. This approach is validated on three medical Visual Question Answering (VQA) datasets, achieving an average improvement of 20.8% in factual accuracy. RULE is implemented using a combination of retrieval strategies, statistical methods for risk control, and preference optimization. The method is tested on various medical datasets, including MIMIC-CXR, IU-Xray, and Harvard-FairVLMed, demonstrating its effectiveness in improving the factual accuracy of Med-LVLMs. The results show that RULE significantly enhances the model's ability to generate accurate medical responses, particularly in cases where the model might otherwise rely too heavily on retrieved information. The method is also shown to be compatible with different models and datasets, indicating its general applicability in medical multimodal diagnosis.RULE is a reliable multimodal Retrieval-Augmented Generation (RAG) method designed to enhance the factual accuracy of Medical Large Vision Language Models (Med-LVLMs). The method addresses two major challenges in using RAG: (1) the risk of factual inaccuracies due to an insufficient or excessive number of retrieved contexts, and (2) over-reliance on retrieved contexts, which can lead to incorrect answers even when the model's original response is correct. RULE consists of two key components: (1) a factuality risk control strategy that calibrates the number of retrieved contexts to ensure high accuracy without additional training, and (2) a knowledge-retrieval balance tuning strategy that uses preference optimization to balance the model's reliance on its own knowledge and retrieved contexts. The factuality risk control strategy involves calculating the risk of factual errors for different numbers of retrieved contexts and selecting those that meet a predefined risk tolerance. The knowledge-retrieval balance tuning strategy uses a preference dataset to fine-tune the model, ensuring it does not overly depend on retrieved contexts. This approach is validated on three medical Visual Question Answering (VQA) datasets, achieving an average improvement of 20.8% in factual accuracy. RULE is implemented using a combination of retrieval strategies, statistical methods for risk control, and preference optimization. The method is tested on various medical datasets, including MIMIC-CXR, IU-Xray, and Harvard-FairVLMed, demonstrating its effectiveness in improving the factual accuracy of Med-LVLMs. The results show that RULE significantly enhances the model's ability to generate accurate medical responses, particularly in cases where the model might otherwise rely too heavily on retrieved information. The method is also shown to be compatible with different models and datasets, indicating its general applicability in medical multimodal diagnosis.
Reach us at info@study.space