Understanding RULE%3A Reliable Multimodal RAG for Factuality in Medical Vision Language Models

The paper "Reliable Multimodal RAG for Factuality in Medical Vision Language Models" addresses the issue of factual accuracy in Medical Large Vision Language Models (Med-LVLMs). Med-LVLMs have shown promise in medical diagnosis but often generate responses that deviate from established medical facts. The authors propose RULE, a method that enhances the factual accuracy of Med-LVLMs through two main components: 1. **Factuality Risk Control**: This component introduces a strategy to control the risk of factual inaccuracies by calibrated selection of the number of retrieved contexts. The strategy modifies the Med-LVLM through a post-processing step that performs hypothesis testing for each number of retrieved contexts, ensuring high accuracy without additional training. 2. **Knowledge-Balanced Preference Tuning**: This component addresses the over-reliance on retrieved contexts by fine-tuning the model using a preference dataset. The dataset is curated from samples where the model initially responds correctly but gives incorrect answers after incorporating retrieved contexts. This fine-tuning balances the model's reliance on its own knowledge and retrieved contexts. The effectiveness of RULE is demonstrated on three medical VQA datasets, achieving an average improvement of 20.8% in factual accuracy. The paper also includes experimental results, ablation studies, and a case study to illustrate the improvements and the compatibility of RULE with different models and datasets.The paper "Reliable Multimodal RAG for Factuality in Medical Vision Language Models" addresses the issue of factual accuracy in Medical Large Vision Language Models (Med-LVLMs). Med-LVLMs have shown promise in medical diagnosis but often generate responses that deviate from established medical facts. The authors propose RULE, a method that enhances the factual accuracy of Med-LVLMs through two main components: 1. **Factuality Risk Control**: This component introduces a strategy to control the risk of factual inaccuracies by calibrated selection of the number of retrieved contexts. The strategy modifies the Med-LVLM through a post-processing step that performs hypothesis testing for each number of retrieved contexts, ensuring high accuracy without additional training. 2. **Knowledge-Balanced Preference Tuning**: This component addresses the over-reliance on retrieved contexts by fine-tuning the model using a preference dataset. The dataset is curated from samples where the model initially responds correctly but gives incorrect answers after incorporating retrieved contexts. This fine-tuning balances the model's reliance on its own knowledge and retrieved contexts. The effectiveness of RULE is demonstrated on three medical VQA datasets, achieving an average improvement of 20.8% in factual accuracy. The paper also includes experimental results, ablation studies, and a case study to illustrate the improvements and the compatibility of RULE with different models and datasets.

RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

6 Jul 2024 | Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, Huaxiu Yao