Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

3 Apr 2024 | Renjie PI*, Tianyang Han*, Wei Xiong, Jipeng Zhang, Runtao Liu, Rui Pan, and Tong Zhang
This paper introduces Bootstrapped Preference Optimization (BPO), a method to enhance the visual grounding of Multimodal Large Language Models (MLLMs) by mitigating the bias towards pretraining statistics. The key idea is to use preference learning with datasets containing negative responses generated by the model itself. Two strategies are proposed: 1) using distorted image inputs to elicit responses with pretraining bias, and 2) injecting erroneous but common elements into original responses using text-based LLMs. These responses are paired with annotated responses to form a preference dataset, which is then used for preference learning. The approach suppresses pretraining bias, enabling better alignment with visual inputs. Extensive experiments show significant performance improvements across multiple benchmarks, advancing the state-of-the-art in multimodal conversational systems. The paper also discusses related work, including alignment techniques for large language models and methods for hallucination mitigation in MLLMs. The proposed BPO method is compared with other approaches like SFT and DPO, demonstrating its effectiveness in improving visual truthfulness and helpfulness. The results show that BPO outperforms these methods in terms of performance and sample efficiency. The paper concludes that BPO is a promising solution for improving the visual grounding of MLLMs.This paper introduces Bootstrapped Preference Optimization (BPO), a method to enhance the visual grounding of Multimodal Large Language Models (MLLMs) by mitigating the bias towards pretraining statistics. The key idea is to use preference learning with datasets containing negative responses generated by the model itself. Two strategies are proposed: 1) using distorted image inputs to elicit responses with pretraining bias, and 2) injecting erroneous but common elements into original responses using text-based LLMs. These responses are paired with annotated responses to form a preference dataset, which is then used for preference learning. The approach suppresses pretraining bias, enabling better alignment with visual inputs. Extensive experiments show significant performance improvements across multiple benchmarks, advancing the state-of-the-art in multimodal conversational systems. The paper also discusses related work, including alignment techniques for large language models and methods for hallucination mitigation in MLLMs. The proposed BPO method is compared with other approaches like SFT and DPO, demonstrating its effectiveness in improving visual truthfulness and helpfulness. The results show that BPO outperforms these methods in terms of performance and sample efficiency. The paper concludes that BPO is a promising solution for improving the visual grounding of MLLMs.
Reach us at info@study.space
[slides] Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | StudySpace