[slides] Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

The paper addresses the issue of bias in Multimodal Large Language Models (MLLMs) towards generating responses similar to their pretraining corpus, which hinders their grounding in visual input. To mitigate this, the authors propose Bootstrapped Preference Optimization (BPO), a method that leverages preference learning techniques to suppress the pretraining bias. BPO involves two main strategies: using distorted image inputs to elicit responses containing pretraining bias and leveraging text-based LLMs to inject erroneous but common elements into original responses. These negative responses are paired with positive responses from existing datasets to construct a preference dataset, which is then used for preference learning. The approach effectively enhances the model's grounding in visual inputs, as demonstrated through extensive experiments on multiple benchmarks, showing significant performance improvements and advancing the state-of-the-art in multimodal conversational systems. The paper also discusses the limitations of the method, such as the lack of consideration for malicious inputs and the need for sample selection algorithms to filter out low-quality data.The paper addresses the issue of bias in Multimodal Large Language Models (MLLMs) towards generating responses similar to their pretraining corpus, which hinders their grounding in visual input. To mitigate this, the authors propose Bootstrapped Preference Optimization (BPO), a method that leverages preference learning techniques to suppress the pretraining bias. BPO involves two main strategies: using distorted image inputs to elicit responses containing pretraining bias and leveraging text-based LLMs to inject erroneous but common elements into original responses. These negative responses are paired with positive responses from existing datasets to construct a preference dataset, which is then used for preference learning. The approach effectively enhances the model's grounding in visual inputs, as demonstrated through extensive experiments on multiple benchmarks, showing significant performance improvements and advancing the state-of-the-art in multimodal conversational systems. The paper also discusses the limitations of the method, such as the lack of consideration for malicious inputs and the need for sample selection algorithms to filter out low-quality data.

Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

3 Apr 2024 | Renjie PI1*, Tianyang Han3*, Wei Xiong2 Jipeng Zhang1, Runtao Liu1, Rui Pan1, and Tong Zhang2

3 Apr 2024 | Renjie PI1, Tianyang Han3, Wei Xiong2 Jipeng Zhang1, Runtao Liu1, Rui Pan1, and Tong Zhang2